大数据之Hive,
8.5.5 主流文件存储格式对比实验
从存储文件的压缩比和查询速度两个角度对比。
存储文件的压缩比测试:
0)测试数据
1)TextFile
(1)创建表,存储数据格式为TEXTFILE
create table log_text ( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as textfile ; |
(2)向表中加载数据
hive (default)> load data local inpath '/opt/module/datas/log.data' into table log_text ; |
(3)查看表中数据大小
hive (default)> dfs -du -h /user/hive/warehouse/log_text; |
18.1M /user/hive/warehouse/log_text/log.data
2)ORC
(1)创建表,存储数据格式为ORC
create table log_orc( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as orc ; |
(2)向表中加载数据
hive (default)> insert into table log_orc select * from log_text ; |
(3)查看表中数据大小
hive (default)> dfs -du -h /user/hive/warehouse/log_orc/ ; |
2.8M /user/hive/warehouse/log_orc/000000_0
3)Parquet
(1)创建表,存储数据格式为parquet
create table log_parquet( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as parquet ; |
(2)向表中加载数据
hive (default)> insert into table log_parquet select * from log_text ; |
(3)查看表中数据大小
hive (default)> dfs -du -h /user/hive/warehouse/log_parquet/ ; |
13.1 M /user/hive/warehouse/log_parquet/000000_0
存储文件的压缩比总结:
ORC > Parquet > textFile
存储文件的查询速度测试:
1)TextFile
hive (default)>select count(*) from log_text;
_c0
100000
Time taken: 21.54seconds, Fetched: 1 row(s)
Time taken: 21.08 seconds, Fetched: 1 row(s)
2)ORC
hive (default)>select count(*) from log_orc;
_c0
100000
Time taken: 20.867seconds, Fetched: 1 row(s)
Time taken: 22.667 seconds, Fetched: 1row(s)
3)Parquet
hive (default)>select count(*) from log_parquet;
_c0
100000
Time taken: 22.922seconds, Fetched: 1 row(s)
Time taken: 21.074 seconds, Fetched: 1row(s)
存储文件的查询速度总结:查询速度相近。
8.6 存储和压缩结合
8.6.1 修改Hadoop集群具有Snappy压缩方式
1)查看hadoop checknative命令使用
[jduser@hadoop104 hadoop-2.7.2]$ hadoop
checknative[-a|-h] check native hadoop andcompression libraries availability
2)查看hadoop支持的压缩方式
[jduser@hadoop104 hadoop-2.7.2]$ hadoopchecknative
17/12/24 20:32:52 WARN bzip2.Bzip2Factory: Failed to load/initializenative-bzip2 library system-native, will use pure-Java version
17/12/24 20:32:52 INFO zlib.ZlibFactory: Successfully loaded &initialized native-zlib library
Native library checking:
hadoop: true/opt/module/hadoop-2.7.2/lib/native/libhadoop.so
zlib: true /lib64/libz.so.1
snappy: false
lz4: true revision:99
bzip2: false
3)将编译好的支持Snappy压缩的hadoop-2.7.2.tar.gz包导入到hadoop102的/opt/software中
4)解压hadoop-2.7.2.tar.gz到当前路径
[jduser@hadoop102 software]$ tar -zxvfhadoop-2.7.2.tar.gz
5)进入到/opt/software/hadoop-2.7.2/lib/native路径可以看到支持Snappy压缩的动态链接库
[jduser@hadoop102native]$ pwd
/opt/software/hadoop-2.7.2/lib/native
[jduser@hadoop102native]$ ll
-rw-r--r--. 1 jduser jduser 472950 9月 1 10:19 libsnappy.a
-rwxr-xr-x. 1 jduser jduser 955 9月 1 10:19 libsnappy.la
lrwxrwxrwx. 1 jduser jduser 18 12月 24 20:39libsnappy.so -> libsnappy.so.1.3.0
lrwxrwxrwx. 1 jduser jduser 18 12月 24 20:39libsnappy.so.1 -> libsnappy.so.1.3.0
-rwxr-xr-x. 1 jduser jduser 228177 9月 1 10:19 libsnappy.so.1.3.0
6)拷贝/opt/software/hadoop-2.7.2/lib/native里面的所有内容到开发集群的/opt/module/hadoop-2.7.2/lib/native路径上
[jduser@hadoop102 native]$ cp../native/* /opt/module/hadoop-2.7.2/lib/native/
7)分发集群
[jduser@hadoop102 lib]$ xsync native/
8)再次查看hadoop支持的压缩类型
[jduser@hadoop102hadoop-2.7.2]$ hadoop checknative
17/12/24 20:45:02 WARN bzip2.Bzip2Factory: Failed to load/initializenative-bzip2 library system-native, will use pure-Java version
17/12/24 20:45:02 INFO zlib.ZlibFactory: Successfully loaded &initialized native-zlib library
Native library checking:
hadoop: true/opt/module/hadoop-2.7.2/lib/native/libhadoop.so
zlib: true/lib64/libz.so.1
snappy: true /opt/module/hadoop-2.7.2/lib/native/libsnappy.so.1
lz4: true revision:99
bzip2: false
9)重新启动hadoop集群和hive
8.6.2 测试存储和压缩
官网:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
ORC存储方式的压缩:
Key | Default | Notes |
orc.compress | ZLIB | high level compression (one of NONE, ZLIB, SNAPPY) |
orc.compress.size | 262,144 | number of bytes in each compression chunk |
orc.stripe.size | 67,108,864 | number of bytes in each stripe |
orc.row.index.stride | 10,000 | number of rows between index entries (must be >= 1000) |
orc.create.index | true | whether to create row indexes |
orc.bloom.filter.columns | "" | comma separated list of column names for which bloom filter should be created |
orc.bloom.filter.fpp | 0.05 | false positive probability for bloom filter (must >0.0 and <1.0) |
1)创建一个非压缩的的ORC存储方式
(1)建表语句
create table log_orc_none( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as orc tblproperties ("orc.compress"="NONE"); |
(2)插入数据
hive (default)> insert into table log_orc_none select * from log_text ; |
(3)查看插入后数据
hive (default)> dfs -du -h /user/hive/warehouse/log_orc_none/ ; |
7.7 M /user/hive/warehouse/log_orc_none/000000_0
2)创建一个SNAPPY压缩的ORC存储方式
(1)建表语句
create table log_orc_snappy( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string ) row format delimited fields terminated by '\t' stored as orc tblproperties ("orc.compress"="SNAPPY"); |
(2)插入数据
hive (default)> insert into table log_orc_snappy select * from log_text ; |
(3)查看插入后数据
hive (default)> dfs -du -h /user/hive/warehouse/log_orc_snappy/ ; |
3.8 M /user/hive/warehouse/log_orc_snappy/000000_0
3)上一节中默认创建的ORC存储方式,导入数据后的大小为
2.8 M /user/hive/warehouse/log_orc/000000_0
比Snappy压缩的还小。原因是orc存储文件默认采用ZLIB压缩。比snappy压缩的小。
4)存储方式和压缩总结:
在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。压缩方式一般选择snappy,lzo。