测试hive,
一、准备数据 造数据脚本 : gendata.sh#!/bin/bash file=$1 s=$2 touch $file for((i=0;i<10000000;i++)) do str=','$s; name=${i}${str}${i} #echo $name echo $name>> $file done
echo 'show testdata' head $file
造数据: 先造十个小文件,每个1000w记录: bash gendata.sh name.txt name ; bash gendata.sh zhuzhi.txt zhuzhi ; bash gendata.sh minzu.txt minzu ; bash gendata.sh jg.txt jg ;bash gendata.sh gj.txt gj ; bash gendata.sh dz.txt dz ; bash gendata.sh abcd.txt abcd ; bash gendata.sh efgh.txt efgh ; bash gendata.sh xyz.txt xyz ;bash gendata.sh opq.txt opq
total 1.8G -rw-r--r-- 1 root root 189M Feb 9 10:35 abcd.txt -rw-r--r-- 1 root root 170M Feb 9 10:32 dz.txt -rw-r--r-- 1 root root 189M Feb 9 10:38 efgh.txt -rw-r--r-- 1 root root 170M Feb 9 10:28 gj.txt -rw-r--r-- 1 root root 170M Feb 9 10:25 jg.txt -rw-r--r-- 1 root root 199M Feb 9 10:22 minzu.txt -rw-r--r-- 1 root root 189M Feb 9 10:08 name.txt -rw-r--r-- 1 root root 180M Feb 9 10:49 opq.txt -rw-r--r-- 1 root root 180M Feb 9 10:41 xyz.txt -rw-r--r-- 1 root root 208M Feb 9 10:19 zhuzhi.txt
大文件,1亿记录 bash gendata.sh name1000.txt name
-rw-r--r-- 1 root root 2.1G Feb 9 10:50 name1000.txt
二、测试10个小文件,每个文件1000万记录,180MB大小,总1亿记录,1.8G ,不做任何优化的数据分析
hive中建表:
create table hyl_test_par(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;
手动建立分区文件夹:
hadoop fs -mkdir -p /apps/hive/warehouse/hyl_test_par/sys_sj=20170209/sys_type=2003
上传数据、修改权限:
hadoop fs -put *.txt /apps/hive/warehouse/hyl_test_par/sys_sj=20170209/sys_type=2003/ hadoop fs -chown -R hive /apps/hive/warehouse/hyl_test_par/
修复分区信息:
0: jdbc:hive2://cluster09.hzhz.co:10000> show partitions hyl_test_par; +------------+--+ | partition | +------------+--+ +------------+--+ No rows selected (0.125 seconds) 0: jdbc:hive2://cluster09.hzhz.co:10000> msck repair table hyl_test_par; No rows affected (0.551 seconds) 0: jdbc:hive2://cluster09.hzhz.co:10000> show partitions hyl_test_par; +--------------------------------+--+ | partition | +--------------------------------+--+ | sys_sj=20170209/sys_type=2003 | +--------------------------------+--+ 1 row selected (0.123 seconds)
测试:
select count(*) from hyl_test_par where name <> ' ' ;
第一次运行 +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (189.116 seconds)
第二次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (118.107 seconds)
第三次运行:
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (117.551 seconds)
第四次运行:
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (117.44 seconds)
第五次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (113.291 seconds)
======================================莫名的分割线======================================================= 下午重新跑,性能块了10倍以上!!! 第一次运行 +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.274 seconds)
第二次运行 +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.525 seconds)
第三次运行
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.11 seconds)
第四次: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.722 seconds)
三、测试10个小文件,每个文件1000万记录,180MB大小,总1亿记录,1.8G,经过analysis分区信息的表
同理,创建测试analysis的表,并导入数据:
create table hyl_test_par_ana(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;
hadoop fs -mkdir -p /apps/hive/warehouse/hyl_test_par_ana/sys_sj=20170209/sys_type=2003 hadoop fs -put *.txt /apps/hive/warehouse/hyl_test_par_ana/sys_sj=20170209/sys_type=2003/ hadoop fs -chown -R hive /apps/hive/warehouse/hyl_test_par_ana/
show partitions hyl_test_par_ana;
msck repair table hyl_test_par_ana; show partitions hyl_test_par_ana;
分区信息分析: 0: jdbc:hive2://cluster09.hzhz.co:10000> analyze table hyl_test_par_ana partition(sys_sj=20170209,sys_type=2003) compute statistics ; INFO : Session is already open INFO : Dag name: analyze table hyl_test_par_ana ...statistics(Stage-0) INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0021) . . . INFO : Partition default.hyl_test_par_ana{sys_sj=20170209, sys_type=2003} stats: [numFiles=10, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800] No rows affected (126.81 seconds)
测试: select count(*) from hyl_test_par_ana; 0: jdbc:hive2://cluster09.hzhz.co:10000> select count(*) from hyl_test_par_ana; +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (0.086 seconds)
换sql: select count(*) from hyl_test_par_ana where name <> ' ';
第一次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (118.239 seconds)
第二次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (121.687 seconds)
第三次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (121.319 seconds)
======================================莫名的分割线=======================================================
下午重新跑,性能块了10倍以上!!! 第一次运行 +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (10.923 seconds)
第二次运行 +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (6.058 seconds)
第三次运行 +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (6.45 seconds)
第四次: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (6.218 seconds)
三、测试1个大文件,2.0G大小,总1亿记录,没有任何优化的表 创建使用一个大文件的同结构表,并上传数据:
create table hyl_test_par_big(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;
hadoop fs -mkdir -p /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209/sys_type=2003 hadoop fs -put name1000.txt /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209/sys_type=2003/ hadoop fs -chown -R hive /apps/hive/warehouse/hyl_test_par_big/
[hdfs@cluster13 tmp]$ hadoop fs -du -h /apps/hive/warehouse/hyl_test_par_big/ 2.0 G /apps/hive/warehouse/hyl_test_par_big/sys_sj=20170209
show partitions hyl_test_par_big; msck repair table hyl_test_par_big; show partitions hyl_test_par_big;
测试: select count(*) from hyl_test_par_big where name <> ' '; 第一次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.356 seconds) 第二次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.3 seconds)
第三次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (5.861 seconds) 第四次: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (5.675 seconds)
第五次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (3.814 seconds)
======================================莫名的分割线======================================================= 下午测试: 第一次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.933 seconds)
第二次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (4.435 seconds)
第三次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (5.868 seconds)
第四次运行:
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (3.403 seconds)
第五次运行:
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (3.814 seconds)
四、测试10个小文件,180MB大小,总1亿记录,是经过insert into方式建立的表
通过insert into 从第一张表中导入数据(这个过程中hive会自动analysis信息)到新表,测试:
create table hyl_test_par_auto as select * from hyl_test_par distribute by rand(123);
0: jdbc:hive2://cluster09.hzhz.co:10000> create table hyl_test_par_auto as select * from hyl_test_par distribute by rand(123); INFO : Session is already open INFO : Dag name: create table hyl_test_par_auto a...rand(123)(Stage-1) INFO : Tez session was closed. Reopening... INFO : Session re-established. INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0022)
INFO : Map 1: -/- Reducer 2: 0/10 INFO : Map 1: 0/119 Reducer 2: 0/10 . . . INFO : Moving data to directory hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto from hdfs://myBigdata/apps/hive/warehouse/.hive-staging_hive_2017-02-09_13-05-59_036_2906983779886430780-1/-ext-10001 INFO : Table default.hyl_test_par_auto stats: [numFiles=10, numRows=100000000, totalSize=3327777800, rawDataSize=3227777800]
[hdfs@cluster13 tmp]$ hadoop fs -ls -h hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003 Found 10 items -rwxrwxrwx 2 hive hdfs 183.9 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000000_0 -rwxrwxrwx 2 hive hdfs 183.7 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000001_0 -rwxrwxrwx 2 hive hdfs 183.6 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000002_0 -rwxrwxrwx 2 hive hdfs 184.6 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000003_0 -rwxrwxrwx 2 hive hdfs 183.7 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000004_0 -rwxrwxrwx 2 hive hdfs 183.3 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000005_0 -rwxrwxrwx 2 hive hdfs 184.2 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000006_0 -rwxrwxrwx 2 hive hdfs 184.0 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000007_0 -rwxrwxrwx 2 hive hdfs 184.2 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000008_0 -rwxrwxrwx 2 hive hdfs 183.4 M 2017-02-09 13:28 hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/000009_0
测试: select count(*) from hyl_test_par_auto where name <> ' '; 第一次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (14.653 seconds)
第二次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (13.989 seconds)
第三次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (9.236 seconds)
drop table hyl_test_auto;
create table hyl_test_par_auto(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;
insert into table hyl_test_par_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123); 0: jdbc:hive2://cluster09.hzhz.co:10000> insert into table hyl_test_par_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123); INFO : Session is already open INFO : Dag name: insert into table hyl_test_par_a...rand(123)(Stage-1) INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0022)
INFO : Map 1: 0/119 Reducer 2: 0/10 . . . INFO : Loading data to table default.hyl_test_par_auto partition (sys_sj=20170209, sys_type=2003) from hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_auto/sys_sj=20170209/sys_type=2003/.hive-staging_hive_2017-02-09_13-26-47_915_3059621581501435386-1/-ext-10000 INFO : Partition default.hyl_test_par_auto{sys_sj=20170209, sys_type=2003} stats: [numFiles=10, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800] No rows affected (135.325 seconds)
测试: select count(*) from hyl_test_par_auto where name <> ' '; 第一次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.303 seconds)
第二次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.56 seconds)
第三次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (3.446 seconds)
第四次: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (3.643 seconds)
0: jdbc:hive2://cluster09.hzhz.co:10000> select count(*) from hyl_test_par_auto ; +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (0.098 seconds)
======================================莫名的分割线======================================================= 下午测试: 第一次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.282 seconds)
第二次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (3.414 seconds)
第三次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (6.452 seconds)
第四次运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (6.24 seconds)
五、统计:
耗时\条件 | 10*180MB*NONE | 10*180MB*analyze | 1*2GB*none | 10*180MB*insert into |
第一次运行 | 11.274 | 10.923 | 11.933 | 11.282 |
第二次运行 | 11.525 | 6.058 | 4.435 | 3.414 |
第三次运行 | 11.11 | 6.45 | 5.868 | 6.452 |
第四次运行 | 11.722 | 6.218 | 3.403 | 6.24 |
平均时间 | 11.40775 | 7.41225 | 6.40975 | 6.847 |
热数据平均时间(去掉第一次) | 11.45233333 | 6.242 | 4.568666667 | 5.368666667 |
六、继续寻找优化项:
单个文件,通过insert into方式插入数据的表:
set mapred.reduce.tasks=1; create table hyl_test_par_big_auto(id int,name string) partitioned by(sys_sj string,sys_type string) row format delimited fields terminated by ',' stored as textfile;
insert into table hyl_test_par_big_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123); 0: jdbc:hive2://cluster09.hzhz.co:10000> insert into table hyl_test_par_big_auto partition(sys_sj=20170209,sys_type=2003) select id,name from hyl_test_par distribute by rand(123); INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: insert into table hyl_test_par_b...rand(123)(Stage-1) INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_1486351392526_0023)
INFO : Map 1: -/- Reducer 2: 0/1 INFO : Map 1: 0/119 Reducer 2: 0/1 . . . INFO : Loading data to table default.hyl_test_par_big_auto partition (sys_sj=20170209, sys_type=2003) from hdfs://myBigdata/apps/hive/warehouse/hyl_test_par_big_auto/sys_sj=20170209/sys_type=2003/.hive-staging_hive_2017-02-09_14-30-05_480_3967529948260649900-1/-ext-10000 INFO : Partition default.hyl_test_par_big_auto{sys_sj=20170209, sys_type=2003} stats: [numFiles=1, numRows=100000000, totalSize=1927777800, rawDataSize=1827777800] No rows affected (104.637 seconds)
测试: select count(*) from hyl_test_par_big_auto where name <> ' '; 运行: +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.744 seconds)
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (4.188 seconds)
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (4.041 seconds)
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (5.198 seconds)
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (3.788 seconds)
耗时\条件 | 10*180MB*NONE | 10*180MB*analyze | 1*2GB*none | 10*180MB*insert into | 1*2GB*none*insert into |
第一次运行 | 11.274 | 10.923 | 11.933 | 11.282 | 11.744 |
第二次运行 | 11.525 | 6.058 | 4.435 | 3.414 | 4.188 |
第三次运行 | 11.11 | 6.45 | 5.868 | 6.452 | 4.041 |
第四次运行 | 11.722 | 6.218 | 3.403 | 6.24 | 5.198 |
平均时间 | 11.40775 | 7.41225 | 6.40975 | 6.847 | 6.29275 |
热数据平均时间(去掉第一次) | 11.45233333 | 6.242 | 4.56866666 | 5.368666667 | 4.475666667 |
优化,列信息统计分析
analyze table hyl_test_par_ana compute statistics for columns;
select count(*) from hyl_test_par_ana where name <> ' ';
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.519 seconds)
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+
1 row selected (7.688 seconds) +------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (6.456 seconds)
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (5.651 seconds)
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (3.413 seconds)
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (3.62 seconds)
矢量查询(Vectorized query) 每次处理数据时会将1024行数据组成一个batch进行处理,而不是一行一行进行处理:
set hive.vectorized.execution.enabled = true; set hive.vectorized.execution.reduce.enabled = true;
select count(*) from hyl_test_par where name <> ' ';
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.54 seconds)
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (3.859 seconds)
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (4.134 seconds)
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.844 seconds)
+------------+--+ | _c0 | +------------+--+ | 100000000 | +------------+--+ 1 row selected (11.089 seconds)
本站文章为和通数据库网友分享或者投稿,欢迎任何形式的转载,但请务必注明出处.
同时文章内容如有侵犯了您的权益,请联系QQ:970679559,我们会在尽快处理。