hive索引，

和通数据库htsjk.Com2019-11-30 22:10 来源:未知阅读:6124 评论 199 热度4

标签：

hive索引，

索引是标准的数据库技术，hive 0.7版本之后支持索引。

Hive的索引目的是提高Hive表指定列的查询速度。

没有索引时，类似'WHERE tab1.col1 = 10' 的查询，Hive会加载整张表或分区，然后处理所有的rows，但是如果在字段col1上面存在索引时，那么只会加载和处理文件的一部分。

与其他传统数据库一样，增加索引在提升查询速度时，会消耗额外资源去创建索引和需要更多的磁盘空间存储索引。

Hive提供有限的索引功能，这不像传统的关系型数据库那样有“键(key)”的概念，用户可以在某些列上创建索引来加速某些操作，给一个表创建的索引数据被保存在另外的表中。Hive的索引功能现在还相对较晚，提供的选项还较少。但是，索引被设计为可使用内置的可插拔的java代码来定制，用户可以扩展这个功能来满足自己的需求。当然不是说有的查询都会受惠于Hive索引。用户可以使用EXPLAIN语法来分析HiveQL语句是否可以使用索引来提升用户查询的性能。像RDBMS中的索引一样，需要评估索引创建的是否合理，毕竟，索引需要更多的磁盘空间，并且创建维护索引也会有一定的代价。用户必须要权衡从索引得到的好处和代价。

索引的一般用法

下面介绍索引的常见用法：

A、 Create/build，show和drop index

create index table01_index on table table01(column2) as 'COMPACT' with deferred rebuild;

show index on table01;

drop index table01_index on table01;

B、 Create then build，show formatted和drop index

create index table02_index on table table02(column3) as 'compact' with deferred rebuild;

alter index table02_index on table02 rebuild;

show formatted index on table02;

drop index table02_index on table02;

C、创建bitmap索引，build,show 和drop

create index table03_index on table table03 (column4) as 'bitmap' with deferred rebuild;

alter index table03_index ontable03 rebuild;

show formatted index ontable03;

drop index table03_index on table03;

D、在一张新表上创建索引

create index table04_index on table table04 (column5) as 'compact' with deferred rebuild in tabletable04_index_table;

E、创建索引，存储格式为RCFile

create index table05_index on table table05 (column6) as 'compact' with deferred rebuild stored as rcfile;

F、创建索引，存储格式为TextFile

create index table06_index on table table06 (column7) as 'compact' with deferred rebuild row format delimited fields terminated by '\t' stored as textfile;

G、创建带有索引属性的索引

create index table07_index on table table07 (column8) as 'compact' with deferred rebuild idxproperties("prop1"="value1", "prop2"="value2");

H、创建带有表属性的索引

create index table08_index on table table08 (column9) as 'compact' with deferred rebuild tblproperties("prop3"="value3", "prop4"="value4");

I、如果索引存在，则删除

drop index if exists table09_index on table09;

J、在分区上重建索引

alter index table10_index on table10partition (columnx='valueq', columny='valuer') rebuild;

　　下面说说怎么创建索引：

　　1、先创建表：

`1`	`hive> create table user( id` `int, name string)`

`2`	`> ROW FORMAT DELIMITED`

`3`	`> FIELDS TERMINATED BY'\t'`

`4`	`> STORED AS TEXTFILE;`

　　2、导入数据：

`1`	`hive> load data local inpath` `'/export1/tmp/wyp/row.txt'`

`2`	`> overwrite into table user;`

　　3、创建索引之前测试

`01`	`hive> select * from user where id =500000;`

`02`	`Total MapReduce jobs =` `1`

`03`	`Launching Job` `1out of1`

`04`	`Number of reduce tasks is set to` `0since there's no reduce operator`

`05`	`Cannot run job locally: Input Size (=` `356888890) is larger than`

`06`	`hive.exec.mode.local.auto.inputbytes.max (=134217728)`

`07`	`Starting Job = job_1384246387966_0247, Tracking URL =`

08

`09`	`http://l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/`

10

`11`	`Kill Command=/home/q/hadoop/bin/hadoop job -kill job_1384246387966_0247`

`12`	`Hadoop job information` `forStage-1: number of mappers:2; number of reducers:0`

`13`	`2013-11-1315:09:53,336Stage-1map =0%, reduce =0%`

`14`	`2013-11-1315:09:59,500Stage-1map=50%,reduce=0%, Cumulative CPU2.0sec`

`15`	`2013-11-1315:10:00,531Stage-1map=100%,reduce=0%, Cumulative CPU5.63sec`

`16`	`2013-11-1315:10:01,560Stage-1map=100%,reduce=0%, Cumulative CPU5.63sec`

`17`	`MapReduce Total cumulative CPU time:` `5` `seconds` `630msec`

`18`	`Ended Job = job_1384246387966_0247`

`19`	`MapReduce Jobs Launched:`

`20`	`Job` `0: Map:2` `Cumulative CPU:5.63` `sec`

`21`	`HDFS Read:` `361084006HDFS Write:357SUCCESS`

`22`	`Total MapReduce CPU Time Spent:` `5seconds630msec`

23 OK

`24`	`500000` `wyp.`

`25`	`Time taken:` `14.107seconds, Fetched:1row(s)`

一共用了14.107s
　　4、对user创建索引

`01`	`hive> create index user_index on table user(id)`

`02`	`> as` `'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'`

`03`	`> with deferred rebuild`

`04`	`> IN TABLE user_index_table; 填充索引数据：`

`05`	`hive> alter index user_index on user rebuild; 可以看到索引内存储的数据：`

`06`	`hive> select * from user_index_table limit5;`

`07`	`0` `hdfs://mycluster/user/hive/warehouse/table02/000000_0 [0]`

`08`	`1` `hdfs://mycluster/user/hive/warehouse/table02/000000_0 [352]`

`09`	`2` `hdfs://mycluster/user/hive/warehouse/table02/000000_0 [704]`

`10`	`3` `hdfs://mycluster/user/hive/warehouse/table02/000000_0 [1056]`

`11`	`4` `hdfs://mycluster/user/hive/warehouse/table02/000000_0 [1408]`

`12`	`Time taken:` `0.244seconds, Fetched:5row(s)`

这样就对user表创建好了一个索引。
　　5、对创建索引后的user再进行测试

`01`	`hive> select * from user where id =500000;`

`02`	`Total MapReduce jobs =` `1`

`03`	`Launching Job` `1out of1`

`04`	`Number of reduce tasks is set to` `0since there's no reduce operator`

`05`	`Cannot run job locally: Input Size (=` `356888890) is larger than`

`06`	`hive.exec.mode.local.auto.inputbytes.max (=134217728)`

`07`	`Starting Job = job_1384246387966_0247, Tracking URL =`

08

`09`	`http://l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/`

10

`11`	`Kill Command=/home/q/hadoop/bin/hadoop job -kill job_1384246387966_0247`

`12`	`Hadoop job information` `forStage-1: number of mappers:2; number of reducers:0`

`13`	`2013-11-1315:23:12,336Stage-1map =0%, reduce =0%`

`14`	`2013-11-1315:23:53,240Stage-1map=50%,reduce=0%, Cumulative CPU2.0sec`

`15`	`2013-11-1315:24:00,253Stage-1map=100%,reduce=0%, Cumulative CPU5.27sec`

`16`	`2013-11-1315:24:01,650Stage-1map=100%,reduce=0%, Cumulative CPU5.27sec`

`17`	`MapReduce Total cumulative CPU time:` `5` `seconds` `630msec`

`18`	`Ended Job = job_1384246387966_0247`

`19`	`MapReduce Jobs Launched:`

`20`	`Job` `0: Map:2` `Cumulative CPU:5.63` `sec`

`21`	`HDFS Read:` `361084006HDFS Write:357SUCCESS`

`22`	`Total MapReduce CPU Time Spent:` `5seconds630msec`

23 OK

`24`	`500000` `wyp.`

`25`	`Time taken:` `13.042seconds, Fetched:1row(s)`

时间用了13.042s这和没有创建索引的效果差不多。

　　在Hive创建索引还存在bug：如果表格的模式信息来自SerDe，Hive将不能创建索引：

`01`	`hive> CREATE INDEX employees_index`

`02`	`> ON TABLE employees (country)`

`03`	`> AS` `'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'`

`04`	`> WITH DEFERRED REBUILD`

`05`	`> IDXPROPERTIES ('creator'='me','created_at'='some_time')`

`06`	`> IN TABLE employees_index_table`

`07`	`> COMMENT'Employees indexed by country and name.';`

`08`	`FAILED: Error in metadata: java.lang.RuntimeException: \`

`09`	`Check the index columns, they should appear in the table being indexed.`

`10`	`FAILED: Execution Error,` `returncode1from \`

`11`	`org.apache.hadoop.hive.ql.exec.DDLTask`

这个bug发生在Hive0.10.0、0.10.1、0.11.0，在Hive0.12.0已经修复了，详情请参见：https://issues.apache.org/jira/browse/HIVE-4251

总结：索引表的基本包含几列：1. 源表的索引列；2. _bucketname hdfs中文件地址 3. 索引列在hdfs文件中的偏移量。原理是通过记录索引列在HDFS中的偏移量，精准获取数据，避免全表扫描。

索引测试

(1) 查询表中行数

hive (hive)> select count(*) from userbook;

4409365

(2) 表中未创建索引前查询

hive (hive)> select * from userbook where book_id = '15999998838';

Query ID =hadoop_20150627165551_595da79a-0e27-453b-9142-7734912934c4

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is setto 0 since there's no reduce operator

Starting Job =job_1435392961740_0012, Tracking URL =http://gpmaster:8088/proxy/application_1435392961740_0012/

Kill Command =/home/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1435392961740_0012

Hadoop job information forStage-1: number of mappers: 2; number of reducers: 0

2015-06-27 16:56:04,666 Stage-1map = 0%, reduce = 0%

2015-06-27 16:56:28,974 Stage-1map = 50%, reduce = 0%, Cumulative CPU4.36 sec

2015-06-27 16:56:31,123 Stage-1map = 78%, reduce = 0%, Cumulative CPU6.21 sec

2015-06-27 16:56:34,698 Stage-1map = 100%, reduce = 0%, Cumulative CPU7.37 sec

MapReduce Total cumulative CPUtime: 7 seconds 370 msec

Ended Job =job_1435392961740_0012

MapReduce Jobs Launched:

Stage-Stage-1: Map: 2 Cumulative CPU: 7.37 sec HDFS Read: 348355875 HDFS Write: 76 SUCCESS

Total MapReduce CPU Time Spent:7 seconds 370 msec

userbook.book_id userbook.book_name userbook.author userbook.public_date userbook.address

15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b

Time taken: 45.678 seconds, Fetched: 1 row(s)

(3) 创建索引

hive (hive)> create index userbook_bookid_idx on table userbook(book_id) as 'COMPACT' WITH DEFERRED REBUILD;

(4) 创建索引后再执行查询

hive (hive)> select * from userbook where book_id = '15999998838';

Query ID =hadoop_20150627170019_5bb5514a-4c8e-4c47-9347-ed0657e1f2ff

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is setto 0 since there's no reduce operator

Starting Job =job_1435392961740_0013, Tracking URL = http://gpmaster:8088/proxy/application_1435392961740_0013/

Kill Command =/home/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1435392961740_0013

Hadoop job information forStage-1: number of mappers: 2; number of reducers: 0

2015-06-27 17:00:30,429 Stage-1map = 0%, reduce = 0%

2015-06-27 17:00:54,003 Stage-1map = 50%, reduce = 0%, Cumulative CPU7.43 sec

2015-06-27 17:00:56,181 Stage-1map = 78%, reduce = 0%, Cumulative CPU9.66 sec

2015-06-27 17:00:58,417 Stage-1map = 100%, reduce = 0%, Cumulative CPU10.83 sec

MapReduce Total cumulative CPUtime: 10 seconds 830 msec

Ended Job =job_1435392961740_0013

MapReduce Jobs Launched:

Stage-Stage-1: Map: 2 Cumulative CPU: 10.83 sec HDFS Read: 348356271 HDFS Write: 76 SUCCESS

Total MapReduce CPU Time Spent:10 seconds 830 msec

userbook.book_id userbook.book_name userbook.author userbook.public_date userbook.address

15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b

Time taken: 40.549 seconds, Fetched: 1 row(s)

可以看到创建索引后，速度还是稍快一点的。

其实对于这种简单的查询，通过我们的设置，可以不用启动Map/Reduce的，而是启动Fetch task，直接从HDFS文件中filter过滤出需要的数据，需要设置如下参数：

set hive.fetch.task.conversion=more;

hive (hive)> select * from userbook where book_id = '15999998838';

userbook.book_id userbook.book_name userbook.author userbook.public_date userbook.address

15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b

Time taken: 0.093 seconds,Fetched: 1 row(s)

可以看到速度更快了，毕竟省略掉了开启MR任务，执行效率提高不少。