hive实战,
1. 安装hive
2. hive实战
3. hive存储模型
4. 深入hql查询语言
5. 参考资料及代码下载
<1>. 安装hive
下载hive,下载地址http://mirror.bjtu.edu.cn/apache//hive/,解压该文件:
xuqiang@ubuntu:~/hadoop/src/hive$ tar zxvf hive-0.7.0-bin.tar.gz
设置环境变量:
xuqiang@ubuntu:~/hadoop/src/hive$ cd hive-0.7.0-bin/
xuqiang@ubuntu:~/hadoop/src/hive/hive-0.7.0-bin$ export HIVE_HOME=`pwd`
添加HIVE_HOME到环境变量PATH中:
xuqiang@ubuntu:~/hadoop/src/hive$ export PATH=$HIVE_HOME/bin:$PATH;
在运行hive之前,请确保变量HADOOP_HOME已经设置,如果没有设置,可以使用export命令设置该变量。
然后需要在hdfs上创建如下的目录来保存hive相关的数据。
xuqiang@ubuntu:~/hadoop/src/hive$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
xuqiang@ubuntu:~/hadoop/src/hive$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
xuqiang@ubuntu:~/hadoop/src/hive$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
xuqiang@ubuntu:~/hadoop/src/hive$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
此时运行hive的环境已经准备好了,在命令行中键入如下命令开始运行hive:
xuqiang@ubuntu:~/hadoop/src/hive/hive-0.7.0-bin$ $HIVE_HOME/bin/hive
<2>. hive实战
这里我们将完成这样的一个过程,首先创建一个表,从本机上加载数据到该表中,查询该表,得到我们感兴趣的数据。
首先创建表(具体语法将在下面给出):
hive> create table cite(citing INT, cited INT)
> row format delimited
> fields terminated by ','
> stored as textfile;
创建完表之后,我们可以使用show tables命令查看新建的表:
hive> show tables;
OK
cite
Time taken: 1.257 seconds
查看新建表的结构:
hive> describe cite;
OK
citing int
cited int
Time taken: 0.625 seconds
我们加载本地数据到该表中去:
hive> load data local inpath '/home/xuqiang/hadoop/data/cite75_99.txt'
> overwrite into table cite;
Copying data from file:/home/xuqiang/hadoop/data/cite75_99.txt
Copying file: file:/home/xuqiang/hadoop/data/cite75_99.txt
Loading data to table default.cite
Deleted hdfs://localhost:9000/user/hive/warehouse/cite
OK
Time taken: 89.766 seconds
查询前10行数据:
hive> select * from cite limit 10;
OK
NULL NULL
3858241 956203
3858241 1324234
3858241 3398406
3858241 3557384
3858241 3634889
3858242 1515701
3858242 3319261
3858242 3668705
3858242 3707004
Time taken: 0.778 seconds
查询该文件中存在多少条数据,这时hive将执行一个map-reduce的过程来计算该值:
hive> select count(1) from cite;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201106150005_0004, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201106150005_0004
Kill Command = /home/xuqiang/hadoop/src/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201106150005_0004
2011-06-15 05:33:20,724 Stage-1 map = 0%, reduce = 0%
2011-06-15 05:33:46,325 Stage-1 map = 2%, reduce = 0%
2011-06-15 05:33:49,827 Stage-1 map = 3%, reduce = 0%
2011-06-15 05:33:53,208 Stage-1 map = 4%, reduce = 0%
2011-06-15 05:33:55,259 Stage-1 map = 7%, reduce = 0%
2011-06-15 05:34:40,450 Stage-1 map = 9%, reduce = 0%
2011-06-15 05:34:52,706 Stage-1 map = 48%, reduce = 0%
2011-06-15 05:34:57,961 Stage-1 map = 50%, reduce = 0%
2011-06-15 05:35:28,420 Stage-1 map = 50%, reduce = 17%
2011-06-15 05:35:36,653 Stage-1 map = 58%, reduce = 17%
2011-06-15 05:35:40,844 Stage-1 map = 61%, reduce = 17%
2011-06-15 05:35:49,131 Stage-1 map = 62%, reduce = 17%
2011-06-15 05:35:56,428 Stage-1 map = 67%, reduce = 17%
2011-06-15 05:36:34,380 Stage-1 map = 90%, reduce = 17%
2011-06-15 05:36:52,601 Stage-1 map = 100%, reduce = 17%
2011-06-15 05:37:10,299 Stage-1 map = 100%, reduce = 67%
2011-06-15 05:37:16,471 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201106150005_0004
OK
16522439
Time taken: 274.531 seconds
好的,最后我们删除刚刚新建的表:
hive> drop table cite;
OK
Time taken: 5.724 seconds
<3>. 存储模型
通常情况下hive将数据存储到hadoop上/user/hive/warehouse目录下,关系型数据库使用索引index去加快查询速度,而hive使用的是以恶搞所谓的partition columns的概念,例如比如说存在某一行叫做state,可以根据state中存储的数据值,将state分为50个partitions。如果存在date列的话,那么通常按照时间进行partition,hive在对分区的列上进行查询的速度会比较快,原因是hadoop在数据存储上将不同的分区存储在了不同的目录文件下。例如对于上面的列state和date,可能的存储模型如下:
<4>. 深入hql
我们将通过实际hql语句来分析hql的语法。
<5>. 参考资料及代码下载
http://wiki.apache.org/hadoop/Hive/GettingStarted