Hive学习笔记，

和通数据库htsjk.Com2019-11-13 23:04 来源:未知阅读:17876 评论 43 热度4

标签：

Hive学习笔记，

1、几个排序函数区别

order by：对查询结果集执行一个全局排序。如果在set hive.mapred.mode=strict模式下使用order by语句，那么必须要在语句中加上limit关键字，因为执行order by的时候只能启动单个reduce，如果排序的结果集过大，那么执行时间会非常漫长。

sort by ：会在每个reducer中对数据进行排序，也就是执行一个局部排序过程。这可以保证每个reducer的输出数据都是有序的，这样可以提高后面进行全局排序的效率。只保证每个reducer的输出有序，不保证全局有序。

distribute by：控制在map端如何拆分数据给reduce端的，按照指定的key分发数据，并保证key相同的会被划分到同一个reduce。hive会根据distribute by后面列，根据reduce的个数进行数据分发，默认是采用hash算法。

cluster by：=distribute by......sort by，cluster by只能降序排序。

2、类型转换函数

cast(value as type) 例：select name,salary from table1 where cast(salary as float) < 1000.0

3、创建(临时)表

（1）创建表

-- 例1 创建简单表

create table pokes (foo int, bar string);

-- 例2 创建外部表

create external table page_view(viewtime int, userid bigint,

page_url string, referrer_url string,

ip string comment 'ip address of the user',

country string comment 'country of origination')

comment 'this is the staging page view table'

row format delimited fields terminated by '\054'

stored as textfile

location '';

-- 例3 创建外部分区表

create external table `default`.`map_client_detail_basemap_reconmmend_online`(

`act_name` string comment 'key值' sample 'bmappoi.card.caterrecom1.title.click',

`act_params` map comment '参数列表' sample '',

`dim1` string comment '预留字段1' sample '',

`dim2` string comment '预留字段2' sample '')

comment '详情页周边推荐用户点击在线日志'

partitioned by (

`event_day` string comment '天级partition' sample '20180416')

row format serde

'com.baidu.udw.storage.fileformat.orc.orcserde'

stored as

inputformat

'com.baidu.udw.storage.mapred.orcinputformat'

outputformat

'com.baidu.udw.storage.mapred.orcoutputformat'

location

'afs://xingtian.afs.baidu.com:9902/app/insight/map-client/map_client_detail_basemap_reconmmend_online'

tblproperties (

'bigdata_visible'='1'

,'data_management'='udw')

-- 例4 穿件Bucket(分桶，比分区表数据划分范围粒度更细)表

create table par_table(viewtime int, userid bigint,

page_url string, referrer_url string,

ip string comment 'ip address of the user')

comment 'this is the page view table'

partitioned by(date string, pos string)

clustered by(userid) sorted by(viewtime) into 4 buckets -- CLUSTERED BY 子句来指定划分桶所有的列和划分的桶的个数。

row format delimited ‘\t’

fields terminated by '\n'

stored as sequencefile;

-- 查询分桶表

select * from bucketed_user tablesample(bucket 3 out of 4 on id);

ablesample的作用就是让查询发生在一部分桶上而不是整个数据集上，分桶会将同一个用户id的文件放到同一个桶中，一个桶也会同时存在多个用户id的数据，这样当我们要查询具体某一个id对应的所有的数据便可大大的缩小了查找的范围。

（2）创建临时表

session = database 'session:/';

use session;

hive提供了复合数据类型：

Structs： structs内部的数据可以通过DOT（.）来存取，例如，表中一列c的类型为STRUCT{a INT; b INT}，我们可以通过c.a来访问域a

Maps（K-V对）：访问指定域可以通过["指定域名称"]进行，例如，一个Map M包含了一个group-》gid的kv对，gid的值可以通过M['group']来获取

Arrays：array中的数据为相同类型，例如，假如array A中元素['a','b','c']，则A[1]的值为'b'

-- 1）常规字段类型：

create table tmp_1(os string, act string, pv int) row format delimited fields terminated by '\t' lines terminated by '\n';

-- 2）struct类型：

create table tmp_2(os string, info struct) row format delimited fields terminated by ',' collection items terminated by ':'

--FIELDS TERMINATED BY ：字段与字段之间的分隔符

--COLLECTION ITEMS TERMINATED BY ：一个字段各个item的分隔符

-- 3）array类型：

create table tmp_3(name string, student_id_list array) row format delimited fields terminated by '\t' collection items terminated by ','

--4）map类型：

create bable tmp_4(id string, pref map ) row format delimited fields terminated by '\t' collection iterms terminated by ',' map keys terminated by ':'

4、几个特殊函数的使用

1）concat_ws(separator, str1, str2,...)

它是一个特殊形式的concat()。第一个参数为剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 null，返回值也将为 null。这个函数会跳过分隔符参数后的任何 null 和空字符串。分隔符将被加到被连接的字符串之间简单例子如下：

mysql> select concat_ws(",","first name","second name","last name");

-> 'first name,second name,last name'

mysql> select concat_ws(",","first name",null,"last name");

-> 'first name,last name'

mysql> select concat_ws(',','1','2','3','4','5','6') from dual;

-> '1,2,3,4,5,6'

2）hive字符串分割函数split(str, regex)，用split函数把字符串分割为数组。

a.基本用法：

例1：split('a,b,c,d',',') 得到的结果：["a","b","c","d"]

b.截取字符串中的某个值：当然，我们也可以指定取结果数组中的某一项

例2：split('a,b,c,d',',')[0] 得到的结果：a

3）表生成函数explode实现行转列

explode(ARRAY) 列表中的每个元素生成一行

explode(MAP) map中每个key-value对，生成一行，key为一列，value为一列

explode 是一个 hive 内置的表生成函数：built-in table-generating functions (udtf)，主要是解决 1 to n 的问题，即它可以把一行输入拆成多行，比如一个 array 的每个元素拆成一行，作为一个虚表输出。它有如下需要注意的地方：

•select 列中不能 udtf 和其它非 udtf 列混用，

•udtf 不能嵌套,

•不支持 group by / cluster by / distribute by / sort by

•还有 select 中出现的 udtf 一定需要列别名，否则会报错：

select explode(mycol) as mynewcol from mytable;

select explode(mymap) as (mymapkey, mymapvalue) from mymaptable;

select posexplode(mycol) as pos, mynewcol from mytable;

4）lateral view

lateral view是hive中提供给udtf的conjunction，它可以解决udtf不能添加额外的select列的问题。当我们想对hive表中某一列进行split之后，想对其转换成1 to n的模式，即一行转多列。hive不允许我们在udtf函数之外，再添加其它select语句。

lateral view 其实就是用来和像类似explode这种udtf函数联用的。lateral view 会将udtf生成的结果放到一个虚拟表中，然后这个虚拟表(1 to n)会和输入行即每个game_id进行join 来达到连接udtf外的select字段的目的(源表和拆分的虚表按行做行内 1 join n 的直接连接),这就是lateral view udtf表达式后面需要表别名和列别名的原因。

可以在2个地方用lateral view：

•在udtf前面用

•在from basetable后面用

例如：

select pageid, adid from pageads lateral view explode(adid_list) adtable as adid;

from语句后可以跟多个lateral view。

select mycol1, mycol2 from basetable

lateral view explode(col1) mytable1 as mycol1

lateral view explode(col2) mytable2 as mycol2;

5）行转列函数stack

也可使用如下函数实现行转列

stack(int n,col1,col2,.....,colm) 把M列转换成N行，n必须是个常数。

stack(6,c1,c2,c3,c4,c5,c6)

6） get_json_object

测试数据：

first {"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":""} third

first {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":91,"type":"pear"}],"bicycle":{"price":19.952,"color":"red2"}},"email":""} third

first {"store":{"fruit":[{"weight":10,"type":"apple"},{"weight":911,"type":"pear"}],"bicycle":{"price":19.953,"color":"red3"}},"email":""} third

解析：

first {

"store": {
"fruit": [{
"weight": 8,
"type": "apple"
}, {
"weight": 9,
"type": "pear"
}],
"bicycle": {
"price": 19.951,
"color": "red1"
}
},
"email": ""
}
third

first {
"store": {
"fruit": [{
"weight": 9,
"type": "apple"
}, {
"weight": 91,
"type": "pear"
}],
"bicycle": {
"price": 19.952,
"color": "red2"
}
},
"email": ""
}
third

first {
"store": {
"fruit": [{
"weight": 10,
"type": "apple"
}, {
"weight": 911,
"type": "pear"
}],
"bicycle": {
"price": 19.953,
"color": "red3"
}
},
"email": ""
}
third

create external table if not exists t_json(f1 string, f2 string, f3 string) row format delimited fields TERMINATED BY ' ' location '/test/json'

select get_json_object(t_json.f2, '$.owner') from t_json;

SELECT * from t_json where get_json_object(t_json.f2, '$.store.fruit[0].weight') = 9;

SELECT get_json_object(t_json.f2, '$.non_exist_key') FROM t_json;

5、Mysql(Hive)日期函数：FROM_UNIXTIME

select sys_time,FROM_UNIXTIME(sys_time, '%y%m%d') from mobile_order_0 limit 10; -- 151121

select sys_time,FROM_UNIXTIME(sys_time, '%Y%m%d') from mobile_order_0 limit 10; -- 20151121

select sys_time,FROM_UNIXTIME(sys_time, '%Y-%m-%d') from mobile_order_0 limit 10; -- 2015-12-21

select sys_time,FROM_UNIXTIME(sys_time, '%Y年%m月%d日') from mobile_order_0 limit 10;

-- 2015年12月21日

select sys_time,FROM_UNIXTIME(sys_time, '%Y%M%d') from mobile_order_0 limit 10; -- 2015December21

select sys_time,FROM_UNIXTIME(sys_time, '%Y%M%D') from mobile_order_0 limit 10; -- 2015December21st

6、UDF、UDAF和UDTF

文件格式：Text File，Sequence File

内存中的数据格式：Java int/string, Hadoop IntWritable/Text

1)UDF：用户定义函数(1-1)

UDF函数可以直接应用于select语句，对查询结构做格式化处理后，再输出内容。UDF只能实现一进一出的操作。

使用方法：a.继承org.apache.hadoop.hive.ql.UDF

b.实现evaluate函数

c.evaluate函数支持重载

2)UDAF：用户自定义聚合函数(n-1)

hive查询数据时，有些聚类函数在HQL没有自带，需要用户自定义实现。UDAF实现多进一出的操作。

使用方法：a.必需的两个包：import org.apache.hadoop.hive.ql.exec.UDAF和 org.apache.hadoop.hive.ql.exec.UDAFEvaluator

b.需要继承UDAF类，内部类Evaluator实UDAFEvaluator接口。

c.Evaluator需要实现 init、iterate、terminatePartial、merge、terminate这几个函数。

3)UDTF：用户自定义表生成函数(1-n)，UDTF解决一行输出多行的需求

使用方法：a.继承org.apache.hadoop.hive.ql.udf.generic.GenericUDTF。

b.实现initialize, process, close三个方法。

UDTF首先会调用initialize方法，此方法返回UDTF的返回行的信息（返回个数，类型）。初始化完成后，会调用process方法，对传入的参数进行处理，可以通过forword()方法把结果返回。最后close()方法调用，对需要清理的方法进行清理。

7.条件函数

1)If函数: if

语法: if(boolean testCondition, T valueTrue, T valueFalseOrNull)

返回值: T

说明: 当条件testCondition为TRUE时，返回valueTrue；否则返回valueFalseOrNull

举例：

hive> select if(1=2,100,200) from lxw_dual;

200

hive> select if(1=1,100,200) from lxw_dual;

100

2)非空查找函数: COALESCE

语法: COALESCE(T v1, T v2, …)

返回值: T

说明: 返回参数中的第一个非空值；如果所有值都为NULL，那么返回NULL

举例：

hive> select COALESCE(null,'100','50′) from lxw_dual;

100

3)条件判断函数：CASE

语法: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END

返回值: T

说明：如果a等于b，那么返回c；如果a等于d，那么返回e；否则返回f

举例：

hive> Select case 100 when 50 then 'tom' when 100 then 'mary'else 'tim' end from lxw_dual;

mary

hive> Select case 200 when 50 then 'tom' when 100 then 'mary'else 'tim' end from lxw_dual;

tim

4)条件判断函数：CASE

语法: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END

返回值: T

说明：如果a为TRUE,则返回b；如果c为TRUE，则返回d；否则返回e

举例：

hive> select case when 1=2 then 'tom' when 2=2 then 'mary' else'tim' end from lxw_dual;

mary

hive> select case when 1=1 then 'tom' when 2=2 then 'mary' else'tim' end from lxw_dual;

tom

8.Hive函数大全：http://blog.csdn.net/wisgood/article/details/17376393

Hive 中的复合数据结构简介以及一些函数的用法说明：https://my.oschina.net/leejun2005/blog/120463

9. Hive中类似版本号9.6.0这种的不能直接用>,=,<等来比较。

1）1.0.0 10.0.0 2.0.0的比较结果

2.0.0>10.0.0>1.0.0

这样的结果是不正确的

2）筛选出9.6.0以上版本的数据，直接写>=9.6.0也是不对的，10.0.0这种会被排除在外。这种时候选择正则

a. '4.6.0及以上版本'

rlike '[4-9]\\.[1][0-9]\\.[0-9]|[4-9]\\.[6-9]\\.[1][0-9]|4\\.[6-9]\\.[0-9]|[5-9]\\.[0-9]\\.[0-9]|[1-9][0-9](\\.[0-9]

-- [4-9].[10-19].[0-9] [4-9].[6-9].[10-19] 4.6.0-4.9.9 5.0.0-9.9.9 10.0.0-99.99.99

{1,2}){2}'

b. '9.6.0以下版本'

rike '^[0-8](\\.[0-9]{1,2}){2}|9\\.[0-5]\\.[0-9]{1,2}'

-- 0.0.0-8.99.99 9.0.0-9.5.99

c. '9.6.0及以上版本'

rlike '9\\.([6-9][0-9]{0,1}|10)\\.[0-9]{1,2}|[1-9][0-9](\\.[0-9]{1,2}){2}'

--9.6.0-9.99.99 10.0.0-99.99.99

d. '9.8.0及以上版本'

rlike '9\\.([8-9][0-9]{0,1}|10)\\.[0-9]{1,2}|[1-9][0-9](\\.[0-9]{1,2}){2}'

--9.8.0-9.99.99 10.0.0-99.99.99

10. Hive中的半链接

背景：内嵌在另一个SQL语句中的SELECT语句称为子查询。Hive支持子查询，但是子查询只能出现在SELECT语句的FROM子句中。如果需要使用一个IN子查询，将不被支持

解决方法：使用半链接

如果支持IN子查询语句应该是：select * from A where id in (select id from B)

如果使用半连接语句应该是：select * from A left semi join B ON(A.id = B.id)

11. Hive隐藏分隔字符\001替换为可见字符

Hive默认的分隔符是\001，属于不可见字符，这个字符在vi里是^A

这个时候，按Esc键，输入:%s

这个时候按下CTRL+V+A（会自动变成^A），

然后再输入/|/g，如下：

:%s/^A/|/g

最简单的方法就是用sed（注意这个^A是按CTRL+V+A打出来的哦，直接输入的^A是不行的。）

sed -i 's/^A/|/g' 000000_0

^A在终端下通常按CTRL+V+A组成。

还有一种办法，就是用tr

tr '\001' '\|' <000000_0> 000000_1

把包含隐藏字符的文件000000_0保存为新的文件000000_1

详见：https://blog.csdn.net/bon_mot/article/details/72902784

12. Hive语法学习 https://www.cnblogs.com/HondaHsu/p/4346354.html

13.Hive内部表和外部表的区别

1）在导入数据到外部表，数据并没有移动到自己的数据仓库目录下，也就是说外部表的数据并不是由hive自己来管理的，而表则不一样；

2）在删除表的时候，Hive将会把属于表的元数据和数据都删掉，而删除外部表时Hive仅仅删除外部表的元数据，数据是不会删掉的。

如何选择使用哪种表，在大多数情况下没有太多区别。如果是所有处理都需要Hive完成那么应该创建表，否则使用外部表。

14. 执行hive的几种方式和把HIVE保存到本地的几种方式（https://www.cnblogs.com/kouryoushine/p/7808567.html）

第一种，在bash中直接通过hive -e命令，并用 > 输出流把执行结果输出到制定文件

hive -e "select * from student where sex = '男'" > /tmp/output.txt

第二种，在bash中直接通过hive -f命令，执行文件中一条或者多条sql语句。并用 > 输出流把执行结果输出到制定文件

hive -f exer.sql > /tmp/output.txt

-- 文件内容

select * from student where sex = '男';

select count(*) from student;

第三种，在hive中输入hive-sql语句，通过使用INSERT OVERWRITE LOCAL DIRECTORY结果到本地系统和HDFS文件系统

insert overwrite local directory "/tmp/out"

select cno,avg(grade) from sc group by(cno);

第四种，就是基本的SQL语法，从一个表格中抽取数据，直接插入另外一个表格。参考SQL语法即可。

insert overwrite table student3

select sno,sname,sex,sage,sdept from student3 where year='1996';

15. Hive中LIKE查询使用通配符'%'时，当遇到通配符'%'或'_'应当转义

原文：http://blog.sina.com.cn/s/blog_6ff05a2c0100znp9.html