hive函数，

和通数据库htsjk.Com2019-11-17 22:07 来源:未知阅读:16101 评论 426 热度4

标签：

hive函数，

内置运算符

Hive官方文档

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

你可以将hive当成mysql,因为绝大部分mysql的函数，hive都有

Hive自定义函数和Transform

当Hive提供的内置函数无法满足你的业务处理需要时，此时就可以考虑使用用户自定义函数 UDF（ user-defined function）。

自定义函数类别

UDF开发实例

1代码

import org.apache.hadoop.hive.ql.exec.UDF;

public class ToLowerCase extends UDF {
	 
	// 必须是public
	public String evaluate(String field) {
		String result = field.toLowerCase();
		return result;
	}


}

2、打成jar包上传到服务器

3、将jar包添加到hive的classpath

hive> add JAR /root/udf.jar;
Added [/root/udf.jar] to class path
Added resources: [/root/udf.jar]

4创建临时函数与开发好的java class关联

hive> create temporary function toprovince as 'cn.feizhou.udf.ToLowerCase';

测试

测试数据1.txt

1,A
2,B
3,C
4,d
7,y
8,u

create table ttt_t3(id int,name string) row format delimited fields terminated by ',' stored as textfile;

测试结果

hive> select id,name,toprovince(name) from ttt_t3;
OK
1   A   a
2   B   b
3   C   c
4   d   d
7   y   y
8   u   u

Transform实现

Hive的 TRANSFORM 关键字提供了在SQL中调用自写脚本的功能

适合实现Hive中没有的功能又不想写UDF的情况

案例

[root@HAT01 ~]# cat 6.txt
1,11,978300760,444
2,22,978300760,444

创建表加载数据

hive> create table ttt_t7(movieid string,rate int,timestring string,uid string)row format delimited fields terminated by ',';

hive> load data local inpath '/root/6.txt' overwrite into table ttt_t7;

hive> select * from ttt_t7;

1 11 978300760 444
2 22 978300760 444

hive>

创建py

vim weekday_mapper.py

#!/bin/python
import sys
import datetime

for line in sys.stdin:
line = line.strip()
movieid, rating, unixtime,userid = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([movieid, rating, str(weekday),userid])

将py添加到hive的classpath

hive> add FILE /root/weekday_mapper.py;
Added resources: [/root/weekday_mapper.py]

结果

hive> SELECT TRANSFORM (movieid, rate, timestring,uid) USING 'python weekday_mapper.py' FROM ttt_t7;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
Query ID = root_20190309144803_fa0b3901-21fb-404e-a95e-660362bb41a9
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1552027080460_0009, Tracking URL = http://HAT02:8088/proxy/application_1552027080460_0009/
Kill Command = /home/hadoop/app/hadoop-2.6.4/bin/hadoop job -kill job_1552027080460_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-03-09 14:49:14,142 Stage-1 map = 0%, reduce = 0%
2019-03-09 14:50:02,816 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.24 sec
MapReduce Total cumulative CPU time: 1 seconds 240 msec
Ended Job = job_1552027080460_0009
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.24 sec HDFS Read: 4508 HDFS Write: 133 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 240 msec
OK
1 11 1 444
2 22 1 444
Time taken: 120.21 seconds, Fetched: 2 row(s)

Hive实战案例 ——级联求和

实现步骤

可以用一个hql语句即可实现：

select A.username,A.month,max(A.salary) as salary,sum(B.salary) as accumulate
from 
(select username,month,sum(salary) as salary from t_access_times group by username,month) A 
inner join 
(select username,month,sum(salary) as salary from t_access_times group by username,month) B
on
A.username=B.username
where B.month <= A.month
group by A.username,A.month
order by A.username,A.month;

还有一种方法，关键核心

where B.month = A.month-1