大数据学习笔记之Hive（七）：Hive正则表达式导入数据、UDF日期转换，

和通数据库htsjk.Com2019-10-07 23:58 来源:未知阅读:17959 评论 410 热度5

标签：

大数据学习笔记之Hive（七）：Hive正则表达式导入数据、UDF日期转换，

文章目录

Hive正则表达式导入数据
UDF日期转换

编写
添加到hive：

Hive正则表达式导入数据

需求：上面的日志中间不是用tab键分割的
建表语句：

create table IF NOT EXISTS db_web_data.baidu_log (
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
request_body string,
http_referer string,
http_user_agent string,
http_x_forwarded_for string,
host string
)
//这一行是固定的，表示用正则表达式分割
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
//这里的正则表达式对应于上面，十一个数据，每个中间用空格隔开
//()表示某一列的正则表达式 [^ ]*任意不空出现任一次   -|[^ ]*减号或者任意非空 
"input.regex" = "(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")"
)
//hive存储当前仓库数据的文件格式，默认是textfile
STORED AS TEXTFILE;

引申：查看hadoop的压缩机制

导入数据：

load data local inpath '/home/admin/Destop/baidu_access.log' into table baidu_log;

查询数据

select * from baidu_log limit 5

hdfs查看

UDF日期转换

编写

DataTransformUDF

package com.z.demo.udf;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;


import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

//"31/Aug/2015:00:04:37 +0800" 
public class DataTransformUDF extends UDF {
	private final SimpleDateFormat inputFormat = new SimpleDateFormat("dd/MMM/yy:HH:mm:ss", Locale.ENGLISH);
	private final SimpleDateFormat outputFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

	public Text evaluate(Text str){
		Text output = new Text();
		
		if(null == str) return null;
		
		if(StringUtils.isBlank(str.toString())) return null;
		
		Date parseDate;
		try {
			parseDate = inputFormat.parse(str.toString().trim());
			String outputDate = outputFormat.format(parseDate);
			
			output.set(outputDate);
		} catch (ParseException e) {
			e.printStackTrace();
		}
		
		return output;
	}
	public static void main(String[] args) {
		System.out.println(new DataTransformUDF().evaluate(new Text("31/Aug/2015:00:04:37 +0800")));
	}

}

添加到hive：

添加打包后的jar

hive (db_web_data)> add jar /home/admin/Desktop/dateformat.jar;

添加临时函数

hive (db_web_data)> create temporary function dateformat as 'com.z.demo.udf.DataTransformUDF';

查看是否添加

show funcions;

执行测试

select dateformate(time_local) from baidu_log limit 1;

java.text.ParseException: Unparseable date: ""31/Aug/2015:00:04:53 +0800""
        		at java.text.DateFormat.parse(DateFormat.java:366)

报错：注意"“31/Aug/2015:00:04:53 +0800"” 两边有两个引号

修复：
2.4、定义UDF函数用于去除数据中的双引号
见代码：RemoveQuotesUDF
2.5、再次执行测试
hive> select dateformat(remove_q(time_local)) date from baidu_log limit 1;
date
2015-08-31 00:04:37
Time taken: 0.196 seconds, Fetched: 1 row(s)

RemoveQuotesUDF

package com.z.demo.udf;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class RemoveQuotesUDF extends UDF {

	public Text evaluate(Text str){
		if(null == str){
			return null;
		}
		
		// validate 
		if(StringUtils.isBlank(str.toString())){
			return null ;
		}
		
		// replaceAll
		return new Text(str.toString().replaceAll("\"", ""));
	}
	
	public static void main(String[] args) {
		System.out.println(new RemoveQuotesUDF().evaluate(new Text("\"GET /course/view.php?id=27 HTTP/1.1\"")));
//		System.out.println(new RemoveQuotesUDF().evaluate(new Text(args[0])));
	}
}

添加去引号的function

add jar /home/admin/Desktop/remove_q.jar
create temporary funciont remove_q as 'com.z.demo.udf.RemoveQuotesUDF';
show functions;

再次查询

select dateformat(remove_q(time_local)) from baidu_log limit 1;