hive，

和通数据库htsjk.Com2019-07-20 22:47 来源:未知阅读:18236 评论 370 热度4

标签：

hive，

建表

create table a.b(s string,f string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘\t‘; ##'\t'是文本的分割符

建外部表

create external table ibdata(mdn_id string, loc_id string)

row format delimited

fields terminated by '\t'##数据分隔符

stored by textfile

location 'path';##hdfs路径

sequencefile##.gz压缩文件 textfile##文件类型

Ps:最近新遇到一个问题，hdfs路径错误的话，会报没有权限的错误

绝对路径和相关路径都可以写，hdfs:namenode://user/

分析dpi数据（url分解）

1.找到目标字段返回值（我这边数据量太大了，需要设置map reduce数，搞一个分区）

这里遇到一个问题，is not null可能会报错，具体是我们NUll是string，需要''括起来。

2.找到目标url再进行分析（即url的模糊匹配）注意hive里的转义问题

select url rlike '^(http)?:\/\/(dict)\.(51ifind)\.(com)\/*' from url;

3.有时候url不是那么规整的数据，需要自己分析的，可以用regexp_extract(url,rex,id)

https://mp.csdn.net/postedit/80236627 select regexp_extract('https://mp.csdn.net/postedit/80236627','[0-9/]{9}',0);

返回/80236627（这里的id表示前面正则返回匹配第几个，默认为1）

这里遇到一个问题

http://mnews.gw.com.cn/wap/data/ipad/stock/SH/08/600008/list/1.json

(\\w{2})(\\d{2})\/(\\d{6})或者(\\w{2})([0-9/]{10})即可

4.日期转换（主要是yyyymmddhhssmm转换成yyyy-mm-dd进行判断周几）

方法1: from_unixtime+ unix_timestamp
--20171205转成2017-12-05 
select from_unixtime(unix_timestamp(‘20171205‘,‘yyyymmdd‘),‘yyyy-mm-dd‘) from dual;

--2017-12-05转成20171205
select from_unixtime(unix_timestamp(‘2017-12-05‘,‘yyyy-mm-dd‘),‘yyyymmdd‘) from dual;

方法2: substr + concat
--20171205转成2017-12-05 
select concat(substr(‘20171205‘,1,4),‘-‘,substr(‘20171205‘,5,2),‘-‘,substr(‘20171205‘,7,2)) from dual;

--2017-12-05转成20171205
select concat(substr(‘2017-12-05‘,1,4),substr(‘2017-12-05‘,6,2),substr(‘2017-12-05‘,9,2)) from dual;

判断星期几：0-6代表周日-周六

pmod（datediff（date，‘2018-08-27’）,7）