HIVE总结，

和通数据库htsjk.Com2019-09-26 22:57 来源:未知阅读:2193 评论 14 热度2

标签：

HIVE总结，

文章目录

一、什么是hive
二、hive特点
三、hive架构
四、hive支持的文件格式
五、数据模型
六、shell交互
七、内部表和外部表
八、桶表
九、分区
十、分桶和分区的意义
十一、增加／修改列
十二、insert
十三、导出数据
十四、排序
十五、join
十六、hive命令行
十七、内置函数
十八、自定义函数
十九、Transform
二十、特殊分隔符处理

一、什么是hive

将HDFS中的结构化数据映射成表，利用sql将查询分析任务转为MR程序执行

二、hive特点

集群可扩展、函数可以自定义、容错

三、hive架构

1、用户接口：
CLI：shell命令
JDBC／ODBC：java接口
WebGUI：浏览器访问hive
2、元数据存储：MySql／derby
元数据包括表名、列、分区及其属性（是否为外部表）、表数据所在的目录
3、运行sql组件：解释器、编译器、优化器、执行器
生成查询计划即MR程序

四、hive支持的文件格式

Text 、SequenceFIle 、ParquetFile 、RCFile等

五、数据模型

1、DB：
${hive.metastore.warehouse.dir}目录下一个文件夹
2、Table：
所属db目录下一个文件夹
3、External Table：
其数据存放位置可以在任意指定路径
4、Partition：
table目录下的子目录
5、Bucket：
同一个表目录下根据hash散列之后的多个文件

六、shell交互

1、bin/hive:直接可交互
2、bin/hiveserver2：启动为服务，供jdbc连接，终端连接使用beeline
beeline> !connect jdbc:hive2://itcast01:10000

七、内部表和外部表

内部表：create table
外部表：create external table … location
区别：
1、内部表目录下包含数据，外部表数据在location下面
2、删除外部表不会删除数据
注意：load data inpath 在hdfs上移动数据，并不是复制数据

八、桶表

创建桶表：

create table t_name … partitioned by (col dataType) clustered by (col) sorted by (col) into num buckets

桶表抽样查询 TABLESAMPLE(BUCKET x OUT OF y)：
Select * from student tablesample(bucket 1 out of 2 on id)
先计算出需要抽取几个桶：buckets ／ y
从x桶开始，以y为步长抽下一个桶：x，x+y, x+y+y …

九、分区

建表时创建分区: create [external] table …partitioned by (col_name data_type)
添加分区： alter table student_p add partition(part=‘a’) partition(part=‘b’) [location ‘/temp/data’];
删除分区： alter table student_p drop partition(part=‘a’) partition(part='b’);
注意添加分区时，如果没有指定location，就会在默认位置创建分区文件夹，内部表的默认位置就是数据库表的根目录，外部表的默认位置是创建表时指定的location目录，并且以分区字段的值作为目录名
若添加分区时，指定了location则将会在指定的location处生成目录，且目不会出现分区字段的值作为目录名

十、分桶和分区的意义

1、获得更高效的查询：jion算法时，按照分区字段关联时，可以大大减少数据查询量
2、使取样更高效：减少数据量，测试更高效

十一、增加／修改列

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], …) 
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]

十二、insert

1、插入一条
insert into table t_name values(,)
insert overwrite into table t_name [partition (part1=‘ ‘,part2=’’)] select …
2、插入多条

FROM from_statement 
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select statement1 
[INSERT OVERWRITE TABLE tablename2 [PARTITION ...] 
Select statement2] ...

3、动态分区
使用非严格模式：

set hive.exec.dynamic.partition.mode=nonstrict
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement

十三、导出数据

有local代表导入本地，否则导入到hdfs上
1、单个select语句

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...

2、多个select语句

FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...

十四、排序

1、order by
全局排序，只有一个reduce
2、sort by
在reduce前完成排序，几个reduce就是几次排序，因此每个reduce内部有序，不保证全局有序
3、cluster by
给定字段按hash散列分发到不同的reduce上，且在每个reduce上进行排序
4、distribute by
仅按给定的字段按hash散列分发到不同的reduce上，并不进行排序

十五、join

table_reference JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition

1、等值连接：join
SELECT a.val, b.val, c.val FROM a JOIN b
ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
优化技巧：
a、多个join的key是同一个只会生成一个map reduce
b、多表join时，前面的join结果会保存在缓存中，直到最后一个join才会写入磁盘，因此将大表放到最后
2、left[right/full] join 处理空值
3、left semi join 是in／exists的高效实现

十六、hive命令行

hive [-hiveconf x=y]* [<-i filename>]* [<-f filename>|<-e query-string>] [-S]
1. -i 从文件初始化HQL。
2. -e从命令行执行指定的HQL
3. -f 执行HQL脚本
4. -v 输出执行的HQL语句到控制台
5. -p connect to Hive Server on port number
6. -hiveconf x=y Use this to set hive/hadoop configuration variables.

十七、内置函数

十八、自定义函数

1、UDF
单行 —> 单行
2、UDAF
多行 —> 一行
3、开发步骤
继承：UDF
重写：evaluate
打包
上传到hive的classpath下

hive>add JAR /home/hadoop/udf.jar;

创建临时函数名

Hive>create temporary function tolowercase as 'cn.itcast.bigdata.udf.ToProvince';

HQL中使用

Select tolowercase(name),age from t_test;

十九、Transform

自定义脚本，并在sql语句中使用transform关键字调用
add FILE weekday_mapper.py;


INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (movieid , rate, timestring,uid)
  USING 'python weekday_mapper.py'
  AS (movieid, rating, weekday,userid)
FROM t_rating;
#!/bin/python
import sys
import datetime
for line in sys.stdin:
  line = line.strip()
  movieid, rating, unixtime,userid = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([movieid, rating, str(weekday),userid])

二十、特殊分隔符处理

1、使用正则表达式：RegexSerDe

create table t_bi_reg(id string,name string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties(
'input.regex'='(.*)\\|\\|(.*)',
'output.format.string'='%1$s%2$s'
)
stored as textfile;

2、自定义inputFormat
读取行的时候将数据中的“多字节分隔符”替换为hive默认的分隔符
1、继承TextInputFormat开发自己的input Format，继承RecordReader，重写next方法

package cn.itcast.bigdata.hive.inputformat;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.LineRecordReader;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
public class BiDelimiterInputFormat extends TextInputFormat {
 	@Override
 	public RecordReader<LongWritable, Text> getRecordReader(
 	InputSplit genericSplit, JobConf job, Reporter reporter)
 	throws IOException {
 		reporter.setStatus(genericSplit.toString());
 		MyDemoRecordReader reader = new MyDemoRecordReader(new LineRecordReader(job, (FileSplit) genericSplit));
 		// BiRecordReader reader = new BiRecordReader(job, (FileSplit)genericSplit);
 		return reader;
 	}
 	public static class MyDemoRecordReader implements RecordReader<LongWritable, Text> {
 		LineRecordReader reader;
 		Text text;
 		public MyDemoRecordReader(LineRecordReader reader) {
 			this.reader = reader;
 			text = reader.createValue();
 		}
 		@Override
 		public void close() throws IOException {
 			reader.close();
 		}
 		@Override
 		public LongWritable createKey() {
 			return reader.createKey();
 		}
 		@Override
 		public Text createValue() {
 			return new Text();
 		}
 		@Override
 		public long getPos() throws IOException {
 			return reader.getPos();
 		}
 		@Override
 		public float getProgress() throws IOException {
 			return reader.getProgress();
 		}
 		@Override
 		public boolean next(LongWritable key, Text value) throws IOException {	
 			boolean next = reader.next(key, text);
 			if(next){
 				String replaceText = text.toString().replaceAll("\\|\\|", "\\|");
 				value.set(replaceText);
 			}
 			return next;		
 		}
 	}
}

2、打包放入到hive/lib下
3、建表时指定inputformat:stored as inputformat

create table t_lianggang(id string,name string)
row format delimited
fields terminated by '|'
stored as inputformat 'cn.itcast.bigdata.hive.inputformat.BiDelimiterInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';