Hive笔记，

和通数据库htsjk.Com2019-09-22 22:54 来源:未知阅读:17444 评论 423 热度4

标签：

Hive笔记，

Hive HA 原理：
HAProxy :Hive 实例池
Hive 数据类型
基础
复合
array
map
struct
创建表
create EXTERNAL TABLE( xx
col
PARTITION BY
CLUSTERED BY
ASC|DESC
ROW FORMAT
STORED AS 三种存储格式
LOCATION --> hdfs path

内部表：删除schema信息和 hdfs上的数据都被删除
外部表：只删除schema信息
本地加载
load data local inpath 'path' [overwrite] into table 表名
HFDS加载
load data inpath 'path' [overwrite] into table 表名

Hive(数据仓库) 与 Sql 的区别
除了语法相同，其他没什么相同
对Hadoop上的数据进行查询抽取，不能更新
安装Thrift
远程服务启动
hive --service hiveserver &
thriftServer -> jdbc 连接hive
常用功能：
Hive结果存到本地
hive -S -e "select * from table" >>/tmp/mydir
Hive 查找
hive -S -e "set" | grep warehouse
执行hive文件
hive -f /path/Hive.sql

内部表和外部表
当你只有查看的权限时，用外部表；因为内部表 load 时，会
将数据mv到wavehouse中，会报错。

Partition 通过目录划分分区(相当于索引)，分区字段时特殊字段,数据越大越需要分区
目录结构:/pub/{dt}/{customer_id}
alter table tb ADD PARTITION(dt='',customer_id) location '/pub/20150111/00001'

create table partition_table(
name string,
salary float
)<-顺序不能错->
PARTITIONED BY (dt string,dep string)
ROW FORMAT delimited fields terminated by ','
STORED AS textfile;

load data local inpath '/../xxx' into table partition_table partition(dt='2015-05-01',dep='dev2') --> 不能有特殊字符
partition 在HDFS中其实就是一个一个的文件夹
alter table partition_table ADD PARTITION(dt='2014-05-01',dep='dev1') location '/opt/20140501/pati.log' (文件还是文件夹)
使用分区进行查询
select * from fc where dt='2014-05-01' and custom_id= '00001';
// 删除分区
alter table partition_table drop PARTITION(dt='2014-05-01',dep='dev1')
//添加字段
alter table partition_table add columns(age string)
//只支持等值join
select a.a1,b.b1 from ta a join tb b on a.id=b.id join c on c.id = a.id; //3表join
LEFT SEMI JOIN 其实是 IN/EXISTS 子查询的一种更高效的实现
UNION ALL 并集

**************HIVE 索引**********************
索引比partition 的效率高，然后索引必须要partition为基础
create table index_test(id int,name string) partitioned by (dt string)
row format delimited fields terminated by ',';

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true; 动态partition

//建表
create table index_test(id INT, name STRING) partitioned by (dt STRING) row format delimited fields terminated by ',';
//建索引
create index index1_index_test on table index_test(id) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD ;
show index on index_test; //查看索引
show partitions index_test;
drop index index1_index_test on index_test; //删除索引
*****************Bucket************
Bucket必须要partition为基础，基于sample，不太精确。比partition的数据粒度小。
适用于数据量大，没有必要全部查出来的情况。
create table tb_tmp(id int,name string,timeflag bigint) culustered by (id) sorted by (age);
//clustered by 是按照某个字段来进行hash集群
// 以 id 为桶id，对5取模。
select * from tb_tmp tablesample(bucket 1 out of 5 on id);

*****************UDF************

*****************调优************重点是对MapReduce的重用
explain(SQL也有这个)
explain extended select * from tb;
队列
set mapred.queue.name=queue3;
set mapred.job.queue.name=queue3;
设置任务的优先级：
set mapred.job.priority=HIGH;
本地模式和并行模式
set hive.exec.mode.local.mode=true;//本地模式
默认情况下Hive只会执行一个stage。
hive.exec.parallel 默认为false.
<property>
<name>
hive.exec.parallel
</name>
<value>true</value>
</property>
设置Mapper 和 Reducer 的个数
Mapper的个数有splits 确定，Reducer的个数默认为1
InputSplits.
split的大小可以自己设置，默认128M。
set mapred.reduce.tasks=15
map 重写InputFormatRecordReader可以设置，但是正常情况下不能设置
Reducer的个数,
JVM重用(很重要)
对于大量小文件的job，开启JVM重用
map.job.reuse.jvm.num.tasks=20
索引
动态分区
开启动态分区
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true; 动态partition
与静态分区的区别
动态分区只有select > 0 才会分区，而静态分区不是。
动态分区会为每个分区分配reduce数。
推测执行
Mapreduce的配置,默认是true,
set mapred.map.tasks.speculative.execution=false;
set mapred.reduce.tasks.speculative.execution=false;
谁先执行要谁的结果。(spark 也是这样的机制)
如果在每个节点都执行慢，那么会造成oom。怎么解决?应该设置一个最大值，超过最大值，就停止推测。
Hive的配置
hive.mapred.reduce.tasks.speculative.execution
Join调优
小表在前，大表在后。
数据倾斜
select * from log a left outer join members b on a.memberid = b.memberid;
解决：
先根据log取所有的memberid，然后在mapjoin 关联 members 取今天有日志的members的信息，然后在和log 做mapjoin。
select mapjoin(x) from log a
left outer join (select mapjoin(c) from (select distinct memberid from log) c
join members d
on c.memberid = d.memberid
)x
on a.memberid = b.memberid.

Hive的bug调试

PostgreSQL 对事务有很强的支持，使用与财务，金融等业务。