hive笔记，

和通数据库htsjk.Com2020-02-06 22:52 来源:未知阅读:6609 评论 469 热度4

标签：

hive笔记，

纯笔记：

Hive: 读写及管理分布式数据集的机制，使用SQL来读写及管理分布式数据集的机制，是一个数据仓库软件，重在分析过程，对时效性不敏感，只读数据放到hive进行分析。

OLAP //online analyze process 在线分析处理

OLDP //online transaction process 在线事务处理 RDBMS

hive 支持三种数据结构： table + partition +bucket(hash)

hive的表相当于目录

hive架构：

client : web ui /cli/...

hive: meta store +hive处理引擎

底层： hadoop mr

hive组件：

1.metadata 存在rdbms中的元数据

2.HQL hive处理引擎写查询语句用于MR作业

3.Execute engine 执行引擎执行查询工作，得到结果

4. hdfs/hbase 数据存储技术，存储数据

hive上mysql 初始化元数据库： schematool -initSchema -dbType mysql

不用进命令模式建表： hive -e "create myhive.table t1(id int，name string);

用格式化信息输出表： desc formatted users;

使用load命令加载本地文件到hive: load data local inpath '/soft/source/user.txt' into table myhive.t1;

使用load命令加载hdfs文件到hive: load data inpath 'hdfs://master:8080/user/hadoop/data/user.txt' into table myhive.t1;

修改表结构：

alter table t1 rename to t2; //改表名 t1改为t2

alter table t2 add columns(col1 int); //添加列col1

alter table t2 drop columns married; //删除married列

alter table t2 change married m int; //修改列 married改为m

alter table t2 replaces columns (m int) //只留替换列，数据还在

hive 创建脚本：

1.创建hive脚本：【h.sql】drop table myhive.t1

2.执行脚本： $>hive -f h.sql

hive不支持更新和删除，要删除就用hdfs操作

$hive>dfs -sr / 在hive模式下直接使用hadoop命令

按照country(国家）再按照state(州)对数据进行分区建表，分区表改变了Hive对数据存储的组织方式：

create table employee(name string,salary float,subordinates array<string>,deduction map<string,float>,address struct<street:string,city:string,state:string,zip:int>) partitioned by (country string,state string);

显示分区： $hive>show partitions employee; //查看分区数据

添加分区： $hive>alter table t4 add partition(province='hebei',city='baoding') location '...';

删除分区： $hive>alter table t4 drop partition(province='hebei',city='baoding')

插入数据到分区表： $hive> insert into t4 partition(province='hebei',city='baoding') values(1,'zhang');

创建表，携带数据： $hive>create table user1 as select * from users;

创建表携带数据： $hive>create table user2 like users;

启动hiveserver,远程使用jdbc操作hive： hive --service hiveserver2 &

查看开启端口： netstat -anop |more

beeline客户端的一种shell : beeline -u jdbc:hive2://localhost:10000/myhive

(也可以写为beeline -u jdbc:hive2:// 后面的可以不写，默认localhost)

hive的优化有三种： 桶表，分区表，map端连接

排序：

order by排序是全排序

sort by 只会在每个reduce中对数据排序，执行一个局部排序过程，这可以保证每个reduce的输出数据都是有序的，这样可以提高后面全局排序的效率。

distribute by 等价于MR中的分区过程，保证具有相同数据的某个字段一定进入同一个分区，也就是进入同一个reduce.

select * from orders o distribute by o.cid sort by o.id desc;

创建视图： create view v1 as select a.id,a.name,b.id,b.orderno,b.price from customers a left outer join orders b on a.id=b.id where b.cid is not null;

create view v1 as select a.id cid,a.name cname,b.orderno ono,b.price oprice from customers a left outer join orders b on a.id=b.cid;

从hive到处数据到本地：
1.使用hadoop get命令 hdfs dfs -get ...
2.指定本地目录位置insert overwrite local directory '...' select ...
insert overwrite local directory '/soft/source/orders.daa' select * from orders where cid is not null;

索引：

创建索引：create index idx_customers_id on table customers (id) as 'bitmap' with deferred rebuild;
重建索引： alter index idx_customers_id on customers rebuild;

删除索引,在索引表中生成记录： drop index idx_customers_id on customers;

显示索引： show formatted index on customers;

调优：

使用explain解析查询结果： explain [extended select sum(id) from customers;

JVM重用

set mapred.job.reuse.jvm.num.task=5; [不推荐]yarn不适用

【yarn】

//mapred-site.xml

mapred.job.ubertask.enable=false //启用单个jvm按序排列，默认false

mapred.job.ubertask.maxmaps=9 //最大map数>=9,只能调低

mapred.job.ubertask.maxreduces=1 //目前只支持1个reduce

mapred.job.ubertask.maxbytes=128m

并发执行

explain解释执行计划，对于没有固定依赖关系的task,可以进行并发执行。

hive.exec.parallel=true //启用mr的并发执行，默认false

hive.exec.parallel.thread.number=8 //设置并发执行的job数，默认是8

map端连接:

set hive.auto.convert.join=true;

set hive.mapjoin.smalltable.filesize=600000000; //文件<=指定值时可以启用map连接

set hive.auto.convert.join.noconditionaltask=true; //不需要在select中使用/*+ streamtable(customers) */暗示。

map bucket端连接：

set hive.auto.convert.join=true; --default false

set hive.optimize.bucketmapjoin=true; --default false

skewJoin 倾斜连接：

set hive.optimize.skewjoin=true; //开启倾斜优化

set hive.skewjoin.key =100000; //数据key量超过该值，新的key值就不再发给同一个reduce,发给新的reduce,解决数据倾斜问题

set hive.groupby.skewindata=true //在groupby是否使用数据倾斜优化，默认false

analyze: 对表、partition,column level级别元数据进行统计，作为input传递给CBO(cost-based Optimizer),会选择成本最低查询计划来执行。

analyze table customers compute statictics;

describe extended customers;

事务：

事务支持：

stored as textfile;

orc: Optimized Row Columnar,优化列模式文件

create table tx(id int,name string) clustered by (id) into 2 buckets row format delimited fields terminated by '\t' lines terminated by '\n' stored as orc TBLPROPERTIES('transactional '='true');
打开事务支持：

set hive.support.concurrency=true;

set hive.enforce.bucketing=true;

set hive.exec.dynamic.partition.mode=nonstrict;

set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

set hive.compactor.initiator.on=true;

set hive.compactor.worker.threads=1;