Hive，

和通数据库htsjk.Com2019-07-15 22:08 来源:未知阅读:3490 评论 230 热度5

标签：

Hive，

什么是hive

hive是一个数据仓库的软件，用于使用sql读取和写入数据到一个分布式数据集中，可以把数据集中的结构化数据映射为hive中的一张表，同时可以通过命令行工具，或者jdbc程序连接到hive中进行操作

hive的本质是：将hiveQL（HQL）转化为mapReduce程序运行

hive的优缺点

优点：

特点：

缺点：

Hive的架构原理

Hive的安装

1. 解压上传的hive安装包到安装路径中
tar -zxvf apache-hive-1.2.1-bin.tar.gz -C path

2. 修改hive安装路径的权限为用户X(按需求更改)
sudo chown -R X:X path

3. 修改hive/conf/hive-env.sh.template  为 hive-env.sh
mv hive-env.sh.template hive-env.sh

4. 配置hive-env.sh 文件
#Set HADOOP_HOME to point to a specific hadoop install directory
 HADOOP_HOME=/opt/app/hadoop-2.7.2

 5. 要想启动hive 必须首先启动hdfs与yarn的服务
 sbin/start-all.sh

 6. 在HDFS上创建 /tmp 以及 /user/hive/warehouse 两个目录，根据需要修改权限
  bin/hdfs dfs -mkdir /tmp			//Hive的默认log的保存位置，在/tmp/当前用户名/hive.log 位置下保存。对于log的存储位置就是可以进行修改的

 bin/hdfs dfs -mkdir -p /user/hive/warehouse			//hive的默认数据仓库的位置，也就是default数据库的位置（hive-default.xml.template配置文件中有说明，可以自定义hive-site.xml文件进行修改默认仓库位置）

 bin/hdfs dfs -chmod g+w /tmp
 bin/hdfs dfs -chmod g+w /user/hive/warehouse

7. 运行hive
 bin/hive

hive使用

将本地文件导入到hive数据库中：

create table student(id int,name string) row format delimited fields terminated by ‘\t’;（指定分隔符，与数据文件中的分隔符保持一致）
load data local inpath ‘/data/student.txt’ into table student;

-e 不进入hive而是在命令中使用sql语句执行hive

[hadoop@hadoop01 apache-hive-1.2.1-bin]$ bin/hive -e "select * from student;"
结合 > 的方式把sql的结果，追加到一个文件中
bin/hive -e "select * from student;" > /path/xx.txt

-f执行sql脚本中的语句

首先创建一个sql脚本文件student.txt，在该文件中编写如下内容
select * from student;
使用-f执行该脚本文件
bin/hive -f /path/student.txt

在hive中与本地和hdfs的交互

hive>dfs -ls /;		//查看hdfs根目录
hive>! ls /;		//查看本地根目录

Hive的DDL操作（创建，删除，修改）

管理表和外部表

//删除一个空的数据库
drop database db;
//删除一个非空数据库
drop database db cascade;

管理表：对于默认创建的表，就是管理表，也会被称为内部表，管理表会控制着hive表中的生命周期，其实就是代表管理表对数据存在管理权限，当删除一个管理表的时候，对应表中存储的数据同时也会被删除
//创建一个管理表
create table t1(id int,name string) row format delimited fields terminated by '\t';
//导入数据到该表中
load data local inpath '/path/xx.txt' into table t1;

外部表：对于外部表，由于外部表不是完全拥有数据，删除一个外部表的时候并不会删除数据本身的内容，只会删除外部表的元数据信息，如表结构
案例操作：
//1. 准备一个t2
create external table t2(id int,name string)row format delimited fields terminated by '\t';
//2. 导入数据到外部表中
load data local inpath 'path/xx.txt' into table t2;
//3. 创建t3
create external table t3(id int,name string)row format delimited fields terminated by '\t' location '/user/hive/warehouse/t2';
通过以上操作，t2和它t3同时指向了同一份数据（xx.txt，t3是通过t2在hdfs上的表得到指向信息），由于是外部表的关系，所以当删除其中一个表的时候，并不会把数据删除，也就是说不会影响到另一个表的数据，此时创建的外部表t2与t3用同一个地址指向同一个资源空间，因此在web网页上展示的只有一个t2表，相同的且用同一个映射地址得到的外表在web上只展示一个，但是在hdfs系统目录上是真实存在的

总结：内部表和外部表的区别
管理表：在删除表的时候，会同时把表中对应的数据删除
外部表：在删除表的时候不会删除表中对应的数据，而只会删除表中的元数据信息

分区表

分区表就是在对应的HDFS文件系统上建立，一个个独立的文件夹，在该文件夹下存放的是Hive的分区，就是分目录，把一个大的数据集拆分成一个个小的数据集，独立保存在不同的文件夹中。在查询的时候可以通过where子句进行查询，提高查询效率。
分区表的基本操作：
//1. 创建分区表
create table dept_partition(deptno int,loc string)partitioned by (day string) row format delimited fields terminated by '\t';
create table dept_partition2(deptno int,loc string)partitioned by (month string,day string)row format delimited fields terminated by '\t';	//创建二级分区
//2. 加载数据到分区表
load data local inpath '/path/xx.txt' into table dept_partition partition(day='16');
//3. 查询分区中的数据
select * from dept_partition where day='16' limit 10;
//4. 增加分区
alter table dept_partition add partition(day='17');
//5. 删除分区
alter table dept_partition drop partition(day='17');
//6. 查看分区表中存在多少个分区
show partitions dept_partition;
//7. 查看表的详细结构
desc formatted dept_partition;

案例：把数据直接先上传到HDFS的分区目录中，再让分区表与数据参数关联（两种实现方式）

//1. 首先在HDFS上创建一个目录，如下结构：
/user/hive/warehouse/dept_partition2/month=2/day=16
//2. 上传数据到该目录中
dfs -put /data/dept.txt /user/hive/warehouse/dept_partition2/month=2/day=16;
//3. 此时查看上传数据之后的分区表是没有数据的
 select * from dept_partition2 where month='2' and day='16';
// 注：此时在web网页上显示已经上传了该表以及分区

有以下两种方式使分区表与数据产生关联
1. 执行修复命令，使HDFS上的数据与表结构产生关联
msck repair table dept_partition2;
2. 增加分区
alter table dept_partition2 add partition(month='2',day='16');

增加、修改、删除列

1. 增加列
alter table t1 add columns(id int);
2. 修改列
alter table t1 change column age sex string;
3. 替换列
alter table t1 replace columns(sex string,age string);
4. 删除表
drop table t1
5. 清空表中的数据而不删除表结构
truncate table t1;

shell 脚本自动load数据到分区表

日志服务器里面日志数据，通过flume工具拉取到HDFS上存储的，日志是每天产生的，最好存储的数据的方式，需要在hive中进行分析，则可以使用分区表

#! /bin/bash

#defined yesterday
YESTERDAY=$(date -d"-1 days" +%Y%m%d)

#defined log dir 
ACCESS_LOG_DIR=/access_logs/$YESTERDAY

#HIVE_HOME
HIVE_HOME=/opt/app/apache-hive-1.2.1-bin

#load data

for FILE in $(ls /access_logs/$YESTERDAY)

do
	DAY=${FILE:0:8}
	
	HOUR=${FILE:8:2}
	
	$HIVE_HOME/bin/hive -e "load data local inpath '$ACCESS_LOG_DIR/$FILE' into table web_logs_part partition(day='$DAY',hour='$HOUR');"
	
done

测试脚本

sh web_logs.sh

关于内部表和外部表的删除以及对应的数据问题：