hive初体验，

和通数据库htsjk.Com2019-11-21 22:08 来源:未知阅读:12239 评论 394 热度5

标签：

hive初体验，

1.进入hive

启动zk，然后启动hdfs和yarn。在一台节点启动hive的服务端，服务端启动的命令是：

/export/servers/hive-1.2.2/bin/hiveserver2

修改在客户端机子的配置文件，文件在/export/servers/hive-1.2.2/conf/hive-site.xml，没有就创建它，修改里面那个javax.jdo.option.ConnectionURL的值。要改成服务端里面的主机名或者IP地址（如下）。其他不用修改。

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://Node2:3306/hive?createDatabaseIfNotExist=true</value>

<description>JDBC connect string for a JDBC metastore</description>

</property>

然后在另一台节点启动客户端

/export/servers/hive-1.2.2/bin/beeline

2.Hive表的DDL操作

2.1格式

2.2说明

在建表语句的格式中与基本的SQL语句很相似，有几个字段说明一下：

1. PARTITIONED 表示的是分区，不同的分区会以文件夹的形式存在，在查询的时候指定分区查询将会大大加快查询的时间。

2. CLUSTERED表示的是按照某列聚类，例如在插入数据中有两项“张三，数学”和“张三，英语”，若是CLUSTERED BY name，则只会有一项，“张三，(数学，英语)”，这个机制也是为了加快查询的操作。

3. STORED是指定排序的形式，是降序还是升序。

4. BUCKETS是指定了分桶的信息，这在后面会单独列出来，在这里还不会涉及到。

5. ROW FORMAT是指定了行的参数。还要指定列的信息，如ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY'\n'

6. STORED AS是指定文件的存储格式。Hive中基本提供两种文件格式：SEQUENCEFILE和TEXTFILE，RCFILE压缩的格式，通常可以提供更高的性能。

7. LOCATION指的是在HDFS上存储的位置。

8. 有external是外部表，没有是内部表

2.3内部表和外部表的区别及意义

内部表：

create table t_2(id int, name string, salary bigint, add string)

row format delimited

fields terminated by ',';

外部表：

create external table t_3(id int, name string, salary bigint, add string)

row format delimited

fields terminated by ','

location '/aa/bb';

内部表和外部表的区别：

1. 内部表的目录由Hive创建在默认的目录：/user/hive/warehouse/……

外部表由用户自定义目录，定义关键字: location ‘/位置’

2. drop 一个内部表时，表的元信息和表数据目录都会被删除

drop一个外部表时，只删除表的元信息，表的数据并不会删除

意义：通常，一个数据仓库系统，数据总有一个源头，而源头一般是别的应用系统产生的，其目录无法确定，为了方便映射，就可以在Hive中用外部表进行映射，并且就算你在Hive中把这个表给drop掉了，也不会删除数据，也不会影响别的应用系统。

2.4导入数据

1、首先数据源放在Hive运行那台机器，如我的服务器在node2，客户端在node4，我要在node2上创建元数据，在node4上进行数据到表中的操作，操作如下

load data local inpath '/root/user.data' into table t_1;

2、将hdfs中的文件导入到表中

load data inpath '/user.data.2' into table t_1;

不加关键字local则是从hdfs中的路径中移动文件到表目录中。

3、将查询表的结果放在新的表中

create table t_1_wz

select id,name from t_1;

将从t_1查询得出来的结果放在t_1_wz的表中

4、向存在的表插入数据

这样是将一个表的字段及其类型放在新的表中

create table t_1_nn like t_1;

向新的表插入查询的内容（其实就是sql的插入操作）

insert into table t_nn

select * from t_1;

2.5导出数据

1、将数据从hive的表中导入到HDFS的目录中

insert overwrite directory '/aa/bb'

select *

form t_1

where add='qinzhou';

2、将数据从hive的表中导入到本地的目录中

insert overwrite local directory '/aa/bb'

select *

form t_1

where add='qinzhou';

2.6表的分区

定义分区：

create table t_or (id int, name string, add,string)

partitioned by (day string)

row format delimited

fileds terminated by ',';

向表的分区导入数据：

load data local inpath '/root/data.1' into table t_or partition (day= '20171108');

向表查询表分区的字段：

select count(*) from t_or where day = '20171108' ;

把分区字段当成where查询即可。

2.7表的修改

2.7.1表的增加、删除分区

下面是增加表t_5的分区

alter table t_5 add partition (day='20171110') partition (day='20171111');

下面是检查t_4的分区情况

show partitions t_5;

下面是删除表t_5的分区

alter table t_5 drop partition (day='20171111');

2.7.2表的改名

更改表名：alter table 表名 rename to 新的表明。如下：

alter talbe t_6 rename to t_5;

2.7.3表的列的替换和增加

首先查看表定义的字段

desc t_1;

可以看到如下内容

+-----------+------------+----------+--+

| col_name | data_type | comment |

+-----------+------------+----------+--+

| id | int | |

| name | string | |

| add | string | |

+-----------+------------+----------+--+

现在替换表的列

alter table t_1 replace columns (number int, name string, adress string);

用desc t_1可以看到

+-----------+------------+----------+--+

| col_name | data_type | comment |

+-----------+------------+----------+--+

| number | int | |

| name | string | |

| adress | string | |

+-----------+------------+----------+--+

表明表的列已经被修改了。

下面增加表的列

alter table t_1 add columns (age int);

可以看到表新增了列

+-----------+------------+----------+--+

| col_name | data_type | comment |

+-----------+------------+----------+--+

| number | int | |

| name | string | |

| adress | string | |

| age | int | |

+-----------+------------+----------+--+

现在查询表中的内容

+-------------+--------------+-------------+----------+--+

+-------------+--------------+-------------+----------+--+

+-------------+--------------+-------------+----------+--+

可以看到新增的列为空。

修改已存在的表的列的名字和类型

alter table t_1 change number num int;

2.8表的命令

show tables;

showdatabases;

显示表的分区：showpartition;

例子：show partitions t_1;

显示hive的内置函数：show functions;

例子：select num, substr(name,6) from t_1;

select num, substr(name,1,5) from t_1;

显示表的定义：showt_name

例子：show t_1;

第一种显示表的详细信息：show extended t_name

第一种显示表的详细信息：show formatted t_name

建议用第二种，因为它用比较规范的格式显示。

3. Hive表的DML操作

3.1插入一条数据

Insert into table t_1 value(15,’xiaohong’,’beijing’);

3.2多重插入

首先看到单个插入

定义为t_5的字段和类型：

create table t_lt_5 like t_5;

create table t_gt_5 like t_5;

插入

insert overwrite table t_lt_5 partition(day='1')

select id,name,add from t_5 where id < 5;

插入

insert overwrite table t_gt_5 partition(day='1')

select id,name,add from t_5 where id > 5;

多重插入

from t_5

insert overwrite table t_lt_5 partition(day='1')

select id,name,add where id < 5

insert overwrite table t_gt_5 partition(day='1')

select id,name,add where id > 5;

3.2各种各样的join

（1）内连接

selectt_a.*,t_b.* from t_a join t_b on t_a.id = t_b.id;

也可以这样写，表别名嘛

selecta.*,b.* from t_a a join t_b b on a.id = b.id;

（2）左连接

selecta.*,b.* from t_a a left join t_b b on a.id = b.id;

（3）右连接

selecta.*,b.* from t_a a right join t_b b on a.id = b.id;

（4）全连接

selecta.*,b.* from t_a a full join t_b b on a.id = b.id;

（5）半查询

selecta.* from t_a a left semi join t_b b on a.id = b.id;

这是对于exit，in这种，可以提高效率的查询

插曲：set hive.exec.mode.local.auto=true;

这是设置mr在本地运行

3.3不等值连接

老版本的hive不支持非等值连接

新版本1.2.0之后的hive支持非等值连接，不过它的写法是：

selecta.*,b.* from t_a a, t_b b where a.id>b.id;

不能写成这样子：

selecta.*,b.* from t_a a join t_b b on a.id>b.id;

问题：220.30.10.50

220.30.10.60

250.30.14.90

220.30.10.1 220.30.10.255 北京电信

250.30.14.1 250.30.14.255 上海电信

3.4自定义hive函数

有如下原始数据：

1,zhangsan|18|male|it,2000

2,lisi|28|fmale|it,4000

3,wangwu|48|male|it,20000

原始数据由某个应用服务器产生在目录：/web/data/20171111

可以先做一个外部表关联数据

create external table t_user_info(idint,user_info string, salary int)

row format delimited

fields terminated by ',';

location ‘/web/data/20171111’;

可以得到如下的表：

+-----------------+------------------------+---------------------+--+

| t_user_info.id | t_user_info.user_info | t_user_info.salary |

+-----------------+------------------------+---------------------+--+

| 1 | zhangsan|18|male|it | 2000 |

| 2 | lisi|28|fmale|it | 4000 |

| 3 | wangwu|48|male|it | 20000 |

+-----------------+------------------------+---------------------+--+

上表不方便做细粒度的分析挖掘，需要将user_info拆解成多个字段，用jive自定义的函数不方便，自定义一个函数实现拆解功能。

自定义函数步骤：

A、先开发一个java函数

publicclass UserInfoParser extends UDF{

public String evaluate(String field, intindex) {

String replaceAll = field.replaceAll("\\|", ":");

String[] split = replaceAll.split(":");

returnsplit[index - 1];

}

B、将java程序打包成jar，放在hive机器上

C、在hive命令行中输入一下命令，将程序jar添加到hive运行的classpath中

addjar /root/user.data.jar;

D、在hive创建一个函数名，映射到自己开发的java类中

createtemporary function udf_parser as 'org.hadoop.hive.myfunct.UserInfoParser';

E、接下来可以使用自定的函数udf_paerser了

用函数拆解原来的字段，将结果保存放一张明细的表

create table t_u_info

select

id,

udf_parser(user_info,1)as uname,

udf_parser(user_info,2)as age,

udf_parser(user_info,3)as sexual,

udf_parser(user_info,4)as hangye,

salary

from

t_user_info;