Hive学习笔记（15），

和通数据库htsjk.Com2019-08-24 23:24 来源:未知阅读:2986 评论 309 热度2

标签：

Hive学习笔记（15），

1 Hive 优化

1.1 `hive.fetch.task.conversion`

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have
      any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more    : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
  </property>

1.2 大表拆分（子表）

create [temporary] [external] table [if not EXISTS] [db.name] table_name
[as SELECT_statement];

1.3 外部表，分区表

结合使用
多级分区

1.4 数据

存储格式（textfile, orcfile,parquet）
数据压缩(snappy)

set parquet.compression=SNAPPY;
create table page_views_par_snappy
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
stored as parquet
as SELECT * FROM page_views;

1.5 SQL 优化

优化 SQL 语句

1.5.1 `join`

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

Common / Shuffle / Reduce join

* 发生在 reduce task
* 大表对大表
* 每个表数据从文件中读取

map join

* 发生在 Map Task
* 小表对大表
* 大表的数据从文件中读取
* 小表的数据在内存
* DistributedCache

smb join(sort-merge-bucket)

1.6 HQL 执行计划 explain

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

1.7 并行执行

hive.exec.parallel.thread.number
hive.exec.parallel

1.8 JVM 重用

mapreduce.job.jvm.numtasks

1.9 Reduce 数目

mapreduce.job.reduce

1.10 推测执行

mapreduce.map.speculative
hive.mapred.reduce.tasks.speculative.execution

1.11 Map 数目

hive.merge.size.per.task

1.12 动态分区

1.13 Strict Mode

对分区表进行查询，在 where 子句中没有加分区过滤，将禁止提交任务（默认：nonstrict）,hive.mapred.mode=strict
使用严格模式可以禁止 3 中类型的查询

Hive学习笔记（15），

1 Hive 优化

1.1 hive.fetch.task.conversion

1.2 大表拆分（子表）

1.3 外部表，分区表

1.4 数据

1.5 SQL 优化

1.5.1 join

1.6 HQL 执行计划 explain

1.7 并行执行

1.8 JVM 重用

1.9 Reduce 数目

1.10 推测执行

1.11 Map 数目

1.12 动态分区

1.13 Strict Mode

1.1 `hive.fetch.task.conversion`

1.5.1 `join`