Hive学习笔记(15),
1 Hive 优化
1.1 hive.fetch.task.conversion
<property>
<name>hive.fetch.task.conversion</name>
<value>more</value>
<description>
Expects one of [none, minimal, more].
Some select queries can be converted to single FETCH task minimizing latency.
Currently the query should be single sourced not having any subquery and should not have
any aggregations or distincts (which incurs RS), lateral views and joins.
0. none : disable hive.fetch.task.conversion
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
</description>
</property>
1.2 大表拆分(子表)
create [temporary] [external] table [if not EXISTS] [db.name] table_name
[as SELECT_statement];
1.3 外部表,分区表
- 结合使用
- 多级分区
1.4 数据
- 存储格式(textfile, orcfile,parquet)
- 数据压缩(snappy)
set parquet.compression=SNAPPY;
create table page_views_par_snappy
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
stored as parquet
as SELECT * FROM page_views;
1.5 SQL 优化
- 优化 SQL 语句
1.5.1 join
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization
Common / Shuffle / Reduce join
* 发生在 reduce task
* 大表对大表
* 每个表数据从文件中读取
map join
* 发生在 Map Task
* 小表对大表
* 大表的数据从文件中读取
* 小表的数据在内存
* DistributedCache
smb join(sort-merge-bucket)
1.6 HQL 执行计划 explain
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
1.7 并行执行
hive.exec.parallel.thread.numberhive.exec.parallel
1.8 JVM 重用
mapreduce.job.jvm.numtasks
1.9 Reduce 数目
mapreduce.job.reduce
1.10 推测执行
mapreduce.map.speculativehive.mapred.reduce.tasks.speculative.execution
1.11 Map 数目
hive.merge.size.per.task
1.12 动态分区
1.13 Strict Mode
- 对分区表进行查询,在 where 子句中没有加分区过滤,将禁止提交任务(默认:nonstrict),
hive.mapred.mode=strict - 使用严格模式可以禁止 3 中类型的查询
*
本站文章为和通数据库网友分享或者投稿,欢迎任何形式的转载,但请务必注明出处.
同时文章内容如有侵犯了您的权益,请联系QQ:970679559,我们会在尽快处理。