欢迎投稿

今日深度:

hive on tez踩坑记2-hive0.14 on tez,

hive on tez踩坑记2-hive0.14 on tez,


在测试hive0.14.0 on tez时遇到的问题比较多:
1.在使用cdh5.2.0+hive0.14.0+tez-0.5.0测试时,首先遇到下面的问题

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 java.lang.NoSuchMethodError: org.apache.tez.dag.api.client.Progress.getFailedTaskAttemptCount()I         at org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor.printStatusInPlace(TezJobMonitor.java:613)         at org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor.monitorExecution(TezJobMonitor.java:311)         at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:167)         at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)         at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)         at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1604)         at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1364)         at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1177)         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1004)         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:994)         at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:247)         at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:199)         at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:410)         at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:783)         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)         at java.lang.reflect.Method.invoke(Method.java:597)         at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

通过堆栈可以看出是在tez job提交之后报的错,在org.apache.hadoop.hive.ql.exec.tez.TezTask中
job通过submit方法提交后,实例化一个TezJobMonitor 对象,用来记录tez job的运行情况:

1 2 3 4 5 6 // submit will send the job to the cluster and start executing client = submit(jobConf, dag, scratchDir, appJarLr, session, additionalLr, inputOutputJars, inputOutputLocalResources); // finally monitor will print progress until the job is done TezJobMonitor monitor = new TezJobMonitor(); rc = monitor.monitorExecution(client, ctx.getHiveTxnManager(), conf, dag);

TezJobMonitor.monitorExecution方法中:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 boolean isProfileEnabled = conf.getBoolVar(conf, HiveConf.ConfVars.TEZ_EXEC_SUMMARY); //hive.tez.exec.print.summary,默认为false boolean inPlaceUpdates = conf.getBoolVar(conf, HiveConf.ConfVars.TEZ_EXEC_INPLACE_PROGRESS); //hive.tez.exec.inplace.progress,默认为true boolean wideTerminal = false; boolean isTerminal = inPlaceUpdates == true ? isUnixTerminal() : false; // we need at least 80 chars wide terminal to display in-place updates properly if (isTerminal) {   if (getTerminalWidth() >= MIN_TERMINAL_WIDTH) {     wideTerminal = true;   } } boolean inPlaceEligible = false; if (inPlaceUpdates && isTerminal && wideTerminal && !console.getIsSilent()) {   inPlaceEligible = true; } //进入一个while循环,判断 job的状态,并运行printStatusInPlace或者printStatus方法(其中printStatus最终调用getReport方法) ...... case RUNNING:   if (!running) {     perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.TEZ_SUBMIT_TO_RUNNING);     console.printInfo("Status: Running (" + dagClient.getExecutionContext() + ")\n");     startTime = System.currentTimeMillis();     running = true;   }   if (inPlaceEligible) {     printStatusInPlace(progressMap, startTime, false, dagClient);     // log the progress report to log file as well     lastReport = logStatus(progressMap, lastReport, console);   else {     lastReport = printStatus(progressMap, lastReport, console);   }   break;

比如在printStatusInPlace方法中:

1 2 3 4 5 6 7 8 9 10 11 12 13 SortedSet<String> keys = new TreeSet<String>(progressMap.keySet()); int idx = 0; int maxKeys = keys.size(); for (String s : keys) {    idx++;    Progress progress = progressMap.get(s);    final int complete = progress.getSucceededTaskCount();    final int total = progress.getTotalTaskCount();    final int running = progress.getRunningTaskCount();    final int failed = progress.getFailedTaskAttemptCount(); // 会调用Progress类getFailedTaskAttemptCount方法获取失败的task数    final int pending = progress.getTotalTaskCount() - progress.getSucceededTaskCount() -    progress.getRunningTaskCount();    final int killed = progress.getKilledTaskCount();

在0.5.0的tez中org.apache.tez.dag.api.client.Progress类没有getFailedTaskAttemptCount方法
在0.5.2的tez中才开始增加这个方法,因此要想使用hive0.14.0的话,需要使用tez-0.5.2以上的版本

2.升级至hive0.14.0+tez-0.5.2之后,发现如下错误:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 15/01/13 14:09:21 INFO client.TezClient: The url to track the Tez Session: http://xxxx:8042/proxy/application_1416818587155_0049/ Exception in thread "main" java.lang.RuntimeException: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown         at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:457)         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:672)         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)         at java.lang.reflect.Method.invoke(Method.java:597)         at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Caused by: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown         at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:599)         at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:212)         at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:122)         at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:454)         ... 7 more

可以看到是由于在session初始化异常导致,异常是由TezSessionState.open方法抛出:

1 2 3 4 5 6 ....   try {     session.waitTillReady();   catch(InterruptedException ie) {     //ignore   }

其中session为TezClient的实例,在TezClient.waitTillReady方法中

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 public synchronized void waitTillReady() throws IOException, TezException, InterruptedException {   if (!isSession) {     // nothing to wait for in non-session mode     return;   }   verifySessionStateForSubmission();   while (true) {     TezAppMasterStatus status = getAppMasterStatus(); //这里getAppMasterStatus方法返回了TezAppMasterStatus.SHUTDOWN     if (status.equals(TezAppMasterStatus.SHUTDOWN)) {       throw new SessionNotRunning("TezSession has already shutdown");     }     if (status.equals(TezAppMasterStatus.READY)) {       return;     }     Thread.sleep(SLEEP_FOR_READY);   } }

这里创建TezClient时设置了为sessionmode,并且getAppMasterStatus返回了TezAppMasterStatus.SHUTDOWN,导致在waitTillReady方法中抛出异常,即TezAppMaster没有启动正常导致,查看nm的日志,发现由如下报错:

1 2 3 4 5 6 7 8 9 10 11 12 13 2015-01-13 16:27:58,162 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1416818587155_0060_01_000001 and exit code: 1 ExitCodeException exitCode=1:         at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)         at org.apache.hadoop.util.Shell.run(Shell.java:455)         at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)         at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:196)         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)         at java.util.concurrent.FutureTask.run(FutureTask.java:138)         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)         at java.lang.Thread.run(Thread.java:662)

是由于启动am的container异常报错,查看对应的container日志:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 2015-01-13 17:34:59,731 FATAL [main] app.DAGAppMaster: Error starting DAGAppMaster java.lang.VerifyError: class org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;         at java.lang.ClassLoader.defineClass1(Native Method)         at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)         at java.lang.ClassLoader.defineClass(ClassLoader.java:615)         at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)         at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)         at java.net.URLClassLoader.access$000(URLClassLoader.java:58)         at java.net.URLClassLoader$1.run(URLClassLoader.java:197)         at java.security.AccessController.doPrivileged(Native Method)         at java.net.URLClassLoader.findClass(URLClassLoader.java:190)         at java.lang.ClassLoader.loadClass(ClassLoader.java:306)         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)         at java.lang.ClassLoader.loadClass(ClassLoader.java:247)         at java.lang.Class.getDeclaredConstructors0(Native Method)         at java.lang.Class.privateGetDeclaredConstructors(Class.java:2389)         at java.lang.Class.getConstructor0(Class.java:2699)         at java.lang.Class.getConstructor(Class.java:1657)         at org.apache.hadoop.yarn.factories.impl.pb.RecordFactoryPBImpl.newRecordInstance(RecordFactoryPBImpl.java:62)         at org.apache.hadoop.yarn.util.Records.newRecord(Records.java:36)         at org.apache.hadoop.yarn.api.records.ApplicationId.newInstance(ApplicationId.java:49)         at org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)         at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)         at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1794)

看样子是protoc-buf兼容的问题
cdh5.2.0默认使用protobuf-java-2.5.0.jar,hive0.14.0默认使用protobuf-java-2.5.0.jar,tez 0.5.2也使用pb2.5.0编译,理论上应该不会有pb兼容性问题,怀疑是在tezam启动时加载了2.4.0a 的pb,需要查看启动命令,找到对应的classpath:
通过更改org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor类,增加Thread.sleep来查看启动am的shell,重新编译cdh5.2.0包(主要需要java7支持 range [1.7.0,1.7.1000}],编译时跳过native: mvn package -DskipTests -Pdist -Dtar -e -X),
并替换./share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.5.0-cdh5.2.0.jar 测试:
shell的调用如下:

1 default_container_executor.sh-->default_container_executor_session.sh-->launch_container.sh

而在launch_container.sh脚本:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 export HADOOP_COMMON_HOME="/home/vipshop/platform/hadoop-2.5.0-cdh5.2.0"  #先设置相关的变量 export CLASSPATH="$PWD:$PWD/*:$HADOOP_CONF_DIR:" #这里重设了CLASSPATH export HADOOP_TOKEN_FILE_LOCATION="/home/vipshop/hard_disk/7/yarn/local/usercache/hdfs/appcache/application_1416818587155_0075/container_1416818587155_0075_01_000001/container_tokens" .... ln -sf "/home/vipshop/hard_disk/10/yarn/local/filecache/42/hadoop-yarn-api-2.5.0.jar" "hadoop-yarn-api-2.5.0.jar"  #建立相关jar的软连接到本地目录 hadoop_shell_errorcode=$? if [ $hadoop_shell_errorcode -ne 0 ] then   exit $hadoop_shell_errorcode fi ..... exec /bin/bash -c "$JAVA_HOME/bin/java  -Xmx819m -server  -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN   -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA   -XX:+UseParallelGC -Dlog4j.configuration=tez-container-log4j.properties    -Dyarn.app.container.log.dir=/home/vipshop/hard_disk/9/yarn/logs/    application_1416818587155_0075/container_1416818587155_0075_01_000001     -Dtez.root.logger=INFO,CLA -Dsun.nio.ch.bugLevel=''     org.apache.tez.dag.app.DAGAppMaster --session 1>/home/vipshop/hard_disk/9/    yarn/logs/application_1416818587155_0075/container_1416818587155_0075_01_000001/    stdout 2>/home/vipshop/hard_disk/9/yarn/logs/application_1416818587155_0075/    container_1416818587155_0075_01_000001/stderr "
1 2 #最后运行 java  org.apache.tez.dag.app.DAGAppMaster,即 org.apache.tez.dag.app.DAGAppMaster的main方法,启动DAGAppMaster

CLASSPATH为shell所在的目录,比如这里

1 2 3 4 5 CLASSPATH='/home/vipshop/hard_disk/11/yarn/local/usercache/hdfs/appcache/ application_1416818587155_0079/container_1416818587155_0079_01_000001: /home/vipshop/hard_disk/11/yarn/local/usercache/hdfs/appcache/ application_1416818587155_0079/container_1416818587155_0079_01_000001/*: /home/vipshop/conf:'

在shell的当前目录下查找包含pb的包,发现有一个hive-solr中集成了pb,并且查看到其pb版本为2.4.0a:

1 2 3 4 for i in `find . -name "*.jar"`; do echo $i `jar -tvf $i|grep GeneratedMessage|wc -l`; done|awk '{if($2>0) print}'                         ./protobuf-java-2.5.0.jar 31  //2.5.0 ./hive-exec-0.14.0-dfffe4217f40bd764977b741ad970a562e07fb99992f0180620bd13f68a2577b.jar 31 //2.5.0 ./hive-solr-0.0.1-SNAPSHOT-jar-with-dependencies.jar  //2.4.0a

这就导致在container启动时,classloader加载到了2.4.0a的pb,最终导致container启动失败。使用2.5.0的pb重新编译这个jar包后,hive on tez就运行正常了



本文转自菜菜光 51CTO博客,原文链接:http://blog.51cto.com/caiguangguang/1604100,如需转载请自行联系原作者

www.htsjk.Com true http://www.htsjk.com/hive/42702.html NewsArticle hive on tez踩坑记2-hive0.14 on tez, 在测试hive0.14.0 on tez时遇到的问题比较多: 1.在使用cdh5.2.0+hive0.14.0+tez-0.5.0测试时,首先遇到下面的问题 12345678910111213141516171819202122 java.lang.NoSuchMethodErro...
评论暂时关闭