Hadoop,
Hadoop
Hadoop是什么
Apache Hadoop是一款支持数据密集型分布式应用程序并以Apache 2.0许可协议发布的开源软件框架。 —— [ 维基百科 ]
Hadoop是根据谷歌公司发表的MapReduce和Google文件系统的论文自行实现而成。Hadoop框架透明地为应用提供可靠性和数据移动。它实现了名为MapReduce的编程范式:应用程序被分区成许多小部分,而每个部分都能在集群中的任意节点上运行或重新运行。此外,Hadoop还提供了分布式文件系统,用以存储所有计算节点的数据,这为整个集群带来了非常高的带宽。
Hadoop=HDFS(文件系统,数据存储技术相关)+ Mapreduce(数据处理),Hadoop的数据来源可以是任何形式,在处理半结构化和非结构化数据上与关系型数据库相比有更好的性能,具有更灵活的处理能力,不管任何数据形式最终会转化为key/value,key/value是基本数据单元。用函数式变成Mapreduce代替SQL,SQL是查询语句,而Mapreduce则是使用脚本和代码,而对于适用于关系型数据库,习惯SQL的Hadoop有开源工具hive代替。MapReduce和分布式文件系统的设计,使得整个框架能够自动处理节点故障。它使应用程序与成千上万的独立计算的电脑和PB级的数据。Hadoop也可说是大数据分布式计算的一个解决方案.
Hadoop的组成部分
Hadoop Distributed File System分布式文件系统。 HDFS是一个可以存储极大数据集的文件系统,它是通过向外扩展方式构建的主机集群。它有着独特的设计和性能特点,特别是,HDFS以时延为代价对吞吐量进行了优化,并且通过副本替换冗余达到了高可靠性。
MapReduce并行计算框架。 它是一个编程模型,用以进行大数据量的计算。对于大数据量的计算,通常采用的处理手法就是并行计算。MapReduce的名字源于这个模型中的两项核心操作:Map和Reduce。简单的说来,Map是把一组数据一对一的映射为另外的一组数据,其映射的规则由一个函数来指定,比如对[1,2,3,4]进行乘2的映射就变成了[2,4,6,8]。Reduce是对一组数据进行归约,这个归约的规则由一个函数指定,比如对[1,2,3,4]进行求和的归约得到结果是10,而对它进行求积的归约结果是24。它规范了数据在两个处理阶段(Map和Reduce)的输入和输出,并将其应用于任意规模的大数据集。MapReduce与HDFS紧密结合,确保在任何情况下,MapReduce任务直接在存储所需数据的HDFS节点上运行。
hadoop Common。 支撑Hadoop的公共部分,包括配置相关类、I/O系统、文件系统、远程过程调用和序列化函数库等。详情可参考Hadoop Common 结构学习。
Hadoop子项目
Hadoop环境搭建
打开文件找到下面这行的位置,改成(jdk目录位置,大家根据实际情况修改)
export JAVA_HOME=/home/hadoop/jdk_1.7
在 hadoop-env.sh中加上这句:
export HADOOP_PREFIX=/home/hadoop/hadoop-2.6.0
2)core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/tmp</value>
</property>
</configuration>
<value>/home/hadoop/tmp</value> 应根据自己的情况设置。tmp目录应在服务启动前创建。
3)hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.datanode.ipc.address</name>
<value>0.0.0.0:50020</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:50075</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/data/datanode</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>slave1:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
file:/home/hadoop/data/namenode根据自己情况而定。data目录应在服务启动前创建namenode目录不用创建。
4)mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
</configuration>
5)yarn-site.xml
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8040</value>
</property>
</configuration>
6)slaves
slave1
slave2
这样只有slave1,slave2上有datanode进程。
在master上启用 NameNode测试
执行命令
$ hdfs namenode -format
先格式化
15/02/12 21:29:53 INFO namenode.FSImage: Allocated new BlockPoolId: BP-85825581-192.168.187.102-1423747793784
15/02/12 21:29:53 INFO common.Storage: Storage directory /home/hadoop/tmp/dfs/name has been successfully formatted.
等看到执行信息有has been successfully formatted表示格式化ok
最后启动hadoop集群
执行命令 HADOOP_HOME/sbin/start-dfs.sh
启动完成后,输入jps查看进程,执行命令 jps
5161 SecondaryNameNode
4989 NameNode
如果看到上面二个进程,表示master节点成功
执行命令
start-yarn.sh
jps
5161 SecondaryNameNode
5320 ResourceManager
4989 NameNode
如果看到上面3个进程,表示 yarn启动完成。
执行命令
stop-dfs.sh
stop-yarn.sh
保存退出停掉刚才启动的服务
web界面检查hadoop
hdfs管理界面 http://master:50070/
yarn的管理界面不再是原来的50030端口而是8088 http://master:8088/
查看hadoop状态
hdfs dfsadmin -report 查看hdfs的状态报告
yarn node -list 查看yarn的基本信息
具体的集群环境搭建可以参考Hadoop 分布式集群环境搭建