Hadoop的部署与Hadoop分布式文件系统HDFS,
一、hadoop简介
Hadoop是一个由Apache基金会所开发的分布式系统基础架构。用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力进行高速运算和存储。
Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS。HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上;而且它提供高吞吐量(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序。HDFS放宽了(relax)POSIX的要求,可以以流的形式访问(streaming access)文件系统中的数据。
Hadoop的框架最核心的设计就是:HDFS和MapReduce。HDFS为海量的数据提供了存储,则MapReduce为海量的数据提供了计算。
二、hadoop的安装与部署
1.搭建java环境并安装hadoop
[root@server1 ~]# ls
hadoop-2.7.3.tar.gz jdk-7u79-linux-x64.tar.gz
[root@server1 ~]# useradd -u 800 hadoop ##新建hadoop用户
[root@server1 ~]# id hadoop
uid=800(hadoop) gid=800(hadoop) groups=800(hadoop)
[root@server1 ~]# mv * /home/hadoop/
[root@server1 ~]# su - hadoop
[hadoop@server1 ~]$ tar zfx jdk-7u79-linux-x64.tar.gz
[hadoop@server1 ~]$ tar zfx hadoop-2.7.3.tar.gz
[hadoop@server1 ~]$ ln -s jdk1.7.0_79/ java
[hadoop@server1 ~]$ ln -s hadoop-2.7.3 hadoop
[hadoop@server1 ~]$ cd /home/hadoop/hadoop/etc/hadoop
[hadoop@server1 hadoop]$ vim hadoop-env.sh
2.验证hadoop是否安装成功
[hadoop@server1 hadoop]$ cd /home/hadoop/hadoop
[hadoop@server1 hadoop]$ mkdir input
[hadoop@server1 hadoop]$ cp etc/hadoop/* input/
[hadoop@server1 hadoop]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
[hadoop@server1 hadoop]$ cat output/*
将input中以dfs开头的文件输出到output目录后查看output目录中的文件, output目录中以dfs开头的文件,则安装成功
三、Hadoop分布式文件系统HDFS
1)NameNode:记录了文件是如何被拆分成block以及这些block都存储到了那些DateNode节点 .
2)NameNode:保存了文件系统运行的状态信息 .
3)DataNode:存储被拆分的blocks .
4)Secondary NameNode:帮助 NameNode 收集文件系统运行的状态信息 .
5)JobTracker:当有任务提交到 Hadoop 集群的时候负责 Job 的运行,负责调度多个 TaskTracker .
6)TaskTracker:负责某一个 map 或者 reduce 任务 .
1.单数据存储节点
1.修改hadop配置文件
[root@server1 ~]# su - hadoop
[hadoop@server1 ~]$ cd hadoop/etc/hadoop/
[hadoop@server1 hadoop]$ vim core-site.xml ##ip为本机ip,指定namenode的地址
[hadoop@server1 hadoop]$ vim slaves ##指定Datanode地址
[hadoop@server1 hadoop]$ vim hdfs-site.xml ##指定hdfs保存数据的副本数量为1
2.设置免密登陆
[hadoop@server1 hadoop]$ cd
[hadoop@server1 ~]$ ssh-keygen
[hadoop@server1 ~]$ cd .ssh/
[hadoop@server1 .ssh]$ ls
id_rsa id_rsa.pub
[hadoop@server1 .ssh]$ cp id_rsa.pub authorized_keys ##设置免密登陆
免密登陆成功
3.格式化namenode(元数据节点)并开启dfs
[hadoop@server1 ~]$ cd hadoop
[hadoop@server1 hadoop]$ bin/hdfs namenode -format ##格式化,数据存放的位置为/tmp/hadoop-hadoop
[hadoop@server1 hadoop]$ sbin/start-dfs.sh ##开启dfs
4.配置环境变量
[hadoop@server1 hadoop]$ cd
[hadoop@server1 ~]$ vim .bash_profile
[hadoop@server1 ~]$ source .bash_profile
[hadoop@server1 ~]$ jps ##查看java节点
5.在网页查看
[hadoop@server1 ~]$ netstat -antpl ##查看端口
浏览器输入ip172.25.68.1:50070
6.创建目录并上传文件
[hadoop@server1 ~]$ cd hadoop
[hadoop@server1 hadoop]$ ls
bin etc include input lib libexec LICENSE.txt logs NOTICE.txt output README.txt sbin share
[hadoop@server1 hadoop]$ bin/hdfs dfs -mkdir /user
[hadoop@server1 hadoop]$ bin/hdfs dfs -mkdir /user/hadoop
[hadoop@server1 hadoop]$ bin/hdfs dfs -put input/ ##上传
[hadoop@server1 hadoop]$ bin/hdfs dfs -ls
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2018-08-25 21:10 input
[hadoop@server1 hadoop]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount input output ##统计
2.多数据存储节点
实验环境:
server1:172.25.68.1
server2:172.25.68.2
server3:172.25.68.3
1.删除单节点实验数据并停止dfs
[hadoop@server1 hadoop]$ cd
[hadoop@server1 ~]$ cd hadoop
[hadoop@server1 hadoop]$ rm -fr input/ output/
[hadoop@server1 hadoop]$ bin/hdfs dfs -get output
[hadoop@server1 hadoop]$ rm -fr output/
[hadoop@server1 hadoop]$ sbin/stop-dfs.sh
2.配置server1(namenode)----root用户下进行
[root@server1 ~]# yum install nfs-utils -y
[root@server1 ~]# /etc/init.d/rpcbind start ##开启nfs服务的前提
[root@server1 ~]# vim /etc/exports
[root@server1 ~]# /etc/init.d/nfs start
[root@server1 ~]# exportfs -v
[root@server1 ~]# exportfs -rv ##刷新查看共享目录
3.配置server2和server3(datanode)----两节点配置方法相同
[root@server2 ~]# useradd -u 800 hadoop
[root@server2 ~]# id hadoop
uid=800(hadoop) gid=800(hadoop) groups=800(hadoop)
[root@server2 ~]# yum install -y nfs-utils
[root@server2 ~]# /etc/init.d/rpcbind start
[root@server2 ~]# showmount -e 172.25.68.1
[root@server2 ~]# mount 172.25.68.1:/home/hadoop/ /home/hadoop/
[root@server2 ~]# df
4.配置hdfs(server1)
[root@server1 ~]# su - hadoop
[hadoop@server1 ~]$ cd hadoop/etc/hadoop/
[hadoop@server1 hadoop]$ vim slaves ##指明datanode
[hadoop@server1 hadoop]$ vim hdfs-site.xml
[hadoop@server1 hadoop]$ cd /tmp/
[hadoop@server1 tmp]$ rm -fr *
5.测试免密连接
[hadoop@server1 tmp]$ ssh server2 ##因为mount过来的,所以免密
[hadoop@server1 tmp]$ ssh server3
6.格式化数据节点并开启dfs
[hadoop@server1 tmp]$ cd
[hadoop@server1 ~]$ cd hadoop
[hadoop@server1 hadoop]$ bin/hdfs namenode -format
[hadoop@server1 hadoop]$ ls /tmp/
[hadoop@server1 hadoop]$ sbin/start-dfs.sh
[hadoop@server1 hadoop]$ jps ##查看java进程
测试:
[hadoop@server1 hadoop]$ bin/hdfs dfs -mkdir /user
[hadoop@server1 hadoop]$ bin/hdfs dfs -mkdir /user/hadoop
[hadoop@server1 hadoop]$ bin/hdfs dfs -put input ##该目录下需存在input目录
[hadoop@server1 hadoop]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount input output ##调用计算模块,计算input目录中每个文件的字节数,并将结果输出到output目录中
[hadoop@server1 hadoop]$ bin/hdfs dfs -ls output
[hadoop@server1 hadoop]$ bin/hdfs dfs -cat output/*
[hadoop@server1 hadoop]$ bin/hdfs dfs -get output
[hadoop@server1 hadoop]$ ls
bin etc include input lib libexec LICENSE.txt logs NOTICE.txt output README.txt sbin share
[hadoop@server1 hadoop]$ cd output/
[hadoop@server1 output]$ ls
part-r-00000 _SUCCESS
[hadoop@server1 output]$ cat part-r-00000
7.数据节点的在线添加