python 操作hadoop,pythonhadoop
参考网站:http://arrow.apache.org/docs/python/filesystems.html#hadoop-file-system-hdfs
环境:
python 2.7.14 +pyarrow + hadoop 2.7
系统配置
File System Interfaces
In this section, we discuss filesystem-like interfaces in PyArrow.
Hadoop File System (HDFS) 语法
PyArrow comes with bindings to a C++-based interface to the Hadoop File System. You connect like so:
import pyarrow as pa fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path) with fs.open(path, 'rb') as f: # Do something with f
By default, pyarrow.hdfs.HadoopFileSystem uses
libhdfs, a JNI-based interface to the Java Hadoop client. This library is loaded at runtime (rather than at link / library load time, since the library may not be in your LD_LIBRARY_PATH), and relies on some environment
variables. #环境变量配置
HADOOP_HOME: the root of your installed Hadoop distribution. Often has lib/native/libhdfs.so.JAVA_HOME: the location of your Java SDK installation.ARROW_LIBHDFS_DIR(optional): explicit location oflibhdfs.soif it is installed somewhere other than$HADOOP_HOME/lib/native.CLASSPATH: must contain the Hadoop jars. You can set these using:
export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`
If CLASSPATH is
not set, then it will be set automatically if the hadoop executable
is in your system path, or if HADOOP_HOME is
set.
You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs:
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path, driver='libhdfs3')
接口:
HDFS API
hdfs.connect([host, port, user, …]) |
Connect to an HDFS cluster. |
HadoopFileSystem.cat(path) |
Return contents of file as a bytes object |
HadoopFileSystem.chmod(self, path, mode) |
Change file permissions |
HadoopFileSystem.chown(self, path[, owner, …]) |
Change file permissions |
HadoopFileSystem.delete(path[, recursive]) |
Delete the indicated file or directory |
HadoopFileSystem.df(self) |
Return free space on disk, like the UNIX df command |
HadoopFileSystem.disk_usage(path) |
Compute bytes used by all contents under indicated path in file tree |
HadoopFileSystem.download(self, path, stream) |
|
HadoopFileSystem.exists(path) |
|
HadoopFileSystem.get_capacity(self) |
Get reported total capacity of file system |
HadoopFileSystem.get_space_used(self) |
Get space used on file system |
HadoopFileSystem.info(self, path) |
Return detailed HDFS information for path |
HadoopFileSystem.ls(path[, detail]) |
Retrieve directory contents and metadata, if requested. |
HadoopFileSystem.mkdir(path, **kwargs) |
Create directory in HDFS |
HadoopFileSystem.open(self, path[, mode, …]) |
Open HDFS file for reading or writing |
HadoopFileSystem.rename(path, new_path) |
Rename file, like UNIX mv command |
HadoopFileSystem.rm(path[, recursive]) |
Alias for FileSystem.delete |
HadoopFileSystem.upload(self, path, stream) |
Upload file-like object to HDFS path |