Read and Write Files on HDFS with Python

安装 HDFS Python package mtth/hdfs: conda install -c conda-forge python-hdfs

读写文件：

from hdfs import InsecureClient
client = InsecureClient('http://cdh001:50070/', user='cloudera-dev')
with client.read(
      '/user/cloudera-dev/zjkgfalgodata/20170603/YCZ-65-02/part-000000',
      encoding='utf-8') as reader, client.write(
      '/user/cloudera-dev/zjkgfalgodata/20170603/YCZ-65-02/test',
      encoding='utf-8') as writer:
    raw = str(reader.read())
    lines = raw.split('\n')
    fir = '\n'.join(lines[:6])
    writer.write(fir)

另外 pyarrow 是 pandas 作者 Wes 的作品，它是基于列的内存计算 Apache Arrow 的 Python 接口，统一了 pandas, hbase 和 Spark 的 DataFrame 的内存格式，未来能使用 pandas dataframe 做分布式计算，非常有吸引力，试验了一下，总报找不到文件错误，放弃：

conda install -c conda-forge pyarrow
export ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64
ipython
import pyarrow as pa

Read and Write Files on HDFS with Python

Published

Last Updated

Category

Tags

Contact