安装 HDFS Python package mtth/hdfs:
conda install -c conda-forge python-hdfs
读写文件:
from hdfs import InsecureClient
client = InsecureClient('http://cdh001:50070/', user='cloudera-dev')
with client.read(
'/user/cloudera-dev/zjkgfalgodata/20170603/YCZ-65-02/part-000000',
encoding='utf-8') as reader, client.write(
'/user/cloudera-dev/zjkgfalgodata/20170603/YCZ-65-02/test',
encoding='utf-8') as writer:
raw = str(reader.read())
lines = raw.split('\n')
fir = '\n'.join(lines[:6])
writer.write(fir)
另外 pyarrow 是 pandas 作者 Wes 的作品, 它是基于列的内存计算 Apache Arrow 的 Python 接口, 统一了 pandas, hbase 和 Spark 的 DataFrame 的内存格式, 未来能使用 pandas dataframe 做分布式计算,非常有吸引力, 试验了一下,总报找不到文件错误,放弃:
conda install -c conda-forge pyarrow
export ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64
ipython
import pyarrow as pa