DarkMatter in Cyberspace
  • Home
  • Categories
  • Tags
  • Archives

Read and Write Files on HDFS with Python


安装 HDFS Python package mtth/hdfs: conda install -c conda-forge python-hdfs

读写文件:

from hdfs import InsecureClient
client = InsecureClient('http://cdh001:50070/', user='cloudera-dev')
with client.read(
      '/user/cloudera-dev/zjkgfalgodata/20170603/YCZ-65-02/part-000000',
      encoding='utf-8') as reader, client.write(
      '/user/cloudera-dev/zjkgfalgodata/20170603/YCZ-65-02/test',
      encoding='utf-8') as writer:
    raw = str(reader.read())
    lines = raw.split('\n')
    fir = '\n'.join(lines[:6])
    writer.write(fir)

另外 pyarrow 是 pandas 作者 Wes 的作品, 它是基于列的内存计算 Apache Arrow 的 Python 接口, 统一了 pandas, hbase 和 Spark 的 DataFrame 的内存格式, 未来能使用 pandas dataframe 做分布式计算,非常有吸引力, 试验了一下,总报找不到文件错误,放弃:

conda install -c conda-forge pyarrow
export ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64
ipython
import pyarrow as pa


Published

Dec 6, 2018

Last Updated

Dec 6, 2018

Category

Tech

Tags

  • hdfs 1
  • io 1
  • python 136

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor