DarkMatter in Cyberspace
  • Home
  • Categories
  • Tags
  • Archives

Unicode and File I/O in Python 2.X and 3.X


In Python 2.x, the default string is byte string, which means every byte is convert to a character. If you write a Unicode string, you have to write it as u'...'. On the contrary, in Python 3.x, the default string is Unicode string. If you want a byte string, you have to write it as b'...'.

In Python 3.3:

chn = '将帖子翻译为中\n'
with open('py3uni', 'w') as f:
    f.write(chn)
print(chn.encode('utf-8'))
mygbk = chn.encode('gbk')
with open('py3gbk', 'wb') as f:
  f.write(mygbk)
readgbk = open('py3gbk', encoding='gbk').read()
print(readgbk)
print(type(readgbk))  # <class 'str'>

In Python 2.7:

# -*- encoding: utf-8 -*-
import codecs
inputStr = u'将帖子翻译为中文2015年3月\n'

gbkFn = 'gbkFile'
utf8Fn = 'utf8File'
print('Print original Unicdoe string: ' + inputStr)
print('Print in UTF-8 encoding: ' + inputStr.encode('utf-8'))
with codecs.open(utf8Fn, 'w', 'utf-8') as f:
    f.write(inputStr)
fromUTF8 = codecs.open(utf8Fn, encoding='utf-8').read()
with codecs.open(gbkFn, 'w', 'gbk') as f:
    f.write(inputStr)
fromGBK = codecs.open(gbkFn, encoding='gbk').read()
print('Strings from different encodings are the same?')
print(fromUTF8 == fromGBK)
print("\nString type:")
print(type(fromGBK))

You can convert gbkFile to utf8File in shell with: iconv -f gbk -t utf8 gbkFile > utf8File.

You can also write strings to file in this way:

gbkStr = inputStr.encode('gbk')
with open('gbkFile', 'wb') as f:
    f.write(gbkStr)

While it's not as concise as the previous method.



Published

Feb 17, 2014

Last Updated

Feb 17, 2014

Category

Tech

Tags

  • File 6
  • Python 136
  • Unicode 10

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor