DarkMatter in Cyberspace
  • Home
  • Categories
  • Tags
  • Archives

将中文字符串转换为拼音首字母串


下面的Python脚本将任意汉字和英文、数字混合字符串转换为拼音首字母组成的字符串, 注意:有些汉字不能查找到首字母,例如"深圳东莞"的"圳"和"莞"两个字,原因不明。 需要在vi中用"/\<[A-Z]>搜索这种情况。

实现过程是:首先尝试用unicode, utf8和gbk解码字符串,然后用GBK编码字符串,利用GBK汉字是按拼音顺序编码的原理查出其首字母。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
#!/usr/bin/env python
# -*- coding: utf-8 -*-

def get_word_initial(inp):
    if isinstance(inp, unicode):
        unicode_str = inp
    else:
        try:
            unicode_str = inp.decode('utf8')
        except:
            try:
                unicode_str = inp.decode('gbk')
            except:
                print 'unknown coding'
                return

    init_list = [get_char_initial(i) for i in unicode_str]
    return "".join(init_list)

def get_char_initial(unicode1):
    gbkStr = unicode1.encode('gbk')
    try:
        ord(gbkStr)
        return gbkStr
    except:
        asc = ord(gbkStr[0]) * 256 + ord(gbkStr[1]) - 65536
        if asc >= -20319 and asc <= -20284:
            return 'A'
        if asc >= -20283 and asc <= -19776:
            return 'B'
        if asc >= -19775 and asc <= -19219:
            return 'C'
        if asc >= -19218 and asc <= -18711:
            return 'D'
        if asc >= -18710 and asc <= -18527:
            return 'E'
        if asc >= -18526 and asc <= -18240:
            return 'F'
        if asc >= -18239 and asc <= -17923:
            return 'G'
        if asc >= -17922 and asc <= -17418:
            return 'H'
        if asc >= -17417 and asc <= -16475:
            return 'J'
        if asc >= -16474 and asc <= -16213:
            return 'K'
        if asc >= -16212 and asc <= -15641:
            return 'L'
        if asc >= -15640 and asc <= -15166:
            return 'M'
        if asc >= -15165 and asc <= -14923:
            return 'N'
        if asc >= -14922 and asc <= -14915:
            return 'O'
        if asc >= -14914 and asc <= -14631:
            return 'P'
        if asc >= -14630 and asc <= -14150:
            return 'Q'
        if asc >= -14149 and asc <= -14091:
            return 'R'
        if asc >= -14090 and asc <= -13119:
            return 'S'
        if asc >= -13118 and asc <= -12839:
            return 'T'
        if asc >= -12838 and asc <= -12557:
            return 'W'
        if asc >= -12556 and asc <= -11848:
            return 'X'
        if asc >= -11847 and asc <= -11056:
            return 'Y'
        if asc >= -11055 and asc <= -10247:
            return 'Z'
        return ''

if __name__ == "__main__":
    str_input='广州火车站A2c'
    print(get_word_initial(str_input))

下面是一个完整的使用场景,利用上面的代码为小区的地理位置表添加ID字段。 首先将上面的代码保存在chnInit.py中,然后相同目录下创建一个addID.py文件:

import chnInit
import sys

target = sys.argv[1]
inclID = sys.argv[2]
with open(target, 'r') as src:
    with open(inclID, 'w') as dst:
        for content in src:
            line = content.strip()
            cgi = line.split(' ,')[0]
            node = line.split(' ,')[1].split(' ')[0]
            city = line.split(' ,')[1].split(' ')[1]
            nodeID = chnInit.get_word_initial(node)
            cityID = chnInit.get_word_initial(city)
            dst.write(cgi + ' ,' + city + ' ' + cityID + ' ' + node + ' ' + nodeID + '\n')

相同目录下保存输入文件input.csv,然后运行脚本:

$ head input.csv
460010973309433 ,白云国际机场 广州
...

$ python addID.py input.csv loc_map.csv
$ head loc_map.csv
460010973309433 ,广州 GZ 白云国际机场 BYGJJC
...

根据Python文档7.2.1节:"Methods of File Objects"中的描述:

For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code:

with open(target, 'r') as f:
    for line in f:
        ...

这种方式比较节省内存,适于处理大文件。 当处理小文件时,可以用下面的方法将完整内容保存在一个变量中:

with open('workfile', 'r') as f:
    read_data = f.read()


Published

Dec 31, 2014

Last Updated

Dec 31, 2014

Category

Tech

Tags

  • convert 7
  • gbk 4
  • 拼音 1
  • 首字母 1
  • 中文 5

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor