将中文字符串转换为拼音首字母串

下面的Python脚本将任意汉字和英文、数字混合字符串转换为拼音首字母组成的字符串，注意：有些汉字不能查找到首字母，例如"深圳东莞"的"圳"和"莞"两个字，原因不明。需要在vi中用"/\<[A-Z]>搜索这种情况。

实现过程是：首先尝试用unicode, utf8和gbk解码字符串，然后用GBK编码字符串，利用GBK汉字是按拼音顺序编码的原理查出其首字母。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def get_word_initial(inp):
    if isinstance(inp, unicode):
        unicode_str = inp
    else:
        try:
            unicode_str = inp.decode('utf8')
        except:
            try:
                unicode_str = inp.decode('gbk')
            except:
                print 'unknown coding'
                return

    init_list = [get_char_initial(i) for i in unicode_str]
    return "".join(init_list)

def get_char_initial(unicode1):
    gbkStr = unicode1.encode('gbk')
    try:
        ord(gbkStr)
        return gbkStr
    except:
        asc = ord(gbkStr[0]) * 256 + ord(gbkStr[1]) - 65536
        if asc >= -20319 and asc <= -20284:
            return 'A'
        if asc >= -20283 and asc <= -19776:
            return 'B'
        if asc >= -19775 and asc <= -19219:
            return 'C'
        if asc >= -19218 and asc <= -18711:
            return 'D'
        if asc >= -18710 and asc <= -18527:
            return 'E'
        if asc >= -18526 and asc <= -18240:
            return 'F'
        if asc >= -18239 and asc <= -17923:
            return 'G'
        if asc >= -17922 and asc <= -17418:
            return 'H'
        if asc >= -17417 and asc <= -16475:
            return 'J'
        if asc >= -16474 and asc <= -16213:
            return 'K'
        if asc >= -16212 and asc <= -15641:
            return 'L'
        if asc >= -15640 and asc <= -15166:
            return 'M'
        if asc >= -15165 and asc <= -14923:
            return 'N'
        if asc >= -14922 and asc <= -14915:
            return 'O'
        if asc >= -14914 and asc <= -14631:
            return 'P'
        if asc >= -14630 and asc <= -14150:
            return 'Q'
        if asc >= -14149 and asc <= -14091:
            return 'R'
        if asc >= -14090 and asc <= -13119:
            return 'S'
        if asc >= -13118 and asc <= -12839:
            return 'T'
        if asc >= -12838 and asc <= -12557:
            return 'W'
        if asc >= -12556 and asc <= -11848:
            return 'X'
        if asc >= -11847 and asc <= -11056:
            return 'Y'
        if asc >= -11055 and asc <= -10247:
            return 'Z'
        return ''

if __name__ == "__main__":
    str_input='广州火车站A2c'
    print(get_word_initial(str_input))

下面是一个完整的使用场景，利用上面的代码为小区的地理位置表添加ID字段。首先将上面的代码保存在chnInit.py中，然后相同目录下创建一个addID.py文件：

import chnInit
import sys

target = sys.argv[1]
inclID = sys.argv[2]
with open(target, 'r') as src:
    with open(inclID, 'w') as dst:
        for content in src:
            line = content.strip()
            cgi = line.split(' ,')[0]
            node = line.split(' ,')[1].split(' ')[0]
            city = line.split(' ,')[1].split(' ')[1]
            nodeID = chnInit.get_word_initial(node)
            cityID = chnInit.get_word_initial(city)
            dst.write(cgi + ' ,' + city + ' ' + cityID + ' ' + node + ' ' + nodeID + '\n')

相同目录下保存输入文件input.csv，然后运行脚本：

$ head input.csv
460010973309433 ,白云国际机场 广州
...

$ python addID.py input.csv loc_map.csv
$ head loc_map.csv
460010973309433 ,广州 GZ 白云国际机场 BYGJJC
...

根据Python文档7.2.1节："Methods of File Objects"中的描述：

For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code:

with open(target, 'r') as f:
    for line in f:
        ...

这种方式比较节省内存，适于处理大文件。当处理小文件时，可以用下面的方法将完整内容保存在一个变量中：

with open('workfile', 'r') as f:
    read_data = f.read()

将中文字符串转换为拼音首字母串

Published

Last Updated

Category

Tags

Contact