GeorgeYang'Blog

my technology blog

21行代码拼音检测

阅读:292 创建时间:16-01-27 04:25:05 tags:python,拼音检测

21行代码拼音检测

原理: 基于贝叶斯来实现计算,理论依据:

相关论文显示,80-95%的拼写错误跟想要拼写的单词都只有1个编辑距离,如果觉得一次编辑不够,那再来一次计算

 import re, collections

 def words(text): return re.findall('[a-z]+', text.lower())

 def train(features):
     model = collections.defaultdict(lambda: 1)
     for f in features:
         model[f] += 1
     return model

 NWORDS = train(words(file('big.txt').read()))

 alphabet = 'abcdefghijklmnopqrstuvwxyz'

 def edits1(word):
    splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes    = [a + b[1:] for a, b in splits if b]
    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
    replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
    inserts    = [a + c + b     for a, b in splits for c in alphabet]
    return set(deletes + transposes + replaces + inserts)

 def known_edits2(word):
     return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

 def known(words): return set(w for w in words if w in NWORDS)

 def correct(word):
     candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
     return max(candidates, key=NWORDS.get)

原文 http://blog.csdn.net/Pwiling/article/details/50573650 http://norvig.com/spell-correct.html