现在的位置: 首页 > 综合 > 正文

python自然语言处理学习笔记第三章3

2014年07月26日 ⁄ 综合 ⁄ 共 1691字 ⁄ 字号 评论关闭

对Python 解释器而言,一个正则表达式与任何其他字符串没有两样。如果字符串中包
含一个反斜杠后面跟一些特殊字符,Python 解释器将会特殊处理它们。例如:“\b”会被解
释为一个退格符号。一般情况下,当使用含有反斜杠的正则表达式时,我们应该告诉解释器
一定不要解释字符串里面的符号,而仅仅是将它直接传递给re 库来处理。我们通过给字符
串加一个前缀“r”来表明它是一个原始字符串。例如:原始字符串r'\band\b'包含两个“\
b”符号会被re 库解释为匹配词的边界而不是解释为退格字符。

 

 

通过re.findall() (“find all”即找到所有)方法找出所有(无重叠的)匹配指定正则表
达式的。让我们找出一个词中的元音,再计数它们:
>>> word = 'supercalifragilisticexpialidocious'
>>> re.findall(r'[aeiou]', word)
['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
>>> len(re.findall(r'[aeiou]', word))

 

>>> import nltk
>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))
  File "C:\Python27\lib\site-packages\nltk-2.0.4-py2.7.egg\nltk\probability.py", line 105, in __init__
    self.update(samples)
  File "C:\Python27\lib\site-packages\nltk-2.0.4-py2.7.egg\nltk\probability.py", line 434, in update
    for sample, count in sample_iter:
  File "<pyshell#2>", line 1, in <genexpr>
    fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))
NameError: global name 're' is not defined
>>> import re
>>> fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))
>>> fd.items()
[('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253), ('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95), ('ei', 86), ('oi', 65), ('oa', 59), ('eo', 39), ('iou', 27), ('eu', 18), ('oe', 15), ('iu', 14), ('ae',
11), ('eau', 10), ('uo', 8), ('ao', 6), ('oui', 6), ('eou', 5), ('uou', 5), ('uee', 4), ('aa', 3), ('ieu', 3), ('uie', 3), ('eei', 2), ('aia', 1), ('aii', 1), ('aiia', 1), ('eea', 1), ('iai', 1), ('iao', 1), ('ioa', 1), ('oei', 1), ('ooi', 1), ('ueui', 1),
('uu', 1)]

 

 

抱歉!评论已关闭.