现在的位置: 首页 > 综合 > 正文

python自然语言处理学习笔记第三章

2014年07月26日 ⁄ 综合 ⁄ 共 2645字 ⁄ 字号 评论关闭

从本章开始往后我们的例子程序将假设你以下面的导入语句开始你的交
互式会话或程序:
>>> from __future__ import division

>>> import nltk, re, pprint

读取网络上存储的数据:

>>> from __future__ import division
>>> import nltk,re,pprint
>>> from urllib import urlopen
>>> url = url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> type(raw)
<type 'str'>
>>> len(raw)
1176893
>>> raw[:75]
'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

如果使用代理,则用下面代码:

如果你使用的Internet 代理Python不能正确检测出来,你可能需要用下面的方法手动指定代理:

>>> proxies = {'http': 'http://www.someproxy.com:3128'}
>>> raw = urlopen(url, proxies=proxies).read()

对读入的数据处理:

>>> tokens = nltk.word_tokenize(raw)
>>> type(tokens)
<type 'list'>
>>> len(tokens)
244484
>>> tokens[:10]
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']
>>> tokens[:15]
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by', 'Fyodor', 'Dostoevsky', 'This', 'eBook', 'is']

对得到的文本进一步处理和操作:

>>> text = nltk.Text(tokens)
>>> type(text)
<class 'nltk.text.Text'>
>>> text[1020:1060]
['had', 'successfully', 'avoided', 'meeting', 'his', 'landlady', 'on', 'the', 'staircase.', 'His', 'garret', 'was', 'under', 'the', 'roof', 'of', 'a', 'high', ',', 'five-storied', 'house', 'and', 'was', 'more', 'like', 'a', 'cupboard', 'than', 'a', 'room.',
'The', 'landlady', 'who', 'provided', 'him', 'with', 'garret', ',', 'dinners', ',']
>>> text.collocations()
Building collocations list
Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Marfa Petrovna; Rodion Romanovitch; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; Project Gutenberg; Andrey Semyonovitch; Nikodim Fomitch;
young man; Dmitri Prokofitch; n't know; Ilya Petrovitch; Good heavens
>>> raw.find("PART I")
5338
>>> raw.rfind("End of Project Gutenberg's Crime")
1157743
>>> raw = raw[5303:1157681]
>>> raw.find("PART I")
35

查看网络上html格式的文件。

>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> html = urlopen(url).read()
>>> html[:60]
'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

通过 print html来打印得到的文件。

从HTML 中提取文本是极其常见的任务,NLTK 提供了一个辅助函数nltk.clean_html()将HTML 字符串作为参数,返回原始文本。然后我们可以对原始文本进行分词。

>>> raw = nltk.clern_html(html)
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    raw = nltk.clern_html(html)
AttributeError: 'module' object has no attribute 'clern_html'
>>> raw = nltk.clean_html(html)     //消除html标记
>>> tokens = nltk.word_tokenize(raw)   //把内容转换为列表
>>> tokens   //显示出所有内容

>>> tokens =tokens[96:399]
>>> text = nltk.Text(tokens)
>>> text.concordance('gene')
Building index...
Displaying 4 of 4 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance. They do n't disappear 
des would disappear is if having the gene was a disadvantage and I do not thin

处理搜索结果:

抱歉!评论已关闭.