


>>>import re
['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']


>>>fd=nltk.FreqDist(vs for word in wsj
					for vs in re.findall(r'[aeiou]{2,}',word))
[('iao', 1), ('ioa', 1), ('oa', 59), ('ao', 6), ('uu', 1), ('eou', 5), ('eo', 39), ('aiia', 1), 
('uo', 8), ('eea', 1), ('ai', 261), ('ui', 95), ('oei', 1), ('iai', 1), ('oui', 6), ('uie', 3), ('aii', 1), ('ooi', 1), ...)]



>>>regexp=r'^[aeiouAEIOU]+|[aeiouAEIOU]+$|[^aeiouAEIOU]' ##匹配模式
>>>def compress(word):
	   return ''.join(pieces)   #通过join()连接
>>> english_udhr=nltk.corpus.udhr.words('English-Latin1')
>>>print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))
Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and


>>>cvs=[cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]',w)]
    a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49 



>>>cv_word_pairs=[(cv,w) for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]',w)]
['kaipori', 'kaiporipie', 'kaiporivira', 'kairi', 'kairiro', 'kakiri', 'kapokari', 'kapokarito', 'Karepirie',...]




>>>def stem(word):
	   for suffix in ['ing','ly','ed','ious','ies','ive','es','s','ment']:
		   if word.endswith(suffix):
			   return word[:-len(suffix)]






[('process', 'ing')]


[('processe', 's')]


[('process', 'es')]


[('language', '')]



>>>def stem(word):
	return stem
>>>raw="""DENNIS: Listen, strange women lying in ponds distributing swords

    is no basis for a system of government. Supreme executive power derives from

    a mandate from the masses, not from some farcical aquatic ceremony."""
>>>print([stem(t) for t in tokens])
['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


可以使用一种特殊的正则表达式搜索一个文本中的多个词,例如'<a><man>'找到文本中所有'a man'的实例。尖括号用在标识符的边界,尖括号之间的所有空白都被忽略(这一点只对NLTK的findall()方法处理文本有效)。在下面的例子中,使用<.*>,让其匹配单个标识符,并且放在括号内。这样就只匹配词(如monied)而不匹配短语(a monied man).

>>>from nltk.corpus import gutenberg,nps_chat
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
a monied man; a nervous man; a dangerous man; a white man; a white
man; a white man; a pious man; a queer man; a good man; a mature man;
a white man; a Cape man; a great man; a wise man; a wise man; a
butterless man; a white man; a fiendish man; a pale man; a furious
man; a better man; a certain man; a complete man; a dismasted man; a
younger man; a brave man; a brave man; a brave man; a brave man


you rule bro; telling you bro; u twizted bro
lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la


posted on 2016-10-13 22:24  波比12  阅读(1361)  评论(0编辑  收藏  举报
