Python自然语言处理学习笔记(21):3.5 正则表达式的有益应用
3.5 Useful Applications of Regular Expressions
正则表达式的有益应用
The previous examples all involved searching for words w that match some regular expression regexp using re.search(regexp, w). Apart from checking whether a regular expression matches a word, we can use regular expressions to extract material from words, or to modify words in specific ways.
Extracting Word Pieces 提取词块
The re.findall() (“find all”) method finds all (non-overlapping 无重叠的) matches of the given regular expression. Let’s find all the vowels in a word, then count them:
>>> re.findall(r'[aeiou]', word)
['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
>>> len(re.findall(r'[aeiou]', word))
16
Let’s look for all sequences of two or more vowels in some text, and determine their relative frequency:
>>> wsj = sorted(set(nltk.corpus.treebank.words()))
... for vs in re.findall(r'[aeiou]{2,}', word))
>>> fd.items()
[('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253),
('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95),
('ei', 86), ('oi', 65), ('oa', 59), ('eo', 39), ('iou', 27), ('eu', 18), ...]
Your Turn: In the W3C Date Time Format, dates are represented like this: 2009-12-31. Replace the ? in the following Python code with a regular expression, in order to convert the string '2009-12-31' to a list of integers [2009, 12, 31]:
[int(n) for n in re.findall(?, '2009-12-31')]
[2009, 12, 31]
Doing More with Word Pieces 在单词块上做更多的事情
Once we can use re.findall() to extract material from words, there are interesting things to do with the pieces, such as glue them back together or plot them.
It is sometimes noted that English text is highly redundant(冗长的), and it is still easy to read when word-internal vowels are left out(忽视). For example, declaration becomes dclrtn(变成这样我就不认识了,我不认为是still easy to read...以下两段没仔细看,回头发现有用再来细读), and inalienable(不可分割的) becomes inlnble, retaining(保留) any initial or final vowel sequences. The regular expression in our next example matches initial vowel sequences, final vowel sequences, and all consonants(辅音); everything else is ignored. This three-way disjunction is processed left-to-right, and if one of the three parts matches the word, any later parts of the regular expression are ignored. We use re.findall() to extract all the matching pieces, and ''.join() to join them together (see Section 3.9 for more about the join operation).
>>> def compress(word):
... pieces = re.findall(regexp, word)
... return ''.join(pieces)
...
>>> english_udhr = nltk.corpus.udhr.words('English-Latin1')
>>> print nltk.tokenwrap(compress(w) for w in english_udhr[:75])
Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and
Next, let’s combine regular expressions with conditional frequency distributions. Here we will extract all consonant-vowel sequences from the words of Rotokas, such as ka and si. Since each of these is a pair, it can be used to initialize a conditional frequency distribution. We then tabulate the frequency of each pair:
>>> rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
>>> cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
>>> cfd = nltk.ConditionalFreqDist(cvs)
>>> cfd.tabulate()
a e i o u
k 418 148 94 420 173
p 83 31 105 34 51
r 187 63 84 89 79
s 0 0 100 2 1
t 47 8 0 148 37
v 93 27 105 48 49
Examining the rows for s and t, we see they are in partial “complementary distribution,” which is evidence that they are not distinct phonemes in the language. Thus, we could conceivably(令人信服的) drop s from the Rotokas alphabet and simply have a pronunciation rule that the letter t is pronounced s when followed by i. (Note that the single entry having su, namely kasuari, ‘cassowary’ is borrowed from English).
If we want to be able to inspect the words behind the numbers in that table, it would be helpful to have an index, allowing us to quickly find the list of words that contains a given consonant-vowel pair. For example, cv_index['su'] should give us all words containing su. Here’s how we can do this:
... for cv in re.findall(r'[ptksvr][aeiou]', w)]
>>> cv_index = nltk.Index(cv_word_pairs)
>>> cv_index['su']
['kasuari']
>>> cv_index['po']
['kaapo', 'kaapopato', 'kaipori', 'kaiporipie', 'kaiporivira', 'kapo', 'kapoa',
'kapokao', 'kapokapo', 'kapokapo', 'kapokapoa', 'kapokapoa', 'kapokapora', ...]
This program processes each word w in turn, and for each one, finds every substring that matches the regular expression «[ptksvr][aeiou]». In the case of the word kasuari, it finds ka, su, and ri. Therefore, the cv_word_pairs list will contain ('ka', 'kasuari'), ('su', 'kasuari'), and ('ri', 'kasuari'). One further step, using nltk.Index(), converts this into a useful index.
Finding Word Stems 查询词干
When we use a web search engine, we usually don’t mind (or even notice) if the words in the document differ from our search terms in having different endings. A query for laptops finds documents containing laptop and vice versa(反之亦然). Indeed, laptop and laptops are just two forms of the same dictionary word (or lemma). For some language processing tasks we want to ignore word endings, and just deal with word stems. There are various ways we can pull out(抽出) the stem of a word. Here’s a simple-minded approach that just strips off anything that looks like a suffix(后缀):
... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
... if word.endswith(suffix):
... return word[:-len(suffix)]
... return word
Although we will ultimately use NLTK’s built-in stemmers, it’s interesting to see how we can use regular expressions for this task. Our first step is to build up a disjunction of all the suffixes. We need to enclose it in parentheses in order to limit the scope of the disjunction.
['ing']
Here, re.findall() just gave us the suffix even though(尽管)the regular expression matched the entire word. This is because the parentheses have a second function, to select substrings to be extracted(这是因为圆括号有第二个函数,选择被提取的子字符串). If we want to use the parentheses to specify the scope of the disjunction, but not to select the material to be output, we have to add ?:, which is just one of many arcane subtleties(晦涩微妙的) of regular expressions. Here’s the revised(改进的) version.我尝试了re.findall(r'^.*ing|ly|ed|ious|ies|ive|es|s|ment$', 'processing'),其输出是processing。所以加了括号之后,只输出匹配括号中的表达式。
['processing']
However, we’d actually like to split the word into stem and suffix. So we should just parenthesize both parts of the regular expression:
[('process', 'ing')]
This looks promising, but still has a problem. Let’s look at a different word, processes:
[('processe', 's')]
The regular expression incorrectly found an -s suffix instead of an -es suffix. This demonstrates another subtlety: the star operator is “greedy”(贪婪的) and so the .* part of the expression tries to consume as much of the input as possible. If we use the “non-greedy” version of the star operator, written *?, we get what we want:
[('process', 'es')]
This works even when we allow an empty suffix, by making the content of the second parentheses optional(可选由?控制):
[('language', '')]
This approach still has many problems (can you spot(认出) them?), but we will move on to define a function to perform stemming, and apply it to a whole text:
... regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
... stem, suffix = re.findall(regexp, word)[0]
... return stem
...
>>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government. Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
>>> tokens = nltk.word_tokenize(raw)
>>> [stem(t) for t in tokens]
['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond',
'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern',
'.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from',
'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']
Notice that our regular expression removed the s from ponds but also from is and basis.It produced some non-words, such as distribut and deriv, but these are acceptable stems in some applications.
Searching Tokenized Text 搜索标记化文本
You can use a special kind of regular expression for searching across multiple words in a text (where a text is a list of tokens). For example, "<a> <man>" finds all instances of a man in the text. The angle brackets(尖括号) are used to mark token boundaries, and any whitespace between the angle brackets is ignored (behaviors that are unique to NLTK’s findall() method for texts). In the following example, we include <.*>①, which will match any single token, and enclose it in parentheses so only the matched word (e.g., monied) and not the matched phrase (e.g., a monied man) is produced(仅匹配单词,而不是短语). The second example finds three-word phrases ending with the word bro②. The last example finds sequences of three or more words starting with the letter l③.
>>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
>>> moby.findall(r"<a> (<.*>) <man>") ①
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> chat = nltk.Text(nps_chat.words())
>>> chat.findall(r"<.*> <.*> <bro>") ②
you rule bro; telling you bro; u twizted bro
>>> chat.findall(r"<l.*>{3,}") ③
lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la
Your Turn: Consolidate your understanding of regular expression patterns and substitutions(替换) using nltk.re_show(p, s), which annotates(注释) the string s to show every place where pattern p was matched, and nltk.app.nemo(), which provides a graphical interface for exploring regular expressions(有图形界面,不错). For more practice, try some of the exercises on regular expressions at the end of this chapter.
It is easy to build search patterns when the linguistic phenomenon we’re studying is tied to particular words. In some cases, a little creativity will go a long way. For instance, searching a large text corpus for expressions of the form x and other ys allows us to discover hypernyms (see Section 2.5):
>>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
>>> moby.findall(r"<a> (<.*>) <man>") ①
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> chat = nltk.Text(nps_chat.words())
>>> chat.findall(r"<.*> <.*> <bro>") ②
you rule bro; telling you bro; u twizted bro
>>> chat.findall(r"<l.*>{3,}") ③
lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la
With enough text, this approach would give us a useful store of information about the taxonomy(分类标准) of objects, without the need for any manual labor. However, our search results will usually contain false positives(误报), i.e., cases that we would want to exclude. For example, the result demands and other factors suggests that demand is an instance of the type factor, but this sentence is actually about wage demands. Nevertheless(然而), we could construct our own ontology(本体) of English concepts by manually correcting the output of such searches.
This combination of automatic and manual processing is the most common way for new corpora to be constructed. We will return to this in Chapter 11.
Searching corpora also suffers from the problem of false negatives, i.e., omitting cases that we would want to include. It is risky to conclude that some linguistic phenomenon doesn’t exist in a corpus just because we couldn’t find any instances of a search pattern. Perhaps we just didn’t think carefully enough about suitable patterns.
Your Turn: Look for instances of the pattern as x as y to discover information about entities and their properties. hobbies_learned.findall(r"<as> <\w*> <as> <\w*>")