代码改变世界

自然语言处理(2)之文本资料库

2014-08-27 23:54  追风的蓝宝  阅读(1474)  评论(0编辑  收藏  举报

自然语言处理(2)之文本资料库

1.获取文本资料库

本章首先给出了一个文本资料库的实例:nltk.corpus.gutenberg,通过gutenberg实例来学习文本资料库。我们用help来查看它的类型

  1 >>> import nltk
  2 >>> help(nltk.corpus.gutenberg)
  3 Help on PlaintextCorpusReader in module nltk.corpus.reader.plaintext object:
  4 
  5 class PlaintextCorpusReader(nltk.corpus.reader.api.CorpusReader)
  6  |  Reader for corpora that consist of plaintext documents.  Paragraphs
  7  |  are assumed to be split using blank lines.  Sentences and words can
  8  |  be tokenized using the default tokenizers, or by custom tokenizers
  9  |  specificed as parameters to the constructor.
 10  |  
 11  |  This corpus reader can be customized (e.g., to skip preface
 12  |  sections of specific document formats) by creating a subclass and
 13  |  overriding the ``CorpusView`` class variable.
 14  |  
 15  |  Method resolution order:
 16  |      PlaintextCorpusReader
 17  |      nltk.corpus.reader.api.CorpusReader
 18  |      __builtin__.object
 19  |  
 20  |  Methods defined here:
 21  |  
 22  |  __init__(self, root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=56), sent_tokenizer=<nltk.tokenize.punkt.Punkt
 23 SentenceTokenizer object>, para_block_reader=<function read_blankline_block>, encoding=None)
 24  |      Construct a new plaintext corpus reader for a set of documents
 25  |      located at the given root directory.  Example usage:
 26  |      
 27  |          >>> root = '/usr/local/share/nltk_data/corpora/webtext/'
 28  |          >>> reader = PlaintextCorpusReader(root, '.*\.txt')
 29  |      
 30  |      :param root: The root directory for this corpus.
 31  |      :param fileids: A list or regexp specifying the fileids in this corpus.
 32  |      :param word_tokenizer: Tokenizer for breaking sentences or
 33  |          paragraphs into words.
 34  |      :param sent_tokenizer: Tokenizer for breaking paragraphs
 35  |          into words.
 36  |      :param para_block_reader: The block reader used to divide the
 37  |          corpus into paragraph blocks.
 38  |  
 39  |  paras(self, fileids=None, sourced=False)
 40  |      :return: the given file(s) as a list of
 41  |          paragraphs, each encoded as a list of sentences, which are
 42  |          in turn encoded as lists of word strings.
 43  |      :rtype: list(list(list(str)))
 44  |  
 45  |  raw(self, fileids=None, sourced=False)
 46  |      :return: the given file(s) as a single string.
 47  |      :rtype: str
 48  |  
 49  |  sents(self, fileids=None, sourced=False)
 50  |      :return: the given file(s) as a list of
 51  |          sentences or utterances, each encoded as a list of word
 52  |          strings.
 53  |      :rtype: list(list(str))
 54  |  
 55  |  words(self, fileids=None, sourced=False)
 56  |      :return: the given file(s) as a list of words
 57 |          and punctuation symbols.
 58  |      :rtype: list(str)
 59  |  
 60  |  ----------------------------------------------------------------------
 61  |  Data and other attributes defined here:
 62  |  
 63  |  CorpusView = <class 'nltk.corpus.reader.util.StreamBackedCorpusView'>
 64  |      A 'view' of a corpus file, which acts like a sequence of tokens:
 65  |      it can be accessed by index, iterated over, etc.  However, the
 66  |      tokens are only constructed as-needed -- the entire corpus is
 67  |      never stored in memory at once.
 68  |      
 69  |      The constructor to ``StreamBackedCorpusView`` takes two arguments:
 70  |      a corpus fileid (specified as a string or as a ``PathPointer``);
 71  |      and a block reader.  A "block reader" is a function that reads
 72  |      zero or more tokens from a stream, and returns them as a list.  A
 73  |      very simple example of a block reader is:
 74  |      
 75  |          >>> def simple_block_reader(stream):
 76  |          ...     return stream.readline().split()
 77  |      
 78  |      This simple block reader reads a single line at a time, and
 79  |      returns a single token (consisting of a string) for each
 80  |      whitespace-separated substring on the line.
 81  |      
 82  |      When deciding how to define the block reader for a given
 83  |      corpus, careful consideration should be given to the size of
 84  |      blocks handled by the block reader.  Smaller block sizes will
 85  |      increase the memory requirements of the corpus view's internal
 86  |      data structures (by 2 integers per block).  On the other hand,
 87  |      larger block sizes may decrease performance for random access to
 88  |      the corpus.  (But note that larger block sizes will *not*
 89  |      decrease performance for iteration.)
 90  |      
 91  |      Internally, ``CorpusView`` maintains a partial mapping from token
 92  |      index to file position, with one entry per block.  When a token
 93  |      with a given index *i* is requested, the ``CorpusView`` constructs
 94  |      it as follows:
 95  |      
 96  |        1. First, it searches the toknum/filepos mapping for the token
 97  |           index closest to (but less than or equal to) *i*.
 98  |      
 99  |        2. Then, starting at the file position corresponding to that
100  |           index, it reads one block at a time using the block reader
101  |           until it reaches the requested token.
102  |      
103  |      The toknum/filepos mapping is created lazily: it is initially
104  |      empty, but every time a new block is read, the block's
105  |      initial token is added to the mapping.  (Thus, the toknum/filepos
106  |      map has one entry per block.)
107  |      
108  |      In order to increase efficiency for random access patterns that
109  |      have high degrees of locality, the corpus view may cache one or
110 |      have high degrees of locality, the corpus view may cache one or
111  |      more blocks.
112  |      
113  |      :note: Each ``CorpusView`` object internally maintains an open file
114  |          object for its underlying corpus file.  This file should be
115  |          automatically closed when the ``CorpusView`` is garbage collected,
116  |          but if you wish to close it manually, use the ``close()``
117  |          method.  If you access a ``CorpusView``'s items after it has been
118  |          closed, the file object will be automatically re-opened.
119  |      
120  |      :warning: If the contents of the file are modified during the
121  |          lifetime of the ``CorpusView``, then the ``CorpusView``'s behavior
122  |          is undefined.
123  |      
124  |      :warning: If a unicode encoding is specified when constructing a
125  |          ``CorpusView``, then the block reader may only call
126  |          ``stream.seek()`` with offsets that have been returned by
127  |          ``stream.tell()``; in particular, calling ``stream.seek()`` with
128  |          relative offsets, or with offsets based on string lengths, may
129  |          lead to incorrect behavior.
130  |      
131  |      :ivar _block_reader: The function used to read
132  |          a single block from the underlying file stream.
133  |      :ivar _toknum: A list containing the token index of each block
134  |          that has been processed.  In particular, ``_toknum[i]`` is the
135  |          token index of the first token in block ``i``.  Together
136  |          with ``_filepos``, this forms a partial mapping between token
137  |          indices and file positions.
138  |      :ivar _filepos: A list containing the file position of each block
139  |          that has been processed.  In particular, ``_toknum[i]`` is the
140  |          file position of the first character in block ``i``.  Together
141  |          with ``_toknum``, this forms a partial mapping between token
142  |          indices and file positions.
143  |      :ivar _stream: The stream used to access the underlying corpus file.
144  |      :ivar _len: The total number of tokens in the corpus, if known;
145  |          or None, if the number of tokens is not yet known.
146  |      :ivar _eofpos: The character position of the last character in the
147  |          file.  This is calculated when the corpus view is initialized,
148  |          and is used to decide when the end of file has been reached.
149  |      :ivar _cache: A cache of the most recently read block.  It
150  |         is encoded as a tuple (start_toknum, end_toknum, tokens), where
151  |         start_toknum is the token index of the first token in the block;
152  |         end_toknum is the token index of the first token not in the
153  |         block; and tokens is a list of the tokens in the block.
154  |  
155  |  ----------------------------------------------------------------------
156  |  Methods inherited from nltk.corpus.reader.api.CorpusReader:
157  |  
158  |  __repr__(self)
159  |  
160  |  abspath(self, fileid)
161  |      Return the absolute path for the given file.
162  |      
163  |      :type file: str
164 
165 |      :param file: The file identifier for the file whose path
166  |          should be returned.
167  |      :rtype: PathPointer
168  |  
169  |  abspaths(self, fileids=None, include_encoding=False, include_fileid=False)
170  |      Return a list of the absolute paths for all fileids in this corpus;
171  |      or for the given list of fileids, if specified.
172  |      
173  |      :type fileids: None or str or list
174  |      :param fileids: Specifies the set of fileids for which paths should
175  |          be returned.  Can be None, for all fileids; a list of
176  |          file identifiers, for a specified set of fileids; or a single
177  |          file identifier, for a single file.  Note that the return
178  |          value is always a list of paths, even if ``fileids`` is a
179  |          single file identifier.
180  |      
181  |      :param include_encoding: If true, then return a list of
182  |          ``(path_pointer, encoding)`` tuples.
183  |      
184  |      :rtype: list(PathPointer)
185  |  
186  |  encoding(self, file)
187  |      Return the unicode encoding for the given corpus file, if known.
188  |      If the encoding is unknown, or if the given file should be
189  |      processed using byte strings (str), then return None.
190  |  
191  |  fileids(self)
192  |      Return a list of file identifiers for the fileids that make up
193  |      this corpus.
194  |  
195  |  open(self, file, sourced=False)
196  |      Return an open stream that can be used to read the given file.
197  |      If the file's encoding is not None, then the stream will
198  |      automatically decode the file's contents into unicode.
199  |      
200  |      :param file: The file identifier of the file to read.
201  |  
202  |  readme(self)
203  |      Return the contents of the corpus README file, if it exists.
204  |  
205  |  ----------------------------------------------------------------------
206  |  Data descriptors inherited from nltk.corpus.reader.api.CorpusReader:
207  |  
208  |  __dict__
209  |      dictionary for instance variables (if defined)
210  |  
211  |  __weakref__
212  |      list of weak references to the object (if defined)
213  |  
214  |  root
215  |      The directory where this corpus is stored.
216  |      
217  |      :type: PathPointer

在PlaintextCorpusReader中可以看到很多本文例子中方法,比如fileids(),words()等等。

1.1 fileids()返回语料库的文件标识符

1 >>> from nltk.corpus import gutenberg
2 >>> gutenberg.fileids()
3 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

1.2 words()返回文件的单词列表

1 >>> from nltk.corpus import gutenberg
2 >>> gutenberg.fileids()
3 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
4 >>> gutenberg.words('austen-emma.txt')
5 ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]
6 >>> len(gutenberg.words('austen-emma.txt'))
7 192427

用concordance()来搜索文本里的单词

 1 >>> emma = nltk.Text(gutenberg.words('austen-emma.txt'))
 2 >>> emma
 3 <Text: Emma by Jane Austen 1816>
 4 >>> emma.concordance('surperize')
 5 Building index...
 6 No matches
 7 >>> emma.concordance('surprize')
 8 Displaying 25 of 37 matches:
 9 er father , was sometimes taken by surprize at his being still able to pity ` 
10 hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
11 Knightley actually looked red with surprize and displeasure , as he stood up ,
12 r . Elton , and found to his great surprize , that Mr . Elton was actually on 
13 d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
14 father was quite taken up with the surprize of so sudden a journey , and his f
15 y , in all the favouring warmth of surprize and conjecture . She was , moreove
16 he appeared , to have her share of surprize , introduction , and pleasure . Th
17 ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
18 talking aunt had taken me quite by surprize , it must have been the death of m
19 f all the dialogue which ensued of surprize , and inquiry , and congratulation
20  the present . They might chuse to surprize her ." Mrs . Cole had many to agre
21 the mode of it , the mystery , the surprize , is more like a young woman ' s s
22  to her song took her agreeably by surprize -- a second , slightly but correct
23 " " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; 
24 t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
25 of your admiration may take you by surprize some day or other ." Mr . Knightle
26 ation for her will ever take me by surprize .-- I never had a thought of her i
27  expected by the best judges , for surprize -- but there was great joy . Mr . 
28  sound of at first , without great surprize . " So unreasonably early !" she w
29 d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
30 ; and Emma could imagine with what surprize and mortification she must be retu
31 tled that Jane should go . Quite a surprize to me ! I had not the least idea !
32  . It is impossible to express our surprize . He came to speak to his father o
33 g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai

这里用到了nltk.Text类,再次通过help查看这个类,通过method的查看发现这个类非常有用。

  1 class Text(__builtin__.object)
  2  |  A wrapper around a sequence of simple (string) tokens, which is
  3  |  intended to support initial exploration of texts (via the
  4  |  interactive console).  Its methods perform a variety of analyses
  5  |  on the text's contexts (e.g., counting, concordancing, collocation
  6  |  discovery), and display the results.  If you wish to write a
  7  |  program which makes use of these analyses, then you should bypass
  8  |  the ``Text`` class, and use the appropriate analysis function or
  9  |  class directly instead.
 10  |  
 11  |  A ``Text`` is typically initialized from a given document or
 12  |  corpus.  E.g.:
 13  |  
 14  |  >>> import nltk.corpus
 15  |  >>> from nltk.text import Text
 16  |  >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
 17  |  
 18  |  Methods defined here:
 19  |  
 20  |  __getitem__(self, i)
 21  |  
 22  |  __init__(self, tokens, name=None)
 23  |      Create a Text object.
 24  |      
 25  |      :param tokens: The source text.
 26  |      :type tokens: sequence of str
 27  |  
 28  |  __len__(self)
 29  |  
 30  |  __repr__(self)
 31  |      :return: A string representation of this FreqDist.
 32  |      :rtype: string
 33  |  
 34  |  collocations(self, num=20, window_size=2)
 35  |      Print collocations derived from the text, ignoring stopwords.
 36  |      
 37  |      :seealso: find_collocations
 38  |      :param num: The maximum number of collocations to print.
 39  |      :type num: int
 40  |      :param window_size: The number of tokens spanned by a collocation (default=2)
 41  |      :type window_size: int
 42  |  
 43  |  common_contexts(self, words, num=20)
 44  |      Find contexts where the specified words appear; list
 45  |      most frequent common contexts first.
 46  |      
 47  |      :param word: The word used to seed the similarity search
 48  |      :type word: str
 49  |      :param num: The number of words to generate (default=20)
 50  |      :type num: int
 51  |      :seealso: ContextIndex.common_contexts()
 52  |  
 53 |  concordance(self, word, width=79, lines=25)
 54  |      Print a concordance for ``word`` with the specified context window.
 55  |      Word matching is not case-sensitive.
 56  |      :seealso: ``ConcordanceIndex``
 57  |  
 58  |  count(self, word)
 59  |      Count the number of times this word appears in the text.
 60  |  
 61  |  dispersion_plot(self, words)
 62  |      Produce a plot showing the distribution of the words through the text.
 63  |      Requires pylab to be installed.
 64  |      
 65  |      :param words: The words to be plotted
 66  |      :type word: str
 67  |      :seealso: nltk.draw.dispersion_plot()
 68  |  
 69  |  findall(self, regexp)
 70  |      Find instances of the regular expression in the text.
 71  |      The text is a list of tokens, and a regexp pattern to match
 72  |      a single token must be surrounded by angle brackets.  E.g.
 73  |      
 74  |      >>> from nltk.book import text1, text5, text9
 75  |      >>> text5.findall("<.*><.*><bro>")
 76  |      you rule bro; telling you bro; u twizted bro
 77  |      >>> text1.findall("<a>(<.*>)<man>")
 78  |      monied; nervous; dangerous; white; white; white; pious; queer; good;
 79  |      mature; white; Cape; great; wise; wise; butterless; white; fiendish;
 80  |      pale; furious; better; certain; complete; dismasted; younger; brave;
 81  |      brave; brave; brave
 82  |      >>> text9.findall("<th.*>{3,}")
 83  |      thread through those; the thought that; that the thing; the thing
 84  |      that; that that thing; through these than through; them that the;
 85  |      through the thick; them that they; thought that the
 86  |      
 87  |      :param regexp: A regular expression
 88  |      :type regexp: str
 89  |  
 90  |  generate(self, length=100)
 91  |      Print random text, generated using a trigram language model.
 92  |      
 93  |      :param length: The length of text to generate (default=100)
 94  |      :type length: int
 95  |      :seealso: NgramModel
 96  |  
 97  |  index(self, word)
 98  |      Find the index of the first occurrence of the word in the text.
 99  |  
100  |  plot(self, *args)
101  |      See documentation for FreqDist.plot()
102  |      :seealso: nltk.prob.FreqDist.plot()
103  |  
104  |  readability(self, method)
105  |  
106  |  similar(self, word, num=20)
107  |      Distributional similarity: find other words which appear in the
108  |      same contexts as the specified word; list most similar words first.
109  |      
110  |      :param word: The word used to seed the similarity search
111  |      :type word: str
112  |      :param num: The number of words to generate (default=20)
113  |      :type num: int
114  |      :seealso: ContextIndex.similar_words()
115  |  
116  |  vocab(self)
117  |      :seealso: nltk.prob.FreqDist
118  |  
119  |  ----------------------------------------------------------------------
120  |  Data descriptors defined here:
121  |  
122  |  __dict__
123  |      dictionary for instance variables (if defined)
124  |  
125  |  __weakref__
126  |      list of weak references to the object (if defined)

1.3 raw,sent,words的区别

我们通过以下例子来查看raw,sent,words的区别:

  1 #!/bin/envs python
  2 from nltk.corpus import gutenberg
  3 for fileid in gutenberg.fileids():
  4     num_chars = len(gutenberg.raw(fileid))                                  // 字母的个数
  5     num_words = len(gutenberg.words(fileid))                                // 单词的个数
  6     num_sents = len(gutenberg.sents(fileid))                                // 句子的个数
  7     num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))      // 不相同的单词的个数
  8     print int(num_chars/num_words),int(num_words/num_sents),int(num_words/num_vocab),fileid
  
4 21 26 austen-emma.txt  //平均单词长度   平均每句单词个数   平均单词的重复个数
4 23 16 austen-persuasion.txt
4 23 22 austen-sense.txt
4 33 79 bible-kjv.txt
4 18 5 blake-poems.txt
4 17 14 bryant-stories.txt
4 17 12 burgess-busterbrown.txt
4 16 12 carroll-alice.txt
4 17 11 chesterton-ball.txt
4 19 11 chesterton-brown.txt
4 16 10 chesterton-thursday.txt
4 17 24 edgeworth-parents.txt
4 24 15 melville-moby_dick.txt
4 52 10 milton-paradise.txt
4 11 8 shakespeare-caesar.txt
4 12 7 shakespeare-hamlet.txt
4 12 6 shakespeare-macbeth.txt
4 35 12 whitman-leaves.txt

获取并查看shakespeare-macbeth.txt文本最长的一个句子

  1 #!/bin/envs python
  2 from nltk.corpus import gutenberg
  3 macbenth_sentences = gutenberg.sents('shakespeare-macbeth.txt') # 获取句子的list
  4 print macbenth_sentences
  5 print macbenth_sentences[1037]
  6 longtest_len=max([len(s) for s in macbenth_sentences])         # 获取最长句子的长度
  7 [ s for s in macbenth_sentences if longtest_len == len(s)]     # 获取最长句子的内容

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]

['Good', 'night', ',', 'and', 'better', 'health', 'Attend', 'his', 'Maiesty']

[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Steele', ',', 'Which', 'smoak', "'", 'd', 'with', 'bloody', 'execution', '(', 'Like', 'Valours', 'Minion', ')', 'caru', "'", 'd', 'out', 'his', 'passage', ',', 'Till', 'hee', 'fac', "'", 'd', 'the', 'Slaue', ':', 'Which', 'neu', "'", 'r', 'shooke', 'hands', ',', 'nor', 'bad', 'farwell', 'to', 'him', ',', 'Till', 'he', 'vnseam', "'", 'd', 'him', 'from', 'the', 'Naue', 'toth', "'", 'Chops', ',', 'And', 'fix', "'", 'd', 'his', 'Head', 'vpon', 'our', 'Battlements']]

1.4 NPSChatCorpusReader类

接下来学习下新的一个reader类,nltk给出另一个实例类nltk.corpus.nps_chat,同样用help来查看下该类的信息。可以初步看出该类与xml格式的文件有关。

1 nps_chat = class NPSChatCorpusReader(nltk.corpus.reader.xmldocs.XMLCorpusReader)
2  |  Method resolution order:
3  |      NPSChatCorpusReader
4  |      nltk.corpus.reader.xmldocs.XMLCorpusReader
5  |      nltk.corpus.reader.api.CorpusReader
6  |      __builtin__.object
7  |  
8  |  Methods defined here:
9 ...
1 >>> from nltk.corpus import nps_chat
2 >>> nps_chat.fileids()
3 ['10-19-20s_706posts.xml', '10-19-30s_705posts.xml', '10-19-40s_686posts.xml', '10-19-adults_706posts.xml', '10-24-40s_706posts.xml', '10-26-teens_706posts.xml', '11-06-adults_706posts.xml', '11-08-20s_705posts.xml', '11-08-40s_706posts.xml', '11-08-adults_705posts.xml', '11-08-teens_706posts.xml', '11-09-20s_706posts.xml', '11-09-40s_706posts.xml', '11-09-adults_706posts.xml', '11-09-teens_706posts.xml']
4 >>> chartoom=nps_chat.posts('10-19-20s_706posts.xml')
5 >>> chartoom[123]
6 ['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

 1.5 CategorizedTaggedCorpusReader类

本文以brown类为实例介绍了CategorizedTaggedCorpusReader类。

  1 >>> from nltk.corpus import brown
  2 >>> help(brown)
  3 class CategorizedTaggedCorpusReader(nltk.corpus.reader.api.CategorizedCorpusReader, TaggedCorpusReader)
  4  |  A reader for part-of-speech tagged corpora whose documents are
  5  |  divided into categories based on their file identifiers.
  6  |  
  7  |  Method resolution order:
  8  |      CategorizedTaggedCorpusReader
  9  |      nltk.corpus.reader.api.CategorizedCorpusReader
 10  |      TaggedCorpusReader
 11  |      nltk.corpus.reader.api.CorpusReader
 12  |      __builtin__.object
 13  |  
 14  |  Methods defined here:
 15  |  
 16  |  __init__(self, *args, **kwargs)
 17  |      Initialize the corpus reader.  Categorization arguments
 18  |      (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to
 19  |      the ``CategorizedCorpusReader`` constructor.  The remaining arguments
 20  |      are passed to the ``TaggedCorpusReader``.
 21  |  
 22  |  paras(self, fileids=None, categories=None)
 23  |  
 24  |  raw(self, fileids=None, categories=None)
 25  |  
 26  |  sents(self, fileids=None, categories=None)
 27  |  
 28  |  tagged_paras(self, fileids=None, categories=None, simplify_tags=False)
 29  |  
 30  |  tagged_sents(self, fileids=None, categories=None, simplify_tags=False)
 31  |  
 32  |  tagged_words(self, fileids=None, categories=None, simplify_tags=False)
 33  |  
 34  |  words(self, fileids=None, categories=None)
 35  |  
 36  |  ----------------------------------------------------------------------
 37  |  Methods inherited from nltk.corpus.reader.api.CategorizedCorpusReader:
 38  |  
 39  |  categories(self, fileids=None)
 40  |      Return a list of the categories that are defined for this corpus,
 41  |      or for the file(s) if it is given.
 42  |  
 43  |  fileids(self, categories=None)
 44  |      Return a list of file identifiers for the files that make up
 45  |      this corpus, or that make up the given category(s) if specified.
 46  |  
 47  |  ----------------------------------------------------------------------
 48  |  Data descriptors inherited from nltk.corpus.reader.api.CategorizedCorpusReader:
 49  |  
 50  |  __dict__
 51  |      dictionary for instance variables (if defined)
 52  |  
 53  |  __weakref__
 54  |      list of weak references to the object (if defined)
 55  |  
 56  |  ----------------------------------------------------------------------
 57  |  Methods inherited from nltk.corpus.reader.api.CorpusReader:
 58  |  
 59  |  __repr__(self)
 60  |  
 61  |  abspath(self, fileid)
 62  |      Return the absolute path for the given file.
 63  |      
 64  |      :type file: str
 65  |      :param file: The file identifier for the file whose path
 66  |          should be returned.
 67  |      :rtype: PathPointer
 68  |  
 69  |  abspaths(self, fileids=None, include_encoding=False, include_fileid=False)
 70  |      Return a list of the absolute paths for all fileids in this corpus;
 71  |      or for the given list of fileids, if specified.
 72  |      
 73  |      :type fileids: None or str or list
 74  |      :param fileids: Specifies the set of fileids for which paths should
 75  |          be returned.  Can be None, for all fileids; a list of
 76  |          file identifiers, for a specified set of fileids; or a single
 77  |          file identifier, for a single file.  Note that the return
 78  |          value is always a list of paths, even if ``fileids`` is a
 79  |          single file identifier.
 80  |      
 81  |      :param include_encoding: If true, then return a list of
 82  |          ``(path_pointer, encoding)`` tuples.
 83  |      
 84  |      :rtype: list(PathPointer)
 85  |  
 86  |  encoding(self, file)
 87  |      Return the unicode encoding for the given corpus file, if known.
 88  |      If the encoding is unknown, or if the given file should be
 89  |      processed using byte strings (str), then return None.
 90  |  
 91  |  open(self, file, sourced=False)
 92  |      Return an open stream that can be used to read the given file.
 93  |      If the file's encoding is not None, then the stream will
 94  |      automatically decode the file's contents into unicode.
 95  |      
 96  |      :param file: The file identifier of the file to read.
 97  |  
 98  |  readme(self)
 99  |      Return the contents of the corpus README file, if it exists.
100  |  
101  |  ----------------------------------------------------------------------
102  |  Data descriptors inherited from nltk.corpus.reader.api.CorpusReader:
103  |  
104  |  root
105  |      The directory where this corpus is stored.
106  |      
107  |      :type: PathPointer

看下 brown的内容,如果获取brown资料库的主题和文件

 1 >>> from nltk.corpus import brown
 2 >>> brown.categories()   //返回brown资料库的主题种类
 3 ['adventure', 'belles_lettres', 'editori', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
 4 >>> brown.fileids()[1:10] //返回brown资料库内的文件
 5 ['ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10']
 6 >>> brown.words(categories='news') //返回brown资料库内类别名为news的类别,并按次进行切分
 7 ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
 8 >>> brown.words(fileids=['cg22'])  //返回brown资料库内的文件名为cg22的文件,并按词进行切分
 9 ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
10 >>> brown.sents(categories=['news','editori','reviews'])//返回多个类别,并按句进行切分
11 [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

对brown内的特定的文体进行计数:

1 from nltk.corpus import brown
2 import nltk
3 news_text = brown.words(categories='news')   //返回brown资料库内类别名为news的类别,并按次进行切分
4 fdist = nltk.FreqDist([w.lower() for w in news_text]) //获取news的频率分布
5 modals = ['can','could','may','might','must','will']
6 for m in modals :
7 print m + ':',fdist[m], //获取modals的计数

输出

  can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

计算多个特定类别的多个文体进行统计

  1 from nltk.corpus import brown
  2 import nltk
  3 cfd = nltk.ConditionalFreqDist(
  4         (genre,word)
  5         for genre in brown.categories()
  6         for word in brown.words(categories=genre))
  7 genres=['new','religion','hobbies','science_fiction','romance','humor']
  8 modals = ['can','could','may','might','must','will']
  9 cfd.tabulate(conditions=genres,samples=modals)

                 can could  may might must will
            new    0    0    0    0    0    0
       religion   82   59   78   12   54   71
        hobbies  268   58  131   22   83  264
science_fiction   16   49    4   12    8   16
        romance   74  193   11   51   45   43
          humor   16   30    8    8    9   13

1.6  CategorizedPlaintextCorpusReader类

相比与brown(CategorizedTaggedCorpusReader),retuters(CategorizedPlaintextCorpusReader)的区别在于,retuters可以查找一个或者多个文档涵盖的主题,也可以查找包含在一个或多个类别的文档。

 1 >>> from nltk.corpus import reuters
 2 >>> reuters.fileids()[1:10]
 3 ['test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843']
 4 >>> reuters.categories()
 5 ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']
 6 >>> reuters.categories('training/9865')
 7 ['barley', 'corn', 'grain', 'wheat']
 8 >>> reuters.categories(['training/9865','training/9880'])
 9 ['barley', 'corn', 'grain', 'money-fx', 'wheat']
10 >>> reuters.categories('training/9880')
11 ['money-fx']

对比brown:

1 >>> from nltk.corpus import brown
2 >>> brown.categories(['news','reviews'])   //不能对多个主题进行查找
3 []
4 >>> brown.fileids(['cr05','cr06'])
5 []

1.7 基本语料库函数

示例 描述
fileids() 语料库的文件
fileids([categories]) 分类对应的语料库中的文件
categories() 语料库中的分类
categoried([fileids]) 文件对应的语料库中的分类
raw() 语料库的原始内容
raw(fileids=[f1,f2,f3]) 指定文件的原始内容
raw(categories=[c1,c2]) 制定分类的原始内容
words() 整个语料库中的词汇
words(fileids=[f1,f2,f3]) 指定文件的词汇
words(categories=[c1,c2]) 指定分类的词汇
sents() 指定分类的句子
sents(fileids=[f1,f2,f3]) 指定文件的句子
sents(categories=[c1,c2]) 指定分类的句子
abspath(fileid) 制定文件在磁盘的位置
encoding(fileid) 文件的编码(如果知道的话)
open(fileid) 打开指定语料库文件的文件流
root() 到本地安装的语料库根目录的路径
readme() 语料库的README文件的内容

1.8 载入自己的语料库

1 >>> from nltk.corpus import PlaintextCorpusReader
2 >>> corpus_root='/Users/rcf/workspace/python/python_test/NLP_WITH_PYTHON/chapter_2'
3 >>> wordlist=PlaintextCorpusReader(corpus_root,'.*')   //corpus_root 资料库路径,'.*'文件类型
4 >>> wordlist.fileids()
5 ['1.py', '2.py', '3.py', '4.py']
6 >>> wordlist.words('3.py')
7 ['from', 'nltk', '.', 'corpus', 'import', 'brown', ...]