Codes in NLTK
NLTK includes the following software modules (~120k lines of Python code):
- Corpus readers
- interfaces to many corpora
- Tokenizers
- whitespace, newline, blankline, word, treebank, sexpr, regexp, Punkt sentence segmenter
- Stemmers
- Porter, Lancaster, regexp
- Taggers
- regexp, n-gram, backoff, Brill, HMM, TnT
- Chunkers
- regexp, n-gram, named-entity
- Parsers
- recursive descent, shift-reduce, chart, feature-based, probabilistic, dependency, ccg, ...
- Semantic interpretation
- untyped lambda calculus, first-order models, DRT, glue semantics, hole semantics, parser interface
- WordNet
- WordNet interface, lexical relations, similarity, interactive browser
- Classifiers
- decision tree, maximum entropy, naive Bayes, Weka interface, megam
- Clusterers
- expectation maximization, agglomerative, k-means
- Metrics
- accuracy, precision, recall, windowdiff, distance metrics, inter-annotator agreement coefficients, word association measures, rank correlation
- Estimation
- uniform, maximum likelihood, Lidstone, Laplace, expected likelihood, heldout, cross-validation, Good-Turing, Witten-Bell
- Miscellaneous
- unification, chatbots, many utilities
- NLTK-Contrib (less mature)
- categorial grammar (Lambek, CCG), finite-state automata, hadoop (MapReduce), kimmo, readability, textual entailment, timex, TnT interface, inter-annotator agreement
Browse the source code: https://github.com/nltk/nltk/tree/master/nltk
Status: automatic testing of NLTK code with Jenkins: http://build.nltk.org/