文章出自http://homepages.inf.ed.ac.uk/lzhang10/slm.html
The goal of Statistical Language Modeling is to build a statistical language model that can estimate the distribution of natural language as accurate as possible. A statistical language model (SLM) is a probability distribution P(s) over strings S that attempts to reflect how frequently a string S occurs as a sentence.
By expressing various language phenomena in terms of simple parameters in a statistical model, SLMs provide an easy way to deal with complex natural language in computer.
The original (and is still the most important) application of SLMs is speech recognition, but SLMs also play a vital role in various other natural language applications as diverse as machine translation, part-of-speech tagging, intelligent input method and Text To Speech system.
Common SLM techniques:
N-gram model and variants
Structural Language Model
Maximum Entropy Language Model
Whole Sentence Exponetial Model
SLM Software
Here is an (incomplete) list of common used SLM software available freely to SLM community:
- CMU-Cambridge Statistical Language Modeling toolkit (has not been updated for years)
- SRI Language Modeling Toolkit (contains up-to-date SLM techniques, well maintained)
- N-gram stat
- Trigger Toolkit
- My N-gram extraction tool
SLM References
Some recommended papers on SLM technique, only papers that have on-line electrical version are listed. (TODO: sort papers based on their categories)
- Two Decades Of Statistical Language Modeling: Where Do We Go From Here?
- A Maximum Entropy Approach to Adaptive Statistical Language Modeling
- A Maximum Entropy Language Model Integrating N-Grams And Topic Dependencies For Conversational Speech Recognition
- A Structured Language Model
- Aggregate and mixed-order Markov models for statistical language processing
- Combining Nonlocal, Syntactic And N-Gram Dependencies In Language Modeling
- Exploiting Syntactic Structure for Language Modeling
- Improvement of a Whole Sentence Maximum Entropy Language Model Using Grammatical Features
- Language Modeling By Variable Length Sequences : Theoretical Formulation And Evaluation Of Multigrams
- Structure And Performance Of A Dependency Language Model
- A Neural Probabilistic Language Model
- Factored Language Models and Generalized Parallel Backoff