【转】汇总:LDA理论、变形、优化、应用、工具库
转自:http://site.douban.com/204776/widget/notes/12599608/note/287085506/
#LDA理论
——Topic Model相关论文汇总
http://site.douban.com/204776/widget/notes/12599608/note/286839088/
##Survey:
1. 基于文档主题结构的关键词抽取方法研究
刘知远的博士论文,他是当时微博关键词应用的作者我记得。
在短文本上也提出了一些方法改进。
2. Parameter estimation for text analysis
这篇绝对是重量级。
#Short-Text:
1. Automatic Keyphrase Extraction by Bridging Vocabulary Gap
#Practice / In Action (especially in Chinese)
1. A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese
2. A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora
Statistical Substring Reduction in Linear Time
3. The Mathematics of Statistical Machine Translation: Parameter Estimation
##Anecdote:
LDA数学八卦
rickjin写的,统计之都上连载的。
http://vdisk.weibo.com/s/qghK5
##LDA variation:
最近有个女人极其强大,总结了各种LDA变形。
在她发的两篇近期论文里:
1. On the design of LDA models for aspect-based opinion mining
2. The FLDA model for aspect-based opinion mining: addressing the cold start problem (WWW'13)
##我看过的几乎LDA paper所有打包
有一定是加过重点的(-noted):
有上面提到的一些论文,但比那个多的多。
可以直接看里面noted的文件夹,因为没note过的我觉得没用。
http://vdisk.weibo.com/s/BA3xC
#LDA优化
——LDA优化实现论文汇总
http://site.douban.com/204776/widget/notes/12599608/note/286923972/
觉得比较有实际应用上的价值,因为文本数量有时候很多,实现上的优化就很必要了。
快速推理算法:
Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation
在线学习:
Online Learning for Latent Dirichlet Allocation
http://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf
http://videolectures.net/nips2010_hoffman_oll/
www.ece.duke.edu/~lcarin/Lingbo4.15.2011.pptx
文本流的推理算法;
Topic models over text streams: a study of batch and online unsupervised learning
Efficient Methods for Topic Model Inference on Streaming Document Collections
分布式学习:
Distributed Inference for Latent Dirichlet Allocation
PLDA+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing
#LDA应用
——LDA应用变形
http://site.douban.com/204776/widget/notes/12599608/note/286930572/
说说LDA在不同应用上的几个变形,都有细微调整也都带来了新的问题。
##情感分析
Opinion Integration Through Semi-supervised Topic Modeling
把传统的Topic Model作为非监督的典型,拓展成了半监督。加入了模型的先验信息,对于一些汽车产品,从维基百科中提出它的各个特征的描述,然后训练成先验信息。
Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid
联合抽取主题和观点。引入监督学习的方法,区分主题和情感词汇。进一步再用LDA进行聚类。
##学术挖掘
比如KDD2013今年也有的作者建模,再比如学术热点探测……
The author-topic model for authors and documents
同时对作者和主题进行建模。每个作者再限定该作者只能对应一个主题,每个作者也是主题上的一个分布,同时用作者~主题的分布取代文档~主题的分布。
Joint latent topic models for text and citations
对主题和引用同事建模,建立引用关系链接。
Detecting Topic Evolution in Scientific Literature: How Can Citations Help?
通过引用信息,建立主题进化模型。
##社会媒体主题
Twitter的研究太多了,小站SNA部分也总结过很多了。不多写了。
#LDA工具库
——LDA工具库
http://site.douban.com/204776/widget/notes/12599608/note/287084873/
(这部分还缺R,等我自己用过再做评价)
先发一个格式比较好的链接(但不全):
http://mengjunxie.github.io/ae-lda/topic-modeling.html
####
Latent Dirichlet allocation
http://www.cs.princeton.edu/~blei/lda-c/
This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data. LDA allows you to analyze of corpus, and extract the topics that combined to form its documents. For example, click here to see the topics estimated from a small corpus of Associated Press documents. LDA is fully described in Blei et al. (2003) .
####
Discrete Component Analysis
http://www.nicta.com.au/people/buntinew/discrete_component_analysis
The Discrete Component Analysis (DCA) software is being developed as a stand-alone package, and as a plug-in to the Elefant system, a machine learning toolbox from NICTA. Currently the software is being run in stand-alone mode using the data streaming libraries from the older and now unsupported MPCA system, developed at Helsinki Institute for IT. The software itself is written in the C language and compiles on a Linux and a Mac OS X environment.
The models presented here are known under many names, such as latent Dirichlet allocation, multi-aspect models, multinomial PCA, and non-negative matrix factorisation.
####
Infinite LDA
http://www.arbylon.net/projects/knowceans-ilda/readme.txt
https://bitbucket.org/gchrupala/colada/wiki/Resources
Implementations of Latent Dirichlet Allocation (LDA) and
Hierarchical Dirichlet Processes (HDP)
@author Gregor Heinrich, gregor :: arbylon : net
@version 0.96
@date 1 Mar 2011
- History: ILDA version 0.1: May 2008, LDA version 0.1: Feb. 2005, based
on http://arbylon.net/projects/LdaGibbsSampler.java
- Simple implementations of Gibbs sampling for LDA and HDP
- Scientific documentation: see texts lda.pdf and ilda.pdf
- Technical documentation: see Javadoc and source (packages *.corpus and
*.utils are from knowceans-tools on SourceForge)
- Data documentation: see nips/readme.txt including source references
- License: All code is licensed under GPL v3.0.
- If the code is used in scientific work, please refer to its source
via the URL:
http://arbylon.net/projects/knowceans-ilda.zip
or the documentation of the ILDA or LDA implementations:
G. Heinrich. "Infinite LDA" -- implementing the HDP with minimum code
complexity. TN2011/1, http://arbylon.net/publications/ilda.pdf, 2011
G. Heinrich. Parameter estimation for text analysis. Technical report,
No. 09RP008-FIGD, Fraunhofer IGD, 2009
TODO:
- Diverse checks, e.g., Antoniak distribution sampling, hyperparameter
estimators, general quantitative validation of HDP model
- Output formatting
- Visual matrix implementation for HDP / IldaGibbs
####
MAchine Learning for LanguagE Toolkit
http://mallet.cs.umass.edu/
MALLET is open source software [License]. For research use, please remember to cite MALLET.
Download MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.
####
Multithreaded LDA
https://sites.google.com/site/rameshnallapati/software
Multithreaded extension of Blei's LDA implementation. C Ramesh Nallapati Speeds up the computation by orders of magnitude depending on the number of processors.
####
GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation
https://sites.google.com/site/rameshnallapati/software
GibbsLDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference. It is very fast and is designed to analyze hidden/latent topic structures of large-scale datasets including large collections of text/Web documents. LDA was first introduced by David Blei et al [Blei03]. There have been several implementations of this model in C (using Variational Methods), Java, and Matlab. We decided to release this implementation of LDA in C/C++ using Gibbs Sampling to provide an alternative to the topic-model community.
GibbsLDA++ is useful for the following potential application areas:
Information retrieval and search (analyzing semantic/latent topic/concept structures of large text collection for a more intelligent information search).
Document classification/clustering, document summarization, and text/web mining community in general.
Content-based image clustering, object recognition, and other applications of computer vision in general.
Other potential applications in biological data.
####
Gensim
http://radimrehurek.com/gensim/
Gensim is a FREE Python library
Scalable statistical semantics
Analyze plain-text documents for semantic structure
Retrieve semantically similar documents
####
Stanford Topic Modeling Toolbox
http://nlp.stanford.edu/software/tmt/tmt-0.4/
The Stanford Topic Modeling Toolbox (TMT) brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to:
Import and manipulate text from cells in Excel and other spreadsheets.
Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text.
Select parameters (such as the number of topics) via a data-driven process.
Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data.
The Stanford Topic Modeling Toolbox was written at the Stanford NLP group by:
Daniel Ramage and Evan Rosen, first released in September 2009.
####
Matlab Topic Modeling Toolbox 1.4
http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm
Installation & Licensing
Download the zipped toolbox (18Mb).
NOTE: this toolbox now works with 64 bit compilers. If you are looking for the old version of this toolbox that has the code for 32 bit compilers, download this version
The program is free for scientific use. Please contact the authors, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author. By using this software, you are agreeing to this license statement.
Type 'help function' at command prompt for more information on each function
Read these notes on data format for a description on the input and output format for the different topic models
Note for MAC and Linux users: some of the Matlab functions are implemented with mex code (C code linked to Matlab). For windows based platforms, the dll's are already provided in the distribution package. For other platforms, please compile the mex functions by executing "compilescripts" at the Matlab prompt
#
最后的最后,
发个Topic Modeling Bibliography
http://www.cs.princeton.edu/~mimno/topics.html