转:http://blog.csdn.net/pirage/article/details/9467547
LDA理论
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res.,3:993–1022, March 2003.
- 开山之作
Rickjin. LDA数学八卦. 2013.2.8
- 传说中的“上帝掷骰子”的来源之处。这篇文章是一个连载的科普性博客,作者是rickjin,文章分为7个章节,主要5个章节讲得是Gamma函数、Beta/Dirichlet函数、MCMC和Gibbs采样、文本建模、LDA文本建模,对于想要了解和LDA的同学来说,是一篇很好的入门教程,建议结合Blei的开山之作一起看。
LDA优化改进
Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. Fast collapsed gibbs sampling for latent dirichlet allocation. InProceeding of the 14th ACM SIGKDD inter-national conference on Knowledge discovery and data mining, KDD ’08, pages 569–577, New York, NY, USA, 2008. ACM.
- 快速推理算法
Matthew Hoffman, David M. Blei, and Francis Bach. Online learning for latent dirichlet allocation. In NIPS, 2010.
- 在线学习
Arindam Banerjee and Sugato Basu. Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning. InSDM. SIAM, 2007.
- 文本流推理
Limin Yao, David Mimno, and Andrew McCallum. Efficient methods for topic model inference on stream-ing document collections. In Proceedings of the 15th ACM SIGKDD international conference on Knowl-edge discovery and data mining, KDD ’09, ages 937–946, New York, NY, USA, 2009. ACM.
- 文本流推理
Feng Yan, Ningyi Xu, and Yuan Qi. Parallel inference for latent dirichlet allocation on graphics processing units. InNIPS, 2009.
- 分布式学习
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed Inference for Latent Dirichlet Allocation. 2007.
- 分布式学习
Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, and Maosong Sun. Plda+: Parallel latent dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol., 2:26:1–26:18, May 2011.
- 分布式学习
Arthur Asuncion, Padhraic Smyth, and Max Welling. Asynchronous distributed learning of topic models. In NIPS, pages 81–88, 2008.
- 分布式学习
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Suppl. 1):5228–5235, April 2004.
- 主要介绍LDA的参数优化,经验性alpha,beta取值方法。
LDA变形
1、打破原有可交换的假设
David M. Blei and John D. Lafferty. A correlated topic model of science. AAS, 1(1):17–35, 2007.
- Blei的大作,引入了主题之间的关联
Wei Li and Andrew McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. InICML, 2006.
- 引入了主题之间的关联
Jonathan Chang and David Blei. Relational topic models for document networks. InAIStats, 2009.
- 引入了文档之间的关联
Xuerui Wang, Andrew McCallum, and Xing Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. InProceedings of the 2007 Seventh IEEE International Conference on Data Mining, pages 697–702, Washington, DC, USA, 2007. IEEE Computer Society
- 考虑了词与词之间的顺序
Yue Lu and Chengxiang Zhai. Opinion integration through semi-supervised topic modeling. InProceeding of the 17th international conference on World Wide Web, WWW ’08, pages 121–130, New York, NY, USA, 2008. ACM
- 在优化公式的基础上的改进,增加先验信息来区别不同的主题
Qiaozhu Mei, Deng Cai, Duo Zhang, and ChengXiang Zhai. Topic modeling with network regularization. InProceeding of the 17th international conference on World Wide Web, WWW ’08, pages 101–110, New York, NY, USA, 2008. ACM.
- 增加规则化因子,引入一些关联信息和验证信息
2、基于非参数贝叶斯方法的变形
Y. W. Teh. Dirichlet processes. InEncyclopedia of Machine Learning. Springer, 2010.
- 基于DIrichlet Process的变形
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.
- 基于Dirichlet Process的变形,即HDP模型,可以自动的学习出主题的数目。该方法:① 在一定程度之上解决了主题模型中自动确定主题数目这个问题, ② 代价是必须小心的设定、调整参数的设置, ③ 实际中运行复杂度更高,代码复杂难以维护。 所以在实际中,往往取一个折中,看看自动确定主题数目这个问题对于整个应用的需求到底有多严格,如果经验设定就可以满足的话,就不用采用基于非参数贝叶斯的方法了,但是如果为了引入一些先验只是或者结构化信息,往往非参数是优先选择,例如树状层次的主题模型和有向无环图的主题模型
Yee Whye Teh. A hierarchical bayesian language model based on pitman-yor processes. InProceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Asso-ciation for Computational Linguistics, ACL-44, pages 985–992, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics.
- 基于Pitman-Yor Process的非参数贝叶斯方法
Issei Sato and Hiroshi Nakagawa. Topic models with power-law using pitman-yor process. InProceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’10, pages 673–682, New York, NY, USA, 2010. ACM.
- 基于Pitman-Yor Process的非参数贝叶斯方法
3、从无结构化信息到结构化或者半结构化的信息
David M. Blei and Jon D. McAuliffe. Supervised topic models. InNIPS, 2007.
- 提出关联主题学习和响应变量(例如用户评分,文档类标,或者时间标签等)
David Mimno and Andrew McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. InUAI, 2008.
- 通过引入一个log-linear先验在文档-主题分布上,可以将主题抽取关联到文档的metadata等等多种特征(例如作者、时间等)
LDA应用
1、情感分析
Ivan Titov and Ryan McDonald. Modeling online reviews with multi-grain topic models. In Proceeding of the 17th international conference on World Wide Web, WWW ’08, pages 111–120, New York, NY, USA, 2008. ACM.
- 从用户评论数据中进行无监督主题抽取,考虑了一个多级背景主题模型:词~句子~段落~文档,解决了传统LDA模型提出的主题往往对应品牌而不是可以ratable的主题。
Ivan Titov and Ryan McDonald. A joint model of text and aspect ratings for sentiment summarization. In Proceedings of ACL-08: HLT, pages 308–316, Columbus, Ohio, June 2008. Association for Computational Linguistics.
- 本文将一些具有结构化信息的特征融入到主题模型中,具体来说,我们同时关联两个生成过程,一个就是文档中词的生成,另一个就是这些结构化特征的生成。
Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, and ChengXiang Zhai. Topic sentiment mixture: modeling facets and opinions in weblogs. InProceedings of the 16th international conference on World Wide Web, WWW ’07, pages 171–180, New York, NY, USA, 2007. ACM.
- 本文考虑区分情感和主题两种不同类型的词汇,进而同时抽取主题和观点。对所有的主题,设定一系列共有的情感语言模型。
Chenghua Lin and Yulan He. Joint sentiment/topic model for sentiment analysis. InProceeding of the 18th ACM conference on Information and knowledge management, CIKM ’09, pages 375–384, New York, NY, USA, 2009. ACM.
- 在抽取主题词汇和情感词汇之后,通过计算每个文档整体的情感倾向,进行文档情感分类。
Xin Zhao, Jing Jiang, Hongfei Yan, and Xiaoming Li. Jointly modeling aspects and opinions with a MaxEnt-LDA hybrid. InProceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 56–65, Cambridge, MA, October 2010. Association for Computational Linguistics.
- 如何生成基于主题的情感摘要
2、学术文章挖掘
Michal Rosen-Zvi, Tom Griffiths, Mark Steyvers, and Padhraic Smyth. The author-topic model for authors and documents. InUAI, 2004.
- 从作者的角度考虑文档主题的生成。对于每一个作者不再限定该作者只能对应一个主题,而是对应于一个主题上的分布。
Ramesh M. Nallapati, Amr Ahmed, Eric P. Xing, and William W. Cohen. Joint latent topic models for text and citations. InProceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08, pages 542–550, New York, NY, USA, 2008. ACM
- 第一个提出同事对于主题和参考引用进行建模的文章。其基本思想就是首先分别对于文本采用之前标准的主题模型的生成方式,然后对于任何一对具有引用关系的文档对,根据文档-主题分布的相似性生成引用链接关系。将文本和链接之间的关联通过主题分布建立对应关系。
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Suppl. 1):5228–5235, April 2004.
- 基于主题模型来进行学术语料分析。主要分析了哪些主题是热主题,哪些主题是冷主题,然后分析了这些主题随着时间的发展的强度变化,其中主要使用了平均的文档-主题分布来计算强度。
Ding Zhou, Xiang Ji, Hongyuan Zha, and C. Lee Giles. Topic evolution and social interactions: how authors effect research. InProceedings of the 15th ACM international conference on Information and knowledge management, CIKM ’06, pages 248–257, New York, NY, USA, 2006. ACM.
- 分析了作者是如何影响主题进化的。文章认为主题的进化是作者与作者之间的交互带来的,提出了一个马尔科夫模型来对于这种机遇作者交互的话题交互进行建模,并且应用这个模型分析了一些有意思的问题,如对于一个给定的主题进化,是哪些作者主导了变化呢?
Gideon S. Mann, David Mimno, and Andrew McCallum. Bibliometric impact measures leveraging topic analysis. InProceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, JCDL ’06, pages 65–74, New York, NY, USA, 2006. ACM.
- 提出了使用主题模型进行细粒度的文章影响力的分析,主要是提出了一些计算度量来对学术研究进行分析,例如引用数目、主题影响因子等等。
3、社会媒体
Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. Comparing twitter and traditional media using topic models. InECIR, pages 338–349, 2011.
- 提出了一种用于短文本的Twtter-LDA模型
4、时序文本流
David M. Blei and John D. Lafferty. Dynamic topic models. In ICML, 2006.
- Blei先生提出的动态LDA,主要是将时间离散化,采用批处理的方法
Xuerui Wang and Andrew McCallum. Topics over time: a non-markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’06, pages 424–433, New York, NY, USA, 2006. ACM.
- 本文认为对于一个文本,除了文本信息可见,标签信息也是可见的,然后通过主题分布信息来同时关联起来词汇和时间标签。
Xuanhui Wang, ChengXiang Zhai, Xiao Hu, and Richard Sproat. Mining correlated bursty topic patterns from coordinated text streams. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007
- 提出侦测多个流中的bursty主题,采用两个方法:使用前后时间段内部主题进行平滑;不同流之间主题相互加强。
Qiaozhu Mei and ChengXiang Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. InProceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, KDD ’05, pages 198–207, New York, NY, USA, 2005. ACM.
- 显式的提出了文本流中的主题时序进行分析。本文使用了相对简单的方法,首先划分数据段,然后分段学习得到主题集合,然后根据在连续的两个时间段内的主题相似度对其建立链接关系。
5、网络结构数据
Jonathan Chang and David Blei. Relational topic models for document networks. InAIStats, 2009.
- Blei提出的Relational topic model(RTM)模型。
Qiaozhu Mei, Deng Cai, Duo Zhang, and ChengXiang Zhai. Topic modeling with network regularization. InProceeding of the 17th international conference on World Wide Web, WWW ’08, pages 101–110, New York, NY, USA, 2008. ACM.
- 提出了使用网络规则化因子在主题模型中