Code and Data
2013-11-01 09:37:14
Computer Science Department
University of Massachusetts Amherst
Code:
FACTORIE is a toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating relational factor graphs, estimating parameters and performing inference. It is flexible, supporting multiple modeling and inference paradigms. Its original emphasis was on conditional random fields, undirected graphical models, MCMC inference, online training, and discriminative parameter estimation. However, it now also supports directed generative models (such as latent Dirichlet allocation), and has preliminary support for variational inference, including belief propagation and mean-field methods. It is also scalable, with demonstrated success on problems with many millions of variables and factors, and on models that have changing structure, such as case factor diagrams. It has also been plugged into a database back-end, representing a new approach to probabilistic databases capable of handling billions of variables.
MALLET is a library of Java code for machine learning applied to text. It provides facilities not only for document classification, but also information extraction, part-of-speech tagging, noun phrase segmentation, and much more. The development of the library is quite mature, however it does not yet have as polished front-ends or documentation as rainbow.
Cora HMM is the C implementation of HMMs used for information extraction in Cora. It was written by Kristie Seymore
RLKIT a software library that makes it easy to test various reinforcement learning algorithms in different environments with different sensory-motor systems. It's implemented in Objective-C and GNU Guile (Scheme).
Data:
SRAA: Simulated/Real/Aviation/Auto UseNet data [document classification]
73,218 UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos, real aviation. I have often used this data for binary classification---separating real from simulated, and auto from aviation---making the point that the same data can be classified different ways depending on the user's needs. This is especially interesting for semi-supervised learning. This data was gathered by Andrew McCallum while at Just Research.
Cora Citation Matching [reference matching, object correspondence]
Text of citations hand-clustered into groups referring to the same paper.
Cora Research Paper Classification [relational document classification]
Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.
Cora Information Extraction [information extraction]
Research paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields.
Frequently Asked Questions [information extraction]
Several UseNet FAQ's segmented into questions and answers. Data gathered and labeled by Dayne Freitag and Andrew McCallum.
CMU Seminar Announcements [information extraction]
48 emailed seminar announcements, with labeled segments for speaker, title, start-time, end-time. Labeled by Dayne Freitag.
Industry Sector [document classification]
Corporate web pages classified into a topic hierarchy with about 70 leaves.
20 Newsgroups [document classification]
About 20,000 UseNet postings from 20 newsgroups. Gathered by Ken Lang at CMU in the mid-90's. This is the original set, without various editing done by Jason Rennie and others.
Leda C++
JGraphT,JGraph
LAW
Laboratory for Web Algorithmics provides large graphs compressed using LLP + WebGraph