17 Great Machine Learning Libraries
17 Great Machine Learning Libraries
After wonderful feedback on my previous post on Scikit-learn from the guys at /r/MachineLearning, I decided to collect the list of machine learning libraries into this seperate note. Let me know if there’s a library that should be included here.
Update (15 May 2014): thanks to Djalel Benbouzid and Dwayne Campbell for additional suggestions. Sorry it’s taken me so long to add them…
Python
- Scikit-learn: comprehensive and easy to use, I wrote a whole article on why I like this library.
- PyBrain: Neural networks are one thing that are missing from SciKit-learn, but this module makes up for it.
- nltk: really useful if you’re doing anything NLP or text mining related.
- Theano: efficient computation of mathematical expressions using GPU. Excellent for deep learning.
- Pylearn2: machine learning toolbox built on top of Theano - in very early stages of development.
- MDP (Modular toolkit for Data Processing): a framework that is useful when setting up workflows.
Java
- Spark: Apache’s new upstart, supposedly up to a hundred times faster than Hadoop, now includes MLLib, which contains a good selection of machine learning algorithms, including classification, clustering and recommendation generation. Currently undergoing rapid development. Development can be in Python as well as JVM languages.
- Mahout: Apache’s machine learning framework built on top of Hadoop, this looks promising, but comes with all the baggage and overhead of Hadoop.
- Weka: this is a Java based library with a graphical user interface that allows you to run experiments on small datasets. This is great if you restrict yourself to playing around to get a feel for what is possible with machine learning. However, I would avoid using this in production code at all costs: the API is very poorly designed, the algorithms are not optimised for production use and the documentation is often lacking.
- Mallet: another Java based library with an emphasis on document classification. I’m not so familiar with this one, but if you have to use Java this is bound to be better than Weka.
- JSAT: stands for “Java Statistical Analysis Tool” - created by Edward Raff and was born out of his frustation with Weka (I know the feeling). Looks pretty cool.
.NET
- Accord.NET: this seems to be pretty comprehensive, and comes recommended by primaryobjects on Reddit. There is perhaps a slight slant towards image processing and computer vision, as it builds on the popular library AForge.NET for this purpose.
- Another option is to use one of the Java libraries compiled to .NET using IKVM - I have used this approach with success in production.
C++
- Vowpal Wabbit: designed for very fast learning and released under a BSD license, this comes recommended by terath on Reddit.
- MultiBoost: a fast C++ framework implementing some boosting algorithms as well as some cascades (like the Viola-Jones cascades). It’s mainly focused on AdaBoost.MH so it is multi-class/multi-label.
- Shogun: large machine learning library with a focus on kernel methods and support vector machines. Bindings to Matlab, R, Octave and Python.
General
- LibSVM and LibLinear: these are C libraries for support vector machines; there are also bindings or implementations for many other languages. These are the libraries used for support vector machine learning in Scikit-learn.
Conclusion
This article is a work in progress, so please send me your comments or criticisms!
Want more? Sign up below to get a free ebook Machine Learning in Practice, and updates on new posts:
这两天开始折腾ML的开源库,ML的开源库有很多,比如Torch,MLC,Weka(基于java),Waffles,Shark,scikit,opencv-ml,等等,综合比较了各个开源库的优劣,决定搞搞以下几个库:
1. Shark,基于c++
2. scikit,基于python
3. weka,基于java
4. opencv-ml,基于c++,图像处理中用的比较多,之前已接触过
花了一个下午的时间终于成功安装配置Shark,感觉Shark库还是挺强大的,基本上包含了常用的ML算法,而且是基于C++,用起来比较顺手。
环境:win32, vs10
网上对于Shark的安装的相关文章很少,以下内容基本参考:(感谢分享)
http://www.cnblogs.com/xiangwengao/archive/2013/05/04/3059632.html
http://www.cnblogs.com/xiangwengao/archive/2013/05/01/3052821.html
http://www.cnblogs.com/xiangwengao/archive/2013/05/01/3052827.html
一、Shark——之正确获取
有两篇错误安装方法.这两篇介绍的获取Shark路径都有问题,根本不可用或者获取不了.(我已验证过确实这样)
第1篇错误http://www.iteye.com/news/27669
. 严重不对,因为SVN下载的是开发版,有时会缺少文件导致VS编译不成功,最终无法使用.我在按照svn下载安装时,缺少LinAlg的文件,根本无法使用.坚决建议大家别采用.
第2篇错误 http://shark-project.sourceforge.net/,根本找不到文件,地址早就失效了.该篇文章后面介绍的安装和使用还凑合.
正确的下载地址:https://sourceforge.net/projects/shark-project/files/Shark%20Core/下载zip文件进行安装.
版本:2.3.4
Shark利用CMake进行编译,需要C++ Boost库支持.具体后续.
二、Shark——之安装篇
Shark Machine Learning Library 的主页链接是:http://shark-project.sourceforge.net/,shark是由德国波鸿大学开发的,曾获得2011年世界开源大赛金奖。shark基于C++的泛型编程,里面大量使用了模板,因此封装性和继承性极佳。由于是基于C++的,所以函数的效率还是不错的。
shark的库主要分为4部分
- ReClaM 回归与分类模块 涵盖了线性方法、神经网络、SVM、Kernel 等
- EALib 进化计算模块
- MOO-EAlib 多目标的进化计算
- Fuzzy 模糊计算模块
OK, 开始吧,下面进入安装过程。shark的函数库可以安装在Microsoft,Linux,Mac 的操作系统上,本文介绍其在 Microsoft Windows 上的安装过程。值得注意的是,在下载的shark包路径 Shark/doc/TutorialsOld/ 下面有一个在各种平台下的安装说明,但是比较老。
第一步,准备安装软件,产生编译文件。跨平台编译工具 Cmake v2.8,Mircosoft Visual Stdio 2005 或更高版本。我的shark 包的路径在 D:/shark ,cmake的设置如下
点击configure 按钮,选择我们需要的编译器 VS2005,然后再点击 Generate。完成后显示如下
这时候去看看 D:/build_shark 路径下,cmake 已经为我们生成了 VS2005 需要的编译文件了
第二步,使用 VS2005 编译连接,得到我们需要的 shark.lib 静态链接库。
双击 build_shark 文件夹下面的 shark.sln, 把工程导入到 vs2005 编译环境下。
这里大家就可以看到 shark 自带的所有实例工程和shark.lib的工程了,可以选择工具栏的“生成”—>“重新生成解决方案”,这时候vs2005就会为我们生成所有的实 例程序,由于实例比较多,整个过程可能持续数分钟,出去喝杯茶吧,保持耐心哦。当然,我是为了演示一下实例程序,所以选择重新生成了,你可以根据自己的需 要选择特定的工程,比如,你打开shark.vcproj,就会生成shark.lib。
这里再称赞一下德国人的严谨精神,70个工程,作为一个开源库居然没有错误一次编译成功,做工精细啊。
OK,编译完成后,看看 build_shark 文件夹下面多出来了好几个文件件,其中examples 下面就是所有的实例程序,当然还没有debug呢,需要哪个的话,自己去搞吧,关键是注意 debug 文件夹,下面终于见到我们需要的东西了:shark.lib
(Release也可以做一遍)
下一篇我讲一下如何把我们得到的shark.lib 导入到自己的工程里面,运行一个实例。
二、Shark——之运行篇
在上一篇里面,我们最后得到了Shark Machine Learning Library 的shark.lib 静态链接库。本文将继续讲解,使用得到的库,在VS2005 环境里运行一个shark自带的例子,这个例子叫做“TSP_GA”,看名字就知道了,使用遗传算法求解TSP问题的。
OK,开始吧。
第一步,先到这个路径Shark\examples\EALib 下面,找到本文要用的源文件TSP_GA.cpp。新建一个工程,文件路径下新建两个文件夹,一个叫include,一个叫lib,分别用于放置shark的头文件和链接库。
第二步,给工程添加静态链接库和头文件包含。点击“项目”->“属性”,选择“C/C++”->"常规",如下图所示,添加头文件的路径(附加包含目录)
然后,点击“链接器”->“常规”,添加shark.lib的附加库目录,如下图
继续,点击“链接器”->“输入”,填写库名称,如下图
OK,到此为止,我们就把工程的链接库和头文件都设置好了。
第三步,运行 TSP_GA 工程,成功!恭喜你,你已经成功安装了 shark 库函数!
说明一下,由于是控制台应用程序,最后运行完可能闪一下就没了。一个小技巧是,在程序最后加一句 getchar(); 这样敲回车才会退出。
总结:安装过程还算顺利,linux下面的安装待续......
支付宝扫一扫捐赠
微信公众号: 共鸣圈
欢迎讨论,邮件: 924948$qq.com 请把$改成@
QQ群:263132197
QQ: 924948