公开数据来源
- 公开的数据来源:http://www.datatang.com/数据堂的数据非常丰富,包括各种行业数据,电信,零售,金融,银行等等,特别适用于数据挖掘。
- UCI是最经典的,不过也比较古老http://archive.ics.uci.edu/ml/
- 数据堂最近异军突起,非常值得称赞
- 国外还有一些网站,比如http://mlcomp.org/,http://mldata.org/你可以看看
- 另外KDDCUP每年都会针对一个特定的问题进行比赛,数据集也是公开的
- 最近几年,数据挖掘的比赛越来越多了,你可以去PASCAL上看看你感兴趣的领域,自己搜索一下
- http://www.delicious.com/pskomoroch/dataset这个是delicious上面一个人搜集的数据集网站书签,比较杂,或许你能找到你所要的(话说delicious改版之前这个里面的内容比现在的多多了)
- 再有就是看具体的做的内容,然后看相关学者都用什么数据集,除了LDC那种变态组织,其他很多数据都可以通过track论文中的信息或者是作者主页上的信息下载到的
- 做数据挖掘和数据分析都是针对某一个领域或者问题去做,其实也看那个领域会不会有开放的心态去公开数据,前两年在Hans Rosling老先生在TED上公开呼吁之后,很多机构,包括联合国都公开了自己的数据
- Quora上有人问过类似的问题:Where can I get large datasets open to the public?
问题链接:http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public
该页面的Answer Wiki列举了数十个数据来源,现在搬运如下:
Cross-disciplinary data repositories, data collections and data search engines:
- http://aws.amazon.com/datasets
- http://databib.org
- http://datacite.org
- http://figshare.com
- http://linkeddata.org
- http://reddit.com/r/datasets
- http://thedatahub.org alias http://ckan.net
Single datasets and data repositories
http://archive.ics.uci.edu/ml/
http://crawdad.org/
http://data.austintexas.gov
http://data.cityofchicago.org
http://data.govloop.com
http://data.gov.uk/
http://data.medicare.gov
http://data.seattle.gov
http://data.sfgov.org
http://data.sunlightlabs.com
https://datamarket.azure.com/
http://developer.yahoo.com/geo/g...
http://econ.worldbank.org/datasets
http://en.wikipedia.org/wiki/Wik...
http://factfinder.census.gov/ser...
http://ftp.ncbi.nih.gov/
http://gettingpastgo.socrata.com
http://googleresearch.blogspot.c...
http://books.google.com/ngrams/
http://medihal.archives-ouvertes.fr
http://public.resource.org/
http://rechercheisidore.fr
http://snap.stanford.edu/data/in...
http://timetric.com/public-data/
https://wist.echo.nasa.gov/~wist...
http://www2.jpl.nasa.gov/srtm
http://www.archives.gov/research...
http://www.bls.gov/
http://www.crunchbase.com/
http://www.dartmouthatlas.org/
http://www.data.gov/
http://www.datakc.org
http://dbpedia.org
http://www.delicious.com/jbaldwi...
http://www.factual.com/
http://research.stlouisfed.org/f...
http://www.freebase.com/
http://www.google.com/publicdata...
http://www.guardian.co.uk/news/d...
http://www.infochimps.com
http://www.kaggle.com/
http://build.kiva.org/
http://www.nationalarchives.gov....
http://www.nyc.gov/html/datamine...
http://www.ordnancesurvey.co.uk/...
http://www.philwhln.com/how-to-g...
http://www.imdb.com/interfaces
http://imat-relpred.yandex.ru/en...
http://www.dados.gov.pt/pt/catal...
http://knoema.com
http://daten.berlin.de/
http://www.qunb.com
http://databib.org/
http://datacite.org/Edit - 1、气候监测数据集 http://cdiac.ornl.gov/ftp/ndp026b
2、几个实用的测试数据集下载的网站
http://www.cs.toronto.edu/~roweis/data.html
http://www.cs.toronto.edu/~roweis/data.html
http://kdd.ics.uci.edu/summary.task.type.html
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
http://www.phys.uni.torun.pl/~duch/software.html
在下面的网址可以找到reuters数据集http://www.research.att.com/~lewis/reuters21578.html
以下网址上有各种数据集:
http://kdd.ics.uci.edu/summary.data.type.html
进行文本分类,还有一个数据集是可以用的,即rainbow的数据集
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
3、找了很多测试数据集,写论文的同志们肯定需要的,至少能用来检验算法的效果
可能有一些不能访问,但是总有能访问的吧:
UCI收集的机器学习数据集
ftp://pami.sjtu.edu.cn/
http://www.ics.uci.edu/~mlearn//MLRepository.htm
statlib
http://liama.ia.ac.cn/SCILAB/scilabindexgb.htm
http://lib.stat.cmu.edu/
样本数据库
http://kdd.ics.uci.edu/
http://www.ics.uci.edu/~mlearn/MLRepository.html
关于基金的数据挖掘的网站
http://www.gotofund.com/index.asp
http://lans.ece.utexas.edu/~strehl/
reuters数据集
http://www.research.att.com/~lewis/reuters21578.html
各种数据集:
http://kdd.ics.uci.edu/summary.data.type.html
http://www.mlnet.org/cgi-bin/mlnetois.pl/?File=datasets.html
http://lib.stat.cmu.edu/datasets/
http://dctc.sjtu.edu.cn/adaptive/datasets/
http://fimi.cs.helsinki.fi/data/
http://www.almaden.ibm.com/software/quest/Resources/index.shtml
http://miles.cnuce.cnr.it/~palmeri/datam/DCI/
进行文本分类&WEB
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
http://www.w3.org/TR/WD-logfile-960221.html
http://www.w3.org/Daemon/User/Config/Logging.html#AccessLog
http://www.w3.org/1998/11/05/WC-workshop/Papers/bala2.html
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
http://www.web-caching.com/traces-logs.html
http://www-2.cs.cmu.edu/webkb
http://www.cs.auc.dk/research/DP/tdb/TimeCenter/TimeCenterPublications/TR-75.pdf
http://www.cs.cornell.edu/projects/kddcup/index.html
时间序列数据的网址
http://www.stat.wisc.edu/~reinsel/bjr-data/
apriori算法的测试数据
http://www.almaden.ibm.com/cs/quest/syndata.html
数据生成器的链接
http://www.cse.cuhk.edu.hk/~kdd/data_collection.html
http://www.almaden.ibm.com/cs/quest/syndata.html
关联:
http://flow.dl.sourceforge.net/sourceforge/weka/regression-datasets.jar
http://www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html#assocSynData
WEKA:
http://flow.dl.sourceforge.net/sourceforge/weka/regression-datasets.jar
1。A jarfile containing 37 classification problems, originally obtained from the UCI repository
http://prdownloads.sourceforge.net/weka/datasets-UCI.jar
2。A jarfile containing 37 regression problems, obtained from various sources
http://prdownloads.sourceforge.net/weka/datasets-numeric.jar
3。A jarfile containing 30 regression datasets collected by Luis Torgo
http://prdownloads.sourceforge.net/weka/regression-datasets.jar
癌症基因:
http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi
金融数据:
http://lisp.vse.cz/pkdd99/Challenge/chall.htm
另一个人提供的
http://www.cs.toronto.edu/~roweis/data.html
http://kdd.ics.uci.edu/summary.task.type.html
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
http://www.phys.uni.torun.pl/~duch/software.html
在下面的网址可以找到reuters数据集
http://www.research.att.com/~lewis/reuters21578.html
以下网址上有各种数据集:
http://kdd.ics.uci.edu/summary.data.type.html
进行文本分类,还有一个数据集是可以用的,即rainbow的数据集
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
Download the Financial Data (~17.5M zipped file, ~67M unzipped data)
Download the Medical Data (~2M zipped file, ~6M unzipped data)
http://lisp.vse.cz/pkdd99/Challenge/chall.htm
kdnuggets 相关链接数据集(借花献佛了):
http://www.kdnuggets.com/datasets/index.html
你也可以到http://blogger.org.cn/blog/more.asp?name=idmer&id=24017
察看kdnuggets 数据集资源的详细介绍。