各个库的用法+其他

  1. https://github.com/donnemartin/data-science-ipython-notebooks
  2. 如何抽取特征和特征选择?

答:特征抽取:主要是抽取一些与业务相关的特征,文本的话可能会用到一些正则表达式

     特征选择:信息增益、卡方检验、互信息

          模型参数?

答:一般模型参数不多,最多是多一个正则化项。

      3.  网格搜索:主要是优化超参数

     from sklearn.pipeline import Pipeline
     from sklearn.grid_search import GridSearchCV       
from sklearn.metrics import classification_report:f1_score, classification_report
import mahotas as mh:用mahotas库将图片转换成同样大小和颜色.

          模型的fit、fit_transform、.transform:

fit直接就是拟合数据,fit_transform是拟合数据并且把训练数据转换为矩阵数字的式,.transform已经被废弃了

          只有在文本分类的时候才用到vectorizer:

fit_transform主要是学习字典,并且返回doc和item矩阵, transfrom就是直接把数据转换为doc和item的矩阵,不学习

from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import f1_score, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Perceptron
categories = ['rec.sport.hockey', 'rec.sport.baseball', 'rec.autos']
newsgroups_train = fetch_20newsgroups(subset='train', categories = categories, remove = ('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories = categories, remove = ('headers', 'footers', 'quotes'))

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)

classifier = Perceptron(n_iter = 100, eta0 = 0.1)
classifier.fit_transform(X_train, newsgroups_train.target)
predictions = classifier.predict(X_test)
print (classification_report(newsgroups_test.target, predictions))

 

posted @ 2016-04-26 21:36  hudongni1  阅读(184)  评论(0编辑  收藏  举报