各个库的用法+其他
- https://github.com/donnemartin/data-science-ipython-notebooks
- 如何抽取特征和特征选择?
答:特征抽取:主要是抽取一些与业务相关的特征,文本的话可能会用到一些正则表达式
特征选择:信息增益、卡方检验、互信息
模型参数?
答:一般模型参数不多,最多是多一个正则化项。
3. 网格搜索:主要是优化超参数
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report:f1_score, classification_report
import mahotas as mh:用mahotas库将图片转换成同样大小和颜色.
模型的fit、fit_transform、.transform:
fit直接就是拟合数据,fit_transform是拟合数据并且把训练数据转换为矩阵数字的式,.transform已经被废弃了
只有在文本分类的时候才用到vectorizer:
fit_transform主要是学习字典,并且返回doc和item矩阵, transfrom就是直接把数据转换为doc和item的矩阵,不学习
from sklearn.datasets import fetch_20newsgroups from sklearn.metrics import f1_score, classification_report from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import Perceptron categories = ['rec.sport.hockey', 'rec.sport.baseball', 'rec.autos'] newsgroups_train = fetch_20newsgroups(subset='train', categories = categories, remove = ('headers', 'footers', 'quotes')) newsgroups_test = fetch_20newsgroups(subset='test', categories = categories, remove = ('headers', 'footers', 'quotes')) vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(newsgroups_train.data) X_test = vectorizer.transform(newsgroups_test.data) classifier = Perceptron(n_iter = 100, eta0 = 0.1) classifier.fit_transform(X_train, newsgroups_train.target) predictions = classifier.predict(X_test) print (classification_report(newsgroups_test.target, predictions))