使用scikit-learn对文本进行分类
使用scikit-learn对文本进行分类
原链接 towardsdatascience.com/a-beginners-guide-to-text-classification-with-scikit-learn-632357e16f3a
preparing the data
reading the dataset
import pandas as pd
df_review = pd.read_csv("./IMDB Dataset.csv")
df_review
review sentiment0 One of the other reviewers has mentioned that ... positive1 A wonderful little production.
The... positive2 I thought this was a wonderful way to spend ti... positive3 Basically there's a family where a little boy ... negative4 Petter Mattei's "Love in the Time of Money" is... positive... ... ...49995 I thought this movie did a down right good job... positive49996 Bad plot, bad dialogue, bad acting, idiotic di... negative49997 I am a Catholic taught in parochial elementary... negative49998 I'm going to have to disagree with the previou... negative49999 No one expects the Star Trek movies to be high... negative
50000 rows × 2 columns
in order to train out model faster in the following stedps, we are going to take a smaller sample of## this smalll sample will contain 9000 positive and 1000 negative reviews to make the data imbalanced
so you can understand the undersampling and oversampling techniques in the next stedp
# small sample creating
df_positive = df_review[df_review['sentiment']'positive'][:9000]
df_negative = df_review[df_review['sentiment']'negative'][:1000]
df_review_imb = pd.concat([df_positive,df_negative],axis=0)
df_review_imb
10000 rows
review sentiment0 One of the other reviewers has mentioned that ... positive1 A wonderful little production.
The... positive2 I thought this was a wonderful way to spend ti... positive4 Petter Mattei's "Love in the Time of Money" is... positive5 Probably my all-time favorite movie, a story o... positive... ... ...2000 Stranded in Space (1972) MST3K version - a ver... negative2005 I happened to catch this supposed "horror" fli... negative2007 waste of 1h45 this nasty little film is one to... negative2010 Warning: This could spoil your movie. Watch it... negative2013 Quite what the producers of this appalling ada... negative
10000 rows × 2 columns
Dealing with the imbalanced class
imbalanced data
df_review_imb['sentiment'].hist()
AxesSubplot:
df_review_imb
review sentiment0 One of the other reviewers has mentioned that ... positive1 A wonderful little production.
The... positive2 I thought this was a wonderful way to spend ti... positive4 Petter Mattei's "Love in the Time of Money" is... positive5 Probably my all-time favorite movie, a story o... positive... ... ...2000 Stranded in Space (1972) MST3K version - a ver... negative2005 I happened to catch this supposed "horror" fli... negative2007 waste of 1h45 this nasty little film is one to... negative2010 Warning: This could spoil your movie. Watch it... negative2013 Quite what the producers of this appalling ada... negative
10000 rows × 2 columns
imblearn library
you can either undersample positive reviews or nversample negative reviews (based on data you working# in this case, we will use the RandomUnderSampler
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state = 0)
df_review_bal, df_review_bal['sentiment'] = rus.fit_resample(df_review_imb[['review']],
df_review_imb['sentiment'])
with)
df_review_imb['review']
0 One of the other reviewers has mentioned that ...
1 A wonderful little production.
The...
2 I thought this was a wonderful way to spend ti...
4 Petter Mattei's "Love in the Time of Money" is...
5 Probably my all-time favorite movie, a story o...
...
2000 Stranded in Space (1972) MST3K version - a ver...
2005 I happened to catch this supposed "horror" fli...
2007 waste of 1h45 this nasty little film is one to...
2010 Warning: This could spoil your movie. Watch it...
2013 Quite what the producers of this appalling ada...
Name: review, Length: 10000, dtype: object
df_review_bal
review sentiment0 Basically there's a family where a little boy ... negative1 This show was an amazing, fresh & innovative i... negative2 Encouraged by the positive comments about this... negative3 Phil the Alien is one of those quirky films wh... negative4 I saw this movie when I was about 12 when it c... negative... ... ...1995 Knute Rockne led an extraordinary life and his... positive1996 At the height of the 'Celebrity Big Brother' r... positive1997 This is another of Robert Altman's underrated ... positive1998 This movie won a special award at Cannes for i... positive1999 You'd be forgiven to think a Finnish director ... positive
2000 rows × 2 columns
we compare the imbalanced and balanced dataset with the following code
print(df_review_imb.value_counts('sentiment'))
print(df_review_bal.value_counts('sentiment'))
sentiment
positive 9000
negative 1000
dtype: int64
sentiment
negative 1000
positive 1000
dtype: int64
splitting data into train and test set
the train dataset will be used to fit the model, test will be used to provided an unbiased evaluation from sklearn.model_selection import train_test_split
train, test = train_test_split(df_review_bal, test_size=0.3, random_state=42)
set the independent and dependent variables within our train and test test
train_x, train_y = train['review'], train['sentiment']
test_x, test_y = test['review'], test['sentiment']
of a final model
Text Representation ( bag of words)
CountVectorizer
统计每个文档中每个单词出现的次数
TF-IDF
在每个文档中每个单词出现的次数的基础之上/出现在几篇文档之中
turing our text data into numerical vectors
example with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english') # create a new instance of TfidfVectorizer(tfidf)
train_x_vector = tfidf.fit_transform(train_x) #apply the parameters to the data
train_x_vector #a sparse matrix
# pd.DataFrame.sparse.from_spmatrix(train_x_vector,
index = train_x.index,#
columns=cv.get_features_names())
<1400x21091 sparse matrix of type '<class 'numpy.float64'>'
with 124311 stored elements in Compressed Sparse Row format>
test_x_vector = tfidf.transform(test_x)
Model Selection
SVM
from sklearn.svm import SVC
svc = SVC(kernel='linear')
svc.fit(train_x_vector, train_y)
SVC(kernel='linear')
after fiting svc, we can predict whether a review is positive or negative with .predict() method
print(svc.predict(tfidf.transform(['A good movie'])))
print(svc.predict(tfidf.transform(['An excellent movie'])))
print(svc.predict(tfidf.transform(['I did not like this movie at all'])))
['positive']
['positive']
['negative']
Decision tree
to fit a decision tree model, we need to introduce the input and output
from sklearn.tree import DecisionTreeClassifier
dec_tree = DecisionTreeClassifier()
dec_tree.fit(train_x_vector, train_y)
DecisionTreeClassifier()
print(dec_tree.predict(tfidf.transform(['A good movie'])))
print(dec_tree.predict(tfidf.transform(['An excellent movie'])))
print(dec_tree.predict(tfidf.transform(['I did not like this movie at all'])))
['positive']
['positive']
['positive']
Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(train_x_vector.toarray(), train_y)
GaussianNB()
print(gnb.predict(tfidf.transform(['A good movie']).toarray()))
print(gnb.predict(tfidf.transform(['An excellent movie']).toarray()))
print(gnb.predict(tfidf.transform(['I did not like this movie at all']).toarray()))
['negative']
['negative']
['negative']
Logistic Regression
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(train_x_vector,train_y)
LogisticRegression()
print(log_reg.predict(tfidf.transform(['A good movie'])))
print(log_reg.predict(tfidf.transform(['An excellent movie'])))
print(log_reg.predict(tfidf.transform(['I did not like this movie at all'])))
['negative']
['positive']
['negative']
Model Evaluation
Mean Accuracy
to obtain the mean accuracy of each model, just use the .score method with the test samples and true labels print(svc.score(test_x_vector,test_y))
print(dec_tree.score(test_x_vector,test_y))
print(gnb.score(test_x_vector.toarray(), test_y))
print(log_reg.score(test_x_vector,test_y))
0.8333333333333334
0.6416666666666667
0.6166666666666667
0.8233333333333334
as below
F1 score
from sklearn.metrics import f1_score
f1_score(test_y, svc.predict(test_x_vector),
labels=['positive','negative'],
average=None)
array([0.83606557, 0.83050847])
test_x_vector
<600x21091 sparse matrix of type '<class 'numpy.float64'>'
with 48628 stored elements in Compressed Sparse Row format>
from sklearn.metrics import f1_score
f1_score(test_y, svc.predict(test_x_vector),
labels=['negative','positive'],
average=None)
array([0.83050847,
0.83606557])
dec_tree_f1 = f1_score(test_y, dec_tree.predict(test_x_vector),labels=['negative','positive'], average=None)
gnb_f1 = f1_score(test_y, gnb.predict(test_x_vector.toarray()),labels=['negative','positive'], average=None)
log_reg_F1 = f1_score(test_y, log_reg.predict(test_x_vector),labels=['negative','positive'], average=None)
print(dec_tree_f1)
print(gnb_f1)
print(log_reg_F1)
[0.64811784[0.57564576[0.81724138
0.63497453]
0.65045593]
0.82903226]
Classification report
to show the main classification metrics that include those caculated before.
to obtain the classification report, we need the true labels and predicted labels classification_report(y_true, from sklearn.metrics import classification_report
print(classification_report(test_y,
svc.predict(test_x_vector),
labels=['negative','positive']))
precision recall f1-score support
negative 0.85 0.81 0.83 302
positive 0.82 0.86 0.84 298
accuracy 0.83 600
macro avg 0.83 0.83 0.83 600
weighted avg 0.83 0.83 0.83 600
y_pred)
Confusion matrix
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(test_y,
svc.predict(test_x_vector),
labels=['positive','negative'])
conf_matrix
array([[255, 43],
[ 57, 245]])
import seaborn as sn
sn.heatmap(conf_matrix, annot=True, cmap=plt.cm.Blues)
AxesSubplot:

import matplotlib.pyplot as plt
plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Blues) #, cmap=Blues
<matplotlib.image.AxesImage at 0x7f184deea590>

GridSearchCV
this is technique consists of an exhaustive search on specified parameters in order to obtain the optimum from sklearn.model_selection import GridSearchCV
set the parameters
parameters = {'C':[1,4,8,1,32],'kernel':['linear','rbf']}
svc = SVC()
svc_grid = GridSearchCV(svc, parameters, cv=5)
svc_grid.fit(train_x_vector, train_y)
values of hyperparameters.
GridSearchCV(cv=5, estimator=SVC(),
param_grid={'C': [1, 4, 8, 1, 32], 'kernel': ['linear', 'rbf']})
now we specified some parameters to obtain the optimum model
print(svc_grid.best_params_)
print(svc_grid.best_estimator_)
{'C': 4, 'kernel': 'rbf'}
SVC(C=4)