使用scikit-learn对文本进行分类

使用scikit-learn对文本进行分类
原链接 towardsdatascience.com/a-beginners-guide-to-text-classification-with-scikit-learn-632357e16f3a
preparing the data

reading the dataset

import pandas as pd

df_review = pd.read_csv("./IMDB Dataset.csv")
df_review

review sentiment0 One of the other reviewers has mentioned that ... positive1 A wonderful little production.

The... positive2 I thought this was a wonderful way to spend ti... positive3 Basically there's a family where a little boy ... negative4 Petter Mattei's "Love in the Time of Money" is... positive... ... ...49995 I thought this movie did a down right good job... positive49996 Bad plot, bad dialogue, bad acting, idiotic di... negative49997 I am a Catholic taught in parochial elementary... negative49998 I'm going to have to disagree with the previou... negative49999 No one expects the Star Trek movies to be high... negative
50000 rows × 2 columns

in order to train out model faster in the following stedps, we are going to take a smaller sample of## this smalll sample will contain 9000 positive and 1000 negative reviews to make the data imbalanced

so you can understand the undersampling and oversampling techniques in the next stedp

# small sample creating
df_positive = df_review[df_review['sentiment']'positive'][:9000]
df_negative = df_review[df_review['sentiment']
'negative'][:1000]

df_review_imb = pd.concat([df_positive,df_negative],axis=0)

df_review_imb

10000 rows
review sentiment0 One of the other reviewers has mentioned that ... positive1 A wonderful little production.

The... positive2 I thought this was a wonderful way to spend ti... positive4 Petter Mattei's "Love in the Time of Money" is... positive5 Probably my all-time favorite movie, a story o... positive... ... ...2000 Stranded in Space (1972) MST3K version - a ver... negative2005 I happened to catch this supposed "horror" fli... negative2007 waste of 1h45 this nasty little film is one to... negative2010 Warning: This could spoil your movie. Watch it... negative2013 Quite what the producers of this appalling ada... negative
10000 rows × 2 columns
Dealing with the imbalanced class

imbalanced data

df_review_imb['sentiment'].hist()
AxesSubplot:

df_review_imb

review sentiment0 One of the other reviewers has mentioned that ... positive1 A wonderful little production.

The... positive2 I thought this was a wonderful way to spend ti... positive4 Petter Mattei's "Love in the Time of Money" is... positive5 Probably my all-time favorite movie, a story o... positive... ... ...2000 Stranded in Space (1972) MST3K version - a ver... negative2005 I happened to catch this supposed "horror" fli... negative2007 waste of 1h45 this nasty little film is one to... negative2010 Warning: This could spoil your movie. Watch it... negative2013 Quite what the producers of this appalling ada... negative
10000 rows × 2 columns

imblearn library

you can either undersample positive reviews or nversample negative reviews (based on data you working# in this case, we will use the RandomUnderSampler

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state = 0)
df_review_bal, df_review_bal['sentiment'] = rus.fit_resample(df_review_imb[['review']],
df_review_imb['sentiment'])

with)
df_review_imb['review']

0 One of the other reviewers has mentioned that ...
1 A wonderful little production.

The...
2 I thought this was a wonderful way to spend ti...
4 Petter Mattei's "Love in the Time of Money" is...
5 Probably my all-time favorite movie, a story o...
...
2000 Stranded in Space (1972) MST3K version - a ver...
2005 I happened to catch this supposed "horror" fli...
2007 waste of 1h45 this nasty little film is one to...
2010 Warning: This could spoil your movie. Watch it...
2013 Quite what the producers of this appalling ada...
Name: review, Length: 10000, dtype: object

df_review_bal
review sentiment0 Basically there's a family where a little boy ... negative1 This show was an amazing, fresh & innovative i... negative2 Encouraged by the positive comments about this... negative3 Phil the Alien is one of those quirky films wh... negative4 I saw this movie when I was about 12 when it c... negative... ... ...1995 Knute Rockne led an extraordinary life and his... positive1996 At the height of the 'Celebrity Big Brother' r... positive1997 This is another of Robert Altman's underrated ... positive1998 This movie won a special award at Cannes for i... positive1999 You'd be forgiven to think a Finnish director ... positive
2000 rows × 2 columns

we compare the imbalanced and balanced dataset with the following code

print(df_review_imb.value_counts('sentiment'))
print(df_review_bal.value_counts('sentiment'))

sentiment
positive 9000
negative 1000
dtype: int64
sentiment
negative 1000
positive 1000
dtype: int64

splitting data into train and test set

the train dataset will be used to fit the model, test will be used to provided an unbiased evaluation
from sklearn.model_selection import train_test_split

train, test = train_test_split(df_review_bal, test_size=0.3, random_state=42)

set the independent and dependent variables within our train and test test

train_x, train_y = train['review'], train['sentiment']
test_x, test_y = test['review'], test['sentiment']

of a final model
Text Representation ( bag of words)
CountVectorizer
统计每个文档中每个单词出现的次数
TF-IDF
在每个文档中每个单词出现的次数的基础之上/出现在几篇文档之中

turing our text data into numerical vectors

example with TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english') # create a new instance of TfidfVectorizer(tfidf)
train_x_vector = tfidf.fit_transform(train_x) #apply the parameters to the data
train_x_vector #a sparse matrix

# pd.DataFrame.sparse.from_spmatrix(train_x_vector,

index = train_x.index,#

columns=cv.get_features_names())
<1400x21091 sparse matrix of type '<class 'numpy.float64'>'
with 124311 stored elements in Compressed Sparse Row format>

test_x_vector = tfidf.transform(test_x)

Model Selection
SVM
from sklearn.svm import SVC

svc = SVC(kernel='linear')
svc.fit(train_x_vector, train_y)

SVC(kernel='linear')

after fiting svc, we can predict whether a review is positive or negative with .predict() method

print(svc.predict(tfidf.transform(['A good movie'])))
print(svc.predict(tfidf.transform(['An excellent movie'])))
print(svc.predict(tfidf.transform(['I did not like this movie at all'])))

['positive']
['positive']
['negative']

Decision tree

to fit a decision tree model, we need to introduce the input and output

from sklearn.tree import DecisionTreeClassifier

dec_tree = DecisionTreeClassifier()
dec_tree.fit(train_x_vector, train_y)

DecisionTreeClassifier()

print(dec_tree.predict(tfidf.transform(['A good movie'])))
print(dec_tree.predict(tfidf.transform(['An excellent movie'])))
print(dec_tree.predict(tfidf.transform(['I did not like this movie at all'])))

['positive']
['positive']
['positive']

Naive Bayes
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(train_x_vector.toarray(), train_y)

GaussianNB()

print(gnb.predict(tfidf.transform(['A good movie']).toarray()))
print(gnb.predict(tfidf.transform(['An excellent movie']).toarray()))
print(gnb.predict(tfidf.transform(['I did not like this movie at all']).toarray()))
['negative']
['negative']
['negative']

Logistic Regression
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(train_x_vector,train_y)

LogisticRegression()

print(log_reg.predict(tfidf.transform(['A good movie'])))
print(log_reg.predict(tfidf.transform(['An excellent movie'])))
print(log_reg.predict(tfidf.transform(['I did not like this movie at all'])))

['negative']
['positive']
['negative']

Model Evaluation
Mean Accuracy

to obtain the mean accuracy of each model, just use the .score method with the test samples and true labels
print(svc.score(test_x_vector,test_y))

print(dec_tree.score(test_x_vector,test_y))
print(gnb.score(test_x_vector.toarray(), test_y))
print(log_reg.score(test_x_vector,test_y))

0.8333333333333334
0.6416666666666667
0.6166666666666667
0.8233333333333334

as below
F1 score
from sklearn.metrics import f1_score

f1_score(test_y, svc.predict(test_x_vector),
labels=['positive','negative'],
average=None)

array([0.83606557, 0.83050847])

test_x_vector

<600x21091 sparse matrix of type '<class 'numpy.float64'>'
with 48628 stored elements in Compressed Sparse Row format>

from sklearn.metrics import f1_score

f1_score(test_y, svc.predict(test_x_vector),
labels=['negative','positive'],
average=None)
array([0.83050847,
0.83606557])
dec_tree_f1 = f1_score(test_y, dec_tree.predict(test_x_vector),labels=['negative','positive'], average=None)
gnb_f1 = f1_score(test_y, gnb.predict(test_x_vector.toarray()),labels=['negative','positive'], average=None)
log_reg_F1 = f1_score(test_y, log_reg.predict(test_x_vector),labels=['negative','positive'], average=None)

print(dec_tree_f1)
print(gnb_f1)
print(log_reg_F1)

[0.64811784[0.57564576[0.81724138
0.63497453]
0.65045593]
0.82903226]
Classification report

to show the main classification metrics that include those caculated before.

to obtain the classification report, we need the true labels and predicted labels classification_report(y_true,
from sklearn.metrics import classification_report

print(classification_report(test_y,
svc.predict(test_x_vector),
labels=['negative','positive']))

precision recall f1-score support

 negative 0.85 0.81 0.83 302
positive 0.82 0.86 0.84 298

 accuracy 0.83 600
macro avg 0.83 0.83 0.83 600
weighted avg 0.83 0.83 0.83 600

y_pred)
Confusion matrix
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(test_y,
svc.predict(test_x_vector),
labels=['positive','negative'])

conf_matrix

array([[255, 43],
[ 57, 245]])

import seaborn as sn

sn.heatmap(conf_matrix, annot=True, cmap=plt.cm.Blues)

AxesSubplot:

import matplotlib.pyplot as plt
plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Blues) #, cmap=Blues

<matplotlib.image.AxesImage at 0x7f184deea590>


GridSearchCV

this is technique consists of an exhaustive search on specified parameters in order to obtain the optimum
from sklearn.model_selection import GridSearchCV

set the parameters

parameters = {'C':[1,4,8,1,32],'kernel':['linear','rbf']}
svc = SVC()
svc_grid = GridSearchCV(svc, parameters, cv=5)

svc_grid.fit(train_x_vector, train_y)

values of hyperparameters.
GridSearchCV(cv=5, estimator=SVC(),
param_grid={'C': [1, 4, 8, 1, 32], 'kernel': ['linear', 'rbf']})

now we specified some parameters to obtain the optimum model

print(svc_grid.best_params_)
print(svc_grid.best_estimator_)

{'C': 4, 'kernel': 'rbf'}
SVC(C=4)

posted @ 2021-05-16 09:57  hi_mxd  阅读(237)  评论(0编辑  收藏  举报