使用Bag of words和随机森林进行文本情感分类
1. 读取数据集
使用pandas读取训练数据集。
import re
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
# downlaod frequent stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
# Read train data
path_train_data = "movie_review/labeledTrainData.tsv"
train = pd.read_csv(path_train_data, header=0, delimiter="\t", quoting=3)
导入的包:
- re:Python内置的正则库;
- pandas:数据读取和处理库;
- numpy:高性能的向量库;
- BeautifulSoup:HTML解析库,用来从HTML文档和XML文档中提取内容;
- sklearn:常用的机器学习库;
- CounterVectorizer:
- RandomForestClassifier:随机森林分类器;
- nltk:自然语言处理库;
- corpus:nltk自带的语料库;
- stopwords:停顿词(如is, are, I, you, and....');
2. 预处理文本
定义两个函数来将清理文本,比如剔除掉HTML的tag、标点符号以及一些无意义的词(and, is, are...)。
def review_to_words(raw_review):
# 1. Remove HTML tags
review_text = BeautifulSoup(raw_review).get_text()
# 2. Remove non-letter
letters_only = re.sub("[^a-zA-Z]", " ", review_text)
# 3. Convert to lower case, split into indivisul words
words = letters_only.lower().split()
# 4. Convert stopwords to a set, because searching a set is
# much faster than searching a list
stops = set(stopwords.words('english'))
# 5. Remove stopwords
meaningful_words = [w for w in words if w not in stops]
# 6. Using space as separator join words to a string
return (" ".join(meaningful_words))
def get_clean_reviews(reviews):
num_reviews = len(reviews)
clean_reviews = []
for i in range(num_reviews):
if ((i+1) % 1000) == 0:
print("Review %d of %d" % (i+1, num_reviews))
clean_reviews.append(review_to_words(reviews[i]))
return clean_reviews
clean_train_reviews = get_clean_reviews(train["review"])
3. 创建bag of words:
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)
# Numpy arrays are easy to work with, so convert the result to an array
train_data_features = train_data_features.toarray()
vocab = vectorizer.get_feature_names()
vectorizer.fit_transform
的作用是将一个字符串数组转化为一个 bag of words
列表。
4. 训练模型
训练一个包含100课树的随机森林:
print("Training the random forest...")
# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit(train_data_features, train['sentiment'])
5. 准备测试集:
使用之前定义的函数,来获取测试数据集的特征向量。
print("Testing the model...")
path_test_data = "movie_review/testData.tsv"
test = pd.read_csv(path_test_data, header=0, delimiter="\t", quoting=3)
# processing test data
clean_test_reviews = get_clean_reviews(test["review"])
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()
注意,在获取 test_data_features
时,我们并没有使用 fit_transform
而是直接使用了 transform
,因为在前面 vectorize
已经训练好了。
6. 预测
将预测结果写入csv文件,方便查看:
path_result = "movie_review/Bag_of_Words_model.csv"
result = forest.predict(test_data_features)
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv(path_result , index=False, quoting=3 )
CS专业在读,热爱编程。
专业之外,喜欢阅读,尤爱哲学、金庸、马尔克斯。
专业之外,喜欢阅读,尤爱哲学、金庸、马尔克斯。