使用Bag of words和随机森林进行文本情感分类

1. 读取数据集

使用pandas读取训练数据集。

import re
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

# downlaod frequent stopwords
import nltk
nltk.download('stopwords')  
from nltk.corpus import stopwords

# Read train data
path_train_data = "movie_review/labeledTrainData.tsv"
train = pd.read_csv(path_train_data, header=0, delimiter="\t", quoting=3)

导入的包:

  • re:Python内置的正则库;
  • pandas:数据读取和处理库;
  • numpy:高性能的向量库;
  • BeautifulSoup:HTML解析库,用来从HTML文档和XML文档中提取内容;
  • sklearn:常用的机器学习库;
    • CounterVectorizer:
    • RandomForestClassifier:随机森林分类器;
  • nltk:自然语言处理库;
    • corpus:nltk自带的语料库;
    • stopwords:停顿词(如is, are, I, you, and....');

2. 预处理文本

定义两个函数来将清理文本,比如剔除掉HTML的tag、标点符号以及一些无意义的词(and, is, are...)。


def review_to_words(raw_review):
    # 1. Remove HTML tags
    review_text = BeautifulSoup(raw_review).get_text()
    # 2. Remove non-letter
    letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    # 3. Convert to lower case, split into indivisul words
    words = letters_only.lower().split()
    # 4. Convert stopwords to a set, because searching a set is 
    #    much faster than searching a list
    stops = set(stopwords.words('english'))
    # 5. Remove stopwords
    meaningful_words = [w for w in words if w not in stops]
    # 6. Using space as separator join words to a string 
    return (" ".join(meaningful_words))

def get_clean_reviews(reviews):
    num_reviews = len(reviews)
    clean_reviews = []
    for i in range(num_reviews):
        if ((i+1) % 1000) == 0:  
            print("Review %d of %d" % (i+1, num_reviews))
        clean_reviews.append(review_to_words(reviews[i]))
    return clean_reviews

clean_train_reviews = get_clean_reviews(train["review"])

3. 创建bag of words:

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an array
train_data_features = train_data_features.toarray()
vocab = vectorizer.get_feature_names()

vectorizer.fit_transform的作用是将一个字符串数组转化为一个 bag of words列表。

4. 训练模型

训练一个包含100课树的随机森林:

print("Training the random forest...")
# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit(train_data_features, train['sentiment'])

5. 准备测试集:

使用之前定义的函数,来获取测试数据集的特征向量。

print("Testing the model...")
path_test_data = "movie_review/testData.tsv"
test = pd.read_csv(path_test_data, header=0, delimiter="\t", quoting=3)

# processing test data
clean_test_reviews = get_clean_reviews(test["review"])
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

注意,在获取 test_data_features时,我们并没有使用 fit_transform而是直接使用了 transform,因为在前面 vectorize已经训练好了。

6. 预测

将预测结果写入csv文件,方便查看:

path_result = "movie_review/Bag_of_Words_model.csv"
result = forest.predict(test_data_features)
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv(path_result , index=False, quoting=3 )
posted @ 2022-02-22 16:08  刷书狂魔  阅读(161)  评论(0编辑  收藏  举报
总访问: counter for blog 次