Kaggle案例分析3--Bag of Words Meets Bags of Popcorn
项目描述:这是一个关于情感分析的教程.谷歌的Word2Vec(文本深度表示模型)是一个由深度学习驱动的方法, 旨在获取words内部的含义.Word2Vec试图理解单词之间的含义与语义关系.它类似于recurrent neural nets(递归神经网络)或者深度神经网络, 但是计算效率更高.情感分析是机器学习领域的一个具有挑战性的任务,人们通过语言来表达自己的情感,比如说讽刺,歧视,双关语,这些无论是对人类还是计算机都具有一定的误导性.本教程将专注于Word2Vec在情感分析上的应用.
项目时间:2014/12/9-2015/6/30
教程概述:这个教程将帮助我们熟悉Word2Vec在自然语言处理方面的应用,它主要有两个目标:
基本的自然语言处理: 这个教程的Part1涵盖了一些基本的自然语言处理技术,帮助初学者入门;
基于深度学习的文本理解: Part2和Part3讲述了如何使用Word2Vec来训练一个模型以及如何使用得到的词向量来做情感分析.
本教程所采用的数据集为IMDB情感分析数据集[2],它包含了10万条电影评论.本文处理流程主要包含以下几个模块:
利用pd.read_csv读取数据 --> 利用BeautifulSoup包去除评论中的HTML标签 --> 用正则化re去除评论中的标点符号 --> 将评论中所有大写字母换成小写 -->
Part 1: 对于初学者-Bag of Words
1.1 数据读取
下图展示了部分的训练数据. 训练集的名称是: labeledTrainData.tsv(csv文件为用,分隔的文件, tsv为用制表符分隔的文件), 它包含了三列属性, id/sentiment/review, 分别表示用户的id, 评论内容是否具有情感色彩的真实类别标签(取值0/1), 以及用户具体的评论内容.
# Import the padnas package, then use the "read_csv" function to read the labeled training data import pandas as pd # Load the data train = pd.read_csv('labeledTrainData.tsv', header=0, delimiter="\t",quoting=3)
Kaggle比赛的数据一般为.csv或者.tsv文件, 均可以使用pandas模块里面的read_csv()函数来进行读取. 此函数输入的第一个参数为文件名, 这个参数是必须的; 其余还有很多输入参数可供选择来实现不同的功能, header可以是一个list, 列表里面的值指定了行数(这些行的数据被忽略), 比如header=0表示数据的第一行是属性值; delimiter是含义是分隔符, 指定文件里面的元素是用什么分隔的.这里delimiter="\t"代表这个文件是使用制表符(Tab)来分隔的. 返回值train是一个DataFrame类型的数据, 调用train的shape属性可以查看数据大小(25000x3), 从下面程序可以看出, 训练集中有25000条数据. DataFrame类型的没列数据用属性名来标识, 可以通过列属性名来提取数据, 这些属性名变成了train的一个属性,可以使用train.id 这样的方式获取, 也可以使用使用train['id'] 这样的方式. 如: train['id'][0] 可用来获取数据'id'列的第1个数据('"5814_8"'),train['id'][0:3] 来获取'id'列的第1-3个数据: 具体见下面的程序:
In [11]: train['id'][0] Out[11]: '"5814_8"' In [12]: train.id[0:3] Out[12]: 0 "5814_8" 1 "2381_9" 2 "7759_3" Name: id, dtype: object In [13]: train.id[0] Out[13]: '"5814_8"' In [14]: train['id'][0:3] Out[14]: 0 "5814_8" 1 "2381_9" 2 "7759_3" Name: id, dtype: object
1 In[15]:train.shape 2 Out[15]: (25000, 3)
3 In [16]: train.columns.values 4 Out[16]: array(['id', 'sentiment', 'review'], dtype=object)
查看第一条评论内容:
train['review'][0]
'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci\'s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ\'s music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ\'s bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i\'ve gave this subject....hmmm well i don\'t know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."'
对数据进行观察可以看出, 评论里面除了正常的单词, 还带有HTML标签<br/>, 单词缩写(如Michael Jondon简写成MJ), 各种标点符号等. 在下一节我们将介绍如何对数据进行清洗.
1.2 数据清洗和文本预处理
1.2.1 去除评论中的HTML标签: BeautifulSoup包
首先, 我们将去除文本中的HTML标签, 为此, 这里需要使用BeautifulSoup包,这个包是python的一个库,主要的用于我们在写爬虫时,从HTML或者XML文件中提取数据,这里只用其来去除评论里面的HTML标签,关于这个包更进一步的用法可以参考文档[4].如果电脑上没有安装BeautifulSoup, 可以执行下面的命令进行安装:
$ sudo pip install BeautifulSoup4
BeautifulSoup是一个类, 有很多的成员函数;看看BeautifulSoup是如何对文本中的HTML标签进行处理的:
1 # Import the padnas package, then use the "read_csv" function to read the labeled training data 2 import pandas as pd 3 from bs4 import BeautifulSoup 4 5 # Load the training dataset 6 train = pd.read_csv('labeledTrainData.tsv', header=0, delimiter="\t",quoting=3) 7 8 # Initialize the BeautifulSoup object on a single movie review 9 example1 = BeautifulSoup(train['review'][0]) # 初始化一个BeautifulSoup对象!!! 10 11 # print the raw review and then the output of get_text(),for comparison 12 print train['review'][0] 13 print example1.get_text()
下面显示的为 example1.get_text() 的结果, 里面的HTML标签被去除了. BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库. 这个库的功能很强大, 远超过我们对此数据集进行处理所用到的功能.虽然正则化表达式也可以达到同样的功能, 但是不建议在这里使用正则化表达式, 就算是像这里如此简单的应用, 也建议使用BeautifulSoup来完成.
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."
2) 处理标点符号, 数字, 停用词(stopword) : NLTK和正则表达式
需要注意的是: 不是说遇到需要文本处理就要把标点符号,数字等这些字符去除掉, 到底需不需要去除, 要考虑实际的任务要求.举个例子, 在情感分析中, 类似于"!!!"或者":-("这样的符号都是带有一定的感情色彩的,这些符号需要被当做单词来对待.这里为了简便起见, 将文本中所有的符号一并去除. 类似的, 本教程中也将去除所有的数字, 但是也有处理它们的其他方式, 从而使得这些数字变得有意义. 比如: 我们可以把它们当做是单词,也可以把所有的数字都用占位符字符串"NUM"来替代. 为了达到我们去除标点和数字的目的, 我们这里采用正则化表达式(Python的re模块), 这个模块是python里面自建的模块, 不需要另外安装. Python的re模块的官方文档见这里.
import re # Use regular expressions to do a find-and-replace letters_only = re.sub('[^a-zA-Z]', # 搜寻的pattern ' ', # 用来替代的pattern(空格) example1.get_text()) # 待搜索的text print letters_only
返回内容如下所示, 所有的除了a-z, A-Z, 空格之外的字符, 比如:数字, 标点符号都被去除了.
With all this stuff going down at the moment with MJ i ve started listening to his music watching the odd documentary here and there watched The Wiz and watched Moonwalker again Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent Moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord Why he wants MJ dead so bad is beyond me Because MJ overheard his plans Nah Joe Pesci s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno maybe he just hates MJ s music Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence Also the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene Bottom line this movie is for people who like MJ on one level or another which i think is most people If not then stay away It does try and give off a wholesome message and ironically MJ s bestest buddy in this movie is a girl Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty Well with all the attention i ve gave this subject hmmm well i don t know because people can be different behind closed doors i know this for a fact He is either an extremely nice but stupid guy or one of the most sickest liars I hope he is not the latter
关于正则化表达式的语法这里就不讲了, 可以自行百度. 文本里面还包含了一些大写字母, 可以将所有的大写字母变成小写:
lower_case = letters_only.lower() # Convert to lower case words = lower_case.split() # Split into word
可以调用letters_only的lower()方法将文本里面的大写字母转变成小写, 然后调用lower_case的split()方法将段落的每个单词提取出来, 变成一个list类型的words.
图2 部分变量展示
最后, 我们需要考虑如何处理那些出现频率高,但是却没有多大意义的单词, 如a, and, the, is等. 这类单词称之为"stop words",尽管我们说stop words是一种语言中最常见到的单词,但是却没有任何一个统一的stop words列表被所有的自然语言处理工具所使用,有时候,一个工具甚至会使用多个stop words列表.NLTK包(Natural Language Toolkit)里面包含了stop words的列表,安装好nltk后,要用.download()来安装数据包,执行命令后,出现的界面如下所示,可能会下载较长的时间.
>>> import nltk >>> nltk.download()
安装好了数据包以后,就可以使用nltk来查看stop words的列表:
参考文献:
[1]Bag of Words Meets Bogs of Popcorn: https://www.kaggle.com/c/word2vec-nlp-tutorial
[2]Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). "Learning Word Vectors for Sentiment Analysis." The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
[3] https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors
[4] Beautiful Soup 4.2.0 文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html