01-NLP-04-04

用每日新闻预测金融市场变化(进阶版)

这篇教程里,我们会使用FastText来做分类

In [53]:
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from datetime import date
 

监视数据

我们先读入数据。这里我提供了一个已经combine好了的数据。

In [54]:
data = pd.read_csv('../input/Combined_News_DJIA.csv')
 

这时候,我们可以看一下数据长什么样子

In [55]:
data.head()
Out[55]:
 DateLabelTop1Top2Top3Top4Top5Top6Top7Top8...Top16Top17Top18Top19Top20Top21Top22Top23Top24Top25
0 2008-08-08 0 b"Georgia 'downs two Russian warplanes' as cou... b'BREAKING: Musharraf to be impeached.' b'Russia Today: Columns of troops roll into So... b'Russian tanks are moving towards the capital... b"Afghan children raped with 'impunity,' U.N. ... b'150 Russian tanks have entered South Ossetia... b"Breaking: Georgia invades South Ossetia, Rus... b"The 'enemy combatent' trials are nothing but... ... b'Georgia Invades South Ossetia - if Russia ge... b'Al-Qaeda Faces Islamist Backlash' b'Condoleezza Rice: "The US would not act to p... b'This is a busy day: The European Union has ... b"Georgia will withdraw 1,000 soldiers from Ir... b'Why the Pentagon Thinks Attacking Iran is a ... b'Caucasus in crisis: Georgia invades South Os... b'Indian shoe manufactory - And again in a se... b'Visitors Suffering from Mental Illnesses Ban... b"No Help for Mexico's Kidnapping Surge"
1 2008-08-11 1 b'Why wont America and Nato help us? If they w... b'Bush puts foot down on Georgian conflict' b"Jewish Georgian minister: Thanks to Israeli ... b'Georgian army flees in disarray as Russians ... b"Olympic opening ceremony fireworks 'faked'" b'What were the Mossad with fraudulent New Zea... b'Russia angered by Israeli military sale to G... b'An American citizen living in S.Ossetia blam... ... b'Israel and the US behind the Georgian aggres... b'"Do not believe TV, neither Russian nor Geor... b'Riots are still going on in Montreal (Canada... b'China to overtake US as largest manufacturer' b'War in South Ossetia [PICS]' b'Israeli Physicians Group Condemns State Tort... b' Russia has just beaten the United States ov... b'Perhaps *the* question about the Georgia - R... b'Russia is so much better at war' b"So this is what it's come to: trading sex fo...
2 2008-08-12 0 b'Remember that adorable 9-year-old who sang a... b"Russia 'ends Georgia operation'" b'"If we had no sexual harassment we would hav... b"Al-Qa'eda is losing support in Iraq because ... b'Ceasefire in Georgia: Putin Outmaneuvers the... b'Why Microsoft and Intel tried to kill the XO... b'Stratfor: The Russo-Georgian War and the Bal... b"I'm Trying to Get a Sense of This Whole Geor... ... b'U.S. troops still in Georgia (did you know t... b'Why Russias response to Georgia was right' b'Gorbachev accuses U.S. of making a "serious ... b'Russia, Georgia, and NATO: Cold War Two' b'Remember that adorable 62-year-old who led y... b'War in Georgia: The Israeli connection' b'All signs point to the US encouraging Georgi... b'Christopher King argues that the US and NATO... b'America: The New Mexico?' b"BBC NEWS | Asia-Pacific | Extinction 'by man...
3 2008-08-13 0 b' U.S. refuses Israel weapons to attack Iran:... b"When the president ordered to attack Tskhinv... b' Israel clears troops who killed Reuters cam... b'Britain\'s policy of being tough on drugs is... b'Body of 14 year old found in trunk; Latest (... b'China has moved 10 *million* quake survivors... b"Bush announces Operation Get All Up In Russi... b'Russian forces sink Georgian ships ' ... b'Elephants extinct by 2020?' b'US humanitarian missions soon in Georgia - i... b"Georgia's DDOS came from US sources" b'Russian convoy heads into Georgia, violating... b'Israeli defence minister: US against strike ... b'Gorbachev: We Had No Choice' b'Witness: Russian forces head towards Tbilisi... b' Quarter of Russians blame U.S. for conflict... b'Georgian president says US military will ta... b'2006: Nobel laureate Aleksander Solzhenitsyn...
4 2008-08-14 1 b'All the experts admit that we should legalis... b'War in South Osetia - 89 pictures made by a ... b'Swedish wrestler Ara Abrahamian throws away ... b'Russia exaggerated the death toll in South O... b'Missile That Killed 9 Inside Pakistan May Ha... b"Rushdie Condemns Random House's Refusal to P... b'Poland and US agree to missle defense deal. ... b'Will the Russians conquer Tblisi? Bet on it,... ... b'Bank analyst forecast Georgian crisis 2 days... b"Georgia confict could set back Russia's US r... b'War in the Caucasus is as much the product o... b'"Non-media" photos of South Ossetia/Georgia ... b'Georgian TV reporter shot by Russian sniper ... b'Saudi Arabia: Mother moves to block child ma... b'Taliban wages war on humanitarian aid workers' b'Russia: World "can forget about" Georgia\'s... b'Darfur rebels accuse Sudan of mounting major... b'Philippines : Peace Advocate say Muslims nee...

5 rows × 27 columns

 

其实看起来特别的简单直观。如果是1,那么当日的DJIA就提高或者不变了。如果是1,那么DJIA那天就是跌了。

 

分割测试/训练集

这下,我们可以先把数据给分成Training/Testing data

In [56]:
train = data[data['Date'] < '2015-01-01']
test = data[data['Date'] > '2014-12-31']
 

然后,我们把每条新闻做成一个单独的句子,集合在一起:

In [57]:
X_train = train[train.columns[2:]]
corpus = X_train.values.flatten().astype(str)

X_train = X_train.values.astype(str)
X_train = np.array([' '.join(x) for x in X_train])
X_test = test[test.columns[2:]]
X_test = X_test.values.astype(str)
X_test = np.array([' '.join(x) for x in X_test])
y_train = train['Label'].values
y_test = test['Label'].values
 

这里我们注意,我们需要三样东西:

corpus是全部我们『可见』的文本资料。我们假设每条新闻就是一句话,把他们全部flatten()了,我们就会得到list of sentences。

同时我们的X_train和X_test可不能随便flatten,他们需要与y_train和y_test对应。

In [58]:
corpus[:3]
Out[58]:
array([ 'b"Georgia \'downs two Russian warplanes\' as countries move to brink of war"',
       "b'BREAKING: Musharraf to be impeached.'",
       "b'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)'"], 
      dtype='<U312')
In [59]:
X_train[:1]
Out[59]:
array([ 'b"Georgia \'downs two Russian warplanes\' as countries move to brink of war" b\'BREAKING: Musharraf to be impeached.\' b\'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)\' b\'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire\' b"Afghan children raped with \'impunity,\' U.N. official says - this is sick, a three year old was raped and they do nothing" b\'150 Russian tanks have entered South Ossetia whilst Georgia shoots down two Russian jets.\' b"Breaking: Georgia invades South Ossetia, Russia warned it would intervene on SO\'s side" b"The \'enemy combatent\' trials are nothing but a sham: Salim Haman has been sentenced to 5 1/2 years, but will be kept longer anyway just because they feel like it." b\'Georgian troops retreat from S. Osettain capital, presumably leaving several hundred people killed. [VIDEO]\' b\'Did the U.S. Prep Georgia for War with Russia?\' b\'Rice Gives Green Light for Israel to Attack Iran: Says U.S. has no veto over Israeli military ops\' b\'Announcing:Class Action Lawsuit on Behalf of American Public Against the FBI\' b"So---Russia and Georgia are at war and the NYT\'s top story is opening ceremonies of the Olympics?  What a fucking disgrace and yet further proof of the decline of journalism." b"China tells Bush to stay out of other countries\' affairs" b\'Did World War III start today?\' b\'Georgia Invades South Ossetia - if Russia gets involved, will NATO absorb Georgia and unleash a full scale war?\' b\'Al-Qaeda Faces Islamist Backlash\' b\'Condoleezza Rice: "The US would not act to prevent an Israeli strike on Iran." Israeli Defense Minister Ehud Barak: "Israel is prepared for uncompromising victory in the case of military hostilities."\' b\'This is a busy day:  The European Union has approved new sanctions against Iran in protest at its nuclear programme.\' b"Georgia will withdraw 1,000 soldiers from Iraq to help fight off Russian forces in Georgia\'s breakaway region of South Ossetia" b\'Why the Pentagon Thinks Attacking Iran is a Bad Idea - US News &amp; World Report\' b\'Caucasus in crisis: Georgia invades South Ossetia\' b\'Indian shoe manufactory  - And again in a series of "you do not like your work?"\' b\'Visitors Suffering from Mental Illnesses Banned from Olympics\' b"No Help for Mexico\'s Kidnapping Surge"'], 
      dtype='<U4424')
In [60]:
y_train[:5]
Out[60]:
array([0, 1, 0, 0, 1])
 

来,我们再把每个单词给分隔开:

同样,corpus和X_train的处理不同

In [61]:
from nltk.tokenize import word_tokenize

corpus = [word_tokenize(x) for x in corpus]
X_train = [word_tokenize(x) for x in X_train]
X_test = [word_tokenize(x) for x in X_test]
 

tokenize完毕后,

我们可以看到,虽然corpus和x都是一个二维数组,但是他们的意义不同了。

corpus里,第二维数据是一个个句子。

x里,第二维数据是一个个数据点(对应每个label)

In [62]:
X_train[:2]
Out[62]:
[['b',
  "''",
  'Georgia',
  "'downs",
  'two',
  'Russian',
  'warplanes',
  "'",
  'as',
  'countries',
  'move',
  'to',
  'brink',
  'of',
  'war',
  "''",
  "b'BREAKING",
  ':',
  'Musharraf',
  'to',
  'be',
  'impeached',
  '.',
  "'",
  "b'Russia",
  'Today',
  ':',
  'Columns',
  'of',
  'troops',
  'roll',
  'into',
  'South',
  'Ossetia',
  ';',
  'footage',
  'from',
  'fighting',
  '(',
  'YouTube',
  ')',
  "'",
  "b'Russian",
  'tanks',
  'are',
  'moving',
  'towards',
  'the',
  'capital',
  'of',
  'South',
  'Ossetia',
  ',',
  'which',
  'has',
  'reportedly',
  'been',
  'completely',
  'destroyed',
  'by',
  'Georgian',
  'artillery',
  'fire',
  "'",
  'b',
  "''",
  'Afghan',
  'children',
  'raped',
  'with',
  "'impunity",
  ',',
  "'",
  'U.N.',
  'official',
  'says',
  '-',
  'this',
  'is',
  'sick',
  ',',
  'a',
  'three',
  'year',
  'old',
  'was',
  'raped',
  'and',
  'they',
  'do',
  'nothing',
  "''",
  "b'150",
  'Russian',
  'tanks',
  'have',
  'entered',
  'South',
  'Ossetia',
  'whilst',
  'Georgia',
  'shoots',
  'down',
  'two',
  'Russian',
  'jets',
  '.',
  "'",
  'b',
  "''",
  'Breaking',
  ':',
  'Georgia',
  'invades',
  'South',
  'Ossetia',
  ',',
  'Russia',
  'warned',
  'it',
  'would',
  'intervene',
  'on',
  'SO',
  "'s",
  'side',
  "''",
  'b',
  "''",
  'The',
  "'enemy",
  'combatent',
  "'",
  'trials',
  'are',
  'nothing',
  'but',
  'a',
  'sham',
  ':',
  'Salim',
  'Haman',
  'has',
  'been',
  'sentenced',
  'to',
  '5',
  '1/2',
  'years',
  ',',
  'but',
  'will',
  'be',
  'kept',
  'longer',
  'anyway',
  'just',
  'because',
  'they',
  'feel',
  'like',
  'it',
  '.',
  "''",
  "b'Georgian",
  'troops',
  'retreat',
  'from',
  'S.',
  'Osettain',
  'capital',
  ',',
  'presumably',
  'leaving',
  'several',
  'hundred',
  'people',
  'killed',
  '.',
  '[',
  'VIDEO',
  ']',
  "'",
  "b'Did",
  'the',
  'U.S.',
  'Prep',
  'Georgia',
  'for',
  'War',
  'with',
  'Russia',
  '?',
  "'",
  "b'Rice",
  'Gives',
  'Green',
  'Light',
  'for',
  'Israel',
  'to',
  'Attack',
  'Iran',
  ':',
  'Says',
  'U.S.',
  'has',
  'no',
  'veto',
  'over',
  'Israeli',
  'military',
  'ops',
  "'",
  "b'Announcing",
  ':',
  'Class',
  'Action',
  'Lawsuit',
  'on',
  'Behalf',
  'of',
  'American',
  'Public',
  'Against',
  'the',
  'FBI',
  "'",
  'b',
  "''",
  'So',
  '--',
  '-Russia',
  'and',
  'Georgia',
  'are',
  'at',
  'war',
  'and',
  'the',
  'NYT',
  "'s",
  'top',
  'story',
  'is',
  'opening',
  'ceremonies',
  'of',
  'the',
  'Olympics',
  '?',
  'What',
  'a',
  'fucking',
  'disgrace',
  'and',
  'yet',
  'further',
  'proof',
  'of',
  'the',
  'decline',
  'of',
  'journalism',
  '.',
  "''",
  'b',
  "''",
  'China',
  'tells',
  'Bush',
  'to',
  'stay',
  'out',
  'of',
  'other',
  'countries',
  "'",
  'affairs',
  "''",
  "b'Did",
  'World',
  'War',
  'III',
  'start',
  'today',
  '?',
  "'",
  "b'Georgia",
  'Invades',
  'South',
  'Ossetia',
  '-',
  'if',
  'Russia',
  'gets',
  'involved',
  ',',
  'will',
  'NATO',
  'absorb',
  'Georgia',
  'and',
  'unleash',
  'a',
  'full',
  'scale',
  'war',
  '?',
  "'",
  "b'Al-Qaeda",
  'Faces',
  'Islamist',
  'Backlash',
  "'",
  "b'Condoleezza",
  'Rice',
  ':',
  '``',
  'The',
  'US',
  'would',
  'not',
  'act',
  'to',
  'prevent',
  'an',
  'Israeli',
  'strike',
  'on',
  'Iran',
  '.',
  "''",
  'Israeli',
  'Defense',
  'Minister',
  'Ehud',
  'Barak',
  ':',
  '``',
  'Israel',
  'is',
  'prepared',
  'for',
  'uncompromising',
  'victory',
  'in',
  'the',
  'case',
  'of',
  'military',
  'hostilities',
  '.',
  "''",
  "'",
  "b'This",
  'is',
  'a',
  'busy',
  'day',
  ':',
  'The',
  'European',
  'Union',
  'has',
  'approved',
  'new',
  'sanctions',
  'against',
  'Iran',
  'in',
  'protest',
  'at',
  'its',
  'nuclear',
  'programme',
  '.',
  "'",
  'b',
  "''",
  'Georgia',
  'will',
  'withdraw',
  '1,000',
  'soldiers',
  'from',
  'Iraq',
  'to',
  'help',
  'fight',
  'off',
  'Russian',
  'forces',
  'in',
  'Georgia',
  "'s",
  'breakaway',
  'region',
  'of',
  'South',
  'Ossetia',
  "''",
  "b'Why",
  'the',
  'Pentagon',
  'Thinks',
  'Attacking',
  'Iran',
  'is',
  'a',
  'Bad',
  'Idea',
  '-',
  'US',
  'News',
  '&',
  'amp',
  ';',
  'World',
  'Report',
  "'",
  "b'Caucasus",
  'in',
  'crisis',
  ':',
  'Georgia',
  'invades',
  'South',
  'Ossetia',
  "'",
  "b'Indian",
  'shoe',
  'manufactory',
  '-',
  'And',
  'again',
  'in',
  'a',
  'series',
  'of',
  '``',
  'you',
  'do',
  'not',
  'like',
  'your',
  'work',
  '?',
  "''",
  "'",
  "b'Visitors",
  'Suffering',
  'from',
  'Mental',
  'Illnesses',
  'Banned',
  'from',
  'Olympics',
  "'",
  'b',
  "''",
  'No',
  'Help',
  'for',
  'Mexico',
  "'s",
  'Kidnapping',
  'Surge',
  "''"],
 ["b'Why",
  'wont',
  'America',
  'and',
  'Nato',
  'help',
  'us',
  '?',
  'If',
  'they',
  'wont',
  'help',
  'us',
  'now',
  ',',
  'why',
  'did',
  'we',
  'help',
  'them',
  'in',
  'Iraq',
  '?',
  "'",
  "b'Bush",
  'puts',
  'foot',
  'down',
  'on',
  'Georgian',
  'conflict',
  "'",
  'b',
  "''",
  'Jewish',
  'Georgian',
  'minister',
  ':',
  'Thanks',
  'to',
  'Israeli',
  'training',
  ',',
  'we',
  "'re",
  'fending',
  'off',
  'Russia',
  '``',
  "b'Georgian",
  'army',
  'flees',
  'in',
  'disarray',
  'as',
  'Russians',
  'advance',
  '-',
  'Gori',
  'abandoned',
  'to',
  'Russia',
  'without',
  'a',
  'shot',
  'fired',
  "'",
  'b',
  "''",
  'Olympic',
  'opening',
  'ceremony',
  'fireworks',
  "'faked",
  "'",
  "''",
  "b'What",
  'were',
  'the',
  'Mossad',
  'with',
  'fraudulent',
  'New',
  'Zealand',
  'Passports',
  'doing',
  'in',
  'Iraq',
  '?',
  "'",
  "b'Russia",
  'angered',
  'by',
  'Israeli',
  'military',
  'sale',
  'to',
  'Georgia',
  "'",
  "b'An",
  'American',
  'citizen',
  'living',
  'in',
  'S.Ossetia',
  'blames',
  'U.S.',
  'and',
  'Georgian',
  'leaders',
  'for',
  'the',
  'genocide',
  'of',
  'innocent',
  'people',
  "'",
  "b'Welcome",
  'To',
  'World',
  'War',
  'IV',
  '!',
  'Now',
  'In',
  'High',
  'Definition',
  '!',
  "'",
  'b',
  "''",
  'Georgia',
  "'s",
  'move',
  ',',
  'a',
  'mistake',
  'of',
  'monumental',
  'proportions',
  '``',
  "b'Russia",
  'presses',
  'deeper',
  'into',
  'Georgia',
  ';',
  'U.S.',
  'says',
  'regime',
  'change',
  'is',
  'goal',
  "'",
  "b'Abhinav",
  'Bindra',
  'wins',
  'first',
  'ever',
  'Individual',
  'Olympic',
  'Gold',
  'Medal',
  'for',
  'India',
  "'",
  'b',
  "'",
  'U.S.',
  'ship',
  'heads',
  'for',
  'Arctic',
  'to',
  'define',
  'territory',
  "'",
  "b'Drivers",
  'in',
  'a',
  'Jerusalem',
  'taxi',
  'station',
  'threaten',
  'to',
  'quit',
  'rather',
  'than',
  'work',
  'for',
  'their',
  'new',
  'boss',
  '-',
  'an',
  'Arab',
  "'",
  "b'The",
  'French',
  'Team',
  'is',
  'Stunned',
  'by',
  'Phelps',
  'and',
  'the',
  '4x100m',
  'Relay',
  'Team',
  "'",
  "b'Israel",
  'and',
  'the',
  'US',
  'behind',
  'the',
  'Georgian',
  'aggression',
  '?',
  "'",
  'b',
  "'",
  "''",
  'Do',
  'not',
  'believe',
  'TV',
  ',',
  'neither',
  'Russian',
  'nor',
  'Georgian',
  '.',
  'There',
  'are',
  'much',
  'more',
  'victims',
  "''",
  "'",
  "b'Riots",
  'are',
  'still',
  'going',
  'on',
  'in',
  'Montreal',
  '(',
  'Canada',
  ')',
  'because',
  'police',
  'murdered',
  'a',
  'boy',
  'on',
  'Saturday',
  '.',
  "'",
  "b'China",
  'to',
  'overtake',
  'US',
  'as',
  'largest',
  'manufacturer',
  "'",
  "b'War",
  'in',
  'South',
  'Ossetia',
  '[',
  'PICS',
  ']',
  "'",
  "b'Israeli",
  'Physicians',
  'Group',
  'Condemns',
  'State',
  'Torture',
  "'",
  'b',
  "'",
  'Russia',
  'has',
  'just',
  'beaten',
  'the',
  'United',
  'States',
  'over',
  'the',
  'head',
  'with',
  'Peak',
  'Oil',
  "'",
  "b'Perhaps",
  '*the*',
  'question',
  'about',
  'the',
  'Georgia',
  '-',
  'Russia',
  'conflict',
  "'",
  "b'Russia",
  'is',
  'so',
  'much',
  'better',
  'at',
  'war',
  "'",
  'b',
  "''",
  'So',
  'this',
  'is',
  'what',
  'it',
  "'s",
  'come',
  'to',
  ':',
  'trading',
  'sex',
  'for',
  'food',
  '.',
  "''"]]
In [63]:
corpus[:2]
Out[63]:
[['b',
  "''",
  'Georgia',
  "'downs",
  'two',
  'Russian',
  'warplanes',
  "'",
  'as',
  'countries',
  'move',
  'to',
  'brink',
  'of',
  'war',
  "''"],
 ["b'BREAKING", ':', 'Musharraf', 'to', 'be', 'impeached', '.', "'"]]
 

预处理

我们进行一些预处理来把我们的文本资料变得更加统一:

  • 小写化

  • 删除停止词

  • 删除数字与符号

  • lemma

我们把这些功能合为一个func:

In [64]:
# 停止词
from nltk.corpus import stopwords
stop = stopwords.words('english')

# 数字
import re
def hasNumbers(inputString):
    return bool(re.search(r'\d', inputString))

# 特殊符号
def isSymbol(inputString):
    return bool(re.match(r'[^\w]', inputString))

# lemma
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def check(word):
    """
    如果需要这个单词,则True
    如果应该去除,则False
    """
    word= word.lower()
    if word in stop:
        return False
    elif hasNumbers(word) or isSymbol(word):
        return False
    else:
        return True

# 把上面的方法综合起来
def preprocessing(sen):
    res = []
    for word in sen:
        if check(word):
            # 这一段的用处仅仅是去除python里面byte存str时候留下的标识。。之前数据没处理好,其他case里不会有这个情况
            word = word.lower().replace("b'", '').replace('b"', '').replace('"', '').replace("'", '')
            res.append(wordnet_lemmatizer.lemmatize(word))
    return res
 

把我们三个数据组都来处理一下:

In [65]:
corpus = [preprocessing(x) for x in corpus]
X_train = [preprocessing(x) for x in X_train]
X_test = [preprocessing(x) for x in X_test]
 

我们再来看看处理之后的数据长相:

In [66]:
print(corpus[553])
print(X_train[523])
 
['north', 'korean', 'leader', 'kim', 'jong-il', 'confirmed', 'ill']
['two', 'redditors', 'climbing', 'mt', 'kilimanjaro', 'charity', 'bidding', 'peak', 'nt', 'squander', 'opportunity', 'let', 'upvotes', 'something', 'awesome', 'estimated', 'take', 'year', 'clear', 'lao', 'explosive', 'remnant', 'left', 'behind', 'united', 'state', 'bomber', 'year', 'ago', 'people', 'died', 'unexploded', 'ordnance', 'since', 'conflict', 'ended', 'fidel', 'ahmadinejad', 'slandering', 'jew', 'mossad', 'america', 'israel', 'intelligence', 'agency', 'target', 'united', 'state', 'intensively', 'among', 'nation', 'considered', 'friendly', 'washington', 'israel', 'lead', 'others', 'active', 'espionage', 'directed', 'american', 'company', 'defense', 'department', 'australian', 'election', 'day', 'poll', 'rural/regional', 'independent', 'member', 'parliament', 'support', 'labor', 'minority', 'goverment', 'julia', 'gillard', 'prime', 'minister', 'france', 'plan', 'raise', 'retirement', 'age', 'set', 'strike', 'britain', 'parliament', 'police', 'murdoch', 'paper', 'adviser', 'pm', 'implicated', 'voicemail', 'hacking', 'scandal', 'british', 'policeman', 'jailed', 'month', 'cell', 'attack', 'woman', 'rest', 'email', 'display', 'fundemental', 'disdain', 'pluralistic', 'america', 'reveals', 'chilling', 'level', 'islamophobia', 'hatemongering', 'church', 'plan', 'burn', 'quran', 'endanger', 'troop', 'u', 'commander', 'warns', 'freed', 'journalist', 'tricked', 'captor', 'twitter', 'access', 'manila', 'water', 'crisis', 'expose', 'impact', 'privatisation', 'july', 'week-long', 'rationing', 'water', 'highlighted', 'reality', 'million', 'people', 'denied', 'basic', 'right', 'potable', 'water', 'sanitation', 'private', 'firm', 'rake', 'profit', 'expense', 'weird', 'uk', 'police', 'ask', 'help', 'case', 'slain', 'intelligence', 'agent', 'greenpeace', 'japan', 'anti-whaling', 'activist', 'found', 'guilty', 'theft', 'captured', 'journalist', 'trick', 'captor', 'revealing', 'alive', 'creepy', 'biometric', 'id', 'forced', 'onto', 'india', 'billion', 'inhabitant', 'fear', 'loss', 'privacy', 'government', 'abuse', 'abound', 'india', 'gear', 'biometrically', 'identify', 'number', 'billion', 'inhabitant', 'china', 'young', 'officer', 'syndrome', 'china', 'military', 'spending', 'growing', 'fast', 'overtaken', 'strategy', 'said', 'professor', 'huang', 'jing', 'school', 'public', 'policy', 'young', 'officer', 'taking', 'control', 'strategy', 'like', 'young', 'officer', 'japan', 'mexican', 'soldier', 'open', 'fire', 'family', 'car', 'military', 'checkpoint', 'killing', 'father', 'son', 'death', 'toll', 'continues', 'climb', 'guatemala', 'landslide', 'foreign', 'power', 'stop', 'interfering', 'case', 'iranian', 'woman', 'sentenced', 'death', 'stoning', 'iran', 'foreign', 'ministry', 'said', 'mexican', 'official', 'gunman', 'behind', 'massacre', 'killed', 'tv', 'anchor', 'stabbed', 'death', 'outside', 'kabul', 'home', 'mosque', 'menace', 'confined', 'lower', 'manhattan', 'many', 'european', 'country', 'similar', 'alarm', 'sounded', 'muslim', 'coming', 'french', 'citizen', 'barred', 'american', 'military', 'base', 'dutch', 'neo-nazi', 'donates', 'sperm', 'white', 'dutch', 'neo-nazi', 'offered', 'donate', 'sperm', 'four', 'fertility', 'clinic', 'netherlands', 'effort', 'promote', 'call', 'strong', 'white', 'race']
 

训练NLP模型

有了这些干净的数据集,我们可以做我们的NLP模型了。

我们这里要用的是FastText。

原理,我在课件上已经讲过了,这里我们来进一步看看具体的使用。

由于这篇paper刚刚发布,很多社区贡献者也都在给社区提供代码,尽早实现python版本的开源编译(我也是其中之一)。

当然,因为Facebook团队本身已经在GitHub上放出了源代码(C++),

所以,我们可以用一个python wrapper来造个interface,方便我们调用。

首先,我们讲过,FT把label也看做一个元素,带进了word2vec的网络中。

那么,我们就需要把这个label塞进我们的“句子”中:

In [67]:
for i in range(len(y_train)):
    label = '__label__' + str(y_train[i])
    X_train[i].append(label)

print(X_train[49])
 
['the', 'man', 'podium', 'dutch', 'non-profit', 'reproductive', 'health', 'organization', 'sail', 'ship', 'around', 'world', 'anchoring', 'international', 'water', 'provide', 'abortion', 'woman', 'country', 'abortion', 'banned', 'b', 'grand', 'ayatollah', 'issue', 'decree', 'calling', 'muslim', 'defend', 'iraq', 'christian', 'marx', 'da', 'kapital', 'sale', 'soar', 'among', 'young', 'german', 'a', 'man', 'england', 'killed', 'wife', 'changed', 'facebook', 'relationship', 'status', 'single', 'georgia', 'used', 'cluster', 'bomb', 'august', 'war', 'arctic', 'temperature', 'break', 'all-time', 'recorded', 'high', 'reddit', 'please', 'send', 'help', 'uk', 'politician', 'insane', 'apparently', 'monitoring', 'mobile', 'web', 'record', 'would', 'giving', 'licence', 'terrorist', 'kill', 'people', 'wow', 'secret', 'coded', 'message', 'embedded', 'child', 'pornographic', 'image', 'paedophile', 'website', 'exploited', 'secure', 'way', 'passing', 'information', 'terrorist', 'england', 'run', 'honey', 'christmas', 'catastrophic', 'honeybee', 'decline', 'b', 'iran', 'stop', 'executing', 'youth', 'china', 'watch', 'internet', 'caf', 'customer', 'web', 'crackdown', 'china\\', 'medium', 'freedom', 'reduced', 'new', 'measure', 'include', 'camera', 'internet', 'cafe', 'picture', 'taken', 'user', 'bali', 'bombing', 'new', 'suspect', 'hindu', 'american', 'foundation', 'petition', 'ny', 'time', 'focus', 'much', 'activity', 'christian', 'missionary', 'india', 'anti-christian', 'violence', 'a', 'quick', 'overview', 'islamic', 'terror', 'organization', 'get', 'funding', 'last', 'titantic', 'survivor', 'auction', 'memento', 'pay', 'nursing', 'home', 'better', 'hungary', 'get', 'loan', 'avert', 'meltdown', 'sao', 'paolo', 'hundred', 'black-clad', 'military', 'police', 'fired', 'teargas', 'stun', 'grenade', 'rubber', 'bullet', 'striking', 'civilian', 'officer', 'seeking', 'percent', 'pay', 'raise', 'austrailian', 'historian', 'arrested', 'holocaust', 'denial', 'defense', 'secretary', 'gate', 'said', 'prepared', 'reconciliation', 'taliban', 'part', 'political', 'outcome', 'afghanistan', 'is', 'switzerland', 'next', 'iceland', 'switzerland', 'forced', 'take', 'emergency', 'measure', 'yesterday', 'shore', 'two', 'biggest', 'lender', 'prevent', 'collapse', 'confidence', 'country\\', 'banking', 'system', 'police', 'battle', 'police', 'sao', 'paulo', 'civilian', 'killed', 'nato', 'air', 'strike', 'afghanistan', 'villager', 'the', 'west', 'loss', 'afghanistan', '__label__0']
 

然后,我们把数据存成文件的形式。因为我们这里的FastText只是个python的interface。调用起来还得用C++的接口。

我们需要存三个东西:

含有label的train集

不含label的test集

label单独放一个文件

In [68]:
X_train = [' '.join(x) for x in X_train]

print(X_train[12])
 
north korea halt denuclearisation u fails remove list state sponsoring terrorism child among dead u airstrike afghanistan the russian parliament voted overwhelmingly officially recognize independence abkhazia south ossetia violent animal right activist set fire scientist home little protection available scientist nbc censored olympic champion matthew mitcham gay un say convincing evidence show u airstrike afghanistan killed people including child italy try outlaw islam mystery virus kill israeli group peace say settlement construction occupied west bank nearly doubled since last year b revealed britain secret propaganda war al-qaida b israel settlement surge draw rice criticism solar powered carbon neutral pyramid house million people dubai russia claim proof genocide how nato transformed military alliance quasi-united nation cartwheeling banned school philly-area activist released china jeff said slapped around threatend saying want head cut want shot b vatican describes hindu attack christian orphanage god protester tell tale beijing detention- sleep deprivation threat oh python kill zookeeper kelly murdered say uk intelligence insider b fury image myra hindley appears british film olympics party b north korea suspend nuclear disablement german suspect bayer pesticide beehive collapse research terrorism invaluable fear arrest top u diplomat escape gun attack pakistan __label__1
 

同理,test集也这样。

In [69]:
X_test = [' '.join(x) for x in X_test]

with open('../input/train_ft.txt', 'w') as f:
    for sen in X_train:
        f.write(sen+'\n')

with open('../input/test_ft.txt', 'w') as f:
    for sen in X_test:
        f.write(sen+'\n')

with open('../input/test_label_ft.txt', 'w') as f:
    for label in y_test:
        f.write(str(label)+'\n')
 

调用FastText模块

In [95]:
import fasttext

clf = fasttext.supervised('../input/train_ft.txt', 'model', dim=256, ws=5, neg=5, epoch=100, min_count=10, lr=0.1, lr_update_rate=1000, bucket=200000)
 

训练完我们的FT模型后,我们可以测试我们的Test集了

In [96]:
y_scores = []

# 我们用predict来给出判断
labels = clf.predict(X_test)

y_preds = np.array(labels).flatten().astype(int)

# 我们来看看
print(len(y_test))
print(y_test)
print(len(y_preds))
print(y_preds)

from sklearn import metrics

# 算个AUC准确率
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_preds, pos_label=1)
print(metrics.auc(fpr, tpr))
 
378
[1 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 1 0 1 1
 0 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1
 0 1 1 1 0 1 0 0 1 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1
 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0
 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1
 0 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1
 1 0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 1
 0 1 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 0 0 1 1 1 1 1
 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 1 1 1 0 1 1 1 0 1 0 1 1 0
 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 0 1 0 1 1
 0 1 0 0 1 1 1 1]
378
[0 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 0 1 1
 1 1 1 0 1 1 0 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0
 1 1 0 1 0 0 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1
 1 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 1 1 1 1
 0 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 1 0 1 1
 1 1 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 0 0 1 0
 1 0 1 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 1 1 1
 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1
 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 0
 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0
 1 1 1 0 1 1 0 1]
0.463877688172
 

同理,这里,我们通过parameter tuning或者是resampling,可以让我们的结果更加好。

当然,因为FT本身也是一个word2vec。并且自带了一个类似于二叉树的分类器在后面。

这样,在小量数据上,是跑不出很理想的结论的,还不如我们自己带上一个SVM的效果。

但是面对大量数据和大量label,它的效果就体现出来了。

In [ ]:
 

posted on 2018-05-28 11:11  Josie_chen  阅读(261)  评论(0编辑  收藏  举报

导航