【机器学习分类任务】多类别分类任务

最近看到一个比较好的基于Scikit-Learn工具包进行分类的任务。具体文章我将post出来,同时也将自己实践的结果进行展示。

可阅读:https://cloud.tencent.com/developer/article/1097919

 

文本下载的数据集可以从data.gov进行下载:https://catalog.data.gov/dataset/consumer-complaint-database

数据集内容是关于银行顾客投诉的。

问题描述

我们的问题是有监督文本分类问题,我们的目标是调查哪种有监督机器学习方法最适合解决它。

给定一个投诉,我们希望将其分配到12个类别之一。 分类器假定每个新投诉都被分配到一个且仅一个类别。 这是多类文本分类问题。 我迫不及待地想看看我们能做些什么!

 

1. 首先,观察数据集非常必要

import pandas as pd
df = pd.read_csv('Consumer_Complaints.csv')
# 因为后面稀疏矩阵的建立无法承受很大维度的,因此这里切割是为了缩小矩阵
df = df.ix[:40000,:]

对于这个项目,我们只需要其中的两栏 - “产品”和“消费者投诉叙述(Consumer complaint narrative)”。“消费者投诉叙述(Consumer complaint narrative)作为我们的输入”,“产品”作为输出,即输入的类别

  • 输入:Consumer_complaint_narrative(每一篇消费者投诉叙述内容作为一篇文档) 例如:“我的信用报告中有过时的信息,我以前有争议,但这些信息已经超过七年且尚未删除,这不符合信用报告要求”
  • 输出:产品(product)(输入对应的类别) 示例:信用报告(Credit reporting)

我们将删除“Consumer complaints narrative”栏中的缺失值,并添加一列来编码产品作为整数描述,因为类别变量通常比整数字符串更好。

我们还创建了几个字典供将来使用。

清理完成后,可以展示前五行数据:

2. 数据处理

2.1 去除*Consumer complaint narrative*空值NaN

df = df[pd.notnull(df['Consumer complaint narrative'])]
df.shape
-- > (12633, 18)
输出为12633,从原来的40000数据变成了12633行
# 查看所有列信息
df.info()

 2.2 替换字段的名字

col = ['Product', 'Consumer complaint narrative']
df = df[col]
df.head()
df.columns = ['Product', 'Consumer_complaint_narrative']

2.3 给df添加分类id字段,同时构建倒排索引

df['category_id'] = df['Product'].factorize()[0]
from io import StringIO
category_id_df = df[['Product', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Product']].values)
df.head()

category_id_df.head()

 

 

2.4 查看割裂分类情况的统计数据

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,6))
df.groupby('Product').Consumer_complaint_narrative.count().plot.bar(ylim=0)
plt.show()

 

3.  词袋模型化,转成TFIDF表示

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.Consumer_complaint_narrative)
labels = df.category_id
features.shape

查看数据特征:(12633, 33769) --> 33769

可以查看组合的特征

tfidf.get_feature_names()

 

 

 然后将得到的feature转换成array

features = features.toarray()
features

 4. 特征选择

from sklearn.feature_selection import chi2
import numpy as np
N = 2

for Product, category_id in sorted(category_to_id.items()):
    # 卡方分布的特征筛选,通过每一层目标确认
    features_chi2 = chi2(features, labels == category_id)
    # argsort从小到大的索引
    indices = np.argsort(features_chi2[0])
#     print(indices)
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}':".format(Product))
    print("  . Most correlated unigrams:\n       . {}".format('\n       . '.join(unigrams[-N:])))
    print("  . Most correlated bigrams:\n       . {}".format('\n       . '.join(bigrams[-N:])))

# 得到每个分类的最相关词

5. 试着用朴素贝叶斯来做分类

 

  • 为了训练监督分类器,我们首先将每一篇文档转化为数字向量。 我们研究了矢量表示,如TF-IDF加权矢量。
  • 在对该文本进行了这种向量表示之后,我们可以训练监督分类器来对未知的一篇文档(“某一篇消费者投诉内容”)预测它的类别(“产品”)。

 

在完成上述数据转换之后,现在我们拥有所有文档的特征和类别信息,现在对分类器进行训练了。 我们可以使用许多算法来解决这类问题。

 

  • 朴素贝叶斯分类器:(Naive Bayes Classifier: the one most suitable for word counts is the multinomial variant)

 

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(df['Consumer_complaint_narrative'], df['Product'], random_state = 0)

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# 模型训练
clf = MultinomialNB().fit(X_train_tfidf, y_train)

# 然后对输入文章进行推断

print(clf.predict(tfidf_transformer.transform(count_vect.transform(["I have a paid and satisfied Ga tax state lien that has been released on XX/XX/2017 from the XXXX XXXX XXXX XXXX. I have submitted the information over to all credit bureaus, XXXX, XXXX, and Equifax. All bureaus besides Equifax have released and removed the lien from my credit report. I have called Equifax on several occasions to se reason on why they wont remove this lien from my report."]))))

#  打印分类结果

# 查看原来文本中的分类

df[df['Consumer_complaint_narrative'] == "I have a paid and satisfied Ga tax state lien that has been released on XX/XX/2017 from the XXXX XXXX XXXX XXXX. I have submitted the information over to all credit bureaus, XXXX, XXXX, and Equifax. All bureaus besides Equifax have released and removed the lien from my credit report. I have called Equifax on several occasions to se reason on why they wont remove this lien from my report."]

 

5.1 结论: 分类发现是可以符合要求的

 

6. 模型选择

我们现在准备尝试不同的机器学习模型,评估它们的准确性并找出一些潜在的问题。

我们以下四种模型作为benchmark:

  • Logistic回归
  • (多项)朴素贝叶斯
  • 线性支持向量机
  • 随机森林
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),
    MultinomialNB(),
    LogisticRegression(random_state=0),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
  for fold_idx, accuracy in enumerate(accuracies):
    entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
import seaborn as sns
sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df, 
              size=8, jitter=True, edgecolor="gray", linewidth=2)
plt.show()

 

 

 # 查看精准情况

cv_df.groupby('model_name').accuracy.mean()

 

 

 似乎SVM和LR有着微弱的优势

 

7. 模型评估

继续使用我们的最佳模型(LinearSVC),我们将查看混淆矩阵(confusion matrix),并显示预测标签和实际标签之间的差异。

model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features,
 labels, df.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(conf_mat, annot=True, fmt='d',
            xticklabels=category_id_df.Product.values, 
yticklabels=category_id_df.Product.values)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

 

 

 

绝大多数预测结果都在对角线(预测标签=实际标签),这也是我们所希望的。 但是,存在一些错误分类,我们看看是什么造成的:

from IPython.display import display
for predicted in category_id_df.category_id:
    for actual in category_id_df.category_id:
        if predicted != actual and conf_mat[actual, predicted] >= 6:
            print("'{}' predicted as '{}' : {} examples.".format(id_to_category[actual], id_to_category[predicted], conf_mat[actual, predicted]))
            display(df.loc[indices_test[(y_test == actual) & (y_pred == predicted)]][['Product', 'Consumer_complaint_narrative']])
            print('')

 

 

 

 

 

 

8. 数据类别查询

model.fit(features, labels)
N = 2
for Product, category_id in sorted(category_to_id.items()):
  indices = np.argsort(model.coef_[category_id])
  feature_names = np.array(tfidf.get_feature_names())[indices]
  unigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 1][:N]
  bigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 2][:N]
  print("# '{}':".format(Product))
  print("  . Top unigrams:\n       . {}".format('\n       . '.join(unigrams)))
  print("  . Top bigrams:\n       . {}".format('\n       . '.join(bigrams)))

利用卡方校验,查看每个类比对应的最高相关词:

# 'Bank account or service':
  . Top unigrams:
       . scottrade
       . bank
  . Top bigrams:
       . funds released
       . xxxx 15
# 'Checking or savings account':
  . Top unigrams:
       . bank
       . chase
  . Top bigrams:
       . savings account
       . checking account
# 'Consumer Loan':
  . Top unigrams:
       . car
       . repossessed
  . Top bigrams:
       . loan santander
       . xxxx points
# 'Credit card':
  . Top unigrams:
       . card
       . macy
  . Top bigrams:
       . synchrony bank
       . credit card
# 'Credit card or prepaid card':
  . Top unigrams:
       . card
       . capital
  . Top bigrams:
       . late fee
       . balance transfer
# 'Credit reporting':
  . Top unigrams:
       . experian
       . equifax
  . Top bigrams:
       . xxxx bureaus
       . xxxx bank
# 'Credit reporting, credit repair services, or other personal consumer reports':
  . Top unigrams:
       . report
       . freeze
  . Top bigrams:
       . xxxx xxxx
       . xxxx 2017
# 'Debt collection':
  . Top unigrams:
       . debt
       . collection
  . Top bigrams:
       . card debt
       . account victim
# 'Money transfer, virtual currency, or money service':
  . Top unigrams:
       . coinbase
       . paypal
  . Top bigrams:
       . paypal account
       . pay overdraft
# 'Money transfers':
  . Top unigrams:
       . money
       . google
  . Top bigrams:
       . western union
       . order cancelled
# 'Mortgage':
  . Top unigrams:
       . mortgage
       . escrow
  . Top bigrams:
       . green tree
       . escrow account
# 'Other financial service':
  . Top unigrams:
       . wellsfargo
       . term
  . Top bigrams:
       . money order
       . 10 days
# 'Payday loan':
  . Top unigrams:
       . loan
       . cash
  . Top bigrams:
       . copy attached
       . pay day
# 'Payday loan, title loan, or personal loan':
  . Top unigrams:
       . loan
       . pls
  . Top bigrams:
       . 00 loan
       . line credit
# 'Prepaid card':
  . Top unigrams:
       . rush
       . prepaid
  . Top bigrams:
       . rush card
       . pre paid
# 'Student loan':
  . Top unigrams:
       . navient
       . loans
  . Top bigrams:
       . sallie mae
       . great lakes
# 'Vehicle loan or lease':
  . Top unigrams:
       . ally
       . car
  . Top bigrams:
       . response given
       . payment sent

 

8.1 分类报告

 

 

 

 

 

 

 

 代码链接:

Github代码: https://github.com/harpyestar/sklearn_classify/blob/master/achieve_classify_task.ipynb

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

 

posted @ 2019-12-12 12:02  Harp_Yestar  阅读(1249)  评论(0)    收藏  举报