【机器学习分类任务】多类别分类任务
最近看到一个比较好的基于Scikit-Learn工具包进行分类的任务。具体文章我将post出来,同时也将自己实践的结果进行展示。
可阅读:https://cloud.tencent.com/developer/article/1097919
文本下载的数据集可以从data.gov进行下载:https://catalog.data.gov/dataset/consumer-complaint-database
数据集内容是关于银行顾客投诉的。
问题描述
我们的问题是有监督文本分类问题,我们的目标是调查哪种有监督机器学习方法最适合解决它。
给定一个投诉,我们希望将其分配到12个类别之一。 分类器假定每个新投诉都被分配到一个且仅一个类别。 这是多类文本分类问题。 我迫不及待地想看看我们能做些什么!
1. 首先,观察数据集非常必要
import pandas as pd
df = pd.read_csv('Consumer_Complaints.csv')
# 因为后面稀疏矩阵的建立无法承受很大维度的,因此这里切割是为了缩小矩阵
df = df.ix[:40000,:]

对于这个项目,我们只需要其中的两栏 - “产品”和“消费者投诉叙述(Consumer complaint narrative)”。“消费者投诉叙述(Consumer complaint narrative)作为我们的输入”,“产品”作为输出,即输入的类别
- 输入:Consumer_complaint_narrative(每一篇消费者投诉叙述内容作为一篇文档) 例如:“我的信用报告中有过时的信息,我以前有争议,但这些信息已经超过七年且尚未删除,这不符合信用报告要求”
- 输出:产品(product)(输入对应的类别) 示例:信用报告(Credit reporting)
我们将删除“Consumer complaints narrative”栏中的缺失值,并添加一列来编码产品作为整数描述,因为类别变量通常比整数字符串更好。
我们还创建了几个字典供将来使用。
清理完成后,可以展示前五行数据:
2. 数据处理
2.1 去除*Consumer complaint narrative*空值NaN
df = df[pd.notnull(df['Consumer complaint narrative'])] df.shape
-- > (12633, 18)
输出为12633,从原来的40000数据变成了12633行
# 查看所有列信息
df.info()

2.2 替换字段的名字
col = ['Product', 'Consumer complaint narrative'] df = df[col] df.head()
df.columns = ['Product', 'Consumer_complaint_narrative']
2.3 给df添加分类id字段,同时构建倒排索引
df['category_id'] = df['Product'].factorize()[0]
from io import StringIO
category_id_df = df[['Product', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Product']].values)
df.head()

category_id_df.head()



2.4 查看割裂分类情况的统计数据
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,6))
df.groupby('Product').Consumer_complaint_narrative.count().plot.bar(ylim=0)
plt.show()

3. 词袋模型化,转成TFIDF表示
from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.Consumer_complaint_narrative)
labels = df.category_id
features.shape
查看数据特征:(12633, 33769) --> 33769
可以查看组合的特征
tfidf.get_feature_names()


然后将得到的feature转换成array
features = features.toarray()
features

4. 特征选择
from sklearn.feature_selection import chi2
import numpy as np
N = 2
for Product, category_id in sorted(category_to_id.items()):
# 卡方分布的特征筛选,通过每一层目标确认
features_chi2 = chi2(features, labels == category_id)
# argsort从小到大的索引
indices = np.argsort(features_chi2[0])
# print(indices)
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
print("# '{}':".format(Product))
print(" . Most correlated unigrams:\n . {}".format('\n . '.join(unigrams[-N:])))
print(" . Most correlated bigrams:\n . {}".format('\n . '.join(bigrams[-N:])))
# 得到每个分类的最相关词


5. 试着用朴素贝叶斯来做分类
- 为了训练监督分类器,我们首先将每一篇文档转化为数字向量。 我们研究了矢量表示,如TF-IDF加权矢量。
- 在对该文本进行了这种向量表示之后,我们可以训练监督分类器来对未知的一篇文档(“某一篇消费者投诉内容”)预测它的类别(“产品”)。
在完成上述数据转换之后,现在我们拥有所有文档的特征和类别信息,现在对分类器进行训练了。 我们可以使用许多算法来解决这类问题。
- 朴素贝叶斯分类器:(Naive Bayes Classifier: the one most suitable for word counts is the multinomial variant)
from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB # 分割数据集 X_train, X_test, y_train, y_test = train_test_split(df['Consumer_complaint_narrative'], df['Product'], random_state = 0) count_vect = CountVectorizer() X_train_counts = count_vect.fit_transform(X_train) tfidf_transformer = TfidfTransformer() X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) # 模型训练 clf = MultinomialNB().fit(X_train_tfidf, y_train)
# 然后对输入文章进行推断
print(clf.predict(tfidf_transformer.transform(count_vect.transform(["I have a paid and satisfied Ga tax state lien that has been released on XX/XX/2017 from the XXXX XXXX XXXX XXXX. I have submitted the information over to all credit bureaus, XXXX, XXXX, and Equifax. All bureaus besides Equifax have released and removed the lien from my credit report. I have called Equifax on several occasions to se reason on why they wont remove this lien from my report."]))))
# 打印分类结果

# 查看原来文本中的分类
df[df['Consumer_complaint_narrative'] == "I have a paid and satisfied Ga tax state lien that has been released on XX/XX/2017 from the XXXX XXXX XXXX XXXX. I have submitted the information over to all credit bureaus, XXXX, XXXX, and Equifax. All bureaus besides Equifax have released and removed the lien from my credit report. I have called Equifax on several occasions to se reason on why they wont remove this lien from my report."]

5.1 结论: 分类发现是可以符合要求的
6. 模型选择
我们现在准备尝试不同的机器学习模型,评估它们的准确性并找出一些潜在的问题。
我们以下四种模型作为benchmark:
- Logistic回归
- (多项)朴素贝叶斯
- 线性支持向量机
- 随机森林
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
models = [
RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
LinearSVC(),
MultinomialNB(),
LogisticRegression(random_state=0),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
import seaborn as sns
sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df,
size=8, jitter=True, edgecolor="gray", linewidth=2)
plt.show()

# 查看精准情况
cv_df.groupby('model_name').accuracy.mean()

似乎SVM和LR有着微弱的优势
7. 模型评估
继续使用我们的最佳模型(LinearSVC),我们将查看混淆矩阵(confusion matrix),并显示预测标签和实际标签之间的差异。
model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features,
labels, df.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(conf_mat, annot=True, fmt='d',
xticklabels=category_id_df.Product.values,
yticklabels=category_id_df.Product.values)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

绝大多数预测结果都在对角线(预测标签=实际标签),这也是我们所希望的。 但是,存在一些错误分类,我们看看是什么造成的:
from IPython.display import display
for predicted in category_id_df.category_id:
for actual in category_id_df.category_id:
if predicted != actual and conf_mat[actual, predicted] >= 6:
print("'{}' predicted as '{}' : {} examples.".format(id_to_category[actual], id_to_category[predicted], conf_mat[actual, predicted]))
display(df.loc[indices_test[(y_test == actual) & (y_pred == predicted)]][['Product', 'Consumer_complaint_narrative']])
print('')


8. 数据类别查询
model.fit(features, labels)
N = 2
for Product, category_id in sorted(category_to_id.items()):
indices = np.argsort(model.coef_[category_id])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 1][:N]
bigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 2][:N]
print("# '{}':".format(Product))
print(" . Top unigrams:\n . {}".format('\n . '.join(unigrams)))
print(" . Top bigrams:\n . {}".format('\n . '.join(bigrams)))
利用卡方校验,查看每个类比对应的最高相关词:
# 'Bank account or service':
. Top unigrams:
. scottrade
. bank
. Top bigrams:
. funds released
. xxxx 15
# 'Checking or savings account':
. Top unigrams:
. bank
. chase
. Top bigrams:
. savings account
. checking account
# 'Consumer Loan':
. Top unigrams:
. car
. repossessed
. Top bigrams:
. loan santander
. xxxx points
# 'Credit card':
. Top unigrams:
. card
. macy
. Top bigrams:
. synchrony bank
. credit card
# 'Credit card or prepaid card':
. Top unigrams:
. card
. capital
. Top bigrams:
. late fee
. balance transfer
# 'Credit reporting':
. Top unigrams:
. experian
. equifax
. Top bigrams:
. xxxx bureaus
. xxxx bank
# 'Credit reporting, credit repair services, or other personal consumer reports':
. Top unigrams:
. report
. freeze
. Top bigrams:
. xxxx xxxx
. xxxx 2017
# 'Debt collection':
. Top unigrams:
. debt
. collection
. Top bigrams:
. card debt
. account victim
# 'Money transfer, virtual currency, or money service':
. Top unigrams:
. coinbase
. paypal
. Top bigrams:
. paypal account
. pay overdraft
# 'Money transfers':
. Top unigrams:
. money
. google
. Top bigrams:
. western union
. order cancelled
# 'Mortgage':
. Top unigrams:
. mortgage
. escrow
. Top bigrams:
. green tree
. escrow account
# 'Other financial service':
. Top unigrams:
. wellsfargo
. term
. Top bigrams:
. money order
. 10 days
# 'Payday loan':
. Top unigrams:
. loan
. cash
. Top bigrams:
. copy attached
. pay day
# 'Payday loan, title loan, or personal loan':
. Top unigrams:
. loan
. pls
. Top bigrams:
. 00 loan
. line credit
# 'Prepaid card':
. Top unigrams:
. rush
. prepaid
. Top bigrams:
. rush card
. pre paid
# 'Student loan':
. Top unigrams:
. navient
. loans
. Top bigrams:
. sallie mae
. great lakes
# 'Vehicle loan or lease':
. Top unigrams:
. ally
. car
. Top bigrams:
. response given
. payment sent
8.1 分类报告

代码链接:
Github代码: https://github.com/harpyestar/sklearn_classify/blob/master/achieve_classify_task.ipynb

浙公网安备 33010602011771号