利用Python进行文本分类

** 利用Python进行文本分类,
可用于过滤垃圾文本

  1. 抽样
  2. 人工标注样本文本中垃圾信息
  3. 样本建模
  4. 模型评估
  5. 新文本预测
    参考:
    http://scikit-learn.org/stable/user_guide.html
    PYTHON自然语言处理中文翻译 NLTK Natural Language Processing with Python 中文版
    主要步骤:
  6. 分词
  7. 特征词提取
  8. 生成词-文档矩阵
  9. 整合分类变量
  10. 建模
  11. 评估
    **

** 7. 预测新文本 **

**
**


    #示例
    #!/usr/bin/env python
    # -*- coding:utf-8 -*-
    import MySQLdb
    import pandas as pd
    import numpy as np
    import jieba
    import nltk
    import jieba.posseg as pseg
    from sklearn import cross_validation
    
    #1. 读取数据,type为文本分类,0/1变量
    df = pd.read_csv('F:\csv_test.csv',names=['id','cont','type'])
    
    #2. 关键抽取
    cont = df['cont']
    tagall=[]
    for t in cont:
            tags = jieba.analyse.extract_tags(t,kn)
            tagall.append(tags)
    dist = nltk.FreqDist(tagall) #词频统计选top100的关键词
    fea_words = fdist.keys()[:100]
    
    #3. 生成词-文档矩阵
    def word_features(content, top_words):
          word_set = set(content)
          features = {}
          for w in top_words:
              features["w_%s" % w] = (w in word_set)
          return features
    
    #4. 整合矩阵与分类结果变量 
    def data_feature(df, fea_words):
       data_set = []
       cont = df['cont']
       for i in range(0,len(cont)):
            content =jieba.cut(cont)
            feat = word_features(content,fea_words )
            category = df.loc[i,'type']
            tup = (feat, category)
            data_set.append(tup)
        return  data_set
    
    data_list = data_feature(df, fea_words)
    #5. 建立分类模型
    #训练集与测试集
    train_set,test_set = cross_validation.train_test_split(data_list,test_size=0.5)
    #建模,贝叶斯
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    #建模,决策树
    classifier = nltk.DecisionTreeClassifier.train(train_set)
    
    #6. 模型评估准确率
    print nltk.classify.accuracy(classifier,test_set)
    
    #7. 预测结果输出
    pre_set = data_feature(new_data,fea_words)
    pre_result = []
    for item in pre_set:
        result = classifier.classify(item)
        pre_result.append(result)
    #查看预测结果分布
    pre_tab = set(pre_result)
    for p in pre_tab:
        print p,pre_result.count(p)
    
    其中2中特征词提取可采用各种方法进行, 
    3,4步骤可改善,提高性能, 
    5建模部分的模型可采用更多分类模型,逻辑回归,SVM...
[/code]

  
  


![在这里插入图片描述](https://img-blog.csdnimg.cn/20210608151750993.gif)
posted @ 2021-07-06 18:29  老酱  阅读(1240)  评论(0编辑  收藏  举报