使用ML.NET实现情感分析[新手篇]

在发出《.NET Core玩转机器学习》和《使用ML.NET预测纽约出租车费》两文后，相信读者朋友们即使在不明就里的情况下，也能按照内容顺利跑完代码运行出结果，对使用.NET Core和ML.NET，以及机器学习的效果有了初步感知。得到这些体验后，那么就需要回头小结一下了，本文仍然基于一个情感分析的案例，以刚接触机器学习的.NET开发者的视角，侧重展开一下起手ML.NET的基本理解和步骤。

当我们意识到某个现实问题超出了传统的模式匹配能力范围，需要借助模拟的方式先尽可能还原已经产生的事实（通常也称为拟合），然后复用这种稳定的模拟过程（通常也称为模型），对即将发生的条件进行估计，求得发生或不发生相同结果的概率，此时就是利用机器学习最好的机会，同时也要看到，这也是机器学习通常离不开大量数据的原因，历史数据太少，模拟还原这个过程效果就会差很多，自然地，评估的结果误差就大了。所以在重视数据的准确性、完整性的同时，要学会经营数据的体量出来。

若要使用机器学习解决问题，一般会经历以下这些步骤：

1. 描述问题产生的场景

2. 针对特定场景收集数据

3. 对数据预处理

4. 确定模型（算法）进行训练

5. 对训练好的模型进行验证和调优

6. 使用模型进行预测分析

接下来我将用案例逐一介绍。

描述问题产生的场景

说到情感分析，我假定一个最简单的句子表达的场景，就是当看到一句话，通过特定的词语，我们能判断这是一个正向积极的态度，或是负面消极的。比如“我的程序顺利通过测试啦”这就是一个正向的，而“这个函数的性能实在堪忧”就是一个负面的表达。所以，对词语的鉴别就能间接知道说这句话的人的情感反应。（本案例为降低理解的复杂程度，暂不考虑断句、重音、标点之类的这些因素。）

针对特定场景收集数据

为了证实上面的思路，我们需要先收集一些有用的数据。其实这也是让众多开发者卡住的环节，除了使用爬虫和自己系统中的历史数据，往往想不到短时间还能在哪获取到。互联网上有不少学院和机构，甚至政府都是有开放数据集提供的，推荐两处获取比较高质量数据集的来源：

UC Irvine Machine Learning Repository来自加州大学

kaggle.com一个著名的计算科学与机器学习竞赛网站

这次我从UCI找到一个刚好只是每行有一个句子加一个标签，并且标签已标注好每个句子是正向还是负向的数据集了。在Sentiment Labelled Sentences Data Set下载。格式类似如下：

A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  	0
Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  	0
Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.  	0
Very little music or anything to speak of.  	0
The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head.  	1
The rest of the movie lacks art, charm, meaning... If it's about emptiness, it works I guess because it's empty.  	0
Wasted two hours.  	0
...

观察每一行，一共是Tab分隔的两个字段，第一个字段是句子，一般我们称之为特征（Feature），第二个字段是个数值，0表示负向，1表示正向，一般我们称之为目标或标签（Label），目标值往往是人工标注的，如果没有这个，是无法使用对历史数据进行拟合这种机器学习方式的。所以，一份高质量的数据集对人工标注的要求很高，要尽可能准确。

对数据预处理

对于创建项目一系列步骤，参看我开头提到的两篇文章即可，不再赘述。我们直接进入正题，ML.NET对数据的处理以及后面的训练流程是通用的，这也是为了以后扩展到其他第三方机器学习包设计的。首先观察数据集的格式，创建与数据集一致的结构，方便导入过程。LearningPipeline类专门用来定义机器学习过程的对象，所以紧接着我们需要创建它。代码如下：

const string _dataPath = @".\data\sentiment labelled sentences\imdb_labelled.txt";
const string _testDataPath = @".\data\sentiment labelled sentences\yelp_labelled.txt";

public class SentimentData
{
    [Column(ordinal: "0")]
    public string SentimentText;
    [Column(ordinal: "1", name: "Label")]
    public float Sentiment;
}

var pipeline = new LearningPipeline();
pipeline.Add(new TextLoader<SentimentData>(_dataPath, useHeader: false, separator: "tab"));
pipeline.Add(new TextFeaturizer("Features", "SentimentText"));

SentimentData就是我需要的导入用的数据结构，可以看到，Column属性除了指示对应数据集的行位置，额外的对应最后一列，表示正向还是负向的字段，还要指定它是目标值，并取了个标识名。TextLoader就是专门用来导入文本数据的类，TextFeaturizer就是指定特征的类，因为每一行数据不是每一个字段都可以成为特征的，如果有较多字段时，可以在此处特别地指定出来，这样不会被无关的字段影响。

确定模型（算法）进行训练

本案例目标是一个0/1的值类型，换句话说恰好是一个二分类问题，因此模型上我选择了FastTreeBinaryClassifier这个类，如果略有了解机器学习的朋友一定知道逻辑回归算法，与之在目的上大致相似。若要定义模型，同时要指定一个预测用的结构，这样模型就会按特定的结构输出模型的效果，一般这个输出用的结构至少要包含目标字段。代码片段如下：

public class SentimentPrediction
{
    [ColumnName("PredictedLabel")]
    public bool Sentiment;
}

pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });

PredictionModel<SentimentData, SentimentPrediction> model = pipeline.Train<SentimentData, SentimentPrediction>();

对训练好的模型进行验证和调优

在得到模型后，需要用测试数据集进行验证，看看拟合的效果是不是符合预期，BinaryClassificationEvaluator就是FastTreeBinaryClassifier对应的验证用的类，验证的结果用BinaryClassificationMetrics类保存。代码片段如下：

var testData = new TextLoader<SentimentData>(_testDataPath, useHeader: false, separator: "tab");
var evaluator = new BinaryClassificationEvaluator();
BinaryClassificationMetrics metrics = evaluator.Evaluate(model, testData);
Console.WriteLine();
Console.WriteLine("PredictionModel quality metrics evaluation");
Console.WriteLine("------------------------------------------");
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.Auc:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");

像Accuracy，Auc，F1Score都是一些常见的评价指标，包含了正确率、误差一类的得分，如果得分很低，就需要调整前一个步骤中定义模型时的参数值。详细的解释参考：Machine learning glossary

使用模型进行预测分析

训练好一个称心如意的模型后，就可以正式使用了。本质上就是再取来一些没有人工标注结果的数据，让模型进行分析返回一个符合某目标值的概率。代码片段如下：

IEnumerable<SentimentData> sentiments = new[]
{
    new SentimentData
    {
        SentimentText = "Contoso's 11 is a wonderful experience",
        Sentiment = 0
    },
    new SentimentData
    {
        SentimentText = "The acting in this movie is very bad",
        Sentiment = 0
    },
    new SentimentData
    {
        SentimentText = "Joe versus the Volcano Coffee Company is a great film.",
        Sentiment = 0
    }
};
IEnumerable<SentimentPrediction> predictions = model.Predict(sentiments);
Console.WriteLine();
Console.WriteLine("Sentiment Predictions");
Console.WriteLine("---------------------");

var sentimentsAndPredictions = sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));
foreach (var item in sentimentsAndPredictions)
{
    Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")}");
}

运行结果可以看到，其分类是符合真实判断的。尽管验证阶段的得分不高，这也是很正常的，再没有任何调优下，存在一些中性、多义的句子干扰预测导致的。

这样，再有新的句子就可以放心地通过程序自动完成分类了，是不是很简单！希望本文能带给.NET开发的朋友们对ML.NET跃跃欲试的兴趣。

顺便提一下，微软Azure还有一个机器学习的在线工作室，链接地址为：https://studio.azureml.net/，相关的AI项目库在：https://gallery.azure.ai/browse，对于暂时无法安装本地机器学习环境，以及找不到练手项目的朋友，不妨试试这个。

最后放出项目的文件结构以及完整的代码：

using System;
using Microsoft.ML.Models;
using Microsoft.ML.Runtime;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;

namespace SentimentAnalysis
{
    class Program
    {
        const string _dataPath = @".\data\sentiment labelled sentences\imdb_labelled.txt";
        const string _testDataPath = @".\data\sentiment labelled sentences\yelp_labelled.txt";

        public class SentimentData
        {
            [Column(ordinal: "0")]
            public string SentimentText;
            [Column(ordinal: "1", name: "Label")]
            public float Sentiment;
        }

        public class SentimentPrediction
        {
            [ColumnName("PredictedLabel")]
            public bool Sentiment;
        }

        public static PredictionModel<SentimentData, SentimentPrediction> Train()
        {
            var pipeline = new LearningPipeline();
            pipeline.Add(new TextLoader<SentimentData>(_dataPath, useHeader: false, separator: "tab"));
            pipeline.Add(new TextFeaturizer("Features", "SentimentText"));
            pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });

            PredictionModel<SentimentData, SentimentPrediction> model = pipeline.Train<SentimentData, SentimentPrediction>();
            return model;
        }

        public static void Evaluate(PredictionModel<SentimentData, SentimentPrediction> model)
        {
            var testData = new TextLoader<SentimentData>(_testDataPath, useHeader: false, separator: "tab");
            var evaluator = new BinaryClassificationEvaluator();
            BinaryClassificationMetrics metrics = evaluator.Evaluate(model, testData);
            Console.WriteLine();
            Console.WriteLine("PredictionModel quality metrics evaluation");
            Console.WriteLine("------------------------------------------");
            Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
            Console.WriteLine($"Auc: {metrics.Auc:P2}");
            Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
        }

        public static void Predict(PredictionModel<SentimentData, SentimentPrediction> model)
        {
            IEnumerable<SentimentData> sentiments = new[]
            {
                new SentimentData
                {
                    SentimentText = "Contoso's 11 is a wonderful experience",
                    Sentiment = 0
                },
                new SentimentData
                {
                    SentimentText = "The acting in this movie is very bad",
                    Sentiment = 0
                },
                new SentimentData
                {
                    SentimentText = "Joe versus the Volcano Coffee Company is a great film.",
                    Sentiment = 0
                }
            };
            IEnumerable<SentimentPrediction> predictions = model.Predict(sentiments);
            Console.WriteLine();
            Console.WriteLine("Sentiment Predictions");
            Console.WriteLine("---------------------");

            var sentimentsAndPredictions = sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));
            foreach (var item in sentimentsAndPredictions)
            {
                Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")}");
            }
            Console.WriteLine();
        }

        static void Main(string[] args)
        {
            var model = Train();
            Evaluate(model);
            Predict(model);
        }
    }
}

posted on 2018-05-10 23:28 Bean.Hsiang 阅读(15208) 评论(11) 收藏举报