理解问题

本教程需要解决的问题是根据网站内评论的意见采取合适的行动。

可用的训练数据集中，网站评论可能是有毒(toxic)(1)或者无毒(not toxic)(0)两种类型。这种场景下，机器学习中的分类任务最为适合。

分类任务用于区分数据内的类别(category)，类型(type)或种类(class)。常见的例子有：

识别情感是正面或是负面
将邮件按照是否为垃圾邮件归类
判定病人的实验室样本是否为癌症
按照客户的偏好进行分类以响应销售活动

分类任务可以是二元又或是多元的。这里面临的是二元分类的问题。

准备数据

首先建立一个控制台应用程序，基于.NET Core。完成搭建后，添加Microsoft.ML类库包。接着在工程下新建名为Data的文件夹。

之后，下载WikiPedia-detox-250-line-data.tsv与wikipedia-detox-250-line-test.tsv文件，并将它们放入Data文件夹，值得注意的是，这两个文件的Copy to Output Directory属性需要修改成Copy if newer。

加载数据

在Program.cs文件的Main方法里加入以下代码：

MLContext mlContext = new MLContext(seed: 0);

_textLoader = mlContext.Data.TextReader(new TextLoader.Arguments()

{

    Separator = "tab",

    HasHeader = true,

    Column = new[]

                {

                    new TextLoader.Column("Label", DataKind.Bool, 0),

                    new TextLoader.Column("SentimentText", DataKind.Text, 1)

                }

});

其目的是通过使用TextLoader类为数据的加载作好准备。

Column属性中构建了两个对象，即对应数据集中的两列数据。不过第一列这里必须使用Label而不是Sentiment。

提取特征

新建一个SentimentData.cs文件，其中加入SentimentData类与SentimentPrediction。

public class SentimentData

{

    [Column(ordinal: "0", name: "Label")]

    public float Sentiment;

    [Column(ordinal: "1")]

    public string SentimentText;

}

public class SentimentPrediction

{

    [ColumnName("PredictedLabel")]

    public bool Prediction { get; set; }

    [ColumnName("Probability")]

    public float Probability { get; set; }

    [ColumnName("Score")]

    public float Score { get; set; }

}

SentimentData类中的SentimentText为输入数据集的特征，Sentiment则是数据集的标记(label)。

SentimentPrediction类用于模型被训练后的预测。

训练模型

在Program类中加入Train方法。首先它会读取训练数据集，接着将特征列中的文本型数据转换为浮点型数组并设定了训练时所使用的决策树二元分类模型。之后，即是实际训练模型。

public static ITransformer Train(MLContext mlContext, string dataPath)

{

    IDataView dataView = _textLoader.Read(dataPath);

    var pipeline = mlContext.Transforms.Text.FeaturizeText("SentimentText", "Features")

        .Append(mlContext.BinaryClassification.Trainers.FastTree(numLeaves: 50, numTrees: 50, minDatapointsInLeaves: 20));

    Console.WriteLine("=============== Create and Train the Model ===============");

    var model = pipeline.Fit(dataView);

    Console.WriteLine("=============== End of training ===============");

    Console.WriteLine();

    return model;

}

评估模型

加入Evaluate方法。到了这一步，需要读取的是用于测试的数据集，且读取后的数据仍然需要转换成合适的数据类型。

public static void Evaluate(MLContext mlContext, ITransformer model)

{

    IDataView dataView = _textLoader.Read(_testDataPath);

    Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");

    var predictions = model.Transform(dataView);

    var metrics = mlContext.BinaryClassification.Evaluate(predictions, "Label");

    Console.WriteLine();

    Console.WriteLine("Model quality metrics evaluation");

    Console.WriteLine("--------------------------------");

    Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");

    Console.WriteLine($"Auc: {metrics.Auc:P2}");

    Console.WriteLine($"F1Score: {metrics.F1Score:P2}");

    Console.WriteLine("=============== End of model evaluation ===============");

}

使用模型

训练及评估模型完成后，就可以正式使用它了。这里需要建立一个用于预测的对象(PredictionFunction)，其预测方法的输入参数是SentimentData类型，返回结果为SentimentPrediction类型。

private static void Predict(MLContext mlContext, ITransformer model)

{

    var predictionFunction = model.MakePredictionFunction<SentimentData, SentimentPrediction>(mlContext);

    SentimentData sampleStatement = new SentimentData

    {

        SentimentText = "This is a very rude movie"

    };

    var resultprediction = predictionFunction.Predict(sampleStatement);

    Console.WriteLine();

    Console.WriteLine("=============== Prediction Test of model with a single sample and test dataset ===============");

    Console.WriteLine();

    Console.WriteLine($"Sentiment: {sampleStatement.SentimentText} | Prediction: {(Convert.ToBoolean(resultprediction.Prediction) ? "Toxic" : "Not Toxic")} | Probability: {resultprediction.Probability} ");

    Console.WriteLine("=============== End of Predictions ===============");

    Console.WriteLine();

}

完整示例代码

using System;

using System.Collections.Generic;

using System.IO;

using System.Linq;

using Microsoft.ML;

using Microsoft.ML.Core.Data;

using Microsoft.ML.Runtime.Data;

using Microsoft.ML.Transforms.Text;

namespace SentimentAnalysis

{

    class Program

    {

        static readonly string _trainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "wikipedia-detox-250-line-data.tsv");

        static readonly string _testDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "wikipedia-detox-250-line-test.tsv");

        static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "Model.zip");

        static TextLoader _textLoader;

        static void Main(string[] args)

        {

            MLContext mlContext = new MLContext(seed: 0);

            _textLoader = mlContext.Data.TextReader(new TextLoader.Arguments()

            {

                Separator = "tab",

                HasHeader = true,

                Column = new[]

                            {

                                new TextLoader.Column("Label", DataKind.Bool, 0),

                                new TextLoader.Column("SentimentText", DataKind.Text, 1)

                            }

            });

            var model = Train(mlContext, _trainDataPath);

            Evaluate(mlContext, model);

            Predict(mlContext, model);

            Console.Read();

        }

        public static ITransformer Train(MLContext mlContext, string dataPath)

        {

            IDataView dataView = _textLoader.Read(dataPath);

            var pipeline = mlContext.Transforms.Text.FeaturizeText("SentimentText", "Features")

                .Append(mlContext.BinaryClassification.Trainers.FastTree(numLeaves: 50, numTrees: 50, minDatapointsInLeaves: 20));

            Console.WriteLine("=============== Create and Train the Model ===============");

            var model = pipeline.Fit(dataView);

            Console.WriteLine("=============== End of training ===============");

            Console.WriteLine();

            return model;

        }

        public static void Evaluate(MLContext mlContext, ITransformer model)

        {

            IDataView dataView = _textLoader.Read(_testDataPath);

            Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");

            var predictions = model.Transform(dataView);

            var metrics = mlContext.BinaryClassification.Evaluate(predictions, "Label");

            Console.WriteLine();

            Console.WriteLine("Model quality metrics evaluation");

            Console.WriteLine("--------------------------------");

            Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");

            Console.WriteLine($"Auc: {metrics.Auc:P2}");

            Console.WriteLine($"F1Score: {metrics.F1Score:P2}");

            Console.WriteLine("=============== End of model evaluation ===============");

        }

        private static void Predict(MLContext mlContext, ITransformer model)

        {

            var predictionFunction = model.MakePredictionFunction<SentimentData, SentimentPrediction>(mlContext);

            SentimentData sampleStatement = new SentimentData

            {

                SentimentText = "This is a very rude movie"

            };

            var resultprediction = predictionFunction.Predict(sampleStatement);

            Console.WriteLine();

            Console.WriteLine("=============== Prediction Test of model with a single sample and test dataset ===============");

            Console.WriteLine();

            Console.WriteLine($"Sentiment: {sampleStatement.SentimentText} | Prediction: {(Convert.ToBoolean(resultprediction.Prediction) ? "Toxic" : "Not Toxic")} | Probability: {resultprediction.Probability} ");

            Console.WriteLine("=============== End of Predictions ===============");

            Console.WriteLine();

        }

    }

}

程序运行后显示的结果：

=============== Create and Train the Model ===============

=============== End of training ===============

=============== Evaluating Model accuracy with Test data===============

Model quality metrics evaluation

--------------------------------

Accuracy: 83.33%

Auc: 98.77%

F1Score: 85.71%

=============== End of model evaluation ===============

=============== Prediction Test of model with a single sample and test dataset ===============

Sentiment: This is a very rude movie | Prediction: Toxic | Probability: 0.7387648

=============== End of Predictions ===============

可以看到在预测This is a very rude movie(这是一部粗制滥造的电影)这句评论时，模型判定其是有毒的:-)

ML.NET教程之情感分析(二元分类问题)的更多相关文章

LSTM 文本情感分析/序列分类 Keras
LSTM 文本情感分析/序列分类 Keras 请参考 http://spaces.ac.cn/archives/3414/ neg.xls是这样的 pos.xls是这样的neg=pd.read_e ...
ML.NET 示例：二元分类之用户评论的情绪分析
写在前面准备近期将微软的machinelearning-samples翻译成中文,水平有限,如有错漏,请大家多多指正. 如果有朋友对此感兴趣,可以加入我:https://github.com/fei ...
Python爬虫和情感分析简介
摘要这篇短文的目的是分享我这几天里从头开始学习Python爬虫技术的经验,并展示对爬取的文本进行情感分析(文本分类)的一些挖掘结果. 不同于其他专注爬虫技术的介绍,这里首先阐述爬取网络数据动机,接着 ...
Python爬取《你好李焕英》豆瓣短评并基于SnowNLP做情感分析
爬取过程在这里: Python爬取你好李焕英豆瓣短评并利用stylecloud制作更酷炫的词云图本文基于前文爬取生成的douban.txt,基于SnowNLP做情感分析. 依赖库: 豆瓣镜像比较快: ...
使用ML.NET实现情感分析[新手篇]
在发出<.NET Core玩转机器学习>和<使用ML.NET预测纽约出租车费>两文后,相信读者朋友们即使在不明就里的情况下,也能按照内容顺利跑完代码运行出结果,对使用.NET ...
ML.NET 示例：二元分类之信用卡欺诈检测
写在前面准备近期将微软的machinelearning-samples翻译成中文,水平有限,如有错漏,请大家多多指正. 如果有朋友对此感兴趣,可以加入我:https://github.com/fei ...
使用ML.NET实现情感分析[新手篇]后补
在<使用ML.NET实现情感分析[新手篇]>完成后,有热心的朋友建议说,为何例子不用中文的呢,其实大家是需要知道怎么预处理中文的数据集的.想想确实有道理,于是略微调整一些代码,权作示范. ...
ML.NET 示例：二元分类之垃圾短信检测
写在前面准备近期将微软的machinelearning-samples翻译成中文,水平有限,如有错漏,请大家多多指正. 如果有朋友对此感兴趣,可以加入我:https://github.com/fei ...
pyhanlp文本分类与情感分析
语料库本文语料库特指文本分类语料库,对应IDataSet接口.而文本分类语料库包含两个概念:文档和类目.一个文档只属于一个类目,一个类目可能含有多个文档.比如搜狗文本分类语料库迷你版.zip,下载前 ...

随机推荐

一文看懂 Dubbo 的集成与使用
前言今年年初时,阿里巴巴开源的高性能服务框架dubbo又开始了新一轮的更新,还加入了Apache孵化器.原先项目使用了spring cloud之后,已经比较少用dubbo.目前又抽调回原来的行业应用 ...
javascript中的数据结构
Javascript中的关键字 abstract continue finally instanceof private this boolean ...
[转]MyEclipse内存不足问题
1.修改eclipse.ini 在Myeclipse安装目录下G:\MyEclipse8.5\Genuitec\MyEclipse 8.5有一个myeclipse.ini配置文件,设置如下: -vma ...
CentOS 7 安装配置OpenVPN 2.3.12
1.下载安装包 #wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.09.tar.gz#wget http://swupdate. ...
XPath轴(XPath Axes)总结
XPath轴(XPath Axes)可定义某个相对于当前节点的节点集: 1.child 选取当前节点的所有子元素 2.parent 选取当前节点的父节点 3.descendant 选取当前节点的所有后 ...
第一部分：开发前的准备-第三章 Application 基本原理
第3章应用程序基本原理首先我们需要强调一下Android 应用程序是用java写的.Android SDK工具编译代码并把资源文件和数据打包成一个文件.这个名字的扩展名是.APK.要在androi ...
深入研究 Runloop 与线程保活
深入研究 Runloop 与线程保活在讨论 runloop 相关的文章,以及分析 AFNetworking(2.x) 源码的文章中,我们经常会看到关于利用 runloop 进行线程保活的分析,但如果 ...
Core dump去哪里了？
转自:http://blog.csdn.net/normallife/article/details/53818997 今天程序Crash,去追踪,找core dump,始终没有找到,后来到了/pro ...
(4) MySQL中EXPLAIN执行计划分析
一. 执行计划能告诉我们什么? SQL如何使用索引联接查询的执行顺序查询扫描的数据函数二. 执行计划中的内容 SQL执行计划的输出可能为多行,每一行代表对一个数据库对象的操作 1. ID列 ID ...
python学习笔记（23）——python压缩bin包
说明(2017-12-25 10:43:20): 1. CZ写的压缩bin包代码,记下来以后好抄. # coding:utf-8 ''' Created on 2014年8月14日 @author: ...

ML.NET教程之情感分析(二元分类问题)