理解问题

客户细分需要解决的问题是按照客户之间的相似特征区分不同客户群体。这个问题的先决条件中没有可供使用的客户分类列表,只有客户的人物画像。

数据集

已有的数据是公司的历史商业活动记录以及客户的购买记录。

offer.csv:

Offer #,Campaign,Varietal,Minimum Qty (kg),Discount (%),Origin,Past Peak
1,January,Malbec,72,56,France,FALSE
2,January,Pinot Noir,72,17,France,FALSE
3,February,Espumante,144,32,Oregon,TRUE
4,February,Champagne,72,48,France,TRUE
5,February,Cabernet Sauvignon,144,44,New Zealand,TRUE
6,March,Prosecco,144,86,Chile,FALSE
7,March,Prosecco,6,40,Australia,TRUE
8,March,Espumante,6,45,South Africa,FALSE
9,April,Chardonnay,144,57,Chile,FALSE
10,April,Prosecco,72,52,California,FALSE
11,May,Champagne,72,85,France,FALSE
12,May,Prosecco,72,83,Australia,FALSE
13,May,Merlot,6,43,Chile,FALSE
14,June,Merlot,72,64,Chile,FALSE
15,June,Cabernet Sauvignon,144,19,Italy,FALSE
16,June,Merlot,72,88,California,FALSE
17,July,Pinot Noir,12,47,Germany,FALSE
18,July,Espumante,6,50,Oregon,FALSE
19,July,Champagne,12,66,Germany,FALSE
20,August,Cabernet Sauvignon,72,82,Italy,FALSE
21,August,Champagne,12,50,California,FALSE
22,August,Champagne,72,63,France,FALSE
23,September,Chardonnay,144,39,South Africa,FALSE
24,September,Pinot Noir,6,34,Italy,FALSE
25,October,Cabernet Sauvignon,72,59,Oregon,TRUE
26,October,Pinot Noir,144,83,Australia,FALSE
27,October,Champagne,72,88,New Zealand,FALSE
28,November,Cabernet Sauvignon,12,56,France,TRUE
29,November,Pinot Grigio,6,87,France,FALSE
30,December,Malbec,6,54,France,FALSE
31,December,Champagne,72,89,France,FALSE
32,December,Cabernet Sauvignon,72,45,Germany,TRUE

transaction.csv:

Customer Last Name,Offer #
Smith,2
Smith,24
Johnson,17
Johnson,24
Johnson,26
Williams,18
Williams,22
Williams,31
Brown,7
Brown,29
Brown,30
Jones,8
Miller,6
Miller,10
Miller,14
Miller,15
Miller,22
Miller,23
Miller,31
Davis,12
Davis,22
Davis,25
Garcia,14
Garcia,15
Rodriguez,2
Rodriguez,26
Wilson,8
Wilson,30
Martinez,12
Martinez,25
Martinez,28
Anderson,24
Anderson,26
Taylor,7
Taylor,18
Taylor,29
Taylor,30
Thomas,1
Thomas,4
Thomas,9
Thomas,11
Thomas,14
Thomas,26
Hernandez,28
Hernandez,29
Moore,17
Moore,24
Martin,2
Martin,11
Martin,28
Jackson,1
Jackson,2
Jackson,11
Jackson,15
Jackson,22
Thompson,9
Thompson,16
Thompson,25
Thompson,30
White,14
White,22
White,25
White,30
Lopez,9
Lopez,11
Lopez,15
Lopez,16
Lopez,27
Lee,3
Lee,4
Lee,6
Lee,22
Lee,27
Gonzalez,9
Gonzalez,31
Harris,4
Harris,6
Harris,7
Harris,19
Harris,22
Harris,27
Clark,4
Clark,11
Clark,28
Clark,31
Lewis,7
Lewis,8
Lewis,30
Robinson,7
Robinson,29
Walker,18
Walker,29
Perez,18
Perez,30
Hall,11
Hall,22
Young,6
Young,9
Young,15
Young,22
Young,31
Young,32
Allen,9
Allen,27
Sanchez,4
Sanchez,5
Sanchez,14
Sanchez,15
Sanchez,20
Sanchez,22
Sanchez,26
Wright,4
Wright,6
Wright,21
Wright,27
King,7
King,13
King,18
King,29
Scott,6
Scott,14
Scott,23
Green,7
Baker,7
Baker,10
Baker,19
Baker,31
Adams,18
Adams,29
Adams,30
Nelson,3
Nelson,4
Nelson,8
Nelson,31
Hill,8
Hill,13
Hill,18
Hill,30
Ramirez,9
Campbell,2
Campbell,24
Campbell,26
Mitchell,1
Mitchell,2
Roberts,31
Carter,7
Carter,13
Carter,29
Carter,30
Phillips,17
Phillips,24
Evans,22
Evans,27
Turner,4
Turner,6
Turner,27
Turner,31
Torres,8
Parker,11
Parker,16
Parker,20
Parker,29
Parker,31
Collins,11
Collins,30
Edwards,8
Edwards,27
Stewart,8
Stewart,29
Stewart,30
Flores,17
Flores,24
Morris,17
Morris,24
Morris,26
Nguyen,19
Nguyen,31
Murphy,7
Murphy,12
Rivera,7
Rivera,18
Cook,24
Cook,26
Rogers,3
Rogers,7
Rogers,8
Rogers,19
Rogers,21
Rogers,22
Morgan,8
Morgan,29
Peterson,1
Peterson,2
Peterson,10
Peterson,23
Peterson,26
Peterson,27
Cooper,4
Cooper,16
Cooper,20
Cooper,32
Reed,5
Reed,14
Bailey,7
Bailey,30
Bell,2
Bell,17
Bell,24
Bell,26
Gomez,11
Gomez,20
Gomez,25
Gomez,32
Kelly,6
Kelly,20
Kelly,31
Kelly,32
Howard,11
Howard,12
Howard,22
Ward,4
Cox,2
Cox,17
Cox,24
Cox,26
Diaz,7
Diaz,8
Diaz,29
Diaz,30
Richardson,3
Richardson,6
Richardson,22
Wood,1
Wood,10
Wood,14
Wood,31
Watson,7
Watson,29
Brooks,3
Brooks,8
Brooks,11
Brooks,22
Bennett,8
Bennett,29
Gray,12
Gray,16
Gray,26
James,7
James,8
James,13
James,18
James,30
Reyes,9
Reyes,23
Cruz,29
Cruz,30
Hughes,7
Hughes,8
Hughes,13
Hughes,29
Hughes,30
Price,1
Price,22
Price,30
Price,31
Myers,18
Myers,30
Long,3
Long,7
Long,10
Foster,1
Foster,9
Foster,14
Foster,22
Foster,23
Sanders,1
Sanders,4
Sanders,5
Sanders,6
Sanders,9
Sanders,11
Sanders,20
Sanders,25
Sanders,26
Ross,18
Ross,21
Morales,6
Morales,7
Morales,8
Morales,19
Morales,22
Morales,31
Powell,5
Sullivan,8
Sullivan,13
Sullivan,18
Russell,26
Ortiz,8
Jenkins,24
Jenkins,26
Gutierrez,6
Gutierrez,8
Gutierrez,10
Gutierrez,18
Perry,8
Perry,18
Perry,29
Perry,30
Butler,1
Butler,4
Butler,22
Butler,28
Butler,30
Barnes,10
Barnes,21
Barnes,22
Barnes,31
Fisher,1
Fisher,2
Fisher,11
Fisher,22
Fisher,28
Fisher,30
Fisher,31

预处理

需要对两个数据集做关联处理,这样才能得到单一的视图。同时由于需要比较客户所产生的交易,还需要建立一张透视表。行代表客户,列代表商业活动,单元格值则显示是否客户有购买行为。

var offers = Offer.ReadFromCsv(_offersCsv);
var transactions = Transaction.ReadFromCsv(_transactionsCsv); var clusterData = (from of in offers
join tr in transactions on of.OfferId equals tr.OfferId
select new
{
of.OfferId,
of.Campaign,
of.Discount,
tr.LastName,
of.LastPeak,
of.Minimum,
of.Origin,
of.Varietal,
Count = 1,
}).ToArray(); var count = offers.Count();
var pivotDataArray =
(from c in clusterData
group c by c.LastName into gcs
let lookup = gcs.ToLookup(y => y.OfferId, y => y.Count)
select new PivotData()
{
LastName = gcs.Key,
Features = ToFeatures(lookup, count)
}).ToArray();

ToFeatures方法依据商业活动的数量,生成所需的特征数组。

private static float[] ToFeatures(ILookup<string, int> lookup, int count)
{
var result = new float[count];
foreach (var item in lookup)
{
var key = Convert.ToInt32(item.Key) - 1;
result[key] = item.Sum();
} return result;
}

数据视图

取得用于生成视图的数组后,这里使用CreateStreamingDataView方法构建数据视图。而又因为Features属性是一个数组,所以必须声明其大小。

var mlContext = new MLContext();

var schemaDef = SchemaDefinition.Create(typeof(PivotData));
schemaDef["Features"].ColumnType = new VectorType(NumberType.R4, count); var pivotDataView = mlContext.CreateStreamingDataView(pivotDataArray, schemaDef);

PCA

PCA(principal Component Analysis),主成分分析,是为了将过多的维度值减少至一个合适的范围以便于分析,这里是降到二维空间。

new PrincipalComponentAnalysisEstimator(mlContext, "Features", "PCAFeatures", rank: 2)

OneHotEncoding

One Hot Encoding在此处的作用是将LastName从字符串转换为数字矩阵。

new OneHotEncodingEstimator(mlContext, new[] { new OneHotEncodingEstimator.ColumnInfo("LastName", "LastNameKey", OneHotEncodingTransformer.OutputKind.Ind) })

训练器

K-Means是常用的应对聚类问题的训练器,这里假设要分为三类。

mlContext.Clustering.Trainers.KMeans("Features", clustersCount: 3)

训练模型

trainingPipeline.Fit(pivotDataView);

评估模型

var predictions = trainedModel.Transform(pivotDataView);
var metrics = mlContext.Clustering.Evaluate(predictions, score: "Score", features: "Features"); Console.WriteLine($"*************************************************");
Console.WriteLine($"* Metrics for {trainer} clustering model ");
Console.WriteLine($"*------------------------------------------------");
Console.WriteLine($"* AvgMinScore: {metrics.AvgMinScore}");
Console.WriteLine($"* DBI is: {metrics.Dbi}");
Console.WriteLine($"*************************************************");

可得到如下的评估结果。

*************************************************
* Metrics for Microsoft.ML.Trainers.KMeans.KMeansPlusPlusTrainer clustering model
*------------------------------------------------
* AvgMinScore: 2.3154067927599
* DBI is: 2.69100740819456
*************************************************

使用模型

var clusteringPredictions = predictions
.AsEnumerable<ClusteringPrediction>(mlContext, false)
.ToArray();

画图

为了更直观地观察,可以用OxyPlot类库生成结果图片。

添加类库:

dotnet add package OxyPlot.Core

Plot生成处理:

var plot = new PlotModel { Title = "Customer Segmentation", IsLegendVisible = true };

var clusters = clusteringPredictions.Select(p => p.SelectedClusterId).Distinct().OrderBy(x => x);

foreach (var cluster in clusters)
{
var scatter = new ScatterSeries { MarkerType = MarkerType.Circle, MarkerStrokeThickness = 2, Title = $"Cluster: {cluster}", RenderInLegend = true };
var series = clusteringPredictions
.Where(p => p.SelectedClusterId == cluster)
.Select(p => new ScatterPoint(p.Location[0], p.Location[1])).ToArray();
scatter.Points.AddRange(series);
plot.Series.Add(scatter);
} plot.DefaultColors = OxyPalettes.HueDistinct(plot.Series.Count).Colors; var exporter = new SvgExporter { Width = 600, Height = 400 };
using (var fs = new System.IO.FileStream(_plotSvg, System.IO.FileMode.Create))
{
exporter.Export(plot, fs);
}

最后的图片如下所示:

完整示例代码

Program类:

using CustomerSegmentation.DataStructures;
using Microsoft.ML;
using System;
using System.IO;
using System.Linq;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Transforms.Projections;
using Microsoft.ML.Transforms.Categorical;
using Microsoft.ML.Runtime.Data;
using OxyPlot;
using OxyPlot.Series;
using Microsoft.ML.Core.Data; namespace CustomerSegmentation
{
class Program
{
private static float[] ToFeatures(ILookup<string, int> lookup, int count)
{
var result = new float[count];
foreach (var item in lookup)
{
var key = Convert.ToInt32(item.Key) - 1;
result[key] = item.Sum();
} return result;
} static readonly string _offersCsv = Path.Combine(Environment.CurrentDirectory, "assets", "offers.csv");
static readonly string _transactionsCsv = Path.Combine(Environment.CurrentDirectory, "assets", "transactions.csv");
static readonly string _plotSvg = Path.Combine(Environment.CurrentDirectory, "assets", "customerSegmentation.svg"); static void Main(string[] args)
{
var offers = Offer.ReadFromCsv(_offersCsv);
var transactions = Transaction.ReadFromCsv(_transactionsCsv); var clusterData = (from of in offers
join tr in transactions on of.OfferId equals tr.OfferId
select new
{
of.OfferId,
of.Campaign,
of.Discount,
tr.LastName,
of.LastPeak,
of.Minimum,
of.Origin,
of.Varietal,
Count = 1,
}).ToArray(); var count = offers.Count();
var pivotDataArray =
(from c in clusterData
group c by c.LastName into gcs
let lookup = gcs.ToLookup(y => y.OfferId, y => y.Count)
select new PivotData()
{
LastName = gcs.Key,
Features = ToFeatures(lookup, count)
}).ToArray(); var mlContext = new MLContext(); var schemaDef = SchemaDefinition.Create(typeof(PivotData));
schemaDef["Features"].ColumnType = new VectorType(NumberType.R4, count); var pivotDataView = mlContext.CreateStreamingDataView(pivotDataArray, schemaDef); var dataProcessPipeline = new PrincipalComponentAnalysisEstimator(mlContext, "Features", "PCAFeatures", rank: 2)
.Append(new OneHotEncodingEstimator(mlContext,
new[] { new OneHotEncodingEstimator.ColumnInfo("LastName", "LastNameKey", OneHotEncodingTransformer.OutputKind.Ind) })); var trainer = mlContext.Clustering.Trainers.KMeans("Features", clustersCount: 3);
var trainingPipeline = dataProcessPipeline.Append(trainer); ITransformer trainedModel = trainingPipeline.Fit(pivotDataView); var predictions = trainedModel.Transform(pivotDataView);
var metrics = mlContext.Clustering.Evaluate(predictions, score: "Score", features: "Features"); Console.WriteLine($"*************************************************");
Console.WriteLine($"* Metrics for {trainer} clustering model ");
Console.WriteLine($"*------------------------------------------------");
Console.WriteLine($"* AvgMinScore: {metrics.AvgMinScore}");
Console.WriteLine($"* DBI is: {metrics.Dbi}");
Console.WriteLine($"*************************************************"); var clusteringPredictions = predictions
.AsEnumerable<ClusteringPrediction>(mlContext, false)
.ToArray(); var plot = new PlotModel { Title = "Customer Segmentation", IsLegendVisible = true }; var clusters = clusteringPredictions.Select(p => p.SelectedClusterId).Distinct().OrderBy(x => x); foreach (var cluster in clusters)
{
var scatter = new ScatterSeries { MarkerType = MarkerType.Circle, MarkerStrokeThickness = 2, Title = $"Cluster: {cluster}", RenderInLegend = true };
var series = clusteringPredictions
.Where(p => p.SelectedClusterId == cluster)
.Select(p => new ScatterPoint(p.Location[0], p.Location[1])).ToArray();
scatter.Points.AddRange(series);
plot.Series.Add(scatter);
} plot.DefaultColors = OxyPalettes.HueDistinct(plot.Series.Count).Colors; var exporter = new SvgExporter { Width = 600, Height = 400 };
using (var fs = new System.IO.FileStream(_plotSvg, System.IO.FileMode.Create))
{
exporter.Export(plot, fs);
} Console.Read();
}
}
}

Offer类:

using System.Collections.Generic;
using System.IO;
using System.Linq; namespace CustomerSegmentation.DataStructures
{
public class Offer
{
//Offer #,Campaign,Varietal,Minimum Qty (kg),Discount (%),Origin,Past Peak
public string OfferId { get; set; }
public string Campaign { get; set; }
public string Varietal { get; set; }
public float Minimum { get; set; }
public float Discount { get; set; }
public string Origin { get; set; }
public string LastPeak { get; set; } public static IEnumerable<Offer> ReadFromCsv(string file)
{
return File.ReadAllLines(file)
.Skip(1) // skip header
.Select(x => x.Split(','))
.Select(x => new Offer()
{
OfferId = x[0],
Campaign = x[1],
Varietal = x[2],
Minimum = float.Parse(x[3]),
Discount = float.Parse(x[4]),
Origin = x[5],
LastPeak = x[6]
});
}
}
}

Transaction类:

using System.Collections.Generic;
using System.IO;
using System.Linq; namespace CustomerSegmentation.DataStructures
{
public class Transaction
{
//Customer Last Name,Offer #
//Smith,2
public string LastName { get; set; }
public string OfferId { get; set; } public static IEnumerable<Transaction> ReadFromCsv(string file)
{
return File.ReadAllLines(file)
.Skip(1) // skip header
.Select(x => x.Split(','))
.Select(x => new Transaction()
{
LastName = x[0],
OfferId = x[1],
});
}
}
}

PivotData类:

namespace CustomerSegmentation.DataStructures
{
public class PivotData
{
public float[] Features;
public string LastName;
}
}

ClusteringPrediction类:

using Microsoft.ML.Runtime.Api;
using System;
using System.Collections.Generic;
using System.Text; namespace CustomerSegmentation.DataStructures
{
public class ClusteringPrediction
{
[ColumnName("PredictedLabel")]
public uint SelectedClusterId;
[ColumnName("Score")]
public float[] Distance;
[ColumnName("PCAFeatures")]
public float[] Location;
[ColumnName("LastName")]
public string LastName;
}
}

ML.NET教程之客户细分(聚类问题)的更多相关文章

  1. ML.NET 示例:聚类之客户细分

    写在前面 准备近期将微软的machinelearning-samples翻译成中文,水平有限,如有错漏,请大家多多指正. 如果有朋友对此感兴趣,可以加入我:https://github.com/fei ...

  2. 数据挖掘应用案例:RFM模型分析与客户细分(转)

    正好刚帮某电信行业完成一个数据挖掘工作,其中的RFM模型还是有一定代表性,就再把数据挖掘RFM模型的建模思路细节与大家分享一下吧!手机充值业务是一项主要电信业务形式,客户的充值行为记录正好满足RFM模 ...

  3. RFM模型的应用 - 电商客户细分(转)

    RFM模型是网点衡量当前用户价值和客户潜在价值的重要工具和手段.RFM是Rencency(最近一次消费),Frequency(消费频率).Monetary(消费金额) 消费指的是客户在店铺消费最近一次 ...

  4. ML.NET教程之情感分析(二元分类问题)

    机器学习的工作流程分为以下几个步骤: 理解问题 准备数据 加载数据 提取特征 构建与训练 训练模型 评估模型 运行 使用模型 理解问题 本教程需要解决的问题是根据网站内评论的意见采取合适的行动. 可用 ...

  5. ML.NET技术研究系列-2聚类算法KMeans

    上一篇博文我们介绍了ML.NET 的入门: ML.NET技术研究系列1-入门篇 本文我们继续,研究分享一下聚类算法k-means. 一.k-means算法简介 k-means算法是一种聚类算法,所谓聚 ...

  6. ML.NET教程之出租车车费预测(回归问题)

    理解问题 出租车的车费不仅与距离有关,还涉及乘客数量,是否使用信用卡等因素(这是的出租车是指纽约市的).所以并不是一个简单的一元方程问题. 准备数据 建立一控制台应用程序工程,新建Data文件夹,在其 ...

  7. oracle基础教程oracle客户端详解

    oracle基础教程oracle客户端工具详解 参考网址:http://www.oraclejsq.com/article/010100114.html 该教程介绍了oracle自带客户端sqlplu ...

  8. ML.NET 示例:开篇

    写在前面 准备近期将微软的machinelearning-samples翻译成中文,水平有限,如有错漏,请大家多多指正. 如果有朋友对此感兴趣,可以加入我:https://github.com/fei ...

  9. ML.NET 示例:目录

    ML.NET 示例中文版:https://github.com/feiyun0112/machinelearning-samples.zh-cn 英文原版请访问:https://github.com/ ...

随机推荐

  1. Objc的底层并发API

    本文由webfrogs译自objc.io,原文作者Daniel Eggert.转载请注明出处! 小引 本篇英文原文所发布的站点objc.io是一个专门为iOS和OS X开发者提供的深入讨论技术的平台, ...

  2. springboot本地读取resources/images没问题,上传到云服务器打成jar包就读取不到问题

    //String watermarkfileName = this.getClass().getClassLoader().getResource("images/watermark.png ...

  3. SessionFactory 执行原生态的SQL语句

    public List<User> getPageIndex(int pageIndex,int topNum){ String hql = "select top(" ...

  4. SNF快速开发平台WinForm-CS甘特图

    我们在做项目当中会经常用到按时间进度查看任务,其通过条状图来显示项目,进度,和其他时间相关的系统进展的内在关系随着时间进展的情况. 甘特图包含以下三个含义: 1.以图形或表格的形式显示活动: 2.通用 ...

  5. 第三部分:Android 应用程序接口指南---第二节:UI---第三章 菜单

    第3章 菜单 在许多不同类型的应用中,菜单通常是一种用户界面组件.为了提供给用户提供熟悉且一致的体验,你需要使用菜单API来展示用户动作和你Activity中的其他选项. 从安卓3.0系统(API l ...

  6. Features Download Pricing Mind Maps Blog XMind的快捷键

    XMind提供很多快捷键.使用XMind时,在操作的过程中结合快捷键,双手同时操作,将给你带来很大的便利.例如下面一些常用的快捷键: 编辑主题:F2 添加标签:F3 创建一个新的空白工作簿:Ctrl+ ...

  7. [k8s]zookeeper集群在k8s的搭建(statefulset模式)-pod的调度

    之前一直docker-compose跑zk集群,现在把它挪到k8s集群里. docker-compose跑zk集群 zk集群in k8s部署 参考: https://github.com/kubern ...

  8. 详解CUDA编程

    CUDA 是 NVIDIA 的 GPGPU 模型,它使用 C 语言为基础,可以直接以大多数人熟悉的 C 语言,写出在显示芯片上执行的程序,而不需要去学习特定的显示芯片的指令或是特殊的结构.” 编者注: ...

  9. 每日英语:How To Survive The Windows XPiration Date

    The default background for Microsoft's Windows XP operating system -- a perfect blue sky full of cot ...

  10. linux每日命令(21):find命令之exec

    find是我们很常用的一个Linux命令,但是我们一般查找出来的并不仅仅是看看而已,还会有进一步的操作,这个时候exec的作用就显现出来了. 一. exec参数说明: -exec 参数后面跟的是com ...