机器学习中离散特征的处理方法

Updated: August 25, 2016

Learning with counts is an efficient way to create a compact set of features for a dataset, based on counts of the values. You can use the modules in this section to build a set of counts and features, and later update the counts and the features to take advantage of new data, or merge two sets of count data.

The basic idea underlying count-based featurization is simple: by calculating counts, you can quickly and easily get a summary of what columns contain the most important information. The module counts the number of times a value appears, and then provides that information as a feature for input to a model.

Example of Count-Based Learning

 

Imagine you’re trying to validate a credit card transaction. One crucial piece of information is where this transaction came from, and one of the most common encodings of that location is the postal code. However, there might be as many as 40,000 postal codes, zip codes, and geographical codes to account for. Does your model have the capacity to learn 40,000 more parameters? If you give it that capacity, do you now have enough training data to prevent it from overfitting?

If you had really good data with lots of samples, such fine-grained local granularity could be quite powerful. However, if you have only one sample of a fraudulent transaction from a small locality, does it mean that all of the transactions from that place are bad, or that you don’t have enough data?

One solution to this conundrum is to learn with counts. That is, rather than introduce 40,000 more features, you can observe the counts and proportions of fraud for each postal code. By using these counts as features, you gain a notion of the strength of the evidence for each value. Moreover, by encoding the relevant statistics of the counts, the learner can use the statistics to decide when to back off and use other features.

Count-based learning is very attractive for many reasons: You have fewer features, requiring fewer parameters, which makes for faster learning, faster prediction, smaller predictors, and less potential to overfit.

How Counts are Created

 

An example might help to demonstrate how count-based features are created and applied. This example is highly simplified, to give you an idea of the overall process, and how to use and interpret count-based features.

Suppose you have a table like this, with labels and inputs:

Label column

Input value

0

A

0

A

1

A

0

B

1

B

1

B

1

B

Here is how count-based features are created:

  1. Each case (or row, or sample) has a set of values in columns.

    Here, the values are A, B, and so forth.

  2. For a particular set of values, you find all the other cases in that dataset that have the same value.

    In this case, there are three instances of A and four of B.

  3. Next, you count their class memberships as features in themselves.

    In this case, you get a small matrix, in which there are 2 cases where A=0, 1 case where A = 1, 1 case where B= 0, and 3 cases where B = 1.

When you create features based on this matrix, you get a variety of count-based features, including a calculation of the log-odds ratio as well as the counts for each target class:

Label

0_0_Class000_Count

0_0_Class001_Count

0_0_Class000_LogOdds

0_0_IsBackoff

0

2

1

0.510826

0

0

2

1

0.510826

0

1

2

1

0.510826

0

0

1

3

-0.8473

0

1

1

3

-0.8473

0

1

1

3

-0.8473

0

1

1

3

-0.8473

0

Examples

 

The following article from the Microsoft Machine Learning team provides a detailed walkthrough of how to use counts in machine learning, and compares the efficacy of count-based modeling with other methods.

Using Azure ML to Build Clickthrough Prediction Models

Technical Notes

 
  • How is the log-loss value calculated?

    The Log-loss value is not the plain log-odds; the prior distribution is used to smooth the log-odds computation.

    Suppose you have a data set used for binary classification. In this dataset, the prior frequency for class 0 is p_0, and the prior frequency for class 1 is p_1 = 1 – p_0. For a certain training example feature, the count for class 0 is x_0, and the count for class 1 is x_1.

    Under these assumptions, the log-odds is computed as:

    LogOdds = Log(x_0 + c * p_0) – Log (x_1 + c * p_1)

    Where:

    • c is the prior coefficient, which can be set by the user.

    • Log uses the natural base.

    In other words, for each class i:

    Log_odds[i] = Log( (count[i] + prior_coefficient * prior_frequency[i]) / (sum_of_counts - count[i]) + prior_coefficient * (1 - prior_frequency[i]))

    If the prior coefficient is positive, the log odds can be different from Log(count[i] / (sum_of_counts – count[i])).

  • Why are the log odds not computed for some items?

    By default, all items with a count less than 10 are collected in a single bucket called the "garbage bin". You can change this behavior value by using the Garbage bin threshold option in the Modify Count Table Parameters module.

List of Modules

 

The Learning with Counts category includes the following modules:

Module

Description

Build Counting Transform

Creates a count table and count-based features from a dataset, and saves it as a transformation

Export Count Table

Exports count table from a counting transform

This module supports backward compatibility with experiments that create count-based features using Build Count Table (deprecated) and Count Featurizer (deprecated).

Import Count Table

Imports an existing count table

This module supports backward compatibility with experiments that create count-based features using Build Count Table (deprecated) and Count Featurizer (deprecated). It supports conversion of count tables to count transformations.

Merge Count Transform

Merges two sets of count-based features

Modify Count Table Parameters

Modifies count-based features derived from an existing count table

Data Transformation / Learning with Counts的更多相关文章

  1. 【转】The most comprehensive Data Science learning plan for 2017

    I joined Analytics Vidhya as an intern last summer. I had no clue what was in store for me. I had be ...

  2. 《从0到1学习Flink》—— Flink Data transformation(转换)

    前言 在第一篇介绍 Flink 的文章 <<从0到1学习Flink>-- Apache Flink 介绍> 中就说过 Flink 程序的结构 Flink 应用程序结构就是如上图 ...

  3. Flink 从 0 到 1 学习 —— Flink Data transformation(转换)

    toc: true title: Flink 从 0 到 1 学习 -- Flink Data transformation(转换) date: 2018-11-04 tags: Flink 大数据 ...

  4. Flink Data transformation(转换)

    Flink Data transformation 算子学习 1.Source:数据源,Flink在流处理和批处理上的source大概有4类: 基于本地集合的source.基于文件的source.基于 ...

  5. Intermediate Python for Data Science learning 2 - Histograms

    Histograms from:https://campus.datacamp.com/courses/intermediate-python-for-data-science/matplotlib? ...

  6. Intermediate Python for Data Science learning 1 - Basic plots with matplotlib

    Basic plots with matplotlib from:https://campus.datacamp.com/courses/intermediate-python-for-data-sc ...

  7. Intro to Python for Data Science Learning 8 - NumPy: Basic Statistics

    NumPy: Basic Statistics from:https://campus.datacamp.com/courses/intro-to-python-for-data-science/ch ...

  8. Intro to Python for Data Science Learning 7 - 2D NumPy Arrays

    2D NumPy Arrays from:https://campus.datacamp.com/courses/intro-to-python-for-data-science/chapter-4- ...

  9. Intro to Python for Data Science Learning 5 - Packages

    Packages From:https://campus.datacamp.com/courses/intro-to-python-for-data-science/chapter-3-functio ...

随机推荐

  1. AngularJS中的Provider们:Service和Factory等的区别

    引言 看了很多文章可能还是不太说得出AngularJS中的几个创建供应商(provider)的方法(factory(),service(),provider())到底有啥区别,啥时候该用啥,之前一直傻 ...

  2. U盘装系统

    http://jingyan.baidu.com/article/fec4bce20e344cf2618d8b37.html

  3. Ajax服务请求原理 简单总结

    刚开始以为Ajax是一种新的语言,接触之后才知道,ajax是用于服务器交换数据并更新部分网页的Web应用程序的技术. 第一次看到Ajax请求代码时,感觉一脸萌逼,这些代码竟然把后台数据请求过来了,神奇 ...

  4. keil中出现Undefined symbol FLASH_PrefetchBufferCmd (referred from main.o)等问题解决办法

    在keil中仿照别人的程序写了RCC初始化的程序,编译后出现以下问题 .\obj\pro1.axf: Error: L6218E: Undefined symbol FLASH_PrefetchBuf ...

  5. 功能强大的web打印控件lodop的使用

    打印是很多web系统都需要的功能,最近找到一款功能强大,使用简单,价格便宜的web打印工具Lodop,免费也能用,不过有水印,也不贵商业开发建议购买. 废话不多说,拿来就用,从简单的打印开始. 1.下 ...

  6. Bash Shell 获取进程 PID

    转载地址:http://weyo.me/pages/techs/linux-get-pid/ 导读 Linux 的交互式 Shell 与 Shell 脚本存在一定的差异,主要是由于后者存在一个独立的运 ...

  7. sqlite升级--浅谈Android数据库版本升级及数据的迁移

    Android开发涉及到的数据库采用的是轻量级的SQLite3,而在实际开发中,在存储一些简单的数据,使用SharedPreferences就足够了,只有在存储数据结构稍微复杂的时候,才会使用数据库来 ...

  8. bug_ _

    java.lang.SecurityException: Not allowed to bind to service I app中加了百度定位功能,大部分手机测试没问题,但有部分手机会定位失败,提示 ...

  9. JVM实用参数(五)新生代垃圾回收

    本部分,我们将关注堆(heap) 中一个主要区域,新生代(young generation).首先我们会讨论为什么调整新生代的参数会对应用的性能如此重要,接着我们将学习新生代相关的JVM参数. 单纯从 ...

  10. Nginx常用日志分割方法

    方式一: nginx cronolog日志分割配置文档,根据下面方法,每分钟分割一次NGINX访问日志. 1.nginx日志配置 access_log /var/log/nginx/access.lo ...