https://github.com/szilard/benchm-ml/issues/1

glouppe commented on 7 May 2015

Thanks for the benchmarks! Proper handling of categorical variables is not an easy issue anyway.

I would expect faster, lower memory but decrease in AUC (or same in some cases).

When the categories are ordered, it makes more sense indeed to handle them as numerical variables. I dont have a strong argument as to why it may be also better when there is no natural ordering. I guess it could boil down to the fact that one-hot encoding splits are often very unbalanced, while integer encoded splits may be less unbalanced.

Thanks @glouppe. I read somewhere a paper that AFAIR suggested to sort the (non-ordered) categoricals in order of their frequency in the data and encode them as integers as such. Any idea what that paper might be?

glouppe commented on 7 May 2015

Yes, it is Breiman's book :) When your output is binary, this strategy is in fact optimal (it will find the best subset among the values of the categorical variables) and linear.

See section 3.6.3.2 of my thesis if you dont have the CART book.
http://orbi.ulg.ac.be/bitstream/2268/170309/1/thesis.pdf

One-hot encoding could be helpful when the number of categories are small( in level of 10 to 100). In such case one-hot encoding can discover interesting interactions like (gender=male) AND (job = teacher).

While ordering them makes it harder to be discovered(need two split on job). However, indeed there is not a unified way handling categorical features in trees, and usually what tree was really good at was ordered continuous features anyway..

 
 

 

integer encoding vs 1-hot (py)的更多相关文章

  1. [已解决]关于python无法显示中文的问题:SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no encoding declared。

    想在python代码中输出汉字.但是老是出现SyntaxError: Non-ASCII character '\xe4' in file test.py on line , but no encod ...

  2. 关于python无法显示中文的问题:SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no encoding declared。

    [已解决]关于python无法显示中文的问题:SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no enc ...

  3. requests之headers 'Content-Type': 'text/html'误判encoding为'ISO-8859-1'导致中文text解码错误

    0. requests不设置UA 访问baidu 得到 r.headers['Content-Type'] 是text/html  使用chrome UA: Content-Type:text/htm ...

  4. leetCode练题——13. Roman to Integer

    1.题目13. Roman to Integer Roman numerals are represented by seven different symbols: I, V, X, L, C, D ...

  5. 【论文考古】量化SGD QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

    D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, "QSGD: Communication-Efficient SGD ...

  6. Python函数信息

    Python函数func的信息可以通过func.func_*和func.func_code来获取 一.先看看它们的应用吧: 1.获取原函数名称: 1 >>> def yes():pa ...

  7. Scrapy学习-23-分布式爬虫

    scrapy-redis分布式爬虫 分布式需要解决的问题 request队列集中管理 去重集中管理 存储管理   使用scrapy-redis实现分布式爬虫 github开源项目: https://g ...

  8. Flask入门系列(转载)

    一.入门系列: Flask入门系列(一)–Hello World 项目开发中,经常要写一些小系统来辅助,比如监控系统,配置系统等等.用传统的Java写,太笨重了,连PHP都嫌麻烦.一直在寻找一个轻量级 ...

  9. 使用Java将搜狗词库文件(文件后缀为.scel)转为.txt文件

    要做一个根据词库进行筛选主要词汇的功能,去搜狗下载专业词汇词库时,发现是.scel文件,且通过转换工具(http://tools.bugscaner.com/sceltotxt/)转换为txt时报错如 ...

随机推荐

  1. 【CSS3】 - 初识CSS3

    .navdemo{ width:560px; height: 50px; font:bold 0/50px Arial; text-align:center; margin:40px auto 0; ...

  2. C# 实现程序只启动一次(实现程序自重启)

    程序运行过程中,不能有多个实例运行,并且需要程序自己可以重启(重新运行),所以代码如果下代码: static void Main() { bool createNew; using (System.T ...

  3. css处理最后一个li

    .relatebar li{width:98px;height:146px;padding:5px;float:left;border-left:1px solid #ccc;} .relatebar ...

  4. TabControl控件用法图解

    1.首先创建一个MFC对话框框架,在对话框资源上从工具箱中添加上一个TabControl控件 2.根据需要修改一下属性,然后右击控件,为这个控件添加一个变量,将此控件跟一个CTabCtrl类变量绑定在 ...

  5. 深入浅出K-Means算法

    在数据挖掘中,K-Means算法是一种cluster analysis的算法,其主要是来计算数据聚集的算法,主要通过不断地取离种子点最近均值的算法. 问题 K-Means算法主要解决的问题如下图所示. ...

  6. Oracle LSNRCTL------监听器的启动和关闭

    对于DBA来说,启动和关闭oracle监听器是很基础的任务,但是Linux系统管理员或者程序员有时也需要在开发数据库中做一些基本的DBA操作,因此了解一些基本的管理操作对他们来说很重要. 本文将讨论用 ...

  7. java代码逆序输出再连篇

    总结:思维方式关键 package com.dfd; import java.util.Scanner; //逆序输出数字 public class fdad { public static void ...

  8. EasyUI 读数据库图

    EasyUI 读数据库图 function edit_dg() { //选中一行,获取这一行的主键值 var idImg = $("#tbDeviceClassBrowstab") ...

  9. NoSuchBeanDefinitionException: No bean named 'shiroFilter' is defined

    以前运行正常的项目,过了一段时间再运行时出问题,打开链接无反应,无法访问Tomcat,空白页面. 经检查发现,在Tomcat log中有报错: NoSuchBeanDefinitionExceptio ...

  10. [新手教程]windows 2003 php环境搭建详细教程(转)

    对于windows服务器的php环境配置一直是是新人朋友的难题,也难倒了很多高手.这里分享一个新手教程,给那些建站新人使用.本教程来自朋友吴文辉的博客,欢迎大家有时间可以访问他的博客:吴文辉博客htt ...