integer encoding vs 1-hot (py)
https://github.com/szilard/benchm-ml/issues/1
glouppe commented on 7 May 2015
Thanks for the benchmarks! Proper handling of categorical variables is not an easy issue anyway.
When the categories are ordered, it makes more sense indeed to handle them as numerical variables. I dont have a strong argument as to why it may be also better when there is no natural ordering. I guess it could boil down to the fact that one-hot encoding splits are often very unbalanced, while integer encoded splits may be less unbalanced. |
Thanks @glouppe. I read somewhere a paper that AFAIR suggested to sort the (non-ordered) categoricals in order of their frequency in the data and encode them as integers as such. Any idea what that paper might be?
glouppe commented on 7 May 2015
Yes, it is Breiman's book :) When your output is binary, this strategy is in fact optimal (it will find the best subset among the values of the categorical variables) and linear. See section 3.6.3.2 of my thesis if you dont have the CART book. |
One-hot encoding could be helpful when the number of categories are small( in level of 10 to 100). In such case one-hot encoding can discover interesting interactions like (gender=male) AND (job = teacher).
While ordering them makes it harder to be discovered(need two split on job). However, indeed there is not a unified way handling categorical features in trees, and usually what tree was really good at was ordered continuous features anyway..
integer encoding vs 1-hot (py)的更多相关文章
- [已解决]关于python无法显示中文的问题:SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no encoding declared。
想在python代码中输出汉字.但是老是出现SyntaxError: Non-ASCII character '\xe4' in file test.py on line , but no encod ...
- 关于python无法显示中文的问题:SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no encoding declared。
[已解决]关于python无法显示中文的问题:SyntaxError: Non-ASCII character '\xe4' in file test.py on line 3, but no enc ...
- requests之headers 'Content-Type': 'text/html'误判encoding为'ISO-8859-1'导致中文text解码错误
0. requests不设置UA 访问baidu 得到 r.headers['Content-Type'] 是text/html 使用chrome UA: Content-Type:text/htm ...
- leetCode练题——13. Roman to Integer
1.题目13. Roman to Integer Roman numerals are represented by seven different symbols: I, V, X, L, C, D ...
- 【论文考古】量化SGD QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, "QSGD: Communication-Efficient SGD ...
- Python函数信息
Python函数func的信息可以通过func.func_*和func.func_code来获取 一.先看看它们的应用吧: 1.获取原函数名称: 1 >>> def yes():pa ...
- Scrapy学习-23-分布式爬虫
scrapy-redis分布式爬虫 分布式需要解决的问题 request队列集中管理 去重集中管理 存储管理 使用scrapy-redis实现分布式爬虫 github开源项目: https://g ...
- Flask入门系列(转载)
一.入门系列: Flask入门系列(一)–Hello World 项目开发中,经常要写一些小系统来辅助,比如监控系统,配置系统等等.用传统的Java写,太笨重了,连PHP都嫌麻烦.一直在寻找一个轻量级 ...
- 使用Java将搜狗词库文件(文件后缀为.scel)转为.txt文件
要做一个根据词库进行筛选主要词汇的功能,去搜狗下载专业词汇词库时,发现是.scel文件,且通过转换工具(http://tools.bugscaner.com/sceltotxt/)转换为txt时报错如 ...
随机推荐
- bzoj 1220 跳蚤
Written with StackEdit. Description \(Z\)城市居住着很多只跳蚤.在\(Z\)城市周六生活频道有一个娱乐节目.一只跳蚤将被请上一个高空钢丝的正中央.钢丝很长,可以 ...
- ACM学习历程—计蒜客15 单独的数字(位运算)
http://nanti.jisuanke.com/t/15 题目要求是求出只出现一次的数字,其余数字均出现三次. 之前有过一个题是其余数字出现两次,那么就是全部亦或起来就得到答案. 这题有些不太一样 ...
- vue 相邻自定义组件渲染错误正确的打开方式
话不多说看问题: 当封装自定义组件时例如(自定义下拉列表)两个相同的组件在多次v-if变化时偶尔会发生渲染错误,明明赋值正确但是组建中的ajax方法可能返回的数据乱掉,或者其他神逻辑错误. 经过查询发 ...
- Aix之 xmanager 2.0连接AIX服务器
xmanager连接AIX服务器可以分为两种情况:1.连接IBM服务器,使用远程桌面功能进行系统维护.要求这台服务器已经安装了图形桌面,如CDE等,并启动到图形界面.在xmanager中的Xbrows ...
- 自动工作负载库理论与操作(Automatic Workload Repository,AWR)
AWR的由来: 10g之前的oracle:用户的连接将产生会话,当前会话记录保存在v$session中:处于等待状态的会话会被复制一份放在v$session_wait中.当该连接 断开后,其原来 ...
- FIREDAC的心得
FIREDAC与UNIDAC有些不同 但大体上是相同的 以下是一些随手笔记: FieldCount是当前FDQuery2所在行里面有多少列 一般用FieldList[X]来代表第几列 str:=FDQ ...
- 2、通过HBase API进行开发
一.将HBase的jar包及hbase-site.xml添加到IDE 1.到安装HBase集群的任意一台机器上找到HBase的安装目录,到lib目录下下载HBase需要的jar包,然后再到conf目录 ...
- C# messagebox 居中父窗体
在右边项目上点击右键->Add->class,添加MessageBoxEx类: using System; using System.Windows.Forms; using System ...
- java中求利息的代码
总结:函数的重要性,懂得用哪一种函数 package com.badu; import java.util.Scanner; //输入存款金额 money.存期 year 和年利率 rate, //根 ...
- 校赛热身赛 Problem D. Unsolved Mystery
Problem D. Unsolved MysteryThe world famous mathematician Wang Hao published a paper recently in the ...