ICDM Winner's Interview: 3rd place, Roberto Diaz
ICDM Winner's Interview: 3rd place, Roberto Diaz
This summer, the ICDM 2015 conference sponsored a competitionfocused on making individual user connections across multiple digital devices. Top teams were invited to submit a paper for presentation at an ICDM workshop.
Roberto Diaz, competing as team "CookieMonster", took 3rd place. In this blog, he shares how he became a Kaggle addict, what he values in a competition, and most importantly, details on his approach to this unique dataset. Congrats to Roberto for achieving his goal of becoming a top 100 Kaggle user!
407 players on 340 teams competed in ICDM 2015: Drawbridge Cross-Device Connections
The Basics
What was your background prior to entering this challenge?
In addition to being a Kaggle addict, I am a researcher at Treelogicworking in the machine learning area. In parallel I work on my PhD thesis at the University Carlos III de Madrid focused on the parallelization of Kernel Methods.
Roberto's Kaggle profile
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
I didn't have any knowledge about this domain. The topic is quite new and I couldn't find any papers related to this problem, most probably because there are not public datasets.
How did you get started competing on Kaggle?
I started on the first Facebook competition a long time ago. A friend of mine was taking part in the challenge and he encouraged me to compete. That caught my initial curiosity so I accessed the challenge's forum and I read a post with a solution that scored quite well on the leaderboard and I thought "I think I can do better than that". At the end I scored 9th on the leaderboard.
For my second challenge (EMC Israel Data science challenge) I was on a team with my PhD mates. We finished 3rd receiving a prize.
After that it was too late for me, I had become an addict.
What made you decide to enter this competition?
The things I value most in a challenge are:
- A conference associated to the challenge: It is a good opportunity to publish your results. For example, my solution in the Higgs Boson Machine Learning Challenge:
DÌaz-Morales, R., & Navia-V·zquez, A. (2015, September). Optimization of AMS using Weighted AUC optimized models. In *JMLR: Workshop and Conference Proceedings*, Vol. 42, pp. 109-127.
- A domain unknown to me: It is the best way to learn about how to work with a different kind of data.
- The need to preprocess and extract the features from raw data to build the dataset: It gives you the chance to use your intuition and imagination.
This challenge looked very interesting to me because all the conditions were met.
Let's Get Technical
What preprocessing and supervised learning methods did you use?
In this challenge we had a list of devices and a list of cookies and we had to tell what cookies belonged to the person using the device.
The most important part was the feature extraction procedure, they had to contain information about the relation between devices and cookies (for example, the number of IP addresses visited by each one and by both of them).
Once I had the features I tried simple supervised machine learning algorithms and complex ones (my winning methodology was Semi-Supervised learning procedure using Gradient Boosting + Bagging) and the score just grew up from 0.865 to 0.88.
What was your most important insight into the data?
A key part of the solution was the initial selection of candidates and the post processing:
- Initial selection: It was not possible to create a training set containing every combination of devices and cookies due to the high number of them. In order to reduce the initial complexity of the problem and to create an affordable dataset, some basic rules were created to obtain an initial reduced set of candidate cookies for every device. The rules are based on the IP addresses that both device and cookie have in common and how frequent they are in other devices and cookies.
- Supervised Learning: Every pattern in the training and test set represents a device/candidate cookie pair obtained by the previous step and contains information about the device (Operating System (OS), Country, ...), the cookie (Cookie Browser Version, Cookie Computer OS,...) and the relation between them (number of IP addresses shared by both device and cookie, number of other cookies with the same handle than the cookie,...).
- Post Processing: If the initial selection of candidates did not find a candidate with enough likelihood (logistic output of the classifier) we choose a new set of candidate cookies selecting every cookie that shares an IP address with the device and we score them using the classifier.
The initial selection of candidates reduces the complexity of the problem and the post processing step find out most of the device/cookie pairs lost by that initial selection strategy.
Were you surprised by any of your findings?
Yes. When I sorted the scores obtained by the classifier for every candidate I saw that if the first score is high and the second is very low, is extremely likely that the first cookie belongs to the device. I made use of this information to create semi-supervised learning procedure updating some features in the training set and retraining the algorithm again with this new information to improve the results.
This picture shows the F05 score and the percentage of devices that fulfill the condition when we match devices and the first cookies candidate when the second candidate scores less than a threshold:

Which tools did you use?
This solution has been implemented in python and uses the external software XGBoost.
The libraries of python used were:
How did you spend your time on this competition?
I spent about 20% of the time in feature engineering, 10% in the supervised learning part and 70% eagerly awaiting for the results.
What was the run time for both training and prediction of your winning solution?
Too much, the training procedure takes around 9 hours using 12 cores.
The prediction procedure takes around 30 minutes, it is necessary to extract some features from the relational database.
Words of Wisdom
What have you taken away from this competition?
I was trying to reach a place in top 100 of the users global ranking and I finally got it.
Regarding the challenge:
- I have learned how useful it is to save intermediate results in order to not repeat the full training procedure only to change the last steps of the algorithm.
- A paper with my approach to the problem in the next ICDM 2015 workshop dedicated to the challenge.
Do you have any advice for those just getting started in data science?
"All hope abandon, ye who enter here".
No, seriously, at the beginning you may feel frustrated because it is difficult area but you are in the correct place if:
- You love statistics more than other software engineers
- You love software engineering more than other statisticians.
Bio
Roberto Diaz
is a researcher in the R&D department of Treelogic, a SME Spanish company focused on Machine Learning, Computer Vision and Big Data that takes part in many EU Research and Innovarions programmes. In parallel he works on his PhD thesis in the University Carlos III de Madrid focused on the parallelization of Kernel Methods.
ICDM Winner's Interview: 3rd place, Roberto Diaz的更多相关文章
- Taxi Trip Time Winners' Interview: 3rd place, BlueTaxi
Taxi Trip Time Winners' Interview: 3rd place, BlueTaxi This spring, Kaggle hosted two competitions w ...
- Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham
Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham Ben Graham finished at the top of the ...
- Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang
Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang Recruit Ponpare is Japan's leading ...
- CrowdFlower Winner's Interview: 1st place, Chenglong Chen
CrowdFlower Winner's Interview: 1st place, Chenglong Chen The Crowdflower Search Results Relevance c ...
- Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯
Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯ The Otto Grou ...
- How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo
How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo An early insight into the importa ...
- Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees)
Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees) Peter Best (aka fakepla ...
- Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang
Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang The hugely popular Liberty Mut ...
- 如何在 Kaggle 首战中进入前 10%
原文:https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/ Introduction Kaggle 是目前最 ...
随机推荐
- Alpha 冲刺报告3
队名 massivehard 组员一(组长:晓辉) 今天完成了哪些任务 .整理昨天的两个功能,补些bug 写了一个初步的loyaut github 还剩哪些任务: 后台的用来处理自然语言的服务器还没架 ...
- git学习(一) 如何将项目上传到github
用了github有了段时间,但是感觉都是断断续续的,这次花了点时间来总结下,已方便下次忘记的时候拿出来看一下: 自己主要是参考了这个网站来学习的: git教程 -廖雪峰 第一步: 创建github账号 ...
- PAT 1067 试密码
https://pintia.cn/problem-sets/994805260223102976/problems/994805266007048192 当你试图登录某个系统却忘了密码时,系统一般只 ...
- Jvm dump介绍与使用(内存与线程)
很多情况下,都会出现dump这个字眼,java虚拟机jvm中也不例外,其中主要包括内存dump.线程dump. 当发现应用内存溢出或长时间使用内存很高的情况下,通过内存dump进行分析可找到原因. 当 ...
- 解决多进程中APScheduler重复运行的问题
转自:http://blog.csdn.net/raptor/article/details/69218271 问题 在一个Python web应用中需要定时执行一些任务,所以用了APSchedule ...
- RF相关知识
前言:下文中的总结都是来自于网络,有的来自与博客,有的来自于维基百科/百度百科,仅仅是为了方便查看. ASK: ASK:幅移键控调制的简写,例如二进制的,把二进制符号0和1分别用不同的幅度来表示, ...
- 拦截器的顺序是按照xml中的顺序执行的
- 【HBuilder】手机App推送至Apple App Store过程
一.前言 最近由于公司同事离职,顶替这位同事从事手机App的研发工作,BIM数据平台部门采用的是HBuilder作为手机App的制作环境.本篇介绍我是如何将HBuilder的Releas ...
- 远程桌面(RDP)上的渗透测试技巧和防御
0x00 前言 在本文中,我们将讨论四种情况下的远程桌面渗透测试技巧方法.通过这种攻击方式,我们试图获取攻击者如何在不同情况下攻击目标系统,以及管理员在激活RDP服务时来抵御攻击时应采取哪些主要的 ...
- 毕业设计预习:maxplus2入门教程
maxplus2入门教程 一.安装配置(maxplus2.zip) 下载安装完成后,运行maxstart.exe,显示如下错误提示: 为节省配置工作,在E:盘下新建maxplus2文件夹,仅将所需附加 ...