Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang
Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang
The hugely popular Liberty Mutual Group: Property Inspection Prediction competition wrapped up on August 28, 2015 with Qingchen Wang at the top of a crowded leaderboard. A total of 2,362 players on 2,236 teams competed to predict how many hazards a property inspector would count during a home inspection.
This blog outlines Qinchen's approach, and how a relative newbie to Kaggle competitions learned from the community and ultimately took first place.
The Basics
What was your background prior to entering this challenge?
I did my bachelor’s in computer science. After working for a few months at EA Sports as a software engineer I felt the strong need to learn statistics and machine learning as the problems that interested me the most were about predicting things algorithmically. Since then I’ve earned master’s degrees in machine learning and business and I’ve just started a PhD in marketing analytics.
Qingchen's profile on Kaggle
How did you get started competing on Kaggle?
I had an applied machine learning course during my master’s at UCL and the course project was to compete on the Heritage Health Prize. Although at the time I didn’t really know what I was doing it was still a very enjoyable experience. I’ve competed briefly in other competitions since, but this was the first time I’ve been able to take part in a competition from start to finish and it turned out to have been quite a rewarding experience.
What made you decide to enter this competition?
I was in a period of unemployment so I decided to work on data science competitions full-time until I found something else to do. I actually wanted to do the Caterpillar competition at first but decided to give this one a quick go since the data didn’t require any preprocessing to start. My early submissions were not very good so I became determined to improve and ended up spending the whole time doing this.
What made this competition so rewarding was how much I learned.As more or less a Kaggle newbie, I spent the whole two months trying and learning new things. I hadn’t known about methods like gradient boosting trees or tricks like stacking/ blending and the variety of ways to handle categorical variables. At the same time, it was probably the intuition that I developed through previous education that set my model apart from some of the other competitors so I was able to validate my existing knowledge as well.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
I have zero prior experience or domain knowledge for this competition. It’s interesting because during the middle of the competition I hit a wall and a number of the top-10 ranked competitors have worked in the insurance industry so I thought maybe they had some domain knowledge which gave them an advantage. It turned out to not be the case. As far as data science competitions go, I think this one was rather straightforward.
Histogram of all fields in the dataset with labels. Script by competition participant, Rajiv Shah
Let's Get Technical
What preprocessing and supervised learning methods did you use?
I used only XGBoost (tried others but none of them performed well enough to end up in my ensemble). The key to my result was that I also did binary transformation of hazards which turned the regression problem into a set of classification problems. I noticed that some other people also tried this method through the forum thread but it seems that they didn’t go far enough with the binary transformation as that was the best performing part of my ensemble.
I also played with different encodings of categorical variables and interactions, nothing sophisticated, just the standard tricks that many others have used.
Were you surprised by any of your findings?
I’m surprised by how poor our prediction accuracies were. This seemed like a problem that was well suited for data science algorithms and it was both disappointing and exciting to see such high prediction errors. I guess that’s the difference between real life and the toy examples in courses.
Which tools did you use?
I only used XGBoost. It’s really been a learning experience for me as I entered this competition having no idea what gradient boosted trees was. After throwing random forests at the problem and getting nowhere near the top of the leaderboard, I installed XGBoost and worked really hard on tuning its parameters.
XGBoost fans or those new to boosting, check out this great blog by Jeremy Kun on the math behind boosting and why it doesn't overfit
How did you spend your time on this competition?
Since the variables were anonymous there wasn’t much feature engineering to be done. Instead I treated feature engineering as just another parameter to tune and spent all of my time tuning parameters. My final solution was an ensemble of different specifications so there were a lot of parameters to tune.
What was the run time for both training and prediction of your winning solution?
The combination of training and prediction of my winning solution takes about 2 hours on my personal laptop (2.2ghz Intel i7 processor).
Words of Wisdom
What have you taken away from this competition?
One thing that I learned which I’ve always overlooked before is thatparameter tuning really goes a long way in performance improvements. While in absolute terms it may not be much, in terms of leaderboard improvement it can be of great value. Of course, without the community and the public scripts I wouldn’t have won and may still not know about gradient boosted trees, so a big thanks to all of the people who shared their ideas and code. I learned so much from both sources so it’s been a worthwhile experience.
Click through to an animated view of the community's leaderboard progression over time, and the influence of benchmark code sharing. Script by competition participant, inversion
Do you have any advice for those just getting started in data science?
For those who don’t already have an established field, I strongly endorse education. All of my data science experience and expertise came from courses taken during my bachelor’s and master’s degrees. I believe that without already having been so well educated in machine learning I wouldn’t have been able to adapt so quickly to the new methods used in practice and the tricks that people have talked about.
There are now a number of very good education programs in data science which I suggest that everyone who wants to start in data science to look into. For those who already have their own established fields and are doing data science on the side, I think their own approaches could be very useful when combined with the standard machine learning methods. It’s always important to think outside the box and it’s all the more rewarding when you bring in your own ideas and get them to work.
Finally, don’t be afraid to hit walls and grind through long periods of trying out ideas that don’t work. A failed idea gets you one closer to a successful idea, and having many failed ideas often can result in a string of ideas that work down the road. Throughout this competition I tried every idea I thought of and only a few worked. It was a combination of patience, curiosity, and optimism that got me through these two months. The same applies to learning the technical aspects of machine learning and data science. I still remember the pain that my classmates and I endured in the machine learning courses.
Just for Fun
If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?
I’m a sports junkie so I’d love to see some competitions on sports analytics. It’s a shame that I missed the one on March Madnesspredictions earlier this year. Maybe one day I’ll really run a competition on this stuff.
Editor's note: March Machine Learning Mania is an annual competition so you can catch it again in 2016!
What is your dream job?
My dream job is to lead a data science team, preferably in an industry that’s full of new and interesting prediction problems. I’d be just as happy as a data scientist though, but it’s always nice to have greater responsibilities.
Bio
Qingchen Wang is a PhD student in marketing analytics at theAmsterdam Business School, VU Amsterdam, and ORTEC. His interests are in applications of machine learning methods to complex real world problems in all domains. He has a bachelor’s degree in computer science and biology from the University of British Columbia, a master’s degree in machine learning from University College London, and a master’s degree in business administration fromINSEAD. In his free time Qingchen competes in data science competitions and reads about sports.
Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang的更多相关文章
- Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang
Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang Recruit Ponpare is Japan's leading ...
- CrowdFlower Winner's Interview: 1st place, Chenglong Chen
CrowdFlower Winner's Interview: 1st place, Chenglong Chen The Crowdflower Search Results Relevance c ...
- Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯
Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯ The Otto Grou ...
- How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo
How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo An early insight into the importa ...
- Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees)
Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees) Peter Best (aka fakepla ...
- ICDM Winner's Interview: 3rd place, Roberto Diaz
ICDM Winner's Interview: 3rd place, Roberto Diaz This summer, the ICDM 2015 conference sponsored a c ...
- Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham
Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham Ben Graham finished at the top of the ...
- MOT北京站 | 卓越研发之路:亿万级云端架构演进
随着IT行业技术周期的快速迭代,如何在激烈的市场竞争中突出重围成为了不少技术人的困惑.除了要保持良好的技术视野外,多向IT行业精英学习他们分享的实战经验,也可让技术提升,达到事半功倍的效果. MOT北 ...
- 卓越研发之路 MOT技术管理者课堂
引言:从2018年11月起,在北京.大连.上海.南京.杭州.武汉.成都.西安.深圳.广州等地巡回举办的技术沙龙.活动旨在交流软件研发及互联网技术的实战经验,分享优秀的案例实践,通过平台结识更多友人,挖 ...
随机推荐
- C#_父窗体跟子窗体的控件操作
很多人都苦恼于如何在子窗体中操作主窗体上的控件,或者在主窗体中操作子窗体上的控件.相比较而言,后面稍微简单一些,只要在主窗体中创建子窗体的时候,保留所创建子窗体对象即可. 下面重点介绍前一种,目前常见 ...
- C# List left join
public class Test1 { public int ID { get; set; } public string Name { get; set; } } public class Tes ...
- Symfony中Doctrine对应的Mongodb数据类型 data type
1. hash 就是 json对象 2. collection 就是 数组 3. 若要知道如何使用referenceOne, referenceMany, embbedDocument等 主要查看: ...
- 链家鸟哥:从留级打架问题学生到PHP大神,他的人生驱动力竟然是?
链家鸟哥:从留级打架问题学生到PHP大神,他的人生驱动力竟然是?| 二叉树短视频 http://mp.weixin.qq.com/s/D4l_zOpKDakptCM__4hLrQ 从问题劝退学生到高考 ...
- vue如何触发某个元素的单击事件?
<a class="link" @click.native="test">1111</a> <a class="link ...
- vm15安装esxi6.0
vmware 15安装esxi6.0时发现出现没有硬盘选择,导致无法安装 在vm12上安装正常 经过测试 1.需要在虚拟机硬件兼容性上选择12.x 2.版本也要选6.0,不要选6.X 其次,esxi要 ...
- Alpha冲刺第4天
Alpha第四天 1.团队成员 郑西坤 031602542 (队长) 陈俊杰 031602504 陈顺兴 031602505 张胜男 031602540 廖钰萍 031602323 雷光游 03160 ...
- Windows下PyInstaller的使用教程
直接使用Python开发的软件时有许多不方便的地方,如需要安装特定的Python环境,需要安装依赖库.为了便于部署,需要将Python源代码编译成可执行文件,编译后的可执行文件就能脱离python环境 ...
- 群里提到的IE设置问题 ---B/S 下页面刷新问题
这里面四个选项的含义 下面是每个选项的作用和意义: 1. “每次访问此页时检查”选项表示浏览器每次访问一个页面时,不管浏览器是否缓存过此页面,都要向服务器发出访问请求.这种设置的优点是实时性很强,肯定 ...
- SQLSERVER 设置自动备份数据库
1. SQLSERVER 简单的设置 计划任务 进行 备份数据库的操作. 首先需要打开 一些设置 执行 命令如下: sp_configure ; GO RECONFIGURE; GO sp_confi ...