How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo
How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo
An early insight into the importance of splitting the data on the number of radar scans in each row helped Devin Anzelmo take first place in the How Much Did It Rain? competition. In this blog, he gets into details on his approach and shares key visualizations (with code!) from his analysis.
351 players on 321 teams built models to predict probabilistic distributions of hourly rainfall
The Basics
What was your background prior to entering this challenge?
My background is primarily in cognitive science and biology, but I dabbled in many different areas while in school. My particular interests are in human learning and behavior and how we can use human activity traces to learn to shape future actions.
Devin's profile on Kaggle
My interest in gaming as a means of teaching and competitive nature has made Kaggle a great fit for my learning style. I started competing seriously on Kaggle in October 2014. I did not have much experience with programming or applied machine learning, and thought entering a competition would provide a structured introduction. Once I started competing I found I had difficult time stopping.
What made you decide to enter this competition?
I thought there was a decent chance I could get into the top five in the competition and this drove me to enter. After finishing the BCI competition I had to decide between Otto group product challenge and this one. I chose How Much Did it Rain because the dataset was difficult to process, and it wasn't obvious how to approach the problem. These factors favored my skills. I didn't feel like I could compete in Otto where the determining factor was going to primarily rely on ensembling skills.
Let's Get Technical
What preprocessing and supervised learning methods did you use?
Most of the preprocessing was just feature generation. Like most other competitors I used descriptive statistics and counts of the different error codes. These made up the bulk of my features and turned out to be enough to get first place. We were given QC'd reflectivity data, but instead of using this information to limit the data used in feature generation I included it as a feature and let learning algorithm (Gradient Boosted Decision Trees) use it as needed.
The most important decision with regard to supervised learning was how to model the output probability distribution. I decided to model it by transforming the problem into a multi-class classification problem with soft output. Since there was not enough data to perform classification with the full 70 classes the problem had to be reduced further. It turned out there were many different ways that people solved this problem, and I highly recommend reading the end of competition thread for some other approaches.
See the code on scripts
I ended up using a simple method in which basic component probability distributions were combined using the output of a classification algorithm. For classes that had enough data a step function was used for a CDF. When there was less data the several labels were combined and replaced by a single value. In this case an estimation of the empirical distribution for that class was used as a component CDF. This method worked well and I used it for most of the competition. I did try regression and classification just on the data from the minority classes but it never performed quite as well as just using the empirical distribution.
What was your most important insight into the data?
Early in the competition I discovered that it was helpful to split the data based on the number of radar scans in each row. Each row has data spanning the hour previous to the rain gauge reading. In some cases there was only one radar scan in others there was more then 50. There are over one hundred thousand rows in the training set with more then 17 radar scans. For this data I wanted to create features which take into account the changing of weather conditions over time. In doing this I realized it was not possible to make these features for the rows that had only 1 or 2 radar scans. This was the initial reason for splitting the dataset. When I started looking for places to split it I found that there was also a strong positive correlation between the number of radar scans and the average rain amount. Those rows with 1 scan had 95% 0mm of rain, while the subset with 17 or more scans only 48% of the data had 0mm of rain. Interestingly for the data with few radar scans many of the most important features were the counts of the error codes.
See the code on scripts
In contrast the most important features in the data with many scans were derived from Reflectivity and HybridScan which have a physical relationship to rain amount. Splitting the data allowed me to use many more features for the higher scan data which gave a large boost to the score. Over 65% of the error came from the data with more then 7 scans. The data with low scans contributed to very small amount of the final score and I was able to spend less time modeling these subsets.
Were you surprised by any of your findings?
The most mysterious aspect of the competition was the 5000 rows in the training data that had Expected rain amount over 70mm. The requirements of the competition only asked us to model up to 69mm of rain in an hour but the evaluation metric punished large classification errors so severely that I felt compelled to figure out how to predict these large values. A quick calculation showed that of the 1.1 million rows in the training set these 5000 large values, if mis-predicted, would account for half of my error.
It turned out that many of the samples with labels above 70mm did not have reflectivity values indicating heavy rain. I was still able to improve my local validation score by treating the large rain amount samples as their own class and using an all zero CDF in generating the final prediction. Unfortunately this also worsened my public leaderboard score by a large amount.
See the code on scripts
Through leaderboard feedback I was able to determine that there were differences in the distribution of these large values in the 2013 training set and the 2014 test set. Removing the rows with large values from the training set turned out to be the best course of action.
My hypothesis about the large values is that they were generated by specific rain gauges, which the learning algorithm was able to detect using features based on DistanceToRadar and the -99903 error code. The -99903 error code can correspond to physical blockage of a radar beam by mountains or other physical objects. Both of these features can help identify specific rain gauges which would lead to overfitting the train set if there were fixes to the malfunction before the start of 2014. As I don't have access to the 2014 labels this will remain speculation for now.
Which tools did you use?
I used python for this competition relying heavily on pandas for data exploration and direct numpy implementations when I needed things to be fast. This was my first competition using Xgboost, and I was very pleased with the ease of use and speed.
How did you spend your time on this competition?
I probably spent 50% percent of my time coding, and then having to refactor when I realized my implementation was not flexible enough to incorporate my new ideas. I also tried several crazy things that required substantial programming time that I didn't end up using.
The other 50% percent was split pretty equally between feature engineering, data exploration and tweaking my classification framework.
Words of Wisdom
What have you taken away from this competition?
I spent many hours coding and refactoring in this competition. Since I had to do nearly the same thing on five different datasets having manually code everything made it difficult to try new ideas. Having a flexible framework to try out many ideas is critical and this is one of the things I spent time learning how to do in this competition. The effort has already payed off in other competitions.
With only one submission a day it was important to try out things in a systematic way. What worked best was changing one aspect of my method and seeing whether it improved my score. I needed to keep records of everything I did or it was possible waste time redoing things I already tried. Having the discipline to keep on track and and not try too many things at once is critical for doing well and this competition put me to the test on this.
Do you have any advice for those just getting started competing on Kaggle?
Read the Kaggle blog post profiling KazAnova for a great high level perspective on competing. I read this about two weeks before the end of the competition and I started saving my models and predictions, and automating more of my process which allowed for some late improvements.
Other then this I think its very helpful to read the forums and follow up on hints given by those at the top of the leaderboard. Very often people will give small hints, and I have gotten in habit of following up on even the smallest clues. This has taught me many new things and helped me find critical insights into problems.
Just for Fun
If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?
With the introduction of Kaggle scripts it seems it will now be possible to have solution code evaluated remotely instead of requiring competitors to submit a CSV submission file. I think having this functionality opens up the possibility of solving new types of problems that were not feasible in the past.
With this in mind I would like to run a problem that favors reinforcement learning based solutions. As a simple example we could teach an agent to explore mazes. The training set would consist of several different mazes (perhaps it would be good generate their own training data) and the test set could be another set of unseen mazes hosted on Kaggle. All the training code would be required to run directly on scripts making transition to an evaluation server easy. I don't think this type of problem would have worked without scripts, and I think it would be fun to see if it is possible to turn agent learning problems into Kaggle competitions.
Another possibility with remote execution of solutions would be a Rock Paper Scissors programming tournament. There are already some RPS tournaments available online. Perhaps hosting a variant as a knowledge competition would be possible as these types of competitions are really fun.
What is your dream job?
Ideally I would like to work with neural and behavioural data to help improve human performance and alleviate problems related to mental illness. There are many very challenging problems in this area. Unfortunately most of the current classification frameworks for mental illness are deeply flawed. My dream job would allow for the application of diverse descriptions, methods, and sensors, without the need to push a product out immediately.
My sense is that the amount of theoretical upheaval needed is holding back research in academia, and the ineffectiveness of most current techniques is hampering the development of new businesses (plus the legal issues of the health industry). I would be interested in any project that is making progress through this mire
How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo的更多相关文章
- CrowdFlower Winner's Interview: 1st place, Chenglong Chen
CrowdFlower Winner's Interview: 1st place, Chenglong Chen The Crowdflower Search Results Relevance c ...
- Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees)
Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees) Peter Best (aka fakepla ...
- Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham
Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham Ben Graham finished at the top of the ...
- Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang
Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang Recruit Ponpare is Japan's leading ...
- Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯
Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯ The Otto Grou ...
- Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang
Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang The hugely popular Liberty Mut ...
- ICDM Winner's Interview: 3rd place, Roberto Diaz
ICDM Winner's Interview: 3rd place, Roberto Diaz This summer, the ICDM 2015 conference sponsored a c ...
- 如何在 Kaggle 首战中进入前 10%
原文:https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/ Introduction Kaggle 是目前最 ...
- 【转载】如何在 Kaggle 首战中进入前 10%
本文转载自如何在 Kaggle 首战中进入前 10% 转载仅出于个人学习收藏,侵删 Introduction 本文采用署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议进行许可.著作权由章 ...
随机推荐
- javascript弹出框打印某个数值时,弹出NaN?(not a number)
一.NaN:表示not a number null 未定义或空字符串 undefined 对象属性不存在 或是声明了变量但从未赋值. 二.出现这种情况有(1)此常数的值是零被零除所得到的结果. (2) ...
- android 回车键事件编程
实现android按下回车键便隐藏输入键盘,有两种方法: 1.)如果布局是多个EditText,为每个EditText控件设置android:singleLine=”true”,弹出的软盘输入法中回车 ...
- C# 自动登录网页,浏览页面【转载】
需求:客户的数据同时存在在另外一个不可控的系统中,需要和当前系统同步. 思路:自动登录另外一个系统,然后抓取数据,同步到本系统中. 技术点:模拟用户登录:保存登录状态:抓取数据 /// <sum ...
- Cubieboard 开箱和入门 | Name5566 分类: cubieboard 2014-11-08 17:27 251人阅读 评论(0) 收藏
Cubieboard 开箱和入门 2014 年 01 月 29 日 by name5566 Categories: Computer Science, Cubieboard Hello Cubiebo ...
- Android ADT离线更新办法
Troubleshooting ADT Installation If you are having trouble downloading the ADT plugin after followin ...
- [RxJS] Filtering operator: filter
This lesson introduces filter: an operator that allows us to let only certain events pass, while ign ...
- CodeIgniter框架url去index.php(转)
针对apache,支持mode_rewrite可以通过在目录先建立.htaccess去掉url中index.php .htaccess内容如下: RewriteEngine on RewriteCon ...
- Qt 学习之路:二进制文件读写
在上一章中,我们介绍了有关QFile和QFileInfo两个类的使用.我们提到,QIODevice提供了read().readLine()等基本的操作.同时,Qt 还提供了更高一级的操作:用于二进制的 ...
- Apple-Watch开发
Apple Watch界面设计规范(4) - 通知 Apple Watch界面设计规范(3) - Glance Apple Watch界面设计规范(2) - 应用解析 Apple Watch界面设计规 ...
- Linux 关闭及重启方式
一.shutdown 命令 作用:关闭或重启系统 使用权限:超级管理员使用 常用选项 1. -r 关机后立即重启 2. -h关机后不重启 3. -f快速关机,重启时跳过fsck(file system ...