How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo

An early insight into the importance of splitting the data on the number of radar scans in each row helped Devin Anzelmo take first place in the How Much Did It Rain? competition. In this blog, he gets into details on his approach and shares key visualizations (with code!) from his analysis.

351 players on 321 teams built models to predict probabilistic distributions of hourly rainfall

The Basics

What was your background prior to entering this challenge?

My background is primarily in cognitive science and biology, but I dabbled in many different areas while in school. My particular interests are in human learning and behavior and how we can use human activity traces to learn to shape future actions.

Devin's profile on Kaggle

My interest in gaming as a means of teaching and competitive nature has made Kaggle a great fit for my learning style. I started competing seriously on Kaggle in October 2014. I did not have much experience with programming or applied machine learning, and thought entering a competition would provide a structured introduction. Once I started competing I found I had difficult time stopping.

What made you decide to enter this competition?

I thought there was a decent chance I could get into the top five in the competition and this drove me to enter. After finishing the BCI competition I had to decide between Otto group product challenge and this one. I chose How Much Did it Rain because the dataset was difficult to process, and it wasn't obvious how to approach the problem. These factors favored my skills. I didn't feel like I could compete in Otto where the determining factor was going to primarily rely on ensembling skills.

Let's Get Technical

What preprocessing and supervised learning methods did you use?

Most of the preprocessing was just feature generation. Like most other competitors I used descriptive statistics and counts of the different error codes. These made up the bulk of my features and turned out to be enough to get first place. We were given QC'd reflectivity data, but instead of using this information to limit the data used in feature generation I included it as a feature and let learning algorithm (Gradient Boosted Decision Trees) use it as needed.

The most important decision with regard to supervised learning was how to model the output probability distribution. I decided to model it by transforming the problem into a multi-class classification problem with soft output. Since there was not enough data to perform classification with the full 70 classes the problem had to be reduced further. It turned out there were many different ways that people solved this problem, and I highly recommend reading the end of competition thread for some other approaches.

See the code on scripts

I ended up using a simple method in which basic component probability distributions were combined using the output of a classification algorithm. For classes that had enough data a step function was used for a CDF. When there was less data the several labels were combined and replaced by a single value. In this case an estimation of the empirical distribution for that class was used as a component CDF. This method worked well and I used it for most of the competition. I did try regression and classification just on the data from the minority classes but it never performed quite as well as just using the empirical distribution.

What was your most important insight into the data?

Early in the competition I discovered that it was helpful to split the data based on the number of radar scans in each row. Each row has data spanning the hour previous to the rain gauge reading. In some cases there was only one radar scan in others there was more then 50. There are over one hundred thousand rows in the training set with more then 17 radar scans. For this data I wanted to create features which take into account the changing of weather conditions over time. In doing this I realized it was not possible to make these features for the rows that had only 1 or 2 radar scans. This was the initial reason for splitting the dataset. When I started looking for places to split it I found that there was also a strong positive correlation between the number of radar scans and the average rain amount. Those rows with 1 scan had 95% 0mm of rain, while the subset with 17 or more scans only 48% of the data had 0mm of rain. Interestingly for the data with few radar scans many of the most important features were the counts of the error codes.

See the code on scripts

In contrast the most important features in the data with many scans were derived from Reflectivity and HybridScan which have a physical relationship to rain amount. Splitting the data allowed me to use many more features for the higher scan data which gave a large boost to the score. Over 65% of the error came from the data with more then 7 scans. The data with low scans contributed to very small amount of the final score and I was able to spend less time modeling these subsets.

Were you surprised by any of your findings?

The most mysterious aspect of the competition was the 5000 rows in the training data that had Expected rain amount over 70mm. The requirements of the competition only asked us to model up to 69mm of rain in an hour but the evaluation metric punished large classification errors so severely that I felt compelled to figure out how to predict these large values. A quick calculation showed that of the 1.1 million rows in the training set these 5000 large values, if mis-predicted, would account for half of my error.

It turned out that many of the samples with labels above 70mm did not have reflectivity values indicating heavy rain. I was still able to improve my local validation score by treating the large rain amount samples as their own class and using an all zero CDF in generating the final prediction. Unfortunately this also worsened my public leaderboard score by a large amount.

See the code on scripts

Through leaderboard feedback I was able to determine that there were differences in the distribution of these large values in the 2013 training set and the 2014 test set. Removing the rows with large values from the training set turned out to be the best course of action.

My hypothesis about the large values is that they were generated by specific rain gauges, which the learning algorithm was able to detect using features based on DistanceToRadar and the -99903 error code. The -99903 error code can correspond to physical blockage of a radar beam by mountains or other physical objects. Both of these features can help identify specific rain gauges which would lead to overfitting the train set if there were fixes to the malfunction before the start of 2014. As I don't have access to the 2014 labels this will remain speculation for now.

Which tools did you use?

I used python for this competition relying heavily on pandas for data exploration and direct numpy implementations when I needed things to be fast. This was my first competition using Xgboost, and I was very pleased with the ease of use and speed.

How did you spend your time on this competition?

I probably spent 50% percent of my time coding, and then having to refactor when I realized my implementation was not flexible enough to incorporate my new ideas. I also tried several crazy things that required substantial programming time that I didn't end up using.

The other 50% percent was split pretty equally between feature engineering, data exploration and tweaking my classification framework.

Words of Wisdom

What have you taken away from this competition?

I spent many hours coding and refactoring in this competition. Since I had to do nearly the same thing on five different datasets having manually code everything made it difficult to try new ideas. Having a flexible framework to try out many ideas is critical and this is one of the things I spent time learning how to do in this competition. The effort has already payed off in other competitions.

With only one submission a day it was important to try out things in a systematic way. What worked best was changing one aspect of my method and seeing whether it improved my score. I needed to keep records of everything I did or it was possible waste time redoing things I already tried. Having the discipline to keep on track and and not try too many things at once is critical for doing well and this competition put me to the test on this.

Do you have any advice for those just getting started competing on Kaggle?

Read the Kaggle blog post profiling KazAnova for a great high level perspective on competing. I read this about two weeks before the end of the competition and I started saving my models and predictions, and automating more of my process which allowed for some late improvements.

Other then this I think its very helpful to read the forums and follow up on hints given by those at the top of the leaderboard. Very often people will give small hints, and I have gotten in habit of following up on even the smallest clues. This has taught me many new things and helped me find critical insights into problems.

Just for Fun

If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?

With the introduction of Kaggle scripts it seems it will now be possible to have solution code evaluated remotely instead of requiring competitors to submit a CSV submission file. I think having this functionality opens up the possibility of solving new types of problems that were not feasible in the past.

With this in mind I would like to run a problem that favors reinforcement learning based solutions. As a simple example we could teach an agent to explore mazes. The training set would consist of several different mazes (perhaps it would be good generate their own training data) and the test set could be another set of unseen mazes hosted on Kaggle. All the training code would be required to run directly on scripts making transition to an evaluation server easy. I don't think this type of problem would have worked without scripts, and I think it would be fun to see if it is possible to turn agent learning problems into Kaggle competitions.

Another possibility with remote execution of solutions would be a Rock Paper Scissors programming tournament. There are already some RPS tournaments available online. Perhaps hosting a variant as a knowledge competition would be possible as these types of competitions are really fun.

What is your dream job?

Ideally I would like to work with neural and behavioural data to help improve human performance and alleviate problems related to mental illness. There are many very challenging problems in this area. Unfortunately most of the current classification frameworks for mental illness are deeply flawed. My dream job would allow for the application of diverse descriptions, methods, and sensors, without the need to push a product out immediately.

My sense is that the amount of theoretical upheaval needed is holding back research in academia, and the ineffectiveness of most current techniques is hampering the development of new businesses (plus the legal issues of the health industry). I would be interested in any project that is making progress through this mire 

How Much Did It Rain? Winner's Interview: 1st place, Devin Anzelmo的更多相关文章

  1. CrowdFlower Winner's Interview: 1st place, Chenglong Chen

    CrowdFlower Winner's Interview: 1st place, Chenglong Chen The Crowdflower Search Results Relevance c ...

  2. Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees)

    Facebook IV Winner's Interview: 1st place, Peter Best (aka fakeplastictrees) Peter Best (aka fakepla ...

  3. Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham

    Diabetic Retinopathy Winner's Interview: 1st place, Ben Graham Ben Graham finished at the top of the ...

  4. Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang

    Recruit Coupon Purchase Winner's Interview: 2nd place, Halla Yang Recruit Ponpare is Japan's leading ...

  5. Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯

    Otto Product Classification Winner's Interview: 2nd place, Alexander Guschin ¯\_(ツ)_/¯ The Otto Grou ...

  6. Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang

    Liberty Mutual Property Inspection, Winner's Interview: Qingchen Wang The hugely popular Liberty Mut ...

  7. ICDM Winner's Interview: 3rd place, Roberto Diaz

    ICDM Winner's Interview: 3rd place, Roberto Diaz This summer, the ICDM 2015 conference sponsored a c ...

  8. 如何在 Kaggle 首战中进入前 10%

    原文:https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/ Introduction Kaggle 是目前最 ...

  9. 【转载】如何在 Kaggle 首战中进入前 10%

    本文转载自如何在 Kaggle 首战中进入前 10% 转载仅出于个人学习收藏,侵删 Introduction 本文采用署名 - 非商业性使用 - 禁止演绎 3.0 中国大陆许可协议进行许可.著作权由章 ...

随机推荐

  1. WinForm中TextBox 中判断扫描枪输入与键盘输入

    本文转载:http://www.cnblogs.com/Hdsome/archive/2011/10/28/2227712.html 提出问题:在收货系统中,常常要用到扫描枪扫描条码输入到TextBo ...

  2. C# 保存窗口为图片(保存纵断面图)

    源代码例如以下: #region 保存纵断面截图 private void button_save_Click(object sender , EventArgs e) { SaveFileDialo ...

  3. mysql 源码编绎修改 FLAGS,调试MYSQL

    http://dev.mysql.com/doc/refman/5.7/en/source-configuration-options.html#option_cmake_cmake_c_flags ...

  4. 多路复用的server模型

    多路复用I/O之server模型  主要是关于select()这个函数: 其原型是:int select(int n,fd_set *read_fds,fd_set *write_fds,fd_set ...

  5. iOS-UITableCell详情

    iOS-UITableCell详情 表示UITableViewCell风格的常量有: UITableViewCellStyleDefault UITableViewCellStyleSubtitle ...

  6. Linux系统下Memcached的安装以及自启动

    一.准备工作: 1.下载libevent:http://monkey.org/~provos/libevent/ (由于memcached与客户端的通信是借助libevent来实现的,所以此动作必须在 ...

  7. ORACLE解锁数据库用户

    the account is locked解决办法: 1.进入sqlplus sqlplus "/as sysdba" 2.解锁: alter user hpmng account ...

  8. framework not found -fno-arc编译错误

    由于我是刚接手的代码  然后我拿来运行根本就是运行不了的  然后需要在linker 那边删除点东西就可以了. 把下边的两个删除就可以了 关于other linker flags 的介绍 请参考http ...

  9. 24种设计模式--单例模式【Singleton Pattern】

    这个模式是很有意思,而且比较简单,但是我还是要说因为它使用的是如此广泛,如此的有人缘,单例就是单一.独苗的意思,那什么是独一份呢?你的思维是 独一份,除此之外还有什么不能山寨的呢?我们举个比较难复制的 ...

  10. jQuery api 学习笔记(1)

      之前自己的jquery知识库一直停留在1.4的版本,而目前jquery的版本已经更新到了1.10.2了,前天看到1.10中css()竟然扩充了那么多用法,这2天就迫不及待的更新一下自己的jquer ...