[原文链接:http://engineering.richrelevance.com/recommendations-thompson-sampling/。]

[本文链接:http://www.cnblogs.com/breezedeus/p/3775339.html,转载请注明出处]

Recommendations with Thompson Sampling

06/05/2014 • Topics: Bayesian, Big data, Data Science

by Sergey Feldman

This is the second in a series of three blog posts on bandits for recommendation systems.

If you read the last blog post, you should now have a good idea of the challenges in building a good algorithm for dishing out recommendations in the bandit setting.  The most important challenge is to balance exploitationwith exploration.  That is, we have two somewhat conflicting goals: (a) quickly find the best arm to pull and (b) pull the best arm as often as possible.  What I dubbed the naive algorithm in the preceding blog post fulfilled these two goals in a direct way: explore for a while, and then exploit forever. It was an OK approach, but we found that more sophisticated approaches, like the UCB family of bandit algorithms, had significantly better performance and no parameters to tune.

In this post, we'll introduce another technique: Thompson sampling (also known as probability matching).  This has been well covered elsewhere, but mostly for the binary reward case (zeros and ones).  I'll also go over Thompson in the log-normal reward case, and offer some approximations that can work in for any reward distribution.

Before I define Thompson sampling, let's build up some intuition.  If we have an infinite amount of pulls, then we know exactly what the expected rewards are, and there is no reason to ever explore.  When we have a finite number of pulls, then we have to explore and the reason is that we are uncertain of what is the best arm.  The right machinery to quantify uncertainty is the probability distribution.  The UCB algorithms are implicitly using a probability distribution, but only one number from it: the upper confidence bound.  In Bayesian thinking, we want to use the entire probability distribution.  In the preceding post I defined \(p_a(r)\),  the probability distribution from which rewards are drawn.  That's what controls the bandit.  It would be great if we could estimate this entire distribution, but we don't need to.  Why?  Because all we care about is its mean \( \mu_a \) .  What we will do is encode our uncertainty of μa in the probability distribution \( p(\mu_a | \text{data}_a ) \) , and then use that probability distribution to decide when to explore and when to exploit.

You may be confused by now because there are two related probability distributions floating around, so let's review:

  • \(p_a(r)\) - the probability distribution that bandit a uses to generate rewards when it is pulled.
  • \( p(\mu_a | \text{data}_a ) \)  - the probability distribution of where we think the mean of  \(p_a(r)\) after observing some data.

    With Thompson sampling you keep around a probability distribution  \( p(\mu_a | \text{data}_a ) \)  that encodes your belief about where the expected reward \( \mu_a \)  is for arm a .  For the simple coin-flip case, we can use the convenient Beta-Bernoulli model, and the distribution at round t after seeing \(S_{a,t}\) successes and \(F_{a,t}\) failures for arm a is simply:

    \( p(\mu_a|\text{data}_a) = \text{Beta}(S_{a,t} + 1,F_{a,t} + 1) \),

    where the added 1's are convenient priors.

    So now we have a distribution that encodes our uncertainty of where the true expected reward \( \mu_a \)  is.  What's the actual algorithm?  Here it is, in all its simple glory:

    1. Draw a random sample from \( p(\mu_a | \text{data}_a ) \)  for each arm a .
    2. Pull the arm which has the largest drawn sample.

    That's it!  It turns out that this approach is a very natural way to balance exploration and exploitation.  Here is the same simulation from last time, comparing the algorithms from the preceding blog post to Thompson Sampling:


    Normal Approximation

    The Bernoulli case is well known and well understood.  But what happens when you want to maximize, say, revenue instead of click-through rate?  To find out, I coded up a log-normal bandit where each arm pays out strictly positive rewards drawn from a log-normal distribution (the code is messy, so I won't be posting it).  For Thompson sampling, I used a full posterior with priors over the log-normal parameters μand σ as described here (note that these are not the mean and standard deviation of the log-normal), and for UCB I used the modified Cox method of computing confidence bounds for the log-normal distribution from here.  The normal approximation to exact Thompson sampling is (using Central Limit Theorem arguments):

    \( p(\mu_a|\text{data}) = \mathcal{N}\left(\hat{\mu}_{a,t},\frac{\hat{\sigma}^2_{a,t}}{ N_{a,t}} \right) \),

    where \(\hat{\mu}_{a,t}\) and \(\hat{\sigma}^2_{a,t}\) , are the sample mean and sample variance, respectively, of the rewards observed from arm a at round t , and \(N_{a,t}\) is the number of times arm a has been pulled at round t .

    The UCB normal approximation is identical to that in the previous simulation.  For all algorithms, I used the log of the observed rewards to compute sufficient statistics.  The results:

    Observations:

    1. Epsilon-Greedy is still the worst performer.
    2. The normal approximations are slightly worse than their hand-designed counterparts - green is worse than orange and gray is worse than purple.
    3. UCB is doing better than Thompson sampling over this horizon, but Thompson sampling is maybe poised to do better in the long run.

    The Trouble with UCB

    In the above experiments UCB beat out Thompson sampling.  It sounds like a great algorithm and performs well in simulations, but it has a key weakness when you actually get down to productionizing your bandit algorithm.  Let's say that you aren't Google and you have limited computational resources, which means that you can only update your observed data in batch every 2 hours.  For the delayed batch case, UCB will pull the same arm every time for those 2 hours because it is deterministic in the absence of immediate updates.  While Thompson sampling relies on random samples, which will be different every time even if the distributions aren't updated for a while, UCB needs the distributions to be updated every single round to work properly.

    Given that the simulated performance differences between Thompson sampling and UCB are small, I heartily recommend Thompson sampling over UCB; it will work in a larger variety of practical cases.

    Avoiding Trouble with Thompson Sampling

    RichRelevance sees gobbles of data.  This is usually great, but for Thompson sampling this may mean a subtle pitfall.  To understand this point, first note that the variance of a Beta distribution with parametersα and β is:

    \( \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \).

    For our recommender system, α is the number of successes (clicks) and β is the number of failures (non-clicks).  As the amount of total data α+β goes up, the variance shrinks, and quickly.  After a while, the posteriors will be so narrow, that exploration will effectively cease.  This may sound good - after all, didn't we learn what the best lever to pull is?  In practice, we're dealing with a moving target so it is a good idea to put an upper bound on α+β , so that exploration can continue indefinitely.  For details, see Section 7.2.3 here.

    Analogously, if you're optimizing for revenue instead of click-through rate and using a Normal approximation, you can compute sample means and sample variances in an incremental fashion, using decay as per the last page here.  This will ensure that older samples have less influence than newer ones and allow you to track changing means and variances.

    Coming up next: contextual bandits!

    About Sergey Feldman:

    Sergey Feldman is a data scientist & machine learning cowboy with the RichRelevance Analytics team. He was born in Ukraine, moved with his family to Skokie, Illinois at age 10, and now lives in Seattle. In 2012 he obtained his machine learning PhD from the University of Washington. Sergey loves random forests and thinks the Fourier transform is pure magic.

  • 【转载】Recommendations with Thompson Sampling (Part II)的更多相关文章

    1. 转载 iir直接i型和直接ii型滤波器

      1.IIR滤波器构造           之前在介绍FIR滤波器的时候,我们提到过,IIR滤波器的单位冲击响应是无限的!用差分方程来表达一个滤波器,应该是下式这个样子的.                ...

    2. 【转载】Bandits for Recommendation Systems (Part I)

      [原文链接:http://engineering.richrelevance.com/bandits-recommendation-systems/.] [本文链接:http://www.cnblog ...

    3. How do I learn machine learning?

      https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644   How Can I Learn X? ...

    4. 适用于在线服务的A/B测试方法论

      适用于在线服务的A/B测试方法论 简介: 这篇文章旨在介绍适用于为在线服务进行A/B测试(A/B Test)的方法论.中文网络中目前还缺乏全面的入门级介绍. 我将首先讨论在线服务业进行A/B测试所考虑 ...

    5. Reinforcement Learning: An Introduction读书笔记(2)--多臂机

       > 目  录 <  k-armed bandit problem Incremental Implementation Tracking a Nonstationary Problem ...

    6. 通俗bandit算法

      [原文链接] 选择是一个技术活 著名鸡汤学家沃.滋基硕德曾说过:选择比努力重要. 我们会遇到很多选择的场景.上哪个大学,学什么专业,去哪家公司,中午吃什么,等等.这些事情,都让选择困难症的我们头很大. ...

    7. (zhuan) Prioritized Experience Replay

      Prioritized Experience Replay JAN 26, 2016 Schaul, Quan, Antonoglou, Silver, 2016 This Blog from: ht ...

    8. RL Problems

      1.Delayed, sparse reward(feedback), Long-term planning Hierarchical Deep Reinforcement Learning, Sub ...

    9. MPI - 缓冲区和非阻塞通信

      转载自: Introduction to MPI - Part II (Youtube) Buffering  Suppose we have ) MPI_Send(sendbuf,...,,...) ...

    随机推荐

    1. LDAP与Samba

      默认的Samba服务器支持本地系统用户(smbpasswd添加后)访问Samba资源,不支持OpenLDAP服务器账号访问Samba共享资源 目的:配置完后,OpenLDAP每新增一个用户,就自动支持 ...

    2. Web项目中删错文件怎么办

      在开发过程中,有时会因为手误将文件错误删除,会造成很大的困惑,今天看到一个网友分享的一种可以恢复文件的方式特别好用,现在分享给大家. 1.首先在删除文件的路径下创建与原来文件名字相同的文件. 2.在文 ...

    3. LeetCode OJ 114. Flatten Binary Tree to Linked List

      Given a binary tree, flatten it to a linked list in-place. For example,Given 1 / \ 2 5 / \ \ 3 4 6 T ...

    4. Redis 缓存 + Spring 的集成示例

      参考网址:http://blog.csdn.net/defonds/article/details/48716161

    5. 编译Hadoop

      Apache Hadoop 生态圈软件下载地址:http://archive.apache.org/dist/hadoop/hadoop下载地址 http://archive.apache.org/d ...

    6. C# Winform学习--- 实现石头剪刀布的游戏

      本文使用winform实现简单的石头剪刀布的游戏,主要实现,电脑随机出拳,玩家手动点击出拳:实现简易背景图片3秒切换:简易统计信息. 1.效果图 2.实现代码 新建一个windows窗体程序,用数字1 ...

    7. Android Studio线下版和线上版都使用正式签名脚本(保证keysore签名文件和项目在同级目录),不用再因为繁琐的发正式版而烦恼

      场景:调用微信等第三方应用时如果生成的版本不是正式签名的可能会调用失败,使用如下脚本不用再为繁琐的发正式签名版而烦恼 app项目中的build.gradle追加如下代码: //使用正式签名脚本(保证k ...

    8. 用.net访问电子枢纽信用中心的数据查询服务

      概要说明 电子枢纽全称国家交通运输物流公共信息平台,主要提供物流及生产企业进行物流相关数据交换的标准和API,详细介绍可参考其官网www.logink.org,本文假定阅读者对该平台已有了解,并已成功 ...

    9. Tomcat单向Https验证搭建,亲自实现与主流浏览器、Android/iOS移动客户端安全通信

      众所周知,iOS9已经开始在联网方面默认强制使用Https替换原来的Http请求了,虽然Http和Https各有各的优势,但是总得来说,到了现在这个安全的信息时代,开发者已经离不开Https了. 网上 ...

    10. M2事后分析汇报总结

      学霸网站项目Postmortem结果 M2之于M1的改进 文档和问答的整合 完成webservice 完成数据库触发器设计与完整性约束依赖(大规模) 优化学霸UI 资源的搜索 外部问题的搜索 文档的上 ...