【转载】Bandits for Recommendation Systems (Part I)
[原文链接:http://engineering.richrelevance.com/bandits-recommendation-systems/。]
[本文链接:http://www.cnblogs.com/breezedeus/p/3775316.html,转载请注明出处]
Bandits for Recommendation Systems
06/02/2014 • Topics: Bayesian, Big data, Data Science
This is the first in a series of three blog posts on bandits for recommendation systems.
In this blog post, we will discuss the bandit problem and how it relates to online recommender systems. Then, we'll cover some classic algorithms and see how well they do in simulation.
A common problem for internet-based companies is: which piece of content should we display? Google has this problem (which ad to show), Facebook has this problem (which friend's post to show), and RichRelevance has this problem (which product recommendation to show). Many of the promising solutions come from the study of the multi-armed bandit problem. A one-armed "bandit" is another way to say slot machine (probably because both will leave you with empty pockets). Here is a description that I hijacked from Wikipedia:
"The multi-armed bandit problem is the problem a gambler faces at a row of slot machines when deciding which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls."
Let's rewrite this in retail language. Each time a shopper looks at a webpage, we have to show them one of K product recommendations. They either click on it or do not, and we log this (binary) reward. Next, we proceed to either the next shopper or the next page view of this shopper and have to choose one of Kproduct recommendations again. (Actually, we have to choose multiple recommendations per page, and our 'reward' could instead be sales revenue, but let's ignore these aspects for now.)
Multi-armed bandits come in two flavors: stochastic and adversarial. The stochastic case is where each bandit doesn't change in response to your actions, while in the adversarial case the bandits learn from your actions and adjust their behavior to minimize your rewards. We care about the stochastic case, and our goal is to find the arm which has the largest expected reward. I will index the arms by a , and the probability distribution over possible rewards r for each arm a can be written as \( p_a(r)\) . We have to find the arm with the largest mean reward
\( μ_a=E_a[r] \)
as quickly as possible while accumulating the most rewards along the way. One important point is that in practice \( p_a(r)\) are non-stationary, that is, rewards change over time, and we have to take that into account when we design our algorithms.
Approach #1: A Naive Algorithm
We need to figure out the mean reward (expected value) of each arm. So, let's just try each arm 100 times, take the sample mean of the rewards we get back, and then pull the arm with the best sample mean forever more. Problem solved?
Not exactly. This approach will get you in trouble in a few key ways:
- If K is of even moderate size (10-100), you'll spend a long time gathering data before you can actually benefit from feedback.
- Is 100 samples for each arm enough? How many should we use? This is an arbitrary parameter that will require experimentation to determine.
- If after 100 samples (or however many), the arm you settle on is not actually the optimal one, you can never recover.
- In practice, the reward distribution is likely to change over time, and we should use an algorithm that can take that into account.
OK, so maybe the naive approach won't work. Let's move on to a few that are actually used in practice.
Approach #2: The ϵ− Greedy Algorithm
What we'd like to do is start using what we think is the best arm as soon as possible, and adjust course when information to the contrary becomes available. And we don't want to get stuck in a sub-optimal state forever. The " ϵ− greedy" algorithm addresses both of these concerns. Here is how it works: with probability 1−ϵ pull the arm with the best current sample mean reward, and otherwise pull a random other arm (uniformly). The advantages over the naive are:
- Guaranteed to not get stuck in a suboptimal state forever.
- Will use the current best performing arm a large proportion of the time.
But setting ϵ is hard. If it’s too small, learning is slow at the start, and you will be slow to react to changes. If we happen to sample, say, the second-best arm the first few times, it may take a long time to discover that another arm is actually better. If ϵ is too big, you’ll waste many trials pulling random arms without gaining much. After a while, we'll have enough samples to be pretty sure which is best, but we will still be wasting an ϵ of our traffic exploring other options. In short, ϵ is a parameter that gives poor performance at the extremes, and we have little guidance as to how to set it.
Approach #3: Upper Confidence Bound Algorithms
In the world of statistics, whenever you estimate some unknown parameter (such as the mean of a distribution) using random samples, there is a way to quantify the uncertainty inherent in your estimate. For example, the true mean of a fair six-sided die is 3.5. But if you only roll it once and get a 2, your best estimate of the mean is just 2. Obviously that estimate is not very good, and we can quantify just how variable it is. There are confidence bounds which can be written, for example, as: "The mean of this die is 2, with a 95-th percentile lower bound of 1.4 and a 95-th percentile upper bound of 5.2."
The upper confidence bound (UCB) family of algorithms, as its name suggests, simply selects the arm with the largest upper confidence bound at each round. The intuition is this: the more times you roll the die, the tighter the confidence bounds. If your roll the die an infinite number of times then the width of the confidence bound is zero, and before you ever roll it the width of the confidence bound is the largest it will ever be. So, as the number of rolls increases, the uncertainty decreases, and so does the width of the confidence bound.
In the bandit case, imagine that you have to introduce a brand new choice to the set of K choices a week into your experiment. The ϵ− greedy algorithm would keep chugging along, showing this new choice rarely (if the initial mean is defined to be 0). But the upper confidence bound of this new choice will be very large because of the uncertainty that results from us never having pulled it. So UCB will choose this new arm until its upper bound is below the upper bound of the more established choices.
So, the advantages are:
- Take uncertainty of sample mean estimate into account in a smart way.
- No parameters to validate.
And the major disadvantage is that the confidence bounds designed in the machine learning literature require heuristic adjustments. One way to get around having to wade through heuristics is to recall the central limit theorem. I'll skip the math but it says that the distribution of the sample mean computed from samples from any distribution converges to a Normal (Gaussian) as the number of samples increases (and fairly quickly). Why does that matter here? Because we are estimating the true expected reward for each arm with a sample mean. Ideally we want a posterior of where the true mean is, but that's hard in non-Bernoulli and non-Gaussian cases. So we will instead content ourselves with an approximation and use a Gaussian distribution centered at the sample mean instead. We can thus always use a, say, 95% upper confidence bound, and be secure in the knowledge that it will become more and more accurate the more samples we get. I will discuss this in more detail in the next blog post.
Simulation Comparison
So how do these 3 algorithms perform? To find out, I ran a simple simulation 100 times with K=5 and binary rewards (aka a Bernoulli bandit). Here are the 5 algorithms compared:
- Random - just pick a random arm each time without learning anything.
- Naive - with 100 samples of each arm before committing to the best one
- ϵ− Greedy - with ϵ=0.01
- UCB - with (1 - 1/t) bounds (heuristic modification of UCB1)
- UCB - with 95% bounds
The metric used to compare these algorithms is average (over all the trials) expected regret (lower is better), which quantifies how much reward we missed out on by pulling the suboptimal arm at each time step. The Python code is here and the results are in the plot below.
What can we conclude from this plot?
- Naive is as bad as random for the first 100K rounds, but then has effectively flat performance. In the real world, the arms have shifting rewards, so this algorithm is impractical because it over-commits
- ϵ− greedy is OK but without a decaying ϵ we're still wasting 1% of traffic on exploration when it may no longer be necessary.
- The UCB algorithms are great. It's not clear which one is the winner in this limited horizon, but both handily beat all of the other algorithms.
Now you know all about bandits, and have a good idea of how they might be relevant to online recommender systems. But there's more to do before we have a system that is really up to the job.
Coming up next: let's get Bayesian with Thompson Sampling!
About Sergey Feldman:
Sergey Feldman is a data scientist & machine learning cowboy with the RichRelevance Analytics team. He was born in Ukraine, moved with his family to Skokie, Illinois at age 10, and now lives in Seattle. In 2012 he obtained his machine learning PhD from the University of Washington. Sergey loves random forests and thinks the Fourier transform is pure magic.
【转载】Bandits for Recommendation Systems (Part I)的更多相关文章
- CABaRet: Leveraging Recommendation Systems for Mobile Edge Caching
CABaRet:利用推荐系统进行移动边缘缓存 本文为SIGCOMM 2018 Workshop (Mobile Edge Communications, MECOMM)论文. 笔者翻译了该论文.由于时 ...
- 【RS】Collaborative Memory Network for Recommendation Systems - 基于协同记忆网络的推荐系统
[论文标题]Collaborative Memory Network for Recommendation Systems (SIGIR'18) [论文作者]—Travis Ebesu (San ...
- Deep Learning Recommendation Model for Personalization and Recommendation Systems
这篇文章出自facebook,主要探索了如何利用类别型特征(categorical features)并且构建一个深度推荐系统.值得注意的是,文章还特别强调了工业实现上如何实现并行,也很良心地给出了基 ...
- 【转载】Recommendations with Thompson Sampling (Part II)
[原文链接:http://engineering.richrelevance.com/recommendations-thompson-sampling/.] [本文链接:http://www.cnb ...
- Bandit:一种简单而强大的在线学习算法
假设我有5枚硬币,都是正反面不均匀的.我们玩一个游戏,每次你可以选择其中一枚硬币掷出,如果掷出正面,你将得到一百块奖励.掷硬币的次数有限(比如10000次),显然,如果要拿到最多的利益,你要做的就是尽 ...
- (转) Quick Guide to Build a Recommendation Engine in Python
本文转自:http://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/ Int ...
- 近年Recsys论文
2015年~2017年SIGIR,SIGKDD,ICML三大会议的Recsys论文: [转载请注明出处:https://www.cnblogs.com/shenxiaolin/p/8321722.ht ...
- 微软黑科技强力注入,.NET C#全面支持人工智能
微软黑科技强力注入,.NET C#全面支持人工智能,AI编程领域开始C#.Py--百花齐放 就像武侠小说中,一个普通人突然得到绝世高手的几十年内力注入,招式还没学,一身内力有点方 Introducin ...
- TensorFlow实战——个性化推荐
原创文章,转载请注明出处: http://blog.csdn.net/chengcheng1394/article/details/78820529 请安装TensorFlow1.0,Python3. ...
随机推荐
- Spring Framework------>version4.3.5.RELAESE----->Reference Documentation学习心得----->Spring Framework中web相关的知识(概述)
Spring Framework中web相关的知识 1.概述: 参考资料:官网documentation中第22小节内容 关于spring web mvc: spring framework中拥有自 ...
- 关于apache httpd.conf脚本的理解
新人一枚,这两天一直在研究lamp的搭建,感觉自己对apache理解的不够深彻,决定写这一篇(翻译)httpd.conf文件 未完待续 cat /usr/local/apache/conf/httpd ...
- Tomcat中间件URL中文字符传递问题
1. 问题描述: tomcat中如果URL中需要传递中文参数,需要配置tomcat的service.xml中文传递的编码方式,否则中文传递将出现乱码,导致程序异常. 2. 解决方式: 修改tomcat ...
- jQuery easyui combobox获取值|easyui-combobox获取多个值
Query easyui combobox事例: name="language" data-options=" ...
- <button> 标签 id 与 function 重复时发生的问题
今天遇到一种情况,在调用js自定义方法的时候,总是提示“import:660 Uncaught TypeError: ... is not a function”. 仔细检查了代码,并没有问题.甚至把 ...
- colorbox 自适应 高度
$(".example3").colorbox({ inline: true, scrolling: false , onComplete: ...
- vim符号列表窗口
有时使用vim开发时,需要能够直观的查看文件的符号列表或者变量list,但是vim不直接支持这个功能,需要使用ctags的插件支持. 以下是在ubuntu下的详细设置方法: 步骤1:安装ctags u ...
- WINDOWS下绑定ARP绑定网关
一.WINDOWS下绑定ARP绑定网关步骤一:在能正常上网时,进入MS-DOS窗口,输入命令:arp -a,查看网关的IP对应的正确MAC地址, 并将其记录下来.注意:如果已经不能上网,则先运行一次命 ...
- Daily Scrum 12.14
今日完成任务: 优化了问题页面显示问题的算法:两名开发人员有CCF考试,今天没有完成任务,任务顺延到明天. 明日任务: 黎柱金 解决资源显示全部为同一个PDF的BUG 晏旭瑞 资源搜索问题 孙思权 做 ...
- FreeMarker标签与使用
模板技术在现代的软件开发中有着重要的地位,而目前最流行的两种模板技术恐怕要算freemarker和velocity了,webwork2.2对两者都有不错的支持,也就是说在webwork2中你可以随意选 ...