《DRN: A Deep Reinforcement Learning Framework for News Recommendation》强化学习推荐系统
新闻推荐系统中,新闻具有很强的动态特征(dynamic nature of news features),目前一些模型已经考虑到了动态特征。
二:有一些模型利用了用户的反馈,如用户返回的频率。(user feedback other than click / no click labels (e.g., how frequentuser returns) );
为了解决上述问题,我们提出了DQR,明确处理未来的奖励。为了获取更多的用户反馈信息,我们使用user return pattern 作为用户点击的补充,
content based methods [19, 22, 33],
collaborative fltering based methods [11, 28 ,34]
hybrid methods [12, 24, 25]
deep learning models [8, 45, 52]
1、the dynamic changes in news recommendations are difcult to handle
1- 新闻很快 outdated. 新闻候选集变化很快。
2、current recommendation methods [23, 35, 36, 43] usually only consider the click / no click labels or ratings as users’ feedback.
3、its tendency to keep recommending similar items to users, which might decrease users’ interest in similar topics.
一些方法加入了一些随机 比如,simple ϵ-greedy strategy [31] or Upper Confdence Bound (UCB) [23, 43] (mainly for Multi-Armed Bandit methods)
ϵ-greedy strategy may recommend the customer with totally unrelated items
UCB can not get a relatively accurate reward estimation for an item until this item has been tried several times
因此,提出了自己的模型,DQN 可以考虑当前的奖励与未来的奖励。
Second, we consider user return as another form of user feedback information, by maintainingan activeness score for each user ,
Third, we propose to apply a Dueling Bandit Gradient Descent (DBGD) method [16, 17, 49] for exploration, by choosing random item candidates in the neighborhood of the current recommender.
environment :user+news
每一个时刻,当用户想看新闻的时候,一个state(i.e., features of users) 和 action集合(i.e., features of news candidates)会传给agent.
The agent 会选择最好的action(i.e., recommending a list ofnews to user) 并且 得到用户的feedback作为reward。Specifcally ,reward 是
click label 与 用户活跃度的估计(estimation of user activeness )构成。
3、更有效的exploration 机制, Dueling Bandit Gradient Descent
2.1 News recommendation algorithms
Conventional news recommendation methods can be divided into three categories.
1、Content-based methods [19, 22, 33] 使用 news term frequency features (e.g., TF-IDF) 和 用户画像(based on historical news),然后,选择跟用户画像相似的新闻进行推荐。
2、 collaborative fltering methods [11] usually make rating prediction utilizing the past ratings of current user or similar users [28, 34], or the combination of these two [11]
3、To combine the advantages of the former two groups of methods, hybrid methods [12, 24, 25] are further proposed to improve the user profle modeling.
4、deep learning models [8, 45, 52] have shown much superior performance than previous three categories of models due to its capability of modeling complex user-item relationship
2.2 Reinforcement learning in recommendation
2.2.1 Contextual Multi-Armed Bandit models
context 包括用户的特征跟 item的特征,论文【23】假设rewrad是context的函数。
2.2.2 Markov Decision Process models
capture the reward of current iteration, but also the potential reward in the future iterations
4.1 Model framework
离线: 提取了4种特征( from news and users),用DQN预测 reward(.e., a combination of user-news click label and user activeness). 使用离线的点击日志训练。
1 push:
在每个 timestamp (t1, t2, t3, t4, t5, ...), 用户提出请求,agent根据输入(user feature 和 news candidates)产生 L top-k 个list of news(模型产生的 + exploration)
2 feedback
用户根据 L 返回点击情况。
3 minor update
在每个timestamp, 根据上一个用户u ,推荐列表 L,feedback B , agent G 通过比较 Q and Q˜ 的 performance来更新参数。
If Q˜ better recommendation result, the current network will be updated towards Q˜ . Otherwise, Q will be kept unchanged
4 major update
(5) Repeat step (1)-(4)
4.2 Feature construction
• News features includes 417 dimension one hot features that describe whether certain property appears in this piece ofnews,
including headline, provider, ranking, entity name,category, topic category, and click counts in last 1 hour, 6 hours, 24 hours, 1 week, and 1 year respectively.
• User features mainly describes the features (i.e., headline, provider, ranking, entity name, category, and topic category) of the news that the user clicked in 1 hour, 6 hours, 24 hours,
1 week, and 1 year respectively. There is also a total click count for each time granularity. Therefore, there will be totally 413 × 5 = 2065 dimensions.
• User news features. These 25-dimensional features describe the interaction between user and one certain piece of news,
i.e., the frequency for the entity (also category, topic category and provider) to appear in the history of the user’s readings.
• Context features. These 32-dimensional features describe the context when a news request happens, including time, weekday,
and the freshness of the news (the gap between request time and news publish time).
4.3 Deep Reinforcement Recommendation
rimmediate 代表 rewards (用户是否点击了this piece of news)
DDQN 公式:
4.4 User Activeness
传统方法只考虑ctr。用户活跃度也行重要。 是本文提出的新的可以用作推荐结果反馈的指标。用户活跃度可以理解为使用app的频率,好的推荐结果可以增加用户使用该app的频率,因此可以作为一个反馈指标。
4.5 Explore
本文的探索采取的是Dueling Bandit Gradient Descent 算法,算法的结构如下:
在DQN网络的基础上又多出来一个exploration network Q ̃ ,这个网络的参数是由当前的Q网络参数基础上加入一定的噪声产生的,具体来说:
当一个用户请求到来时,由两个网络同时产生top-K的新闻列表,然后将二者产生的新闻进行一定程度的混合,然后得到用户的反馈。如果exploration network Q ̃的效果好的话,那么当前Q网络的参数向着exploration network Q ̃的参数方向进行更新,具体公式如下:
