推荐系统中的注意力机制——阿里深度兴趣网络（DIN）

参考：

https://zhuanlan.zhihu.com/p/51623339

注意力机制顾名思义，就是模型在预测的时候，对用户不同行为的注意力是不一样的，“相关”的行为历史看重一些，“不相关”的历史甚至可以忽略。那么这样的思想反应到模型中也是直观的。

如果按照之前的做法，我们会一碗水端平的考虑所有行为记录的影响，对应到模型中就是我们会用一个average pooling层把用户交互过的所有商品的embedding vector平均一下形成这个用户的user vector，机灵一点的工程师最多加一个time decay，让最近的行为产生的影响大一些，那就是在做average pooling的时候按时间调整一下权重。

上式中， $V_u$ 是用户的embedding向量， $V_a$ 是候选广告商品的embedding向量， $V_i$ 是用户u的第i次行为的embedding向量，因为这里用户的行为就是浏览商品或店铺，所以行为的embedding的向量就是那次浏览的商品或店铺的embedding向量。

因为加入了注意力机制， $V_u$ 从过去 $V_i$ 的加和变成了 $V_i$ 的加权和， $V_i$ 的权重 $w_i$ 就由 $V_i$ 与 $V_a$ 的关系决定，也就是上式中的 $g(V_i,V_a)$ ，不负责任的说，这个 $g(V_i,V_a)$ 的加入就是本文70%的价值所在。

那么 $g(V_i,V_a)$ 这个函数到底采用什么比较好呢？看完下面的架构图自然就清楚了。

相比原来这个标准的深度推荐网络（Base model），DIN在生成用户embedding vector的时候加入了一个activation unit层，这一层产生了每个用户行为 $V_i$ 的权重，下面我们仔细看一下这个权重是怎么生成的，也就是 $g(V_i,V_a)$ 是如何定义的。

传统的Attention机制中，给定两个item embedding，比如u和v，通常是直接做点积uv或者uWv，其中W是一个|u|x|v|的权重矩阵，但这篇paper中阿里显然做了更进一步的改进，着重看上图右上角的activation unit，首先是把u和v以及u v的element wise差值向量合并起来作为输入，然后喂给全连接层，最后得出权重，这样的方法显然损失的信息更少。但如果你自己想方便的引入attention机制的话，不妨先从点积的方法做起尝试一下，因为这样连训练都不用训练。

再稍微留意一下这个架构图中的红线，你会发现每个ad会有 good_id, shop_id 两层属性，shop_id只跟用户历史中的shop_id序列发生作用，good_id只跟用户的good_id序列发生作用，这样做的原因也是显而易见的。

论文里面，activation unit结构：

activation units are applied on the user behavior features, which performs as a weighted sum pooling to adaptively calculate user representation $v_U$ given a candidate ad A：

where ${e_1, e_2, ..., e_H }$ is the list of embedding vectors of behaviors of user $U$ with length of H, $v_A$ is the embedding vector of ad A.

如果说上面的部分是文70%的价值所在，那么余下30%应该还有这么几点：

用GAUC这个离线metric替代AUC
用Dice方法替代经典的PReLU激活函数
介绍一种Adaptive的正则化方法
介绍阿里的X-Deep Learning深度学习平台

PReLU激活函数：

其中，$p(s) = I(s > 0)$

Dice方法：

Dice can be viewed as a generalization of PReLu. The key idea of Dice is to adaptively adjust the rectified point according to distribution of input data, whose value is set to be the mean of input. Besides, Dice controls smoothly to switch between the two channels. When $E(s) = 0 $ and $Var[s] = 0 $, Dice degenerates into PReLU.

GAUC:

因为auc反映的是整体样本间的一个排序能力，而在计算广告领域，我们实际要衡量的是不同用户对不同广告之间的排序能力，实际更关注的是同一个用户对不同广告间的排序能力。group auc实际是计算每个用户的auc，然后加权平均，最后得到group auc，这样就能减少不同用户间的排序结果不太好比较这一影响

实际处理时权重一般可以设为每个用户view的次数，或click的次数，而且一般计算时，会过滤掉单个用户全是正样本或负样本的情况。

实现代码： https://github.com/qiaoguan/deep-ctr-prediction/blob/master/DeepCross/metric.py

阅读论文：

基线模型： embedding & MLP

$Embedding Layer: $

For the $i-th$ feature group of $t_i$ （$t_i$ 是 $K_i$ 维向量，可能有一个或多个项是1）, let $W_i = [w^i_1 , ...,w^i_j , ...,w^i_{K_i} ] ∈ R^{D×K_i} $ represent the $i-th$ embedding dictionary, where $w^i_j ∈ R^D $ is an embedding vector with dimensionality of D. Embedding operation follows the table lookup mechanism。

embedding机制：

1、If $t_i$ is one-hot vector with $j-th$ element $t_i[j] = 1 $, the embedded representation of $t_i$ is a single embedding vector $e_i = w^i_j $.

2、If $t_i$ is multi-hot vector with $t_i[j] = 1 $ for $j ∈ {i_1, i_2, ...,i_k }$, the embedded representation of $t_i$ is a list of embedding vectors: ${e_{i_1} , e_{i_2} , ...e_{i_k} } = {w^i_{i1} ,w^i_{i2} , ...w^i_{ik} }$.

$Pooling layer and Concat layer: $

The number of non-zero values for multi-hot behavioral feature vector $t_i$ varies across instances, causing the lengths of the corresponding list of embedding vectors to be variable. As fully connected networks can only handle fixed-length inputs, it is a common practice to transform the list of embedding vectors via a pooling layer to get a fixed-length vector:

$e_i = pooling(e_{i_1} , e_{i_2}, ...e{i_k} )$

Both embedding and pooling layers operate in a group-wise manner, mapping the original sparse features into multiple fixedlength representation vectors. Then all the vectors are concatenated together to obtain the overall representation vector for the instance

$MLP:$

Given the concatenated dense representation vector, fully connected layers are used to learn the combination of features automatically. Recently developed methods focus on designing structures of MLP for better information extraction.

随机推荐

使用Jmeter做性能测试
上周刚刚做完项目的性能测试.今天整理和总结一下,随便分享给大家. 首页呢,测试前,我们是有明确的性能指标的,而且测试环境和数据都已准备好,业务分析.场景分析大家根据自己的项目系统进行分析设计,我们选用 ...
自己搭建一个记笔记的环境记录（leanote）
一直在找一个开源的记笔记的软件,偶然看到leanote.竟然还是开源的,还是国人开发的果断mark了.自己在电脑上搭建了一个挺好玩的.可以记录一些不给别人看的小秘密. 下面是步骤记录,当然可以到官网上 ...
动态规划--找零钱 coin change
来自http://www.geeksforgeeks.org/dynamic-programming-set-7-coin-change/ 对于整数N,找出N的所有零钱的表示.零钱可以用S={s1,s ...
[转]Linux 技巧：让进程在后台可靠运行的几种方法
转自: https://www.ibm.com/developerworks/cn/linux/l-cn-nohup/index.html 我们经常会碰到这样的问题,用 telnet/ssh 登录了远 ...
使用原app接口进行微信公众号开发
1.跨域问题原来的app项目已经上线,然而接下来就有意思了,突然上头说要把app的发件功能复制到微信公众号里.那么问题来了,微信公众号的页面是前端和交互式h5大哥写的. 那么就将页面丢微信里,请求我 ...
X-UA-Compatible设置IE浏览器兼容模式
文件兼容性用来告诉IE,让它如何来编译你的网页. 指定文件兼容性模式以下是指定为Emulate IE7 mode 兼容性范例. <html> <head> <!-- ...
POJ 2217：Secretary（后缀数组）
题目大意:求两个字符串的公共子串. 分析: 模板题,将两个字符串接起来用不会出现的字符分割,然后求分属两个字符串的相邻后缀lcp的最大值即可. 代码: program work; type arr=. ...
Codeforces #990E Post Lamp
题目大意今欲用若干条长为 $k$($1\le k\le m, k\in \mathbb{Z}$) 的线段覆盖数轴上 $[0,n]$ 这一段.线段的起点(左端点)必须为 $[0, n-1]$ 中的某个 ...
[nowcoder_Wannafly挑战赛4_F]线路规划
[nowcoder_Wannafly挑战赛4_F]线路规划试题描述 Q国的监察院是一个神秘的组织. 这个组织掌握了整个帝国的地下力量,监察着Q国的每一个人. 监察院一共有 $N$ 个成员,每一个 ...
cmake 版本升级
1.在网址 https://cmake.org/files/v3.1/下载 cmake-3.1.0.tar.gz 2.解压 3.执行 ./configure 4.执行 make 5. 执行 ...

推荐系统中的注意力机制——阿里深度兴趣网络（DIN）

推荐系统中的注意力机制——阿里深度兴趣网络（DIN）的更多相关文章

随机推荐

热门专题