Doubly Robust Joint Learning for Recommendation on Data Missing Not at Random

Wang X., Zhang R., Sun Y. and Qi J. Doubly robust joint learning for recommendation on data missing not at random. In International Conference on Machine Learning (ICML), 2019

为了处理推荐系统中的 MNAR (Missing Not At Random) 情况, 作者结合插值模型和 inverse-propensity-scoring (IPS) estimator 构造了一宗 Doubly Robust estimator, 插值模型准确或者 IPS 估计准确都能保证估计量的一个无偏性.

符号说明

  • \(\mathcal{U} = \{u_1, \ldots, u_N\}\), user;

  • \(\mathcal{I} = \{i_1, \ldots, i_M\}\), item;

  • \(R \in \mathbb{R}^{N \times M}\), true rating matrix;

  • \(\hat{R} \in \mathbb{R}^{N \times M}\), predicted rating matrix;

  • \(O \in \{0, 1\}^{N \times M}\), indicator matrix, \(o_{u,i} = 1\)\(r_{u,i}\) 被观测到;

  • \(R^f = R, R^o = R \odot O\) 来表示完全可观测和部分观测的 rating matrix;

  • \(\mathcal{O} = \{(u, i)| o_{u, i} = 1\}\);

  • \(\bm{x}_{u, i}\), user \(u\), item \(i\) 的一些属性;

  • 理想的损失 (或者 prediction inaccuracy) \(\mathcal{P}\) 定义为:

    \[\tag{1} \mathcal{P} := \mathcal{P}(\hat{R}, R^f) = \frac{1}{NM} \sum_{u, i} e_{u, i}, \]

    其中 \(e_{u, i}\) 可以是

    \[\text{MAE}: e_{u,i} = |\hat{r}_{u, i} - r_{u,i}|, \\ \text{MSE}: e_{u,i} = (\hat{r}_{u, i} - r_{u,i})^2, \\ \]

    或者其它的一些度量;

  • 在缺失数据的情况下, 损失通常为:

    \[\tag{1+} \mathcal{E}_N := \mathcal{E}_N (\hat{R}, R^o) = \frac{1}{|\mathcal{O}|} \sum_{u, i \in \mathcal{O}} e_{u,i}; \]

动机

通常, 我们希望通过最小化 (1) 来获得一个理想的预测 \(\hat{R}\), 但是由于存在缺失数据的限制, 在实际中我们往往只能通过 (1+) 来进行训练. 而通常需要满足一定条件 (如 MCAR) 才能保证无偏性, 即

\[\mathbb{E}_{O} [\mathcal{E}_N] = \mathcal{P}. \]

而在一般的 MNAR 条件下, 这个性质就不成立了, 导致我们通过 (1+) 进行训练所得的模型通常会存在比较严重的 selection bias. 为此, 存在如下两种经典的方法用以解决这个问题.

EIB

error-imputation-based (EIB) estimator 是通过一个插值模型来估计

\[\hat{e}_{u,i} = \omega |\hat{r}_{u, i} - \gamma| \text{ or } \hat{e}_{u, i} = \omega (\hat{r}_{u, i} - \gamma)^2, \]

然后通过如下 EIB estimator 替代 (1+)

\[\tag{2} \mathcal{E}_{EIB} = \mathcal{E}_{EIB}(\hat{R}, \hat{R}^o) = \frac{1}{NM} \sum_{u, i} (o_{u,i} e_{u, i} + (1 - o_{u, i}) \hat{e}_{u, i}). \]

通常我们令 \(\delta_{u, i} = e_{u, i} - \hat{e}_{u, i}\) 为 error deviation, 显然当 \(\delta_{u, i} = 0, \forall (u, i)\) 的时候, EIB estimator 是无偏的.

而一般情况下, \(\mathcal{E}_{EIB}\) 的 bias 为

\[\begin{array}{ll} \text{Bias}(\mathcal{E}_{EIB}) &=|\mathbb{E}_O[\mathcal{E}_{EIB}] - \mathcal{P}| \\ &=|\frac{1}{NM} \sum_{u, i} \mathbb{E}_{o_{u,i}} [o_{u,i} e_{u, i} + (1 - o_{u, i})\hat{e}_{u, i}] - \mathcal{P}| \\ &=|\frac{1}{NM} \sum_{u, i} [p_{u,i} e_{u, i} + (1 - p_{u, i})\hat{e}_{u, i}] - \mathcal{P}| \\ &=|\frac{1}{NM} \sum_{u, i} [(p_{u,i} - 1) e_{u, i} + (1 - p_{u, i})\hat{e}_{u, i}]| \\ &=\frac{1}{NM} |\sum_{u, i} (1 - p_{u, i}) \delta_{u, i}|. \\ \end{array} \]

IPS

IPS estimator 按照如下方式定义:

\[\tag{2+} \mathcal{E}_{IPS} = \mathcal{E}_{IPS}(\hat{R}, R^o) = \frac{1}{NM} \sum_{u, i} \frac{o_{u, i} e_{u, i}}{\hat{p}_{u, i}}. \]

显然若

\[\hat{p}_{u, i} = p_{u, i} > 0 \: \forall (u, i) \]

成立, 或者等价于

\[\Delta_{u, i} := \frac{\hat{p}_{u, i} - p_{u, i}}{\hat{p}_{u, i}} = 0, \: \forall (u, i) \]

成立, IPS estimator 也具有无偏性.

而一般情况下, \(\mathcal{E}_{IPS}\) 的 bias 为

\[\begin{array}{ll} \text{Bias}(\mathcal{E}_{EIB}) &= |\mathbb{E}_O[\mathcal{E}_{IPS}] - \mathcal{P} | \\ &= |\frac{1}{NM} \sum_{u, i} \mathbb{E}_{o_{u,i}}\frac{o_{u, i} e_{u, i}}{\hat{p}_{u,i}} - \mathcal{P} | \\ &= |\frac{1}{NM} \sum_{u, i} \frac{p_{u, i} e_{u, i}}{\hat{p}_{u,i}} - \mathcal{P} | \\ &= \frac{1}{NM} |\sum_{u, i} \Delta_{u, i} e_{u, i}|. \\ \end{array} \]

Doubly Robust Estimator

DR estimator 定义为

\[\mathcal{E}_{DR} = \mathcal{E}_{DR} (\hat{R}, R^o) = \frac{1}{NM} \sum_{u, i} \Big(\hat{e}_{u, i} + \frac{o_{u, i} \delta_{u, i}}{\hat{p}_{u, i}} \Big) = \frac{1}{NM} \sum_{u, i} \frac{(\hat{p}_{u, i} - o_{u, i}) \hat{e}_{u, i} + o_{u,i} e_{u,i}}{\hat{p}_{u,i}}. \]

显然, 当 \(\delta_{u, i} = 0\) 或者 \(\Delta_{u,i} = 0\) 的时候, 便有

\[\mathbb{E}_{O}[\mathcal{E}_{DR}] = \mathcal{P}. \]

而一般情况下, bias 为

\[\begin{array}{ll} \text{Bias}(\mathcal{E}_{DR}) &= |\mathbb{E}_{O}[\mathcal{E}_{DR}] - \mathcal{P}| \\ &= |\frac{1}{NM} \sum_{u,i} (\hat{e}_{u,i} + \frac{p_{u, i}}{\hat{p}_{u, i}} \delta_{u, i}) - \mathcal{P}| \\ &= |\frac{1}{NM} \sum_{u,i} (e_{u,i} - \Delta_{u,i} \delta_{u, i}) - \mathcal{P}| \\ &= \frac{1}{NM} |\sum_{u, i} \Delta_{u, i} \delta_{u,i}|. \end{array} \]

Tail bound 和 泛化界

Lemma 3.2 (Tail Bound of DR Estimator): 倘若 \(o_{u, i}\) 互相独立, 则给定任意的 \(\hat{R}, R\), 至少有 \(1 - \eta\) 的概率保证

\[|\mathcal{E}_{DR} - \mathbb{E}_{O}[\mathcal{E}_{DR}]| \le \sqrt{\frac{\log (2 / \eta)}{2 (NM)^2} \sum_{u, i} (\frac{\delta_{u, i}}{\hat{p}_{u, i}})^2}. \]

Theorem 4.1 (Generalization Bound): 对于有限的空间 \(\mathcal{H}\), 通过

\[\hat{R}^* = \mathop{\text{argmin}} \limits_{\hat{R} \in \mathcal{H}} \mathcal{E}_{DR}(\hat{R}, R^o) \]

所得的最优解 \(\hat{R}^*\) 与真实的解 \(\hat{R}^f\) 之间的差距 \(\mathcal{P}(\hat{R}^*, R^f)\), 至少有 \(1 - \eta\) 的概率满足

\[\mathcal{P}(\hat{R}^*, R^f) \le \mathcal{E}_{DR}(\hat{R}^*, R^o) + \sum_{u, i} \frac{|\Delta_{u, i} \delta_{u, i}^*|}{NM} + \sqrt{\frac{\log (2 |\mathcal{H}| / \eta)}{2 (NM)^2} \sum_{u, i} (\frac{\delta_{u, i}'}{\hat{p}_{u, i}})^2}. \]

其中 \(\delta_{u,i}'\)\(R^f\)\(\hat{R}' = \mathop{\text{argmin}} \limits_{\hat{R} \in |\mathcal{H}|} \sum_{u,i} (\frac{\delta_{u,i}}{\hat{p}_{u,i}})^2\) 的 deviation.


proof:

注意到:

\[\begin{array}{ll} \mathcal{P} &= \mathcal{P} - \mathbb{E}_{O}[\mathcal{E}_{DR}] + \mathbb{E}_{O}[\mathcal{E}_{DR}] \\ &\le \text{Bias} (\mathcal{E}_{DR}) + \mathbb{E}_{O}[\mathcal{E}_{DR}]. \\ \end{array} \]

故只需找到 \(\mathbb{E}_{O}[\mathcal{E_{DR}}]\)\(\mathcal{E}_{DR}\) 的关系即可. 而这个关系和 Lemma 3.2 均主要通过 Hoeffding's inequality 解决.


训练框架

  1. 观测数据 \(R^o\) 和预估的 propensities \(\hat{P}\);

  2. 构建插值模型 \(\hat{e}_{u, i} = g_{\phi}(\bm{x}_{u,i})\) 和 预测模型 \(\hat{r}_{u, i} = f_{\theta}(\bm{x}_{u, i})\);

  3. [multi steps] 通过如下公式优化插值模型 (即仅训练 \(\phi\) ):

    \[\mathcal{L}_e (\phi) = \sum_{(u, i) \in \mathcal{O}} \frac{(\hat{e}_{u,i} - e_{u, i})^2}{\hat{p}_{u, i}} + v \|\phi\|_F^2, \]

    其中

    \[e_{u,i} = r_{u, i} - f_{\theta}(\bm{x}_{u, i}) \\ \hat{e}_{u,i} = g_{\phi}(\bm{x}_{u,i}); \]

  4. [multi steps] 通过如下公式优化预测模型 (即仅训练 \(\theta\)):

    \[\mathcal{L}_r (\theta) = \sum_{(u, i)} [\hat{e}_{u, i} + \frac{o_{u,i} (\hat{e}_{u,i} - e_{u, i})}{\hat{p}_{u, i}}] + v \|\theta\|_F^2, \]

    其中

    \[e_{u, i} = (f_{\theta}(x_{u, i}) - r_{u, i})^2, \\ \hat{e}_{u, i} = (f_{\theta}(x_{u, i}) - g_{\phi}(x_{u,i}) - c)^2, \\ \]

    其中 \(c\) 为'常数' (即其不会回传任何导数), 其值恰好为 \(f_{\theta}(\bm{x}_{u,i})\). 注意如此操作是为了 \(\nabla_{\theta} \hat{e}_{u,i} \not = 0\).

  5. 重复 3, 4 直至收敛.

posted @ 2022-07-17 15:43  馒头and花卷  阅读(494)  评论(3编辑  收藏  举报