Skip to content

Latest commit

 

History

History
1179 lines (877 loc) · 76.5 KB

File metadata and controls

1179 lines (877 loc) · 76.5 KB

Recommender System

Recommender Systems (RSs) are software tools and techniques providing suggestions for items to be of use to a user.

RSs are primarily directed towards individuals who lack sufficient personal experience or competence to evaluate the potentially overwhelming number of alternative items that a Web site, for example, may offer.

Xavier Amatriain discusses the traditional definition and its data mining core.

Traditional definition: The recommender system is to estimate a utility function that automatically predicts how a user will like an item.

User Interest is implicitly reflected in Interaction history, Demographics and Contexts, which can be regarded as a typical example of data mining. Recommender system should match a context to a collection of information objects. There are some methods called Deep Matching Models for Recommendation. It is an application of machine learning, which is in the representation + evaluation + optimization form. And we will focus on the representation and evaluation.

Evolution of the Recommender Problem
Rating
Ranking
Page Optimization
Context-aware Recommendations

Evaluation of Recommendation System

The evaluation of machine learning algorithms depends on the tasks. The evaluation of recommendation system can be regarded as some machine learning models such as regression, classification and so on. We only take the mathematical convenience into consideration in the following methods. Gini index, covering rate and more realistic factors are not discussed in the following content.

Recommendation Strategies
Collaborative Filtering (CF)
Content-Based Filtering (CBF)
Demographic Filtering (DF)
Knowledge-Based Filtering (KBF)
Hybrid Recommendation Systems

Collaborative Filtering

There are 3 kinds of collaborative filtering: user-based, item-based and model-based collaborative filtering.

The user-based methods are based on the similarities of users. If user ${u}$ and ${v}$ are very similar friends, we may recommend the items which user ${u}$ bought to the user ${v}$ and explains it that your friends also bought it.

The item-based methods are based on the similarity of items. If one person added a brush to shopping-list, it is reasonable to recommend some toothpaste to him or her. And we can explain that you bought item $X$ and the people who bought $X$ also bought $Y$. And we focus on the model-based collaborative filtering.

Matrix Completion

Matrix completion is to complete the matrix $X$ with missing elements, such as

$$ \min_{Z} Rank(Z) \\ s.t. \sum_{(i,j):Observed}(Z_{(i,j)}-X_{(i,j)})^2\leq \delta $$

Note that the rank of a matrix is not easy or robust to compute.

We can apply customized PPA to matrix completion problem

$$ \min { {|Z|}{\ast}} \ s.t. Z{\Omega} = X_{\Omega} $$

We let ${Y}\in\mathbb{R}^{n\times n}$ be the the Lagrangian multiplier to the constraints $Z_{\Omega} = X_{\Omega}$ and Lagrange function is $$ L(Z,Y) = {|Z|}{\ast} - Y(Z{\Omega} - X_{\Omega}). $$

  1. Producing $Y^{k+1}$ by $$Y^{k+1}=\arg\max_{Y} {L([2Z^k-Z^{k-1}],Y)-\frac{s}{2}|Y-Y^k|};$$
  2. Producing $Z^{k+1}$ by $$Z^{k+1}=\arg\min_{Z} {L(Z,Y^{k+1}) + \frac{r}{2}|Z-Z^k|}.$$

Rahul Mazumder, Trevor Hastie, Robert Tibshirani reformulate it as the following:

$$ \min f_{\lambda}(Z)=\frac{1}{2}{|P_{\Omega}(Z-X)|}F^2 + \lambda {|Z|}{\ast} $$

where $X$ is the observed matrix, $P_{\Omega}$ is a projector and ${|\cdot|}_{\ast}$ is the nuclear norm of matrix.

Maximum Margin Matrix Factorization

A novel approach to collaborative prediction is presented, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-definite program, and present generalization error bounds based on analyzing the Rademacher complexity of low-norm factorizations.

Consider the soft-margin learning, where we minimize a trade-off between the trace norm of $Z$ and its hinge-loss relative to $X_O$: $$ \min_{Z} { | Z | }{\Omega} + c \sum{(ui)\in O}\max(0, 1 - Z_{ui}X_{ui}). $$

And it can be rewritten as a semi-definite optimization problem (SDP): $$ \min_{A, B} \frac{1}{2}(tr(A)+tr(B))+c\sum_{(ui)\in O}\xi_{ui}, \ s.t. , \begin{bmatrix} A & X \ X^T & B \ \end{bmatrix} \geq 0, Z_{ui}X_{ui}\geq 1- \xi_{ui}, \xi_{ui}>0 ,\forall ui\in O $$ where $c$ is a trade-off constant.

This technique is also called nonnegative matrix factorization.

(\color{red}{Note:}) The data sets we more frequently encounter in collaborative prediction problem are of ordinal ratings $X_{ij} \in {1, 2, \dots, R}$ such as ${1, 2, 3, 4, 5}$. To relate the real-valued $Z_{ij}$ to the discrete $X_{ij}$. we use $R − 1$ thresholds $\theta_{1}, \dots, \theta_{R-1}$.

SVD and Beyond

If we have collected user ${u}$'s explicit evaluation score to the item ${i}$ , $R_{[u][i]}$, and all such data makes up a matrix $R=(R_{[u][i]})$ while the user $u$ cannot evaluate all the item so that the matrix is incomplete and missing much data. SVD is to factorize the matrix into the multiplication of matrices so that $$ \hat{R} = P^{T}Q. $$

And we can predict the score $R_{[u][i]}$ via $$ \hat{R}{[u][i]} = \hat{r}{u,i} = \left<P_u, Q_i\right> = \sum_f p_{u,f} q_{i,f} $$

where $P_u, Q_i$ is the ${u}$-th column of ${P}$ and the ${i}$-th column of ${Q}$, respectively. And we can define the cost function

$$ C(P,Q) = \sum_{(u,i):Observed}(r_{u,i}-\hat{r}{u,i})^{2}=\sum{(u,i):Observed}(r_{u,i}-\sum_f p_{u,f}q_{i,f})^{2}\ \arg\min_{P_u, Q_i} C(P, Q) $$

where $\lambda_u$ is always equal to $\lambda_i$.

Additionally, we can add regular term into the cost function to void over-fitting

$$ C(P,Q) = \sum_{(u,i):Observed}(r_{u,i}-\sum_f p_{u,f}q_{i,f})^{2}+\lambda_u|P_u|^2+\lambda_i|Q_i|^2. $$

It is called the regularized singular value decomposition or Regularized SVD.

Funk-SVD considers the user's preferences or bias. It predicts the scores by $$ \hat{r}{u,i} = \mu + b_u + b_i + \left< P_u, Q_i \right> $$ where $\mu, b_u, b_i$ is biased mean, biased user, biased item, respectively. And the cost function is defined as $$ \min\sum{(u,i): Observed}(r_{u,i} - \hat{r}_{u,i})^2 + \lambda (|P_u|^2+|Q_i|^2+|b_i|^2+|b_u|^2). $$

SVD ++ predicts the scores by

$$ \hat{r}{u,i} = \mu + b_u + b_i + (P_u + |N(u)|^{-0.5}\sum{i\in N(u)} y_i) Q_i^{T} $$ where $y_j$ is the implicit feedback of item ${j}$ and $N(u)$ is user ${u}$'s item set. And it can decompose into 3 parts:

  • $\mu + b_u + b_i$ is the base-line prediction;
  • $\left&lt;P_u, Q_i\right&gt;$ is the SVD of rating matrix;
  • $\left&lt;|N(u)|^{-0.5}\sum_{i\in N(u)} y_i, Q_i\right&gt;$ is the implicit feedback where $N(u)$ is user ${u}$'s item set, $y_j$ is the implicit feedback of item $j$.

We learn the values of involved parameters by minimizing the regularized squared error function.

Inductive Matrix Completion

One possible improvement of this cost function is that we may design more appropriate loss function other than the squared error function.

utexas.edu

Inductive Matrix Completion (IMC) is an algorithm for recommender systems with side-information of users and items. The IMC formulation incorporates features associated with rows (users) and columns (items) in matrix completion, so that it enables predictions for users or items that were not seen during training, and for which only features are known but no dyadic information (such as ratings or linkages).

IMC assumes that the associations matrix is generated by applying feature vectors associated with its rows as well as columns to a low-rank matrix ${Z}$. The goal is to recover ${Z}$ using observations from ${P}$.

The inputs $x_i, y_j$ are feature vectors. The entry $P_{(i, j)}$ of the matrix is modeled as $P_{(i, j)}=x_i^T Z y_j$ and ${Z}$ is to recover in the form of $Z=WH^T$.

$$ \min \sum_{(i,j)\in \Omega}\ell(P_{(i,j)}, x_i^T W H^T y_j) + \frac{\lambda}{2}(| W |^2+| H |^2) $$ The loss function $\ell$ penalizes the deviation of estimated entries from the observations. And $\ell$ is diverse such as the squared error $\ell(a,b)=(a-b)^2$, the logistic error $\ell(a,b) = \log(1 + \exp(-ab))$.

Probabilistic Matrix Factorization

In linear regression, the least square methods is equivalent to maximum likelihood estimation of the error in standard normal distribution.

Regularized SVD
$C(P,Q) = \sum_{(u,i):Observed}(r_{(u,i)}-\sum_f p_{(u,f)} q_{(i,f)})^{2}+\lambda_u|P_u|^2+\lambda_i|Q_i|^2$
Probabilistic model
$r_{u,i}\sim N(\sum_f p_{(u,f)} q_{(i,f)},\sigma^2), P_u\sim N(0,\sigma_u^2 I), Q_i\sim N(0,\sigma_i^2 I)$

And $\sigma_u^2$ and $\sigma_i^2$ is related with the regular term $\lambda_u$ and $\lambda_u$.

So that we can reformulate the optimization problem as maximum likelihood estimation.

Collaborative Less-is-More Filtering(CliMF)

Sometimes, the information of user we could collect is implicit such as the clicking at some item.

In CLiMF the model parameters are learned by directly maximizing the Mean Reciprocal Rank (MRR).

Its objective function is $$ F(U,V)=\sum_{i=1}^{M}\sum_{j=1}^{N} Y_{ij} [\ln g(U_{i}^{T}V_{j})+\sum_{k=1}^{N}\ln (1 - Y_{ij} g(U_{i}^{T}V_{k}-U_{i}^{T}V_{j}))] \-\frac{\lambda}{2}({|U|}^2 + {|V|}^2) $$

where ${M, N}$ is the number of users and items, respectively. Additionally, $\lambda$ denotes the regularization coefficient and $Y_{ij}$ denotes the binary relevance score of item ${j}$ to user ${i}$, i.e.,$Y_{ij} = 1$ if item ${j}$ is relevant to user ${j}$, 0 otherwise. The function $g$ is logistic function $g(x)=\frac{1}{1+\exp(-x)}$. The vector $U_i$ denotes a d-dimensional latent factor vector for user ${i}$, and $V_j$ a d-dimensional latent factor vector for item ${i}$.

Numbers Factors Others
$M$ the number of users $U_i$ latent factor vector for user ${i}$ $Y_{ij}$ binary relevance score
$N$ the number of items $V_j$ latent factor vector for item ${i}$ $f$ logistic function

We use stochastic gradient ascent to maximize the objective function.

BellKor's Progamatic Chaos

Until now, we consider the recommendation task as a regression prediction process, which is really common in machine learning. The boosting or stacking methods may help us to enhance these methods.

A key to achieving highly competitive results on the Netflix data is usage of sophisticated blending schemes, which combine the multiple individual predictors into a single final solution. This significant component was managed by our colleagues at the Big Chaos team. Still, we were producing a few blended solutions, which were later incorporated as individual predictors in the final blend. Our blending techniques were applied to three distinct sets of predictors. First is a set of 454 predictors, which represent all predictors of the BellKor’s Pragmatic Chaos team for which we have matching Probe and Qualifying results. Second, is a set of 75 predictors, which the BigChaos team picked out of the 454 predictors by forward selection. Finally, a set of 24 BellKor predictors for which we had matching Probe and Qualifying results. from Netflix Prize.


Another advantage of collaborative filtering or matrix completion is that even the element of matrix is binary or implicit information such as

Recommendation with Implicit Information

Explicit and implicit feedback

WRMF is simply a modification of this loss function: $$ C(P,Q){WRMF} = \sum{(u,i):Observed}c_{u,i}(I_{u,i} - \sum_f p_{u,f}q_{i,f})^{2} + \lambda_u|P_u|^2 + \lambda_i|Q_i|^2. $$

We make the assumption that if a user has interacted at all with an item, then $I_{u,i} = 1$. Otherwise, $I_{u,i} = 0$. If we take $d_{u,i}$ to be the number of times a user ${u}$ has clicked on an item ${i}$ on a website, then $$c_{u,i}=1+\alpha d_{u,i}$$ where $\alpha$ is some hyperparameter determined by cross validation. The new term in cost function $C=(c_{u,i})$ is called confidence matrix.

WRMF does not make the assumption that a user who has not interacted with an item does not like the item. WRMF does assume that that user has a negative preference towards that item, but we can choose how confident we are in that assumption through the confidence hyperparameter.

Alternating least square (ALS) can give an analytic solution to this optimization problem by setting the gradients equal to 0s.


More on Matrix Factorization

Hyperbolic Recommender Systems

Many well-established recommender systems are based on representation learning in Euclidean space. In these models, matching functions such as the Euclidean distance or inner product are typically used for computing similarity scores between user and item embeddings. This paper investigates the notion of learning user and item representations in hyperbolic space.

Given a user ${u}$ and an item ${v}$ that are both lying in the Poincare ball $B^n$, the distance between two points on P is given by $$d_p(x, y)=cosh^{-1}(1+2\frac{|(x-y|^2}{(1-|x|^2)(1-|y|^2)}).$$

Hyperbolic Bayesian Personalized Ranking(HyperBPR) leverages BPR pairwise learning to minimize the pairwise ranking loss between the positive and negative items. Given a user ${u}$ and an item ${v}$ that are both lying in Poincare ball $B^n$, we take: $$\alpha(u, v) = f(d_p(u, v))$$ where $f(\cdot)$ is simply preferred as a linear function $f(x) = \beta x + c$ with $\beta\in\mathbb{R}$ and $c\in\mathbb{R}$ are scalar parameters and learned along with the network. The objective function is defined as follows: $$\arg\min_{\Theta} \sum_{i, j, k} -\ln(\sigma{\alpha(u_i, v_j) - \alpha(u_i, v_k)}) + \lambda {|\Theta|}_2^2$$

where $(i, j, k)$ is the triplet that belongs to the set ${D}$ that contains all pairs of positive and negative items for each user; $\sigma$ is the logistic sigmoid function; $\Theta$ represents the model parameters; and $\lambda$ is the regularization parameter.

The parameters of our model are learned by using RSGD.

Factorization Machines(FM)

The matrix completion used in recommender system are linear combination of some features such as regularized SVD and they only take the user-user interaction and item-item similarity. Factorization Machines(FM) is inspired from previous factorization models. It represents each feature an embedding vector, and models the second-order feature interactions: $$ \hat{y} = w_0 + \sum_{i=1}^{n} w_i x_i+\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\left<v_i, v_j\right> x_i x_j\ = \underbrace{w_0 + \left<w, x\right>}{\text{First-order: Linear Regression}} + \underbrace{\sum{i=1}^{n-1}\sum_{j=i+1}^{n}\left<v_i, v_j\right> x_i x_j}_{\text{Second-order: pair-wise interactions between features}} $$

where the model parameters that have to be estimated are $$ w_0 \in \mathbb{R}, w\in\mathbb{R}^n, V\in\mathbb{R}^{n\times k}. $$

And $\left&lt;\cdot,\cdot\right&gt;$ is the dot (inner) product of two vectors so that $\left&lt;v_i, v_j\right&gt;=\sum_{f=1}^{k}v_{i,f} \cdot v_{j,f}$. A row $v_i$ within ${V}$ describes the ${i}$-th latent variable with ${k}$ factors for $x_i$.

And the linear regression $w_0 + \sum_{i=1}^{n} w_i x_i$ is called the first order part; the pair-wise interactions between features $\sum_{i=1}^{n}\sum_{j=i+1}^{n}\left&lt;v_i, v_j\right&gt; x_i x_j$ is called the second order part.

However, why we call it factorization machine? Where is the factorization? If ${[W]}{ij}=w{ij}= \left<v_i, v_j\right>$, $W=V V^T$.

In order to reduce the computation complexity, the second order part $\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}\left&lt;v_i, v_j\right&gt; x_i x_j$ is rewritten in the following form $$\frac{1}{2}\sum_{l=1}^{k}[{(\sum_{i=1}^{n}(v_{il}x_i))}^2-\sum_{i=1}^{n}(v_{il}x_i)^2].$$

Field-aware Factorization Machine(FFM)

In FMs, every feature has only one latent vector to learn the latent effect with any other features. In FFMs, each feature has several latent vectors. Depending on the field of other features, one of them is used to do the inner product. Mathematically, $$ \hat{y}=\sum_{j_1=1}^{n}\sum_{j_2=i+1}^{n}\left<v_{j_1,f_2}, v_{j_2,f_1}\right> x_{j_1} x_{j_2} $$ where $f_1$ and $f_2$ are respectively the fields of $j_1$ and $j_2$.

Deep Learning for Recommender System

Deep learning is powerful in processing visual and text information so that it helps to find the interests of users such as Deep Interest Network, xDeepFM and more.

Deep learning models for recommender system may come from the restricted Boltzman machine. And deep learning models are powerful information extractors. Deep learning is really popular in recommender system such as spotlight.

Restricted Boltzmann Machines for Collaborative Filtering(RBM)

Let ${V}$ be a $K\times m$ observed binary indicator matrix with $v_i^k = 1$ if the user rated item ${i}$ as ${k}$ and ${0}$ otherwise. We also let $h_j$, $j = 1, \dots, F,$ be the binary values of hidden (latent) variables, that can be thought of as representing stochastic binary features that have different values for different users.

We use a conditional multinomial distribution (a “softmax”) for modeling each column of the observed "visible" binary rating matrix ${V}$ and a conditional Bernoulli distribution for modeling "hidden" user features ${h}$: $$ p(v_i^k = 1 \mid h) = \frac{\exp(b_i^k + \sum_{j=1}^{F} h_j W_{i,j}^{k})}{\sum_{l=1}^{K}\exp( b_i^k + \sum_{j=1}^{F} h_j W_{i, j}^{l})} \ p( h_j = 1 \mid V) = \sigma(b_j + \sum_{i=1}^{m}\sum_{k=1}^{K} v_i^k W_{i,j}^k) $$ where $\sigma(x) = \frac{1}{1 + exp(-x)}$ is the logistic function, $W_{i,j}^{k}$ is is a symmetric interaction parameter between feature ${j}$ and rating ${k}$ of item ${i}$, $b_i^k$ is the bias of rating ${k}$ for item ${i}$, and $b_j$ is the bias of feature $j$.

The marginal distribution over the visible ratings ${V}$ is $$ p(V) = \sum_{h}\frac{\exp(-E(V,h))}{\sum_{V^{\prime},h^{\prime}} \exp(-E(V^{\prime},h^{\prime}))} $$ with an "energy" term given by:

$$ E(V,h) = -\sum_{i=1}^{m}\sum_{j=1}^{F}\sum_{k=1}^{K}W_{i,j}^{k} h_j v_i^k - \sum_{i=1}^{m}\sum_{k=1}^{K} v_i^k b_i^k -\sum_{j=1}^{F} h_j b_j. $$ The items with missing ratings do not make any contribution to the energy function

The parameter updates required to perform gradient ascent in the log-likelihood over the visible ratings ${V}$ can be obtained $$ \Delta W_{i,j}^{k} = \epsilon \frac{\partial\log(p(V))}{\partial W_{i,j}^{k}} $$ where $\epsilon$ is the learning rate. The authors put a Contrastive Divergence to approximate the gradient.

We can also model “hidden” user features h as Gaussian latent variables: $$ p(v_i^k = 1 | h) = \frac{\exp(b_i^k+\sum_{j=1}^{F}h_j W_{i,j}^{k})}{\sum_{l=1}^{K}\exp(b_i^k+\sum_{j=1}^{F}h_j W_{i,j}^{l})} \ p( h_j = 1 | V) = \frac{1}{\sqrt{2\pi}\sigma_j} \exp(\frac{(h - b_j -\sigma_j \sum_{i=1}^{m}\sum_{k=1}^{K} v_i^k W_{i,j}^k)^2}{2\sigma_j^2}) $$ where $\sigma_j^2$ is the variance of the hidden unit ${j}$.

AutoRec

AutoRec is a novel autoencoder framework for collaborative filtering (CF). Empirically, AutoRec’s compact and efficiently trainable model outperforms state-of-the-art CF techniques (biased matrix factorization, RBMCF and LLORMA) on the Movielens and Netflix datasets.

Formally, the objective function for the Item-based AutoRec (I-AutoRec) model is, for regularisation strength $\lambda &gt; 0$,

$$ \min_{\theta}\sum_{i=1}^{n} {|r^{i}-h(r^{i}|\theta)|}_{O}^2 +\frac{1}{2}({|W|}_F^{2}+ {|V|}_F^{2}) $$

where ${r^{i}\in\mathbb{R}^{d}, i=1,2,\dots,n}$ is partially observed vector and ${| \cdot |}_{o}^2$ means that we only consider the contribution of observed ratings. The function $h(r|\theta)$ is the reconstruction of input $r\in\mathbb{R}^{d}$:

$$ h(r|\theta) = f(W\cdot g(Vr+\mu)+b) $$

for for activation functions $f, g$ as described in dimension reduction. Here $\theta = {W,V,r,b}$.

Wide & Deep Model

The output of this model is $$ P(Y=1|x) = \sigma(W_{wide}^T[x,\phi(x)] + W_{deep}^T \alpha^{(lf)}+b) $$ where the wide part deal with the categorical features such as user demographics and the deep part deal with continuous features.

Deep FM

DeepFM ensembles FM and DNN and to learn both second order and higher-order feature interactions: $$\hat{y}=\sigma(y_{FM} + y_{DNN})$$ where $\sigma$ is the sigmoid function so that $\hat{y}\in[0, 1]$ is the predicted CTR, $y_{FM}$ is the output of FM component, and $y_{DNN}$ is the output of deep component.

The FM component is a factorization machine and the output of FM is the summation of an Addition unit and a number of Inner Product units:

$$ \hat{y} = \left<w, x\right>+\sum_{j_1=1}^{n}\sum_{j_2=i+1}^{n}\left<v_i, v_j\right> x_{j_1} x_{j_2}. $$

The deep component is a feed-forward neural network, which is used to learn high-order feature interactions. There is a personal guess that the component function in activation function $e^x$ can expand in the polynomials form $e^x=1+x+\frac{x^2}{2!}+\dots,+\frac{x^n}{n!}+\dots$, which include all the order of interactions.

We would like to point out the two interesting features of this network structure:

  1. while the lengths of different input field vectors can be different, their embeddings are of the same size $(k)$;
  2. the latent feature vectors $(V)$ in FM now server as network weights which are learned and used to compress the input field vectors to the embedding vectors.

It is worth pointing out that FM component and deep component share the same feature embedding, which brings two important benefits:

  1. it learns both low- and high-order feature interactions from raw features;
  2. there is no need for expertise feature engineering of the input.

Neural Factorization Machines

$$ \hat{y} = w_0 + \left<w, x\right> + f(x) $$ where the first and second terms are the linear regression part similar to that for FM, which models global bias of data and weight of features. The third term $f(x)$ is the core component of NFM for modelling feature interactions, which is a multi-layered feedforward neural network.

B-Interaction Layer including Bi-Interaction Pooling is an innovation in artificial neural network.

Attentional Factorization Machines

Attentional Factorization Machine (AFM) learns the importance of each feature interaction from data via a neural attention network.

We employ the attention mechanism on feature interactions by performing a weighted sum on the interacted vectors:

$$\sum_{(i, j)} a_{(i, j)}(V_i \odot V_j) x_i x_j$$

where $a_{i, j}$ is the attention score for feature interaction.

xDeepFM

It mainly consists of 3 parts: Embedding Layer, Compressed Interaction Network(CIN) and DNN.

http://kubicode.me


Deep Matrix Factorization

Deep Geometric Matrix Completion

It’s easy to observe how better matrix completions can be achieved by considering the sparse matrix as defined over two different graphs: a user graph and an item graph. From a signal processing point of view, the matrix ${X}$ can be considered as a bi-dimensional signal defined over two distinct domains. Instead of recurring to multigraph convolutions realized over the entire matrix ${X}$, two independent single-graph GCNs (graph convolution networks) can be applied on matrices ${W}$ and ${H}$.

Given the aforementioned multi-graph convolutional layers, the last step that remains concerns the choice of the architecture to use for reconstructing the missing information. Every (user, item) pair in the multi-graph approach and every user/item in the separable one present in this case an independent state, which is updated (at every step) by means of the features produced by the selected GCN.

Collaborative Deep Learning for Recommender Systems

Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional CF-based methods use the ratings given to items by users as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in many applications, causing CF-based methods to degrade significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as item content information may be utilized. Collaborative topic regression (CTR) is an appealing recent method taking this approach which tightly couples the two components that learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be very effective when the auxiliary information is very sparse. To address this problem, we generalize recently advances in deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose in this paper a hierarchical Bayesian model called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback) matrix. Extensive experiments on three real-world datasets from different domains show that CDL can significantly advance the state of the art.

Given part of the ratings in ${R}$ and the content information $X_c$, the problem is to predict the other ratings in ${R}$, where row ${j}$ of the content information matrix $X_c$ is the bag-of-words vector $Xc;j{\ast}$ for item ${j}$ based on a vocabulary of size ${S}$.

Stacked denoising autoencoders(SDAE) is a feedforward neural network for learning representations (encoding) of the input data by learning to predict the clean input itself in the output. Using the Bayesian SDAE as a component, the generative process of CDL is de fined as follows:

  1. For each layer ${l}$ of the SDAE network,

    • For each column ${n}$ of the weight matrix $W_l$, draw $$W_l;{\ast}n \sim \mathcal{N}(0,\lambda_w^{-1} I_{K_l}).$$
    • Draw the bias vector $$b_l \sim \mathcal{N}(0,\lambda_w^{-1} I_{K_l}).$$
    • For each row ${j}$ of $X_l$, draw $$X_{l;j\ast}\sim \mathcal{N}(\sigma(X_{l-1;j\ast}W_l b_l), \lambda_s^{-1} I_{K_l}).$$
  2. For each item ${j}$,

    • Draw a clean input $$X_{c;j\ast}\sim \mathcal{N}(X_{L, j\ast}, \lambda_n^{-1} I_{K_l}).$$
    • Draw a latent item offset vector $\epsilon_j \sim \mathcal{N}(0, \lambda_v^{-1} I_{K_l})$ and then set the latent item vector to be: $$v_j=\epsilon_j+X^T_{\frac{L}{2}, j\ast}.$$
  3. Draw a latent user vector for each user ${i}$: $$u_i \sim \mathcal{N}(0, \lambda_u^{-1} I_{K_l}).$$

  4. Draw a rating $R_{ij}$ for each user-item pair $(i; j)$: $$R_{ij}\sim \mathcal{N}(u_i^T v_j, C_{ij}^{-1}).$$

Here $\lambda_w, \lambda_s, \lambda_n, \lambda_u$and $\lambda_v$ are hyperparameters and $C_{ij}$ is a confidence parameter similar to that for CTR ($C_{ij} = a$ if $R_{ij} = 1$ and $C_{ij} = b$ otherwise).

And joint log-likelihood of these parameters is $$L=-\frac{\lambda_u}{2}\sum_{i} {|u_i|}2^2-\frac{\lambda_w}{2}\sum{l} [{|W_l|}F+{|b_l|}2^2]\ -\frac{\lambda_v}{2}\sum{j} {|v_j - X^T{\frac{L}{2},j\ast}|}2^2-\frac{\lambda_n}{2}\sum{l} {|X_{c;j\ast}-X_{L;j\ast}|}2^2 \ -\frac{\lambda_s}{2}\sum{l}\sum_{j} {|\sigma(X_{l-1;j\ast}W_l b_l)-X_{l;j}|}2^2 -\sum{ij} {|R_{ij}-u_i^Tv_j|}_2^2 $$

It is not easy to prove that it converges.

Deep Matching Models for Recommendation

It is essential for the recommender system to find the item which matches the users' demand. Its difference from web search is that recommender system provides item information even if the users' demands or generally interests are not provided. It sounds like modern crystal ball to read your mind.

In A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems the authors propose to extract rich features from user’s browsing and search histories to model user’s interests. The underlying assumption is that, users’ historical online activities reflect a lot about user’s background and preference, and therefore provide a precise insight of what items and topics users might be interested in.

Its training data set and the test data is ${(\mathrm{X}i, y_i, r_i)\mid i =1, 2, \cdots, n}$ and $(\mathrm{X}{n+1}, y_{n+1})$, respectively. Matching Model is trained using the training data set: a class of `matching functions’ $\mathcal F= {f(x, y)}$ is defined, while the value of the function $r(\mathrm{X}, y)\in \mathcal F$ is a real number a set of numbers $R$ and the $r_{n+1}$ is predicted as $r_{n+1} = r(\mathrm{X}{n+1}, y{n+1})$.

The data is assumed to be generated according to the distributions $(x, y) \sim P(X,Y)$, $r \sim P(R \mid X,Y)$ . The goal of the learning task is to select a matching function $f (x, y)$ from the class $F$ based on the observation of the training data. The learning task, then, becomes the following optimization problem. $$\arg\min_{r\in \mathcal F}\sum_{i=1}^{n}L(r_i, r(x_i, y_i))+\Omega(r)$$ where $L(\cdot, \cdot)$ denotes a loss function and $\Omega(\cdot)$ denotes regularization.

In fact, the inputs x and y can be instances (IDs), feature vectors, and structured objects, and thus the task can be carried out at instance level, feature level, and structure level.

And $r(x, y)$ is supposed to be non-negative in some cases.

Framework of Matching
Output: MLP
Aggregation: Pooling, Concatenation
Interaction: Matrix, Tensor
Representation: MLP, CNN, LSTM
Input: ID Vectors $\mathrm{X}$, Feature Vectors $y$

Sometimes, matching model and ranking model are combined and trained together with pairwise loss. Deep Matching models takes the ID vectors and features together as the input to a deep neural network to train the matching scores including Deep Matrix Factorization, AutoRec, Collaborative Denoising Auto-Encoder, Deep User and Image Feature, Attentive Collaborative Filtering, Collaborative Knowledge Base Embedding.

Ensemble Methods for Recommender System

The RecSys can be considered as some regression or classification tasks, so that we can apply the ensemble methods to these methods as BellKor's Progamatic Chaos used the blended solution to win the prize. In fact, its essence is bagging or blending, which is one sequential ensemble strategy in order to avoid over-fitting or reduce the variance.

In this section, the boosting is the focus, which is to reduce the error and boost the performance from a weaker learner.

There are two common methods to construct a stronger learner from a weaker learner: (1) reweight the samples and learn from the error: AdaBoosting; (2) retrain another learner and learn to approximate the error: Gradient Boosting.

BoostFM

BoostFM integrates boosting into factorization models during the process of item ranking. Specifically, BoostFM is an adaptive boosting framework that linearly combines multiple homogeneous component recommender system, which are repeatedly constructed on the basis of the individual FM model by a re-weighting scheme.

BoostFM

  • Input: The observed context-item interactions or Training Data $S ={(\mathbf{x}_i, y_i)}$ parameters E and T.
  • Output: The strong recommender $g^{T}$.
  • Initialize $Q_{ci}^{(t)}=1/|S|,g^{(0)}=0, \forall (c, i)\in S$.
  • for $t = 1 \to T$ do
    1. Create component recommender $\hat{y}^{(t)}$ with $\bf{Q}^{(t)}$ on $\bf S$,$\forall (c,i) \in \bf S$, , i.e., Component Recommender Learning Algorithm;
    1. Compute the ranking accuracy $E[\hat{r}(c, i, y^{(t)})], \forall (c,i) \in \bf S$;
    1. Compute the coefficient $\beta_t$, $$ \beta_t = \ln (\frac{\sum_{(c,i) \in \bf S} \bf{Q}^{(t)}{ci}{1 + E[\hat{r}(c, i, y^{(t)})]}}{\sum{(c,i) \in \bf S} \bf{Q}^{(t)}_{ci}{1- E[\hat{r}(c, i, y^{(t)})]}})^{\frac{1}{2}} ; $$
    1. Create the strong recommender $g^{(t)}$, $$ g^{(t)} = \sum_{h=1}^{t} \beta_h \hat{y}^{(t)} ;$$
    1. Update weight distribution (\bf{Q}^{t+1}), $$ \bf{Q}^{t+1}{ci} = \frac{\exp(E[\hat{r}(c, i, y^{(t)})])}{\sum{(c,i)\in \bf{S}} E[\hat{r}(c, i, y^{(t)})]} ; $$
  • end for

Component Recommender

Naturally, it is feasible to exploit the L2R techniques to optimize Factorization Machines (FM). There are two major approaches in the field of L2R, namely, pairwise and listwise approaches. In the following, we demonstrate ranking factorization machines with both pairwise and listwise optimization.

Weighted Pairwise FM (WPFM)

Weighted ‘Listwise’ FM (WLFM)

Gradient Boosting Factorization Machines

Gradient Boosting Factorization Machine (GBFM) model is to incorporate feature selection algorithm with Factorization Machines into a unified framework.

Gradient Boosting Factorization Machine Model

  • Input: Training Data $S ={(\mathbf{x}_i, y_i)}$.
  • Output: $\hat{y}S =y_0(x) + {\sum}^S{s=1}\left<v_{si}, v_{sj}\right>$.
  • Initialize rating prediction function as $\hat{y}_0(x)$
  • for $s = 1 \to S$ do
    1. Select interaction feature $C_p$ and $C_q$ from Greedy Feature Selection Algorithm;
    1. Estimate latent feature matrices $V_p$ and $V_q$;
    1. Update $\hat{y}s(\mathrm{x}) = \hat{y}{s-1}(\mathrm{x}) + {\sum}{i\in C_p}{\sum}{j\in C_q} \mathbb{I}[i,j\in \mathrm{x}]\left<V_{p}^{i}, V_{q}^{j}\right>$
  • end for

where s is the iteration step of the learning algorithm. At step s, we greedily select two interaction features $C_p$ and $C_q$ where $\mathbb{I}$ is the indicator function, the value is 1 if the condition holds otherwise 0.

Greedy Feature Selection Algorithm

From the view of gradient boosting machine, at each step s, we would like to search a function ${f}$ in the function space ${F}$ that minimize the objective function: $$L=\sum_{i}\ell(\hat{y}_s(\mathrm{x}_i), y_i)+\Omega(f)$$

where $\hat{y}s(\mathrm{x}) = \hat{y}{s−1}(\mathrm{x}) + \alpha_s f_s(\mathrm{x})$.

We heuristically assume that the function ${f}$ has the following form: $$ f_{\ell}(\mathrm{x})={\prod}{t=1}^{\ell} q{C_{i}(t)}(\mathrm{x}) $$ where the function q maps latent feature vector x to real value domain $$ q_{C_{i}(t)}(\mathrm{x})=\sum_{j\in C_{i}(t)}\mathbb{I}[j\in \mathrm{x}]w_{t} $$

It is hard for a general convex loss function $\ell$ to search function ${f}$ to optimize the objective function: $L=\sum_{i}\ell(\hat{y}_s(\mathrm{x}_i), y_i)+\Omega(f)$.

The most common way is to approximate it by least-square minimization, i.e., $\ell={| \cdot |}_2^2$. Like in xGBoost, it takes second order Taylor expansion of the loss function $\ell$ and problem isfinalized to find the ${i}$(t)-th feature which:

$$\arg{\min}{i(t)\in {0, \dots, m}} \sum{i=1}^{n} h_i(\frac{g_i}{h_i}-f_{t-1}(\mathrm{x}i) q{C_{i}(t)}(\mathrm{x}_i))^2 + {|\theta|}_2^2 $$ where the negativefirst derivative and the second derivative at instance ${i}$ as $g_i$ and $h_i$.

Gradient Boosted Categorical Embedding and Numerical Trees

Gradient Boosted Categorical Embedding and Numerical Trees (GB-CSENT) is to combine Tree-based Models and Matrix-based Embedding Models in order to handle numerical features and large-cardinality categorical features. A prediction is based on:

  • Bias terms from each categorical feature.
  • Dot-product of embedding features of two categorical features,e.g., user-side v.s. item-side.
  • Per-categorical decision trees based on numerical features ensemble of numerical decision trees where each tree is based on one categorical feature.

In details, it is as following: $$ \hat{y}(x) = \underbrace{\underbrace{\sum_{i=0}^{k} w_{a_i}}{bias} + \underbrace{(\sum{a_i\in U(a)} Q_{a_i})^{T}(\sum_{a_i\in I(a)} Q_{a_i}) }{factors}}{CAT-E} + \underbrace{\sum_{i=0}^{k} T_{a_i}(b)}_{CAT-NT}. $$ And it is decomposed as the following table.


Ingredients Formulae Features
Factorization Machines $\underbrace{\underbrace{\sum_{i=0}^{k} w_{a_i}}{bias} + \underbrace{(\sum{a_i\in U(a)} Q_{a_i})^{T}(\sum_{a_i\in I(a)} Q_{a_i}) }{factors}}{CAT-E}$ Categorical Features
GBDT $\underbrace{\sum_{i=0}^{k} T_{a_i}(b)}_{CAT-NT}$ Numerical Features

Adaptive Boosting Personalized Ranking (AdaBPR)

AdaBPR (Adaptive Boosting Personalized Ranking) is a boosting algorithm for top-N item recommendation using users' implicit feedback. In this framework, multiple homogeneous component recommenders are linearly combined to achieve more accurate recommendation. The component recommenders are learned based on a re-weighting strategy that assigns a dynamic weight to each observed user-item interaction.

Here explicit feedback refers to users' ratings to items while implicit feedback is derived from users' interactions with items, e.g., number of times a user plays a song.

The primary idea of applying boosting for item recommendation is to learn a set of homogeneous component recommenders and then create an ensemble of the component recommenders to predict users' preferences.

Here, we use a linear combination of component recommenders as the final recommendation model $$f=\sum_{t=1}^{T}{\alpha}t f{t}.$$

In the training process, AdaBPR runs for ${T}$ rounds, and the component recommender $f_t$ is created at t-th round by $$ \arg\min_{f_t\in\mathbb{H}} \sum_{(u,i)\in\mathbb{O}} {\beta}{u} \exp{-E(\pi(u,i,\sum{n=1}^{t}{\alpha}n f{n}))}. $$

where the notations are listed as follows:

  • $\mathbb{H}$ is the set of possible component recommenders such as collaborative ranking algorithms;
  • $E(\pi(u,i,f))$ denotes the ranking accuracy associated with each observed interaction pair;
  • $\pi(u,i,f)$ is the rank position of item ${i}$ in the ranked item list of ${u}$, resulted by a learned ranking model ${f}$;
  • $\mathbb{O}$ is the set of all observed user-item interactions;
  • ${\beta}{u}$ is defined as reciprocal of the number of user $u$'s historical items ${\beta}{u}=\frac{1}{|V_{u}^{+}|}$ ($V_{u}^{+}$ is the historical items of ${u}$).

Explainable Recommendations

Explainable recommendation and search attempt to develop models or methods that not only generate high-quality recommendation or search results, but also intuitive explanations of the results for users or system designers, which can help improve the system transparency, persuasiveness, trustworthiness, and effectiveness, etc.

Social Recommendation

We present a novel framework for studying recommendation algorithms in terms of the ‘jumps’ that they make to connect people to artifacts. This approach emphasizes reachability via an algorithm within the implicit graph structure underlying a recommender dataset and allows us to consider questions relating algorithmic parameters to properties of the datasets.

User-item/user-user interactions are usually in the form of graph/network structure. What is more, the graph is dynamic, and we need to apply to new nodes without model retraining.

Knowledge Graph and Recommender System

Reinforcement Learning and Recommender System


Traditional Approaches Beyond Traditional Methods
Collaborative Filtering Tensor Factorization & Factorization Machines
Content-Based Recommendation Social Recommendations
Item-based Recommendation Learning to rank
Hybrid Approaches MAB Explore/Exploit

Implementation

Computational Advertising

Online advertising has grown over the past decade to over $26 billion in recorded revenue in 2010. The revenues generated are based on different pricing models that can be fundamentally grouped into two types: cost per (thousand) impressions (CPM) and cost per action (CPA), where an action can be a click, signing up with the advertiser, a sale, or any other measurable outcome. A web publisher generating revenues by selling advertising space on its site can offer either a CPM or CPA contract. We analyze the conditions under which the two parties agree on each contract type, accounting for the relative risk experienced by each party.

The information technology industry relies heavily on the on-line advertising such as [Google,Facebook or Alibaba]. Advertising is nothing except information, which is not usually accepted gladly. In fact, it is more difficult than recommendation because it is less known of the context where the advertisement is placed.

Hongliang Jie shares 3 challenges of computational advertising in Etsy, which will be the titles of the following subsections.

Click-Through Rate Modeling

GBRT+LR

When the feature vector ${x}$ are given, the tree split the features by GBRT then we transform and input the features to the logistic regression.

Practical Lessons from Predicting Clicks on Ads at Facebook or the blog use the GBRT to select proper features and LR to map these features into the interval $[0,1]$ as a ratio. Once we have the right features and the right model (decisions trees plus logistic regression), other factors play small roles (though even small improvements are important at scale).

Conversion Rate Modeling

Bid Optimization



User Engagement

User engagement measures whether users find value in a product or service. Engagement can be measured by a variety or combination of activities such as downloads, clicks, shares, and more. Highly engaged users are generally more profitable, provided that their activities are tied to valuable outcomes such as purchases, signups, subscriptions, or clicks.

Labs