Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for creating a Matrix Factorization model #1330

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
1d39560
feat: add support for creating a Matrix Factorization model
rey-esp Jan 27, 2025
e19c262
feat: add support for creating a Matrix Factorization model
rey-esp Jan 27, 2025
1bef4a2
feat: add support for creating a Matrix Factorization model
rey-esp Jan 27, 2025
d157cd7
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Jan 28, 2025
e336bde
Update bigframes/ml/decomposition.py
rey-esp Jan 28, 2025
d5f713a
Update bigframes/ml/decomposition.py
rey-esp Jan 28, 2025
5e3e443
Update bigframes/ml/decomposition.py
rey-esp Jan 28, 2025
34a60bc
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Jan 28, 2025
c116e8a
rating_col
rey-esp Jan 28, 2025
dedef39
(nearly) complete class
rey-esp Jan 28, 2025
e5165a9
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Jan 28, 2025
05eb854
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Jan 28, 2025
2787178
removem print()
rey-esp Jan 28, 2025
8c66e07
removem print()
rey-esp Jan 28, 2025
086b4dd
adding recommend
rey-esp Jan 29, 2025
8ed3ccd
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Jan 29, 2025
1b4eef9
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Jan 29, 2025
7c371ac
remove hyper parameter runing references
rey-esp Jan 30, 2025
7498c8c
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Jan 30, 2025
55ef06a
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Jan 30, 2025
29805b5
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Feb 4, 2025
8de384a
swap predict in _mf for recommend
rey-esp Feb 4, 2025
647532b
recommend -> predict
rey-esp Feb 4, 2025
b340c4f
update predict doc string
rey-esp Feb 4, 2025
580de41
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Feb 4, 2025
29ee357
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Feb 5, 2025
bac2ece
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Feb 6, 2025
3f22c23
Merge branch 'b338873783-matrix-factorization' of github.com:googleap…
rey-esp Feb 6, 2025
213f11d
Merge branch 'main' into b338873783-matrix-factorization
rey-esp Feb 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions bigframes/ml/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,12 @@ def model(self) -> bigquery.Model:
"""Get the BQML model associated with this wrapper"""
return self._model

def recommend(self, input_data: bpd.DataFrame) -> bpd.DataFrame:
return self._apply_ml_tvf(
input_data,
self._model_manipulation_sql_generator.ml_recommend,
)

def predict(self, input_data: bpd.DataFrame) -> bpd.DataFrame:
return self._apply_ml_tvf(
input_data,
Expand Down
111 changes: 111 additions & 0 deletions bigframes/ml/decomposition.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

from typing import List, Literal, Optional, Union

import bigframes_vendored.sklearn.decomposition._mf
import bigframes_vendored.sklearn.decomposition._pca
from google.cloud import bigquery

Expand Down Expand Up @@ -197,3 +198,113 @@ def score(

# TODO(b/291973741): X param is ignored. Update BQML supports input in ML.EVALUATE.
return self._bqml_model.evaluate()


@log_adapter.class_logger
class MatrixFactorization(
base.UnsupervisedTrainablePredictor,
bigframes_vendored.sklearn.decomposition._mf.MatrixFactorization,
):
__doc__ = bigframes_vendored.sklearn.decomposition._mf.MatrixFactorization.__doc__

def __init__(
self,
*,
rey-esp marked this conversation as resolved.
Show resolved Hide resolved
feedback_type: Literal["explicit", "implicit"] = "explicit",
num_factors: int,
rey-esp marked this conversation as resolved.
Show resolved Hide resolved
user_col: str,
item_col: str,
rating_col: str = "rating",
Comment on lines +215 to +217
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GarrettWu @shuoweil I see in #1282 you ended up passing in "id_col" as a separate argument to fit() instead of the class constructor. Is this a pattern you would recommend here?

Note: MatrixFactorization differs somewhat from that application in that normally in scikit-learn one would have a "sparse matrix" data type (e.g. https://docs.scipy.org/doc/scipy/reference/sparse.html) where rows/cols/values would all be bundled up in one object, similar to how we are using the bigframes DataFrame for this purpose.

# TODO: Add support for hyperparameter tuning.
l2_reg: float = 1.0,
):
self.feedback_type = feedback_type
self.num_factors = num_factors
self.user_col = user_col
self.item_col = item_col
self.rating_col = rating_col
self.l2_reg = l2_reg
self._bqml_model: Optional[core.BqmlModel] = None
self._bqml_model_factory = globals.bqml_model_factory()

@classmethod
def _from_bq(
cls, session: bigframes.session.Session, bq_model: bigquery.Model
) -> MatrixFactorization:
assert bq_model.model_type == "MATRIX_FACTORIZATION"

kwargs = utils.retrieve_params_from_bq_model(
cls, bq_model, _BQML_PARAMS_MAPPING
)

model = cls(**kwargs)
model._bqml_model = core.BqmlModel(session, bq_model)
return model

@property
def _bqml_options(self) -> dict:
"""The model options as they will be set for BQML"""
options: dict = {
"model_type": "matrix_factorization",
"feedback_type": self.feedback_type,
"user_col": self.user_col,
"item_col": self.item_col,
"rating_col": self.rating_col,
"l2_reg": self.l2_reg,
}

if self.num_factors is not None:
options["num_factors"] = self.num_factors

return options

def _fit(
self,
X: utils.ArrayType,
y=None,
transforms: Optional[List[str]] = None,
) -> MatrixFactorization:
(X,) = utils.batch_convert_to_dataframe(X)

self._bqml_model = self._bqml_model_factory.create_model(
X_train=X,
transforms=transforms,
options=self._bqml_options,
)
return self

def predict(self, X: utils.ArrayType) -> bpd.DataFrame:
if not self._bqml_model:
raise RuntimeError("A model must be fitted before recommend")

(X,) = utils.batch_convert_to_dataframe(X, session=self._bqml_model.session)

return self._bqml_model.recommend(X)

def to_gbq(self, model_name: str, replace: bool = False) -> MatrixFactorization:
"""Save the model to BigQuery.

Args:
model_name (str):
The name of the model.
replace (bool, default False):
Determine whether to replace if the model already exists. Default to False.

Returns:
MatrixFactorization: Saved model."""
if not self._bqml_model:
raise RuntimeError("A model must be fitted before it can be saved")

new_model = self._bqml_model.copy(model_name, replace)
return new_model.session.read_gbq_model(model_name)

def score(
self,
X=None,
y=None,
) -> bpd.DataFrame:
if not self._bqml_model:
raise RuntimeError("A model must be fitted before score")

# TODO(b/291973741): X param is ignored. Update BQML supports input in ML.EVALUATE.
return self._bqml_model.evaluate()
2 changes: 2 additions & 0 deletions bigframes/ml/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
"LINEAR_REGRESSION": linear_model.LinearRegression,
"LOGISTIC_REGRESSION": linear_model.LogisticRegression,
"KMEANS": cluster.KMeans,
"MATRIX_FACTORIZATION": decomposition.MatrixFactorization,
"PCA": decomposition.PCA,
"BOOSTED_TREE_REGRESSOR": ensemble.XGBRegressor,
"BOOSTED_TREE_CLASSIFIER": ensemble.XGBClassifier,
Expand Down Expand Up @@ -82,6 +83,7 @@
def from_bq(
session: bigframes.session.Session, bq_model: bigquery.Model
) -> Union[
decomposition.MatrixFactorization,
decomposition.PCA,
cluster.KMeans,
linear_model.LinearRegression,
Expand Down
5 changes: 5 additions & 0 deletions bigframes/ml/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,11 @@ def alter_model(
return "\n".join(parts)

# ML prediction TVFs
def ml_recommend(self, source_sql: str) -> str:
"""Encode ML.RECOMMEND for BQML"""
return f"""SELECT * FROM ML.RECOMMEND(MODEL {self._model_ref_sql()},
({source_sql}))"""

def ml_predict(self, source_sql: str) -> str:
"""Encode ML.PREDICT for BQML"""
return f"""SELECT * FROM ML.PREDICT(MODEL {self._model_ref_sql()},
Expand Down
89 changes: 89 additions & 0 deletions third_party/bigframes_vendored/sklearn/decomposition/_mf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
""" Matrix Factorization.
"""

# Author: Alexandre Gramfort <alexandre.gramfort@inria.fr>
# Olivier Grisel <olivier.grisel@ensta.org>
# Mathieu Blondel <mathieu@mblondel.org>
# Denis A. Engemann <denis-alexander.engemann@inria.fr>
# Michael Eickenberg <michael.eickenberg@inria.fr>
# Giorgio Patrini <giorgio.patrini@anu.edu.au>
#
# License: BSD 3 clause

from abc import ABCMeta

from bigframes_vendored.sklearn.base import BaseEstimator

from bigframes import constants


class MatrixFactorization(BaseEstimator, metaclass=ABCMeta):
"""Matrix Factorization (MF).

**Examples:**

>>> import bigframes.pandas as bpd
>>> from bigframes.ml.decomposition import MatrixFactorization
>>> X = bpd.DataFrame([[1, 1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
>>> model = MatrixFactorization(n_components=2, init='random', random_state=0)
>>> W = model.fit_transform(X)
>>> H = model.components_

Args:
num_factors (int or auto, default auto):
Specifies the number of latent factors to use.
user_col (str):
The user column name.
item_col (str):
The item column name.
l2_reg (float, default 1.0):
A floating point value for L2 regularization. The default value is 1.0.
"""

def fit(self, X, y=None):
"""Fit the model according to the given training data.

Args:
X (bigframes.dataframe.DataFrame or bigframes.series.Series or pandas.core.frame.DataFrame or pandas.core.series.Series):
Series or DataFrame of shape (n_samples, n_features). Training vector,
where `n_samples` is the number of samples and `n_features` is
the number of features.

y (default None):
Ignored.

Returns:
bigframes.ml.decomposition.MatrixFactorization: Fitted estimator.
"""
raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE)

def score(self, X=None, y=None):
"""Calculate evaluation metrics of the model.

.. note::

Output matches that of the BigQuery ML.EVALUATE function.
See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate#matrix_factorization_models
for the outputs relevant to this model type.

Args:
X (default None):
Ignored.

y (default None):
Ignored.
Returns:
bigframes.dataframe.DataFrame: DataFrame that represents model metrics.
"""
raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE)

def predict(self, X):
"""Generate a predicted rating for every user-item row combination for a matrix factorization model.

Args:
X (bigframes.dataframe.DataFrame or bigframes.series.Series or pandas.core.frame.DataFrame or pandas.core.series.Series):
Series or a DataFrame to predict.

Returns:
bigframes.dataframe.DataFrame: Predicted DataFrames."""
raise NotImplementedError(constants.ABSTRACT_METHOD_ERROR_MESSAGE)