Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
This is a python package designed to facilitate correcting for distributional bias during cross valiation. It was recently shown that removing a fraction of a dataset into a testing fold can artificially create a shift in label averages across training folds that is inversely correlated with that of their corresponding test folds. We have demonstrated that most machine learning models' results suffer from this bias, which this package resolves by subsampling points from within the trianing set to remove any differences in label average across training folds. to begin using RebalancedCV, we recommend reading it's [documentation pages](https://korem-lab.github.io/RebalancedCV/).


All classes from this package provide train/test indices to split data in train/test sets while rebalancing the training set to account for distributional bias. This package is designed to enable automated rebalancing for the cross-valition implementations in formats similar to scikit-learn's `LeaveOneOut`, `StratifiedKFold`, and `LeavePOut`, through the `RebalancedCV` classes `RebalancedLeaveOneOut`, `RebalancedLeaveOneOutRegression`, `RebalancedKFold`, and `RebalancedLeavePOut`. These Rebalanced classes are designed to work in the exact same code structure and implementation use cases as their scikit-learn equivalents, with the only difference being a subsampling within the provided training indices.
All classes from this package provide train/test indices to split data in train/test sets while rebalancing the training set to account for distributional bias. This package is designed to enable automated rebalancing for the cross-valition implementations in formats similar to scikit-learn's `LeaveOneOut`, `StratifiedKFold`, `LeavePOut`, and `LeaveOneGroupOut`, through the `RebalancedCV` classes `RebalancedLeaveOneOut`, `RebalancedLeaveOneOutRegression`, `RebalancedKFold`, `RebalancedLeavePOut`, and `RebalancedLeaveOneGroupOut`. These Rebalanced classes are designed to work in the exact same code structure and implementation use cases as their scikit-learn equivalents, with the only difference being a subsampling within the provided training indices.

For any support using RebalancedCV, please use our <a href="https://github.com/korem-lab/RebalancedCV/issues">issues page</a> or email: gia2105@columbia.edu.

Expand Down Expand Up @@ -114,7 +114,17 @@ Provides train/test indices to split data in train/test sets with rebalancing to
##### **Parameters**
p : int
Size of the test sets. Must be strictly less than one half of the number of samples.


### RebalancedLeaveOneGroupOut

Provides train/test indices to split data in train/test sets with rebalancing when splitting by **groups**. Each fold holds out one group as the test set and uses the rest for training; the training set is then subsampled so that every fold has the same number of samples per class (avoiding distributional bias). The test set is never subsampled (full left-out group). The `groups` parameter is **required** (same as sklearn's LeaveOneGroupOut). At least two groups are needed.

**When to use rebalancing:** Use **RebalancedLeaveOneGroupOut** when you want comparable train conditions across folds (e.g when reporting an average over groups, when comparing per-group performance in an "even" manner), or when groups merely a blocking factor and you care about unbiased overall or class-wise metrics. **When not to:** Use plain **LeaveOneGroupOut** when you only care about performance on each left-out group and are not aggregating in a way that is sensitive to train-fold balance, or when you prefer realistic train composition per fold. If groups already have similar class distributions, rebalancing is optional but doesn't hurt.

See sklearn.model_selection.LeaveOneGroupOut for Leave-one-group-out cross-validation.

##### **Parameters**
No parameters are used for this class. `groups` must be passed to `split(X, y, groups)` and `get_n_splits(groups=groups)`.

### RebalancedLeaveOneOutRegression

Expand All @@ -131,7 +141,7 @@ All three of this package's classes use the `split` method, which all use the fo
`y` : array-like of shape (n_samples,); The target variable for supervised learning problems. At least two observations per class are needed for RebalancedLeaveOneOut

`groups` : array-like of shape (n_samples,), default=None; Group labels for the samples used while splitting the dataset into
train/test set.
train/test set. Required for RebalancedLeaveOneGroupOut; optional (and ignored) for other classes.

`seed` : Integer, default=None; can be specified to enforce consistency in the subsampling

Expand Down
8 changes: 7 additions & 1 deletion rebalancedcv/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
__version__ = "0.0.1"
from .classification import RebalancedLeaveOneOut, RebalancedKFold, RebalancedLeavePOut, MulticlassRebalancedLeaveOneOut
from .classification import (
RebalancedLeaveOneOut,
RebalancedKFold,
RebalancedLeavePOut,
MulticlassRebalancedLeaveOneOut,
RebalancedLeaveOneGroupOut,
)
from .regression import RebalancedLeaveOneOutRegression
224 changes: 208 additions & 16 deletions rebalancedcv/classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from sklearn.model_selection import BaseCrossValidator
from sklearn.utils.validation import _num_samples, check_array, column_or_1d
import numpy as np
import warnings

import numbers
from sklearn.utils.validation import _deprecate_positional_args
Expand Down Expand Up @@ -656,20 +657,211 @@ def get_n_splits(self, X, y, groups=None):
if X is None:
raise ValueError("The 'X' parameter should not be None.")
return _num_samples(X)


















class RebalancedLeaveOneGroupOut(BaseCrossValidator):
"""Rebalanced Leave-One-Group-Out cross-validator.

Provides train/test indices to split data such that each training set is
comprised of all samples except ones belonging to one specific group,
with subsampling so that every training fold has the same number of
samples per class (avoiding distributional bias). Rebalancing is
applied only to the training set; the test set is always the full
left-out group. Arbitrary domain-specific group information is provided
as an array of integers that encodes the group of each sample. For
instance the groups could be the year of collection of the samples and
thus allow for cross-validation against time-based splits.

The ``groups`` parameter is required (same as sklearn's
``LeaveOneGroupOut``). At least two groups are required. For
rebalancing to be non-degenerate, every class should appear in at least
two groups; if a class has no samples in a training fold, it is omitted
from that fold's training set and a warning is issued.

Notes
-----
Splits are ordered according to the index of the group left out. The
first split has testing set consisting of the group whose index in
``groups`` is lowest, and so on.

Use this class when you want leave-one-group-out *and* need to remove
training-fold label imbalance (e.g. comparing models or tuning
hyperparameters). Use plain ``LeaveOneGroupOut`` when you only care
about generalization to a new group.

See Also
--------
sklearn.model_selection.LeaveOneGroupOut : Leave-one-group-out without
training rebalancing.
sklearn.model_selection.GroupKFold : K-fold variant with
non-overlapping groups.

Examples
--------
>>> import numpy as np
>>> from rebalancedcv import RebalancedLeaveOneGroupOut
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
>>> y = np.array([0, 0, 1, 1, 0, 1])
>>> groups = np.array([1, 1, 1, 2, 2, 2])
>>> rlogo = RebalancedLeaveOneGroupOut()
>>> rlogo.get_n_splits(groups=groups)
2
>>> print(rlogo)
RebalancedLeaveOneGroupOut()
>>> for i, (train_index, test_index) in enumerate(rlogo.split(X, y, groups, seed=42)):
... print(f"Fold {i}:")
... print(f" Train: index={train_index}")
... print(f" Test: index={test_index}")
Fold 0:
Train: index=[4 5]
Test: index=[0 1 2]
Fold 1:
Train: index=[0 2]
Test: index=[3 4 5]
"""

def _iter_test_masks(self, X, y, groups):
if groups is None:
raise ValueError("The 'groups' parameter should not be None.")
# We make a copy of groups to avoid side-effects during iteration
groups = check_array(
groups, input_name="groups", copy=True, ensure_2d=False, dtype=None
)
unique_groups = np.unique(groups)
if len(unique_groups) <= 1:
raise ValueError(
"The groups parameter contains fewer than 2 unique groups "
"(%s). RebalancedLeaveOneGroupOut expects at least 2."
% unique_groups
)
for i in unique_groups:
yield groups == i

def split(self, X, y, groups=None, seed=None):
"""Generate indices to split data into training and test set.

Parameters
----------
X : array-like of shape (n_samples, n_features)
Training data, where `n_samples` is the number of samples and
`n_features` is the number of features.

y : array-like of shape (n_samples,)
The target variable for supervised learning problems.

groups : array-like of shape (n_samples,)
Group labels for the samples used while splitting the dataset
into train/test set. Must be specified.

seed : int or None, default=None
Random seed for subsampling reproducibility.

Yields
------
train : ndarray
The training set indices for that split (subsampled for
consistent class balance).

test : ndarray
The testing set indices for that split (full left-out group).
"""
if groups is None:
raise ValueError("The 'groups' parameter should not be None.")
if seed is not None:
np.random.seed(seed)

X, y, groups = indexable(X, y, groups)
n_samples = _num_samples(X)
groups = np.asarray(groups)
y = np.asarray(y)
type_of_target_y = type_of_target(y)
if type_of_target_y not in ("binary", "multiclass"):
raise ValueError(
"Supported target types are: binary, multiclass. Got {!r}."
.format(type_of_target_y)
)
y = column_or_1d(y)

# Encode labels as 0, 1, ... for bincount (works for any dtype, binary or multiclass)
unique_labels, y_encoded = np.unique(y, return_inverse=True)
n_classes = len(unique_labels)
total_count = np.bincount(y_encoded, minlength=n_classes) # (n_classes,) note that y_encoded are indices, not actual labels

unique_groups = np.unique(groups)
# Per-group class counts: (n_groups, n_classes)
group_class_count = np.zeros((len(unique_groups), n_classes), dtype=int)
for ig, g in enumerate(unique_groups):
mask = groups == g
group_class_count[ig] = np.bincount(y_encoded[mask], minlength=n_classes)

# Minimum training samples per class across all folds (same in every fold after subsample)
# For fold leaving out group g: train count for class k = total_count[k] - group_class_count[g,k]
min_train_count = total_count - group_class_count.max(axis=0) # (n_classes,)

# Warn if any class has no samples in the training set for at least one fold (all in one group)
omitted = np.where(min_train_count <= 0)[0]
if len(omitted) > 0:
omitted_labels = [unique_labels[k] for k in omitted]
warnings.warn(
"The following classes have no samples in the training set for at least one fold "
"(all samples belong to a single group) and are omitted from the rebalanced "
"training set: {}. Consider checking group/label alignment.".format(omitted_labels),
UserWarning,
stacklevel=2,
)

indices = np.arange(n_samples)
for test_mask in self._iter_test_masks(X, y, groups):
train_mask = ~test_mask
train_index = indices[train_mask]
test_index = indices[test_mask]

# Subsample training set so each class has min_train_count[k] samples
train_parts = []
for k in range(n_classes):
n_k = int(min_train_count[k])
if n_k <= 0: # if the class has no samples in the training set, skip it
continue
train_k = train_index[y_encoded[train_index] == k] # indices of the samples of class k in the training set
if len(train_k) < n_k:
class_label = unique_labels[k]
raise ValueError(
"Fold has {} samples of class '{}' in train but need {} (rebalancing impossible)."
.format(len(train_k), class_label, n_k)
)
train_parts.append(
np.random.choice(train_k, size=n_k, replace=False)
)
if train_parts:
train_index = np.sort(np.concatenate(train_parts))
else:
train_index = np.array([], dtype=int)

yield train_index, test_index

def get_n_splits(self, X=None, y=None, groups=None):
"""Returns the number of splitting iterations in the cross-validator.

Parameters
----------
X : array-like of shape (n_samples, n_features), default=None
Always ignored, exists for API compatibility.

y : array-like of shape (n_samples,), default=None
Always ignored, exists for API compatibility.

groups : array-like of shape (n_samples,), default=None
Group labels for the samples used while splitting the dataset
into train/test set. This 'groups' parameter must always be
specified to calculate the number of splits, though the other
parameters can be omitted.

Returns
-------
n_splits : int
Returns the number of splitting iterations in the cross-validator.
"""
if groups is None:
raise ValueError("The 'groups' parameter should not be None.")
groups = check_array(groups, input_name="groups", ensure_2d=False, dtype=None)
return len(np.unique(groups))
47 changes: 45 additions & 2 deletions rebalancedcv/tests/test_rebalancedcv.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from sklearn.model_selection import LeaveOneOut
from rebalancedcv import RebalancedLeaveOneOut, RebalancedKFold, \
RebalancedLeavePOut, RebalancedLeaveOneOutRegression, \
MulticlassRebalancedLeaveOneOut
MulticlassRebalancedLeaveOneOut, RebalancedLeaveOneGroupOut

from sklearn.metrics import roc_auc_score

Expand Down Expand Up @@ -121,7 +121,50 @@ def run_regression_cv(self,
def test_all_classification_cvs(self):
for cv in [RebalancedLeaveOneOut, RebalancedKFold, RebalancedLeavePOut, MulticlassRebalancedLeaveOneOut]:
self.run_classification_cv(cv)


def test_rebalanced_leave_one_group_out(self):
rlogo = RebalancedLeaveOneGroupOut()

## --- API: groups required ---
with self.assertRaises(ValueError):
rlogo.get_n_splits(groups=None)
with self.assertRaises(ValueError):
list(rlogo.split(np.random.rand(6, 2), np.array([0, 0, 1, 1, 0, 1]), groups=None))

## --- Binary: 6 samples, 2 groups of 3 ---
np.random.seed(1)
n_samples, n_features = 6, 2
X = np.random.rand(n_samples, n_features)
y = np.array([0, 0, 1, 1, 0, 1])
groups = np.array([1, 1, 1, 2, 2, 2])
self.assertEqual(rlogo.get_n_splits(groups=groups), 2)
train_means = []
for train_index, test_index in rlogo.split(X, y, groups, seed=1):
self.assertEqual(len(np.unique(groups[test_index])), 1,
"each test set should be exactly one group")
train_means.append(y[train_index].mean())
self.assertTrue(np.max(train_means) == np.min(train_means),
"train class balance should be identical across folds (binary)")

## --- Multi-class: 12 samples, 3 classes, 3 groups ---
np.random.seed(2)
X = np.random.rand(12, 3)
y = np.array([0, 0, 1, 1, 2, 2, 0, 1, 2, 0, 1, 2]) # 4 per class
groups = np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
self.assertEqual(rlogo.get_n_splits(groups=groups), 3)
train_class_counts = []
for train_index, test_index in rlogo.split(X, y, groups, seed=2):
self.assertEqual(len(np.unique(groups[test_index])), 1,
"each test set should be exactly one group")
## rebalancing: same number of each class in train every fold
counts = np.bincount(y[train_index], minlength=3)
train_class_counts.append(tuple(counts))
self.assertEqual(len(set(train_class_counts)), 1,
"train class counts should be identical across folds (multi-class)")
## sanity: each fold had some of every class in train (we have 3 groups, 3 classes spread across)
for counts in train_class_counts:
self.assertTrue(all(c >= 0 for c in counts))

def test_all_regression_cvs(self):
for cv in [RebalancedLeaveOneOutRegression,
]:
Expand Down
Loading