Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: implement Random Projections #332

Merged
merged 8 commits into from
Mar 30, 2024
Merged

Conversation

GBathie
Copy link
Contributor

@GBathie GBathie commented Feb 17, 2024

This PR implements random projection techniques for dimensionality reduction, as seen in the sklean.random_projection module of scikit-learn

Contains two algorithms based on variants on the Johnson-lindenstrauss lemma:
- Random projections with Gaussian coefficients
- Sparse random projections with +/- 1 coefficients (multiplied by a scaling factor).
@codecov-commenter
Copy link

codecov-commenter commented Feb 17, 2024

Codecov Report

Attention: Patch coverage is 8.79121% with 83 lines in your changes are missing coverage. Please review.

Project coverage is 35.87%. Comparing base (4e40ce6) to head (6b9c2a4).

Files Patch % Lines
...nfa-reduction/src/random_projection/hyperparams.rs 3.44% 28 Missing ⚠️
...s/linfa-reduction/src/random_projection/methods.rs 0.00% 26 Missing ⚠️
...infa-reduction/src/random_projection/algorithms.rs 21.87% 25 Missing ⚠️
...ms/linfa-reduction/src/random_projection/common.rs 0.00% 4 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #332      +/-   ##
==========================================
- Coverage   36.18%   35.87%   -0.32%     
==========================================
  Files          92       96       +4     
  Lines        6218     6303      +85     
==========================================
+ Hits         2250     2261      +11     
- Misses       3968     4042      +74     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@quietlychris
Copy link
Member

I've done a quick review; this looks good to me, but have requested @bytesnake also give it a look as he is probably more familiar with the algorithm side of things.

@bytesnake
Copy link
Member

I've done a quick review; this looks good to me, but have requested @bytesnake also give it a look as he is probably more familiar with the algorithm side of things.

thank you for reviewing @relf @quietlychris

RNG defaults to Xoshiro256Plus if not provided by user.
Also added tests for minimum dimension using values from scikit-learn.
@GBathie
Copy link
Contributor Author

GBathie commented Mar 1, 2024

Thank you for the reviews, and @relf for the suggestions, I have implemented them.

Changes:

  • The rng field for both random projections structs is no longer optional, and defaults to Xoshiro256Plus with a fixed seed if not provided by the user.
  • Renamed precision parameter to eps, as increasing this parameter results in a lower dimension embedding and often yields lower classification performance.
  • Added a check that the projections reduce the dimension of the data, returning an error otherwise. Added tests for this behavior.
  • Added a test for the function that computes the embedding dimension from a given epsilon, against values from scikit-learn .
  • Fixed reference issue in docs and other minor issues.

Copy link
Member

@relf relf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution and the changes. Now, gaussian and sparse random projection codes look very alike, I am wondering if you could not refactor even further by using zero-sized types and a unique RandomProjection generic type, something like:

struct Gaussian;
struct Sparse;

pub struct RandomProjectionValidParams<RandomMethod, R: Rng + Clone> {
    pub params: RandomProjectionParamsInner,
    pub rng: Option<R>,
    pub method: std::marker::PhantomData<RandomMethod>,
}

pub struct RandomProjectionParams<RandomMethod, R: Rng + Clone>(
    pub(crate) RandomProjectionValidParams<RandomMethod, R>,
);

pub struct RandomProjection<RandomMethod, F: Float> {
    projection: Array2<F>,
    method: std::marker::PhantomData<RandomMethod>,
}

pub struct GaussianRandomProjection<F: Float> = RandomProjection<Gaussian, F: Float>;
pub struct SparseRandomProjection<F: Float> = RandomProjection<Sparse, F: Float>;

impl<F, Rec, T, R> Fit<Rec, T, ReductionError> for RandomProjectionValidParams<Gausssian, R>
where
    F: Float,
    Rec: Records<Elem = F>,
    StandardNormal: Distribution<F>,
    R: Rng + Clone,
{
    type Object = RandomProjection<Gaussian, F>;

    fn fit(&self, dataset: &linfa::DatasetBase<Rec, T>) -> Result<Self::Object, ReductionError> {...}
}

impl<F, Rec, T, R> Fit<Rec, T, ReductionError> for RandomProjectionValidParams<Sparse, R>
where
    F: Float,
    Rec: Records<Elem = F>,
    StandardNormal: Distribution<F>,
    R: Rng + Clone,
{
    type Object = RandomProjection<Sparse, F>;

    fn fit(&self, dataset: &linfa::DatasetBase<Rec, T>) -> Result<Self::Object, ReductionError> {...}
}

...

What do you think?

@GBathie
Copy link
Contributor Author

GBathie commented Mar 2, 2024

Thanks for your contribution and the changes. Now, gaussian and sparse random projection codes look very alike, I am wondering if you could not refactor even further by using zero-sized types and a unique RandomProjection generic type, something like:

struct Gaussian;
struct Sparse;

pub struct RandomProjectionValidParams<RandomMethod, R: Rng + Clone> {
    pub params: RandomProjectionParamsInner,
    pub rng: Option<R>,
    pub method: std::marker::PhantomData<RandomMethod>,
}

pub struct RandomProjectionParams<RandomMethod, R: Rng + Clone>(
    pub(crate) RandomProjectionValidParams<RandomMethod, R>,
);

pub struct RandomProjection<RandomMethod, F: Float> {
    projection: Array2<F>,
    method: std::marker::PhantomData<RandomMethod>,
}

pub struct GaussianRandomProjection<F: Float> = RandomProjection<Gaussian, F: Float>;
pub struct SparseRandomProjection<F: Float> = RandomProjection<Sparse, F: Float>;

impl<F, Rec, T, R> Fit<Rec, T, ReductionError> for RandomProjectionValidParams<Gausssian, R>
where
    F: Float,
    Rec: Records<Elem = F>,
    StandardNormal: Distribution<F>,
    R: Rng + Clone,
{
    type Object = RandomProjection<Gaussian, F>;

    fn fit(&self, dataset: &linfa::DatasetBase<Rec, T>) -> Result<Self::Object, ReductionError> {...}
}

impl<F, Rec, T, R> Fit<Rec, T, ReductionError> for RandomProjectionValidParams<Sparse, R>
where
    F: Float,
    Rec: Records<Elem = F>,
    StandardNormal: Distribution<F>,
    R: Rng + Clone,
{
    type Object = RandomProjection<Sparse, F>;

    fn fit(&self, dataset: &linfa::DatasetBase<Rec, T>) -> Result<Self::Object, ReductionError> {...}
}

...

What do you think?

I think that's a very good suggestion, it will be easier to maintain than the previous approach using a macro to avoid code duplication. 6b9c2a4 implements a variation of this idea: all the logic has been refactored, and behavior depending on the projection method has been encapsulated in the ProjectionMethod trait. It also makes implementing other projection methods significantly easier.

@quietlychris quietlychris merged commit 2eaa686 into rust-ml:master Mar 30, 2024
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants