Skip to content

Release 0.4.0

Compare
Choose a tag to compare
@bytesnake bytesnake released this 28 Apr 20:49
· 116 commits to master since this release
ce8a815

Linfa's 0.4.0 release introduces four new algorithms, improves documentation of the ICA and K-means implementations, adds more benchmarks to K-Means and updates to ndarray's 0.14 version.

New algorithms

The Partial Least Squares Regression model family is added in this release (thanks to @relf). It projects the observable, as well as predicted variables to a latent space and maximizes the correlation for them. For problems with a large number of targets or collinear predictors it gives a better performance when compared to standard regression. For more information look into the documentation of linfa-pls.

A wrapper for Barnes-Hut t-SNE is also added in this release. The t-SNE algorithm is often used for data visualization and projects data in a high-dimensional space to a similar representation in two/three dimension. It does so by maximizing the Kullback-Leibler Divergence between the high dimensional source distribution to the target distribution. The Barnes-Hut approximation improves the runtime drastically while retaining the performance. Kudos to github/frjnn for providing an implementation!

A new preprocessing crate makes working with textual data and data normalization easy (thanks to @Sauro98). It implements count-vectorizer and IT-IDF normalization for text pre-processing. Normalizations for signals include linear scaling, norm scaling and whitening with PCA/ZCA/choelsky. An example with a Naive Bayes model achieves 84% F1 score for predicting categories alt.atheism, talk.religion.misc, comp.graphics and sci.space on a news dataset.

Platt scaling calibrates a real-valued classification model to probabilities over two classes. This is used for the SV classification when probabilities are required. Further a multi class model, combining multiple binary models (e.g. calibrated SVM models) into a single multi-class model is also added. These composing models are moved to the linfa/src/composing/ subfolder.

Improvements

Numerous improvements are added to the KMeans implementation, thanks to @YuhanLiin. The implementation is optimized for offline training, an incremental training model is added and KMeans++/KMeans|| initialization gives good initial cluster means for medium and large datasets.

We also moved to ndarray's version 0.14 and introduced F::cast for simpler floating point casting. The trait signature of linfa::Fit is changed such that it always returns a Result and error handling is added for the linfa-logistic and linfa-reduction subcrates.

You often have to compare several model parametrization with k-folding. For this a new function cross_validate is added which takes the number of folds, model parameters and a closure for the evaluation metric. It automatically calls k-folding and averages the metric over the folds. To compare different L1 ratios of an elasticnet model, you can use it in the following way:

// L1 ratios to compare
let ratios = vec![0.1, 0.2, 0.5, 0.7, 1.0];

// create a model for each parameter
let models = ratios
    .iter()
    .map(|ratio| ElasticNet::params().penalty(0.3).l1_ratio(*ratio))
    .collect::<Vec<_>>();

// get the mean r2 validation score across 5 folds for each model
let r2_values =
    dataset.cross_validate(5, &models, |prediction, truth| prediction.r2(&truth))?;

// show the mean r2 score for each parameter choice
for (ratio, r2) in ratios.iter().zip(r2_values.iter()) {
    println!("L1 ratio: {}, r2 score: {}", ratio, r2);
}

Other changes

  • fix for border points in the DBSCAN implementation
  • improved documentation of the ICA subcrate
  • prevent overflowing code example in website