Changing how we deal with Regularization (#55)

* change Matrix inversions for linear system * correct docs to new inversion * updated deprecated examples * works well, but why * allow users to invert input regularization * format * nicer formatting * test pass, resolve typo, and format * remove comment blocks
CliMA · Apr 13, 2024 · 08ebbc6 · 08ebbc6
1 parent c399a6f
commit 08ebbc6
Show file tree

Hide file tree

Showing 13 changed files with 215 additions and 288 deletions.
diff --git a/docs/src/parallelism.md b/docs/src/parallelism.md
@@ -3,16 +3,16 @@
 ### Explicit bottlenecks
 - By far the highest computational demand for high-dimensional and/or large data systems is the building of the system matrix, particularly the multiplication ``\Phi^T\Phi`` (for scalar) and ``\Phi_{n,i,m}\Phi_{n,j,m}``. These can be accelerated by multithreading (see below)
 - For large number of features, the inversion `inv(factorization(system_matrix))` is noticeable, though typically still small when in the regime of `n_features << n_samples * output_dim`.
-- For the vector case, the use of non-diagonal ``\Lambda_{i,j}`` for regularization cannot take advantage of certain matrix-tensor identities and may also cause substantial slow-down. (currently)
+- For the vector case, the square output-dimensional regularization matrix ``\Lambda`` must be inverted. For high-dimensional spaces, diagonal approximation will avoid this.
 - Prediction bottlenecks are largely due to allocations and matrix multiplications. Please see our `predict!()` methods which allow for users to pass in preallocated of arrays. This is very beneficial for repeated predictions.
 - For systems without enough regularization, positive definiteness may need to be enforced. If done too often, it has non-negligible cost, as it involves calculating eigenvalues of the non p.d. matrix (particularly `abs(min(eigval)`) that is then added to the diagonal. It is better to add more regularization into ``\Lambda``
 
 ### Implicit bottlenecks
-- The optimization of hyperparameters is a costly operation that may require construction and evaluation of thousands of `RandomFeatureMethod`s. The dimensionality (i.e. complexity) of this task will depend on how many free parameters are taken to be within a distribution though. ``\mathcal{O}(1000s)`` parameters may take several hours to optimize (on multiple threads).
+- The optimization of hyperparameters is a costly operation that may require construction and evaluation of thousands of `RandomFeatureMethod`s. The dimensionality (i.e. complexity) of this task will depend on how many free parameters are taken to be within a distribution though. ``\mathcal{O}(1000s)`` parameters may take even hours to optimize (on multiple threads).
 
 # Parallelism/memory
 - We make use of [`Tullio.jl`](https://github.com/mcabbott/Tullio.jl) which comes with in-built memory management. We are phasing out our own batches in favour of using this for now.
-- [`Tullio.jl`(https://github.com/mcabbott/Tullio.jl) comes with multithreading routines, Simply call the code with `julia --project -t n_threads` to take advantage of this. Depending on problem size you may wish to use your own external threading, Tullio will greedily steal threads in this case. To prevent this interference we provide a keyword argument: 
+- [`Tullio.jl`](https://github.com/mcabbott/Tullio.jl) comes with multithreading routines, Simply call the code with `julia --project -t n_threads` to take advantage of this. Depending on problem size you may wish to use your own external threading, Tullio will greedily steal threads in this case. To prevent this interference we provide a keyword argument: 
 ```julia
 RandomFeatureMethod(... ; tullio_threading=false) # serial threading during the build and fit! methods
 predict(...; tullio_threading=false) # serial threading for prediction

diff --git a/docs/src/setting_up_scalar.md b/docs/src/setting_up_scalar.md
@@ -111,9 +111,9 @@ The keyword `feature_parameters = Dict("sigma" => a)`, can be included to set th
 
 The `RandomFeatureMethod` sets up the training problem to learn coefficients ``\beta\in\mathbb{R}^m`` from input-output training data ``(x,y)=\{(x_i,y_i)\}_{i=1}^n``, ``y_i \in \mathbb{R}`` and parameters ``\theta = \{\theta_j\}_{j=1}^m``:
 ```math
-(\frac{1}{m}\Phi^T(x;\theta) \Phi(x;\theta) + \lambda I) \beta = \Phi^T(x;\theta)y
+(\frac{1}{\lambda m}\Phi^T(x;\theta) \Phi(x;\theta) + I) \beta = \frac{1}{\lambda}\Phi^T(x;\theta)y
 ```
-Where ``\lambda I`` is a regularization.
+Where ``\lambda>0`` is a regularization parameter that effects the `+I`. The equation is written in this way to be consistent with the vector-output case.
 ```julia
 rfm = RandomFeatureMethod(
     sf;

diff --git a/docs/src/setting_up_vector.md b/docs/src/setting_up_vector.md
@@ -115,20 +115,11 @@ The keyword `feature_parameters = Dict("sigma" => a)`, can be included to set th
 
 ## Method
 
-The `RandomFeatureMethod` sets up the training problem to learn coefficients ``\beta\in\mathbb{R}^m`` from input-output training data ``(x,y)=\{(x_i,y_i)\}_{i=1}^n``, ``y_i \in \mathbb{R}^p``  and parameters ``\theta = \{\theta_j\}_{j=1}^m``. Regularization is provided through ``\Lambda = \lambda \otimes I_{n\times n}`` from a user-provided `p-by-p` positive-definite regularization matrix ``\lambda``. In Einstein summation notation the method solves the following system
+The `RandomFeatureMethod` sets up the training problem to learn coefficients ``\beta\in\mathbb{R}^m`` from input-output training data ``(x,y)=\{(x_i,y_i)\}_{i=1}^n``, ``y_i \in \mathbb{R}^p``  and parameters ``\theta = \{\theta_j\}_{j=1}^m``. Regularization is provided through ``\Lambda``, a user-provided `p-by-p` positive-definite regularization matrix (or scaled identity). In Einstein summation notation the method solves the following system
 ```math
-(\frac{1}{m}\Phi_{n,i,p}(x;\theta) \Phi_{n,j,p}(x;\theta) + R_{i,j}) \beta_j = \Phi(x;\theta)_{n,i,p}y_{n,p}
+(\frac{1}{m}\Phi_{n,i,p}(x;\theta) \, [I_{n,m} \otimes \Lambda^{-1}_{p,q}]\, \Phi_{n,j,q}(x;\theta) + I_{i,j}) \beta_j = \Phi(x;\theta)_{n,i,p}\,[ I_{n,m} \otimes \Lambda^{-1}_{p,q}]\, y_{n,p}
 ```
-and where the regularization ``R`` is built from ``\Lambda`` by solving the non-square system
-```math
- \Phi_{m,j,q} R_{i,j} = \Phi_{n,i,p} \Lambda_{p,q,n,m}
-```
-In the case the ``\lambda`` is provided as a `Real` or `UniformScaling` then ``R`` is (consistently) replaced with ``\lambda I_{m\times m}``. The authors typically recommend replacing non-diagonal ``\lambda`` with ``\mathrm{det}(\lambda)^{\frac{1}{p}}I``, which often provides a reasonable approximation, as there is additional computational expense through solving this un-batched system of equations.
-
-!!! note "The nonsquare system"
-    If one chooses to solve for ``R``, note that it is also only defined for dimensions ``m < np``. For stability we also perform the following: we use a truncated SVD for when the rank of the system is less than ``m``, and we also symmetrize the resulting matrix.
-
-
+So long as ``\Lambda`` is easily invertible then this system is efficiently solved.       
 
 ```julia
 rfm = RandomFeatureMethod(
@@ -141,7 +132,8 @@ One can select batch sizes to balance the space-time (memory-process) trade-off.
 
 !!! warning "Conditioning"
     The problem is ill-conditioned without regularization.
-    If you encounters a Singular or Positive-definite exceptions, try increasing the constant scaling `regularization`
+    If you encounters a Singular or Positive-definite exceptions or warnings, try increasing the constant scaling `regularization`. 
+
 
 
 

diff --git a/examples/Learn_hyperparameters/1d_to_1d_regression_direct.jl b/examples/Learn_hyperparameters/1d_to_1d_regression_direct.jl
@@ -86,7 +86,7 @@ function calculate_mean_and_coeffs(
     batch_inputs = batch_generator(itest, test_batch_size, dims = 2) # input_dim x batch_size
 
     #we want to calc lambda/m * coeffs^2 in the end
-    pred_mean = predictive_mean(rfm, fitted_features, DataContainer(itest))
+    pred_mean, features = predictive_mean(rfm, fitted_features, DataContainer(itest))
     scaled_coeffs = sqrt(1 / n_features) * get_coeffs(fitted_features)
     return pred_mean, scaled_coeffs
 
@@ -152,13 +152,13 @@ end
 ## Begin Script, define problem setting
 
 println("starting script")
-date_of_run = Date(2022, 9, 14)
+date_of_run = Date(2024, 4, 10)
 
 
 # Target function
 ftest(x::AbstractVecOrMat) = exp.(-0.5 * x .^ 2) .* (x .^ 4 - x .^ 3 - x .^ 2 + x .- 1)
 
-n_data = 20 * 4
+n_data = 20 * 2
 noise_sd = 0.1
 
 x = rand(rng, Uniform(-3, 3), n_data)
@@ -285,7 +285,7 @@ for (idx, sd, feature_type) in zip(collect(1:length(σ_c)), σ_c, feature_types)
     end
 
     push!(rfms, RandomFeatureMethod(sf, batch_sizes = batch_sizes, regularization = regularizer))
-    push!(fits, fit(rfms[end], io_pairs_test, decomposition_type = "qr"))
+    push!(fits, fit(rfms[end], io_pairs_test))
 end
 
 if PLOT_FLAG
@@ -311,7 +311,8 @@ if PLOT_FLAG
     for (idx, rfm, fit, feature_type, clr) in zip(collect(1:length(σ_c)), rfms, fits, feature_types, clrs)
 
         pred_mean, pred_cov = predict(rfm, fit, DataContainer(xplt))
-        pred_cov = max.(pred_cov, 0.0)
+        pred_cov = pred_cov[1, 1, :] #just variances
+        pred_cov = max.(pred_cov, 0.0) #not normally needed..
         plot!(
             xplt',
             pred_mean',

diff --git a/examples/Learn_hyperparameters/1d_to_1d_regression_direct_withcov.jl b/examples/Learn_hyperparameters/1d_to_1d_regression_direct_withcov.jl
@@ -157,7 +157,7 @@ end
 ## Begin Script, define problem setting
 
 println("starting script")
-date_of_run = Date(2022, 9, 14)
+date_of_run = Date(2024, 4, 10)
 
 
 # Target function
@@ -254,7 +254,7 @@ for (idx, type) in enumerate(feature_types)
             string(i) *
             ", Error: " *
             string(err[i]) *
-            ", with parameter mean" *
+            ", with parameter mean " *
             string(mean(transform_unconstrained_to_constrained(priors, get_u_final(ekiobj[1])), dims = 2)[:, 1]),
             " and sd ",
             string(sqrt.(var(transform_unconstrained_to_constrained(priors, get_u_final(ekiobj[1])), dims = 2))[:, 1]),
@@ -298,7 +298,7 @@ for (idx, sd, feature_type) in zip(collect(1:length(σ_c)), σ_c, feature_types)
     end
 
     push!(rfms, RandomFeatureMethod(sf, batch_sizes = batch_sizes, regularization = regularizer))
-    push!(fits, fit(rfms[end], io_pairs_test, decomposition_type = "qr"))
+    push!(fits, fit(rfms[end], io_pairs_test))
 end
 
 if PLOT_FLAG

diff --git a/examples/Learn_hyperparameters/nd_to_1d_regression_direct.jl b/examples/Learn_hyperparameters/nd_to_1d_regression_direct.jl
@@ -93,13 +93,13 @@ function calculate_mean_and_coeffs(
 
     # build and fit the RF
     rfm = RFM_from_hyperparameters(rng, l, regularizer, n_features, batch_sizes, input_dim)
-    fitted_features = fit(rfm, io_train_cost, decomposition_type = "qr")
+    fitted_features = fit(rfm, io_train_cost)
 
     test_batch_size = get_batch_size(rfm, "test")
     batch_inputs = batch_generator(itest, test_batch_size, dims = 2) # input_dim x batch_size
 
     #we want to calc 1/var(y-mean)^2 + lambda/m * coeffs^2 in the end
-    pred_mean = predictive_mean(rfm, fitted_features, DataContainer(itest))
+    pred_mean, features = predictive_mean(rfm, fitted_features, DataContainer(itest))
     scaled_coeffs = sqrt(1 / n_features) * get_coeffs(fitted_features)
     return pred_mean, scaled_coeffs
 
@@ -169,7 +169,7 @@ end
 
 ## Begin Script, define problem setting
 println("Begin script")
-date_of_run = Date(2022, 9, 14)
+date_of_run = Date(2024, 4, 10)
 input_dim = 6
 println("Number of input dimensions: ", input_dim)
 
@@ -344,7 +344,7 @@ sff = ScalarFourierFeature(n_features_test, feature_sampler)
 #second case with batching
 
 rfm_batch = RandomFeatureMethod(sff, batch_sizes = batch_sizes, regularization = regularizer)
-fitted_batched_features = fit(rfm_batch, io_pairs_test, decomposition_type = "qr")
+fitted_batched_features = fit(rfm_batch, io_pairs_test)
 
 if PLOT_FLAG
     #plot slice through one dimensions, others fixed to 0
@@ -362,7 +362,8 @@ if PLOT_FLAG
         yslice = ftest_nd_to_1d(xslicenew)
 
         pred_mean_slice, pred_cov_slice = predict(rfm_batch, fitted_batched_features, DataContainer(xslicenew))
-        pred_cov_slice = max.(pred_cov_slice, 0.0)
+        pred_cov_slice = max.(pred_cov_slice[1, 1, :], 0.0)
+
         plt = plot(
             xrange,
             yslice',

diff --git a/examples/Learn_hyperparameters/nd_to_1d_regression_direct_matchingcov.jl b/examples/Learn_hyperparameters/nd_to_1d_regression_direct_matchingcov.jl
@@ -93,7 +93,7 @@ function calculate_mean_cov_and_coeffs(
 
     # build and fit the RF
     rfm = RFM_from_hyperparameters(rng, l, regularizer, n_features, batch_sizes, input_dim)
-    fitted_features = fit(rfm, io_train_cost, decomposition_type = "svd")
+    fitted_features = fit(rfm, io_train_cost)
 
     test_batch_size = get_batch_size(rfm, "test")
     batch_inputs = batch_generator(itest, test_batch_size, dims = 2) # input_dim x batch_size
@@ -174,7 +174,7 @@ end
 
 ## Begin Script, define problem setting
 println("Begin script")
-date_of_run = Date(2022, 9, 22)
+date_of_run = Date(2024, 4, 10)
 input_dim_list = [8]
 
 for input_dim in input_dim_list
@@ -368,7 +368,7 @@ for input_dim in input_dim_list
     #second case with batching
 
     rfm_batch = RandomFeatureMethod(sff, batch_sizes = batch_sizes, regularization = regularizer)
-    fitted_batched_features = fit(rfm_batch, io_pairs_test, decomposition_type = "svd")
+    fitted_batched_features = fit(rfm_batch, io_pairs_test)
 
     if PLOT_FLAG
         #plot slice through one dimensions, others fixed to 0

diff --git a/examples/Learn_hyperparameters/nd_to_1d_regression_direct_withcov.jl b/examples/Learn_hyperparameters/nd_to_1d_regression_direct_withcov.jl
@@ -93,7 +93,7 @@ function calculate_mean_cov_and_coeffs(
 
     # build and fit the RF
     rfm = RFM_from_hyperparameters(rng, l, regularizer, n_features, batch_sizes, input_dim)
-    fitted_features = fit(rfm, io_train_cost, decomposition_type = "qr")
+    fitted_features = fit(rfm, io_train_cost)
 
     test_batch_size = get_batch_size(rfm, "test")
     batch_inputs = batch_generator(itest, test_batch_size, dims = 2) # input_dim x batch_size
@@ -173,7 +173,7 @@ end
 
 ## Begin Script, define problem setting
 println("Begin script")
-date_of_run = Date(2022, 9, 14)
+date_of_run = Date(2024, 4, 10)
 input_dim = 8
 println("Number of input dimensions: ", input_dim)
 
@@ -352,7 +352,7 @@ sff = ScalarFourierFeature(n_features_test, feature_sampler)
 #second case with batching
 
 rfm_batch = RandomFeatureMethod(sff, batch_sizes = batch_sizes, regularization = regularizer)
-fitted_batched_features = fit(rfm_batch, io_pairs_test, decomposition_type = "svd")
+fitted_batched_features = fit(rfm_batch, io_pairs_test)
 
 if PLOT_FLAG
     #plot slice through one dimensions, others fixed to 0
@@ -368,9 +368,8 @@ if PLOT_FLAG
         xslicenew[direction, :] = xrange
 
         yslice = ftest_nd_to_1d(xslicenew)
-
         pred_mean_slice, pred_cov_slice = predict(rfm_batch, fitted_batched_features, DataContainer(xslicenew))
-        pred_cov_slice[1, 1, :] = max.(pred_cov_slice[1, 1, :], 0.0)
+        pred_cov_slice = max.(pred_cov_slice[1, 1, :], 0.0)
         plt = plot(
             xrange,
             yslice',
@@ -384,7 +383,7 @@ if PLOT_FLAG
         plot!(
             xrange,
             pred_mean_slice',
-            ribbon = [2 * sqrt.(pred_cov_slice[1, 1, :]); 2 * sqrt.(pred_cov_slice[1, 1, :])]',
+            ribbon = [2 * sqrt.(pred_cov_slice); 2 * sqrt.(pred_cov_slice)]',
             label = "Fourier",
             color = "blue",
         )