Skip to content

Commit 2ce91a6

Browse files
committed
Update documentations
1 parent ab16f3f commit 2ce91a6

File tree

2 files changed

+52
-48
lines changed

2 files changed

+52
-48
lines changed

DOCUMENTATION.md

Lines changed: 51 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ class YourTask(Task):
6767
) -> torch.Tensor:
6868
# TODO: Complete this method.
6969

70-
def tracked_modules(self) -> Optional[List[str]]:
70+
def get_influence_tracked_modules(self) -> Optional[List[str]]:
7171
# TODO: [Optional] Complete this method.
7272
return None # Compute influence scores on all available modules.
7373

@@ -89,7 +89,7 @@ model = prepare_model(model=model, task=task)
8989
...
9090
```
9191

92-
If you have specified specific module names in `Task.tracked_modules`, `TrackedModule` will only be installed for these modules.
92+
If you have specified specific module names in `Task.get_influence_tracked_modules`, `TrackedModule` will only be installed for these modules.
9393

9494
**\[Optional\] Create a DDP and FSDP Module.**
9595
After calling `prepare_model`, you can create [DistributedDataParallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or
@@ -140,7 +140,7 @@ Try rewriting the model so that it uses supported modules (as done for the `conv
140140
Alternatively, you can create a subclass of `TrackedModule` to compute influence scores for your custom module.
141141
If there are specific modules you would like to see supported, please submit an issue.
142142

143-
**How should I write task.tracked_modules?**
143+
**How should I write task.get_influence_tracked_modules?**
144144
We recommend using all supported modules for influence computations. However, if you would like to compute influence scores
145145
on subset of the modules (e.g., influence computations only on MLP layers for transformer or influence computation only on the last layer),
146146
inspect `model.named_modules()` to determine what modules to use. You can specify the list of module names you want to analyze.
@@ -183,7 +183,7 @@ def forward(x: torch.Tensor) -> torch.Tensor:
183183
> [!WARNING]
184184
> The default arguments assume the module is used only once during the forward pass.
185185
> If your model shares parameters (e.g., the module is used in multiple places during the forward pass), set
186-
> `shared_parameters_exist=True` in `FactorArguments`.
186+
> `has_shared_parameters=True` in `FactorArguments`.
187187
188188
**Why are there so many arguments?**
189189
Kronfluence was originally developed to compute influence scores on large-scale models, which is why `FactorArguments` and `ScoreArguments`
@@ -204,14 +204,13 @@ from kronfluence.arguments import FactorArguments
204204
factor_args = FactorArguments(
205205
strategy="ekfac", # Choose from "identity", "diagonal", "kfac", or "ekfac".
206206
use_empirical_fisher=False,
207-
distributed_sync_steps=1000,
208207
amp_dtype=None,
209-
shared_parameters_exist=False,
208+
has_shared_parameters=False,
210209

211210
# Settings for covariance matrix fitting.
212211
covariance_max_examples=100_000,
213-
covariance_data_partition_size=1,
214-
covariance_module_partition_size=1,
212+
covariance_data_partitions=1,
213+
covariance_module_partitions=1,
215214
activation_covariance_dtype=torch.float32,
216215
gradient_covariance_dtype=torch.float32,
217216

@@ -220,10 +219,10 @@ factor_args = FactorArguments(
220219

221220
# Settings for Lambda matrix fitting.
222221
lambda_max_examples=100_000,
223-
lambda_data_partition_size=1,
224-
lambda_module_partition_size=1,
225-
lambda_iterative_aggregate=False,
226-
cached_activation_cpu_offload=False,
222+
lambda_data_partitions=1,
223+
lambda_module_partitions=1,
224+
use_iterative_lambda_aggregation=False,
225+
offload_activations_to_cpu=False,
227226
per_sample_gradient_dtype=torch.float32,
228227
lambda_dtype=torch.float32,
229228
)
@@ -237,7 +236,7 @@ You can change:
237236
- `use_empirical_fisher`: Determines whether to use the [empirical Fisher](https://arxiv.org/abs/1905.12558) (using actual labels from batch)
238237
instead of the true Fisher (using sampled labels from model's predictions). It is recommended to be `False`.
239238
- `amp_dtype`: Selects the dtype for [automatic mixed precision (AMP)](https://pytorch.org/docs/stable/amp.html). Disables AMP if set to `None`.
240-
- `shared_parameters_exist`: Specifies whether the shared parameters exist in the forward pass.
239+
- `has_shared_parameters`: Specifies whether the shared parameters exist in the forward pass.
241240

242241
### Fitting Covariance Matrices
243242

@@ -254,13 +253,13 @@ covariance_matrices = analyzer.load_covariance_matrices(factors_name="initial_fa
254253
This step corresponds to **Equation 16** in the paper. You can tune:
255254
- `covariance_max_examples`: Controls the maximum number of data points for fitting covariance matrices. Setting it to `None`,
256255
Kronfluence computes covariance matrices for all data points.
257-
- `covariance_data_partition_size`: Number of data partitions to use for computing covariance matrices.
258-
For example, when `covariance_data_partition_size = 2`, the dataset is split into 2 chunks and covariance matrices
256+
- `covariance_data_partitions`: Number of data partitions to use for computing covariance matrices.
257+
For example, when `covariance_data_partitions=2`, the dataset is split into 2 chunks and covariance matrices
259258
are separately computed for each chunk. These chunked covariance matrices are later aggregated. This is useful with GPU preemption as intermediate
260259
covariance matrices will be saved in disk. It can be also helpful when launching multiple parallel jobs, where each GPU
261260
can compute covariance matrices on some partitioned data (you can specify `target_data_partitions` in the parameter).
262-
- `covariance_module_partition_size`: Number of module partitions to use for computing covariance matrices.
263-
For example, when `covariance_module_partition_size = 2`, the module is split into 2 chunks and covariance matrices
261+
- `covariance_module_partitions`: Number of module partitions to use for computing covariance matrices.
262+
For example, when `covariance_module_partitions=2`, the module is split into 2 chunks and covariance matrices
264263
are separately computed for each chunk. This is useful when the available GPU memory is limited (e.g., the total
265264
covariance matrices cannot fit into GPU memory). However, this will require multiple iterations over the dataset and can be slow.
266265
- `activation_covariance_dtype`: `dtype` for computing activation covariance matrices. You can also use `torch.bfloat16`
@@ -271,7 +270,7 @@ or `torch.float16`.
271270
**Dealing with OOMs.** Here are some steps to fix Out of Memory (OOM) errors.
272271
1. Try reducing the `per_device_batch_size` when fitting covariance matrices.
273272
2. Try using lower precision for `activation_covariance_dtype` and `gradient_covariance_dtype`.
274-
3. Try setting `covariance_module_partition_size > 1`.
273+
3. Try setting `covariance_module_partitions > 1`.
275274

276275
### Performing Eigendecomposition
277276

@@ -301,22 +300,22 @@ lambda_matrices = analyzer.load_lambda_matrices(factors_name="initial_factor")
301300

302301
This corresponds to **Equation 20** in the paper. You can tune:
303302
- `lambda_max_examples`: Controls the maximum number of data points for fitting Lambda matrices.
304-
- `lambda_data_partition_size`: Number of data partitions to use for computing Lambda matrices.
305-
- `lambda_module_partition_size`: Number of module partitions to use for computing Lambda matrices.
306-
- `cached_activation_cpu_offload`: Computing the per-sample-gradient requires saving the intermediate activation in memory.
307-
You can set `cached_activation_cpu_offload=True` to cache these activations in CPU. This is helpful for dealing with OOMs, but will make the overall computation slower.
308-
- `lambda_iterative_aggregate`: Whether to compute the Lambda matrices with for-loops instead of batched matrix multiplications.
303+
- `lambda_data_partitions`: Number of data partitions to use for computing Lambda matrices.
304+
- `lambda_module_partitions`: Number of module partitions to use for computing Lambda matrices.
305+
- `offload_activations_to_cpu`: Computing the per-sample-gradient requires saving the intermediate activation in memory.
306+
You can set `offload_activations_to_cpu=True` to cache these activations in CPU. This is helpful for dealing with OOMs, but will make the overall computation slower.
307+
- `use_iterative_lambda_aggregation`: Whether to compute the Lambda matrices with for-loops instead of batched matrix multiplications.
309308
This is helpful for reducing peak GPU memory, as it avoids holding multiple copies of tensors with the same shape as the per-sample-gradient.
310309
- `per_sample_gradient_dtype`: `dtype` for computing per-sample-gradient. You can also use `torch.bfloat16`
311310
or `torch.float16`.
312311
- `lambda_dtype`: `dtype` for computing Lambda matrices. You can also use `torch.bfloat16`
313-
or `torch.float16`. Recommended to use `torch.float32`.
312+
or `torch.float16`.
314313

315314
**Dealing with OOMs.** Here are some steps to fix Out of Memory (OOM) errors.
316315
1. Try reducing the `per_device_batch_size` when fitting Lambda matrices.
317-
2. Try setting `lambda_iterative_aggregate=True` or `cached_activation_cpu_offload=True`. (Try out `lambda_iterative_aggregate=True` first.)
316+
2. Try setting `use_iterative_lambda_aggregation=True` or `offload_activations_to_cpu=True`. (Try out `use_iterative_lambda_aggregation=True` first.)
318317
3. Try using lower precision for `per_sample_gradient_dtype` and `lambda_dtype`.
319-
4. Try using `lambda_module_partition_size > 1`.
318+
4. Try using `lambda_module_partitions > 1`.
320319

321320
### FAQs
322321

@@ -339,21 +338,24 @@ import torch
339338
from kronfluence.arguments import ScoreArguments
340339

341340
score_args = ScoreArguments(
342-
damping=1e-08,
343-
cached_activation_cpu_offload=False,
344-
distributed_sync_steps=1000,
341+
damping_factor=1e-08,
345342
amp_dtype=None,
343+
offload_activations_to_cpu=False,
346344

347345
# More functionalities to compute influence scores.
348-
data_partition_size=1,
349-
module_partition_size=1,
350-
per_module_score=False,
346+
data_partitions=1,
347+
module_partitions=1,
348+
compute_per_module_scores=False,
349+
compute_per_token_scores=False,
351350
use_measurement_for_self_influence=False,
351+
aggregate_query_gradients=False,
352+
aggregate_train_gradients=False,
352353

353354
# Configuration for query batching.
354-
query_gradient_rank=None,
355+
query_gradient_low_rank=None,
356+
use_full_svd=False,
355357
query_gradient_svd_dtype=torch.float32,
356-
num_query_gradient_accumulations=1,
358+
query_gradient_accumulation_steps=1,
357359

358360
# Configuration for dtype.
359361
score_dtype=torch.float32,
@@ -362,23 +364,25 @@ score_args = ScoreArguments(
362364
)
363365
```
364366

365-
- `damping`: A damping factor for the damped inverse Hessian-vector product (iHVP). Uses a heuristic based on mean eigenvalues
367+
- `damping_factor`: A damping factor for the damped inverse Hessian-vector product (iHVP). Uses a heuristic based on mean eigenvalues
366368
`(0.1 x mean eigenvalues)` if `None`, as done in [this paper](https://arxiv.org/abs/2308.03296).
367-
- `cached_activation_cpu_offload`: Whether to offload cached activations to CPU.
368369
- `amp_dtype`: Selects the dtype for [automatic mixed precision (AMP)](https://pytorch.org/docs/stable/amp.html). Disables AMP if set to `None`.
369-
- `data_partition_size`: Number of data partitions for computing influence scores.
370-
- `module_partition_size`: Number of module partitions for computing influence scores.
371-
- `per_module_score`: Whether to return a per-module influence scores. Instead of summing over influences across
370+
- `offload_activations_to_cpu`: Whether to offload cached activations to CPU.
371+
- `data_partitions`: Number of data partitions for computing influence scores.
372+
- `module_partitions`: Number of module partitions for computing influence scores.
373+
- `compute_per_module_scores`: Whether to return a per-module influence scores. Instead of summing over influences across
372374
all modules, this will keep track of intermediate module-wise scores.
373-
- - `use_measurement_for_self_influence`: Whether to use the measurement (instead of the loss) when computing self-influence scores.
374-
- `query_gradient_rank`: The rank for the query batching (low-rank approximation to the preconditioned query gradient; see **Section 3.2.2**). If `None`, no query batching will be used.
375+
- `compute_per_token_scores`: Whether to return a per-token influence scores. Only applicable to transformer-based models.
376+
- `aggregate_query_gradients`: Whether to use the summed query gradient instead of per-sample query gradients.
377+
- `aggregate_train_gradients`: Whether to use the summed training gradient instead of per-sample training gradients.
378+
- `use_measurement_for_self_influence`: Whether to use the measurement (instead of the loss) when computing self-influence scores.
379+
- `query_gradient_low_rank`: The rank for the query batching (low-rank approximation to the preconditioned query gradient; see **Section 3.2.2**). If `None`, no query batching will be used.
375380
- `query_gradient_svd_dtype`: `dtype` for performing singular value decomposition (SVD) for query batch. You can also use `torch.float64`.
376-
- `num_query_gradient_accumulations`: Number of query gradients to accumulate over. For example, when `num_query_gradient_accumulations=2` with
381+
- `query_gradient_accumulation_steps`: Number of query gradients to accumulate over. For example, when `query_gradient_accumulation_steps=2` with
377382
`query_batch_size=16`, a total of 32 query gradients will be stored in memory when computing dot products with training gradients.
378383
- `score_dtype`: `dtype` for computing influence scores. You can use `torch.bfloat16` or `torch.float16`.
379384
- `per_sample_gradient_dtype`: `dtype` for computing per-sample-gradient. You can use `torch.bfloat16` or `torch.float16`.
380-
- `precondition_dtype`: `dtype` for performing preconditioning. You can use `torch.bfloat16` or `torch.float16`,
381-
but `torch.float32` is recommended.
385+
- `precondition_dtype`: `dtype` for performing preconditioning. You can use `torch.bfloat16` or `torch.float16`.
382386

383387
### Computing Influence Scores
384388

@@ -409,12 +413,12 @@ vector will correspond to `g_m^T ⋅ H^{-1} ⋅ g_l`, where `g_m` is the gradien
409413

410414
**Dealing with OOMs.** Here are some steps to fix Out of Memory (OOM) errors.
411415
1. Try reducing the `per_device_query_batch_size` or `per_device_train_batch_size`.
412-
2. Try setting `cached_activation_cpu_offload=True`.
416+
2. Try setting `offload_activations_to_cpu=True`.
413417
3. Try using lower precision for `per_sample_gradient_dtype` and `score_dtype`.
414418
4. Try using lower precision for `precondition_dtype`.
415-
5. Try setting `query_gradient_rank > 1`. The recommended values are `16`, `32`, `64`, `128`, and `256`. Note that query
419+
5. Try setting `query_gradient_low_rank > 1`. The recommended values are `16`, `32`, `64`, `128`, and `256`. Note that query
416420
batching is only supported for computing pairwise influence scores, not self-influence scores.
417-
6. Try setting `module_partition_size > 1`.
421+
6. Try setting `module_partitions > 1`.
418422

419423
### FAQs
420424

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@ Please address any reported issues before submitting your PR.
182182
## Acknowledgements
183183

184184
[Omkar Dige](https://github.com/xeon27) contributed to the profiling, DDP, and FSDP utilities, and [Adil Asif](https://github.com/adil-a/) provided valuable insights and suggestions on structuring the DDP and FSDP implementations.
185-
I also thank Hwijeen Ahn, Sang Keun Choe, Youngseog Chung, Minsoo Kang, Lev McKinney, Laura Ruis, Andrew Wang, and Kewen Zhao for their feedback.
185+
I also thank Hwijeen Ahn, Sang Keun Choe, Youngseog Chung, Minsoo Kang, Sophie Liao, Lev McKinney, Laura Ruis, Andrew Wang, and Kewen Zhao for their feedback.
186186

187187
## License
188188

0 commit comments

Comments
 (0)