@@ -143,6 +143,60 @@ video = pipe("<my-awesome-prompt>").frames[0]
143
143
export_to_video(video, " output.mp4" , fps=8)
144
144
` ` `
145
145
146
+ # ## Memory Usage
147
+
148
+ LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, ` 49x512x768` resolution, ** without precomputation** :
149
+
150
+ ` ` `
151
+ Training configuration: {
152
+ " trainable parameters" : 117440512,
153
+ " total samples" : 69,
154
+ " train epochs" : 1,
155
+ " train steps" : 10,
156
+ " batches per device" : 1,
157
+ " total batches observed per epoch" : 69,
158
+ " train batch size" : 1,
159
+ " gradient accumulation steps" : 1
160
+ }
161
+ ` ` `
162
+
163
+ | stage | memory_allocated | max_memory_reserved |
164
+ | :-----------------------:| :----------------:| :-------------------:|
165
+ | before training start | 13.486 | 13.879 |
166
+ | before validation start | 14.146 | 17.623 |
167
+ | after validation end | 14.146 | 17.623 |
168
+ | after epoch 1 | 14.146 | 17.623 |
169
+ | after training end | 4.461 | 17.623 |
170
+
171
+ Note: requires about ` 18` GB of VRAM without precomputation.
172
+
173
+ LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, ` 49x512x768` resolution, ** with precomputation** :
174
+
175
+ ` ` `
176
+ Training configuration: {
177
+ " trainable parameters" : 117440512,
178
+ " total samples" : 1,
179
+ " train epochs" : 10,
180
+ " train steps" : 10,
181
+ " batches per device" : 1,
182
+ " total batches observed per epoch" : 1,
183
+ " train batch size" : 1,
184
+ " gradient accumulation steps" : 1
185
+ }
186
+ ` ` `
187
+
188
+ | stage | memory_allocated | max_memory_reserved |
189
+ | :-----------------------------:| :----------------:| :-------------------:|
190
+ | after precomputing conditions | 8.88 | 8.920 |
191
+ | after precomputing latents | 9.684 | 11.613 |
192
+ | before training start | 3.809 | 10.010 |
193
+ | after epoch 1 | 4.26 | 10.916 |
194
+ | before validation start | 4.26 | 10.916 |
195
+ | after validation end | 13.924 | 17.262 |
196
+ | after training end | 4.26 | 14.314 |
197
+
198
+ Note: requires about ` 17.5` GB of VRAM with precomputation. If validation is not performed, the memory usage is reduced to ` 11` GB.
199
+
146
200
< /details>
147
201
148
202
< details>
@@ -169,8 +223,7 @@ OUTPUT_DIR="/path/to/models/hunyuan-video/hunyuan-video-loras/hunyuan-video_caki
169
223
170
224
# Model arguments
171
225
model_cmd=" --model_name hunyuan_video \
172
- --pretrained_model_name_or_path tencent/HunyuanVideo
173
- --revision refs/pr/18"
226
+ --pretrained_model_name_or_path hunyuanvideo-community/HunyuanVideo"
174
227
175
228
# Dataset arguments
176
229
dataset_cmd=" --data_root $DATA_ROOT \
@@ -252,7 +305,7 @@ import torch
252
305
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
253
306
from diffusers.utils import export_to_video
254
307
255
- model_id = " tencent /HunyuanVideo"
308
+ model_id = " hunyuanvideo-community /HunyuanVideo"
256
309
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
257
310
model_id, subfolder = " transformer" , torch_dtype = torch.bfloat16
258
311
)
@@ -272,10 +325,70 @@ output = pipe(
272
325
export_to_video(output, " output.mp4" , fps = 15 )
273
326
```
274
327
328
+ ### Memory Usage
329
+
330
+ LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, ` 49x512x768 ` resolutions, ** without precomputation** :
331
+
332
+ ```
333
+ Training configuration: {
334
+ "trainable parameters": 163577856,
335
+ "total samples": 69,
336
+ "train epochs": 1,
337
+ "train steps": 10,
338
+ "batches per device": 1,
339
+ "total batches observed per epoch": 69,
340
+ "train batch size": 1,
341
+ "gradient accumulation steps": 1
342
+ }
343
+ ```
344
+
345
+ | stage | memory_allocated | max_memory_reserved |
346
+ | :-----------------------:| :----------------:| :-------------------:|
347
+ | before training start | 38.889 | 39.020 |
348
+ | before validation start | 39.747 | 56.266 |
349
+ | after validation end | 39.748 | 58.385 |
350
+ | after epoch 1 | 39.748 | 40.910 |
351
+ | after training end | 25.288 | 40.910 |
352
+
353
+ Note: requires about ` 59 ` GB of VRAM without precomputation.
354
+
355
+ LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, ` 49x512x768 ` resolutions, ** with precomputation** :
356
+
357
+ ```
358
+ Training configuration: {
359
+ "trainable parameters": 163577856,
360
+ "total samples": 1,
361
+ "train epochs": 10,
362
+ "train steps": 10,
363
+ "batches per device": 1,
364
+ "total batches observed per epoch": 1,
365
+ "train batch size": 1,
366
+ "gradient accumulation steps": 1
367
+ }
368
+ ```
369
+
370
+ | stage | memory_allocated | max_memory_reserved |
371
+ | :-----------------------------:| :----------------:| :-------------------:|
372
+ | after precomputing conditions | 14.232 | 14.461 |
373
+ | after precomputing latents | 14.717 | 17.244 |
374
+ | before training start | 24.195 | 26.039 |
375
+ | after epoch 1 | 24.83 | 42.387 |
376
+ | before validation start | 24.842 | 42.387 |
377
+ | after validation end | 39.558 | 46.947 |
378
+ | after training end | 24.842 | 41.039 |
379
+
380
+ Note: requires about ` 47 ` GB of VRAM with precomputation. If validation is not performed, the memory usage is reduced to about ` 42 ` GB.
381
+
275
382
</details >
276
383
277
384
If you would like to use a custom dataset, refer to the dataset preparation guide [ here] ( ./assets/dataset.md ) .
278
385
386
+ > [ !NOTE]
387
+ > To lower memory requirements:
388
+ > - Pass ` --precompute_conditions ` when launching training.
389
+ > - Pass ` --gradient_checkpointing ` when launching training.
390
+ > - Do not perform validation/testing. This saves a significant amount of memory, which can be used to focus solely on training if you're on smaller VRAM GPUs.
391
+
279
392
## Memory requirements
280
393
281
394
<table align =" center " >
0 commit comments