Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) #6694

inkcherry · 2024-10-31T10:01:22Z

depend on #6649

When performing fetch/release operations on Z3 leaf modules, the loop time is excessively long in fine-grained module. Compared to non-leaf modules, Z3 leaf modules may include a larger number of parameters. Although each loop unit does not consume much time, the overall loop length can be significant.

The fetch time is impacted by:

Post-allgather operations (narrow， slice ，cat, difficult to avoid)
Memory pressure（record_stream/fetch event create&sync）
The release time is impacted by:
slice
Free parameter record_stream

Considering the fine-grained leaf modules, where each parameter is relatively small, we can treat the parameters within each leaf module as a unified entity to handle memory pressure. This approach can approximately halve the CPU time required for fetch/release operations.

…esced_fetch

…pSpeed into z3_coalesced_fetch

…module

inkcherry · 2024-12-23T03:02:35Z

Hi , @tjruwase. Some models use a large number of MoE experts, but the total number of parameters for MoE remains unchanged. This PR handles such cases as a whole to optimize memory management and reduce bubbles caused by long loops. when zero_module_granularity_threshold is enabled.

Do you have any suggestions?

tjruwase · 2024-12-24T15:07:23Z

deepspeed/runtime/zero/partition_parameters.py

-            for part_to_copy in partitions:
-                if not get_accelerator().is_synchronized_device():
-                    part_to_copy.record_stream(get_accelerator().current_stream())
+            if handle_dependency:


Would the following not work?

Suggested change

if handle_dependency:

if handle_dependency and not get_accelerator().is_synchronized_device() :

inkcherry and others added 8 commits October 21, 2024 08:27

z3 coalesced fetch

a2610d8

fix format

4e8be08

fix default value

7641994

fix default

805a820

Merge branch 'master' into z3_coalesced_fetch

ce7dfb7

fix ut

810353b

fix ut

a8dd8fe

Merge branch 'master' into z3_coalesced_fetch

53584ca

inkcherry requested review from tjruwase, tohtana and awan-10 as code owners October 31, 2024 10:01

tjruwase and others added 19 commits October 31, 2024 10:40

Merge branch 'master' into z3_coalesced_fetch

4d86198

add ut(usage)

7b94377

use int type config

cd31a0d

fix format

ea50964

Merge remote-tracking branch 'origin/z3_coalesced_fetch' into z3_coal…

b068118

…esced_fetch

fix note

600d9c7

Merge branch 'master' into z3_coalesced_fetch

4477077

refine code

c2c434b

remove debug code

e5f9430

update

c2b020a

Merge remote-tracking branch 'origin/z3_coalesced_fetch' into z3_coal…

511ace0

…esced_fetch

don't set leaf for container module

3680109

Merge branch 'master' into z3_coalesced_fetch

f2752f8

update ut

22c0f81

udpate

f773258

change config name, refine doc

c31ad02

fix rjust size

40ceeac

fix merge

73e5bd5

format

c31c8d2

inkcherry added 7 commits November 7, 2024 02:57

always print info if the config is enabled

619cbe6

Merge branch 'master' into z3_coalesced_fetch

3c0a183

update

a6e5a39

Merge branch 'z3_coalesced_fetch' of https://github.com/inkcherry/Dee…

e7e5cdf

…pSpeed into z3_coalesced_fetch

Merge remote-tracking branch 'upstream/master' into z3_coalesced_fetch

00ac4eb

use mark parametrize for test

25df962

opt loop

663e637

inkcherry force-pushed the reduce_coalesced_fetch_bubble branch 2 times, most recently from d9c3687 to 663e637 Compare November 12, 2024 08:16

inkcherry requested a review from loadams as a code owner November 12, 2024 08:16

inkcherry added 6 commits November 13, 2024 09:46

update

13aaf2f

Use fast fetch only for the case of z3_leaf_module with fine-grained …

063e5b9

…module

Merge branch 'master' into reduce_coalesced_fetch_bubble

444d586

update

5e9f2e8

Merge branch 'master' into reduce_coalesced_fetch_bubble

21b3823

Merge branch 'master' into reduce_coalesced_fetch_bubble

26ffcf7

tjruwase removed the request for review from awan-10 December 23, 2024 18:39

tjruwase reviewed Dec 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) #6694

Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) #6694

inkcherry commented Oct 31, 2024 •

edited

Loading

inkcherry commented Dec 23, 2024

tjruwase Dec 24, 2024

	if handle_dependency:
	if handle_dependency and not get_accelerator().is_synchronized_device() :

Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) #6694

Are you sure you want to change the base?

Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) #6694

Conversation

inkcherry commented Oct 31, 2024 • edited Loading

inkcherry commented Dec 23, 2024

tjruwase Dec 24, 2024

Choose a reason for hiding this comment

inkcherry commented Oct 31, 2024 •

edited

Loading