adds training rules update #552

Elnifio · 2025-02-07T07:15:04Z

No description provided.

github-actions · 2025-02-07T07:15:16Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ShriyaPalsamudram · 2025-02-07T15:44:02Z

@Elnifio Can we not delete gpt3 rows and instead add a separate row in all the tables for llama3.1 405b? You can move the gpt3 rows to the bottom section of the tables and leave the latest available version as 4.1 whereas llama3.1 405 and most others would then become v5.0

Elnifio · 2025-02-07T22:24:29Z

@ShriyaPalsamudram Thanks for the suggestion! I have added back GPT3 in this commit.

ShriyaPalsamudram · 2025-02-07T16:48:19Z

training_rules.adoc

-* For (3072,4096) - opt_base_learning_rate=2.0e-5 or opt_base_learning_rate=3.0e-5
-* GBS [4096,8192] - opt_base_learning_rate=3.0e-5
+* GBS<1152 or GBS>9216 - new RCP needs to be generated, reach out to the task force
+* GBS within range but not listed: opt_base_learning_rate = 8e-5 * (GBS / 1152)


Should we also have a rounding rule and specify number of decimal places to use for LR?

ShriyaPalsamudram · 2025-02-07T16:52:18Z

training_rules.adoc

@@ -576,7 +573,7 @@ Reference Convergence Points are used to ensure that the convergence of the subm
 * For GPT3 where there are two reference implementations which have been verified to be equivalent with minimum variance, each reference implementation should provide at least N epoch convergence numbers for each RCP.
 * After a set of Reference Convergence Points is gathered, we find the minimal set of these points that are needed for the fastest possible convergence. For example, if the RCP for batch size 128 is at 10 epochs, the RCP for batch size 256 is at 20 epochs, and the RCP for batch size 512 is also at 20 epochs, then we prune the RCP at the 256 batch size. Based on the assumption that convergence increases with batch size, we expect to be able to converge faster than 20 epochs at batch size 256. In practice we prune ALL RCP points that have slower convergence than the linear interpolation at the same batch size of any two surrounding points. Eventually we end up with a pruned set of RCPs which defines the fastest possible convergence of the reference code as a function of batch size.
 * A potential submitter can request generation of new RCPs by suggesting a better set of hparams to the WG or generate new RCPs by running the reference themselves. A request for a new RCP run should be backed by at least one run on either the submitter’s code or the reference code proving faster convergence. A request to generate RCPs should be made in the Training WG meeting at least 8 weeks before submission deadline and the reference owner (or a volunteer appointed by WG) should provide the RCP at least 4 weeks before submission deadline. Subject to WG's approval, requester's set of convergence points (2N runs) may act as temporary RCPs for that round if the RCP request is not met by a timely response.
-* For GPT3, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and both reference owners (NV and Google) should provide RCPs (N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester's set of convergence points (2N runs) can be used as temporary RCPs.
+* For Llama31_405B, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and the reference owner (NVIDIA) should provide RCPs (N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester's set of convergence points (2N runs) can be used as temporary RCPs.


should provide RCPs (N runs each) -> should provide RCPs (2N runs each)

ShriyaPalsamudram · 2025-02-13T15:33:26Z

training_rules.adoc

@@ -576,7 +610,7 @@ Reference Convergence Points are used to ensure that the convergence of the subm
 * For GPT3 where there are two reference implementations which have been verified to be equivalent with minimum variance, each reference implementation should provide at least N epoch convergence numbers for each RCP.
 * After a set of Reference Convergence Points is gathered, we find the minimal set of these points that are needed for the fastest possible convergence. For example, if the RCP for batch size 128 is at 10 epochs, the RCP for batch size 256 is at 20 epochs, and the RCP for batch size 512 is also at 20 epochs, then we prune the RCP at the 256 batch size. Based on the assumption that convergence increases with batch size, we expect to be able to converge faster than 20 epochs at batch size 256. In practice we prune ALL RCP points that have slower convergence than the linear interpolation at the same batch size of any two surrounding points. Eventually we end up with a pruned set of RCPs which defines the fastest possible convergence of the reference code as a function of batch size.
 * A potential submitter can request generation of new RCPs by suggesting a better set of hparams to the WG or generate new RCPs by running the reference themselves. A request for a new RCP run should be backed by at least one run on either the submitter’s code or the reference code proving faster convergence. A request to generate RCPs should be made in the Training WG meeting at least 8 weeks before submission deadline and the reference owner (or a volunteer appointed by WG) should provide the RCP at least 4 weeks before submission deadline. Subject to WG's approval, requester's set of convergence points (2N runs) may act as temporary RCPs for that round if the RCP request is not met by a timely response.
-* For GPT3, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and both reference owners (NV and Google) should provide RCPs (N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester's set of convergence points (2N runs) can be used as temporary RCPs.
+* For GPT3 and Llama31_405B, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and the reference owner (NVIDIA) should provide RCPs (N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester's set of convergence points (2N runs) can be used as temporary RCPs.


Propose to separate gpt3 and llama3.1 405b as follows -

For GPT3, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and both reference owners (NV and Google) should provide RCPs (N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester's set of convergence points (2N runs) can be used as temporary RCPs.

For Llama31_405B, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and the reference owner (NVIDIA) should provide RCPs (2N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester's set of convergence points (2N runs) can be used as temporary RCPs.

training_rules.adoc

adds training rules update

c92aa38

Elnifio requested review from a team as code owners February 7, 2025 07:15

adds back GPT3

7267e8a

Elnifio added 3 commits February 20, 2025 09:45

changes the optimizer type

95f63fb

change number of tokens to number of samples

48b9c93

updates the sequence length

81264df

ShriyaPalsamudram reviewed Feb 20, 2025

View reviewed changes

addresses comments

a178297

ShriyaPalsamudram approved these changes Feb 20, 2025

View reviewed changes

ShriyaPalsamudram merged commit b55dde3 into mlcommons:master Feb 20, 2025
1 check passed

github-actions bot locked and limited conversation to collaborators Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds training rules update #552

adds training rules update #552

Elnifio commented Feb 7, 2025

github-actions bot commented Feb 7, 2025 •

edited

Loading

ShriyaPalsamudram commented Feb 7, 2025 •

edited

Loading

Elnifio commented Feb 7, 2025

ShriyaPalsamudram Feb 7, 2025

ShriyaPalsamudram Feb 7, 2025

ShriyaPalsamudram Feb 13, 2025

adds training rules update #552

adds training rules update #552

Conversation

Elnifio commented Feb 7, 2025

github-actions bot commented Feb 7, 2025 • edited Loading

ShriyaPalsamudram commented Feb 7, 2025 • edited Loading

Elnifio commented Feb 7, 2025

ShriyaPalsamudram Feb 7, 2025

Choose a reason for hiding this comment

ShriyaPalsamudram Feb 7, 2025

Choose a reason for hiding this comment

ShriyaPalsamudram Feb 13, 2025

Choose a reason for hiding this comment

github-actions bot commented Feb 7, 2025 •

edited

Loading

ShriyaPalsamudram commented Feb 7, 2025 •

edited

Loading