-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds training rules update #552
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
@Elnifio Can we not delete gpt3 rows and instead add a separate row in all the tables for llama3.1 405b? You can move the gpt3 rows to the bottom section of the tables and leave the latest available version as 4.1 whereas llama3.1 405 and most others would then become v5.0 |
@ShriyaPalsamudram Thanks for the suggestion! I have added back GPT3 in this commit. |
training_rules.adoc
Outdated
* For (3072,4096) - opt_base_learning_rate=2.0e-5 or opt_base_learning_rate=3.0e-5 | ||
* GBS [4096,8192] - opt_base_learning_rate=3.0e-5 | ||
* GBS<1152 or GBS>9216 - new RCP needs to be generated, reach out to the task force | ||
* GBS within range but not listed: opt_base_learning_rate = 8e-5 * (GBS / 1152) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also have a rounding rule and specify number of decimal places to use for LR?
training_rules.adoc
Outdated
@@ -576,7 +573,7 @@ Reference Convergence Points are used to ensure that the convergence of the subm | |||
* For GPT3 where there are two reference implementations which have been verified to be equivalent with minimum variance, each reference implementation should provide at least N epoch convergence numbers for each RCP. | |||
* After a set of Reference Convergence Points is gathered, we find the minimal set of these points that are needed for the fastest possible convergence. For example, if the RCP for batch size 128 is at 10 epochs, the RCP for batch size 256 is at 20 epochs, and the RCP for batch size 512 is also at 20 epochs, then we prune the RCP at the 256 batch size. Based on the assumption that convergence increases with batch size, we expect to be able to converge faster than 20 epochs at batch size 256. In practice we prune ALL RCP points that have slower convergence than the linear interpolation at the same batch size of any two surrounding points. Eventually we end up with a pruned set of RCPs which defines the fastest possible convergence of the reference code as a function of batch size. | |||
* A potential submitter can request generation of new RCPs by suggesting a better set of hparams to the WG or generate new RCPs by running the reference themselves. A request for a new RCP run should be backed by at least one run on either the submitter’s code or the reference code proving faster convergence. A request to generate RCPs should be made in the Training WG meeting at least 8 weeks before submission deadline and the reference owner (or a volunteer appointed by WG) should provide the RCP at least 4 weeks before submission deadline. Subject to WG's approval, requester's set of convergence points (2N runs) may act as temporary RCPs for that round if the RCP request is not met by a timely response. | |||
* For GPT3, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and both reference owners (NV and Google) should provide RCPs (N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester's set of convergence points (2N runs) can be used as temporary RCPs. | |||
* For Llama31_405B, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and the reference owner (NVIDIA) should provide RCPs (N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester's set of convergence points (2N runs) can be used as temporary RCPs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should provide RCPs (N runs each) -> should provide RCPs (2N runs each)
training_rules.adoc
Outdated
@@ -576,7 +610,7 @@ Reference Convergence Points are used to ensure that the convergence of the subm | |||
* For GPT3 where there are two reference implementations which have been verified to be equivalent with minimum variance, each reference implementation should provide at least N epoch convergence numbers for each RCP. | |||
* After a set of Reference Convergence Points is gathered, we find the minimal set of these points that are needed for the fastest possible convergence. For example, if the RCP for batch size 128 is at 10 epochs, the RCP for batch size 256 is at 20 epochs, and the RCP for batch size 512 is also at 20 epochs, then we prune the RCP at the 256 batch size. Based on the assumption that convergence increases with batch size, we expect to be able to converge faster than 20 epochs at batch size 256. In practice we prune ALL RCP points that have slower convergence than the linear interpolation at the same batch size of any two surrounding points. Eventually we end up with a pruned set of RCPs which defines the fastest possible convergence of the reference code as a function of batch size. | |||
* A potential submitter can request generation of new RCPs by suggesting a better set of hparams to the WG or generate new RCPs by running the reference themselves. A request for a new RCP run should be backed by at least one run on either the submitter’s code or the reference code proving faster convergence. A request to generate RCPs should be made in the Training WG meeting at least 8 weeks before submission deadline and the reference owner (or a volunteer appointed by WG) should provide the RCP at least 4 weeks before submission deadline. Subject to WG's approval, requester's set of convergence points (2N runs) may act as temporary RCPs for that round if the RCP request is not met by a timely response. | |||
* For GPT3, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and both reference owners (NV and Google) should provide RCPs (N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester's set of convergence points (2N runs) can be used as temporary RCPs. | |||
* For GPT3 and Llama31_405B, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and the reference owner (NVIDIA) should provide RCPs (N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester's set of convergence points (2N runs) can be used as temporary RCPs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propose to separate gpt3 and llama3.1 405b as follows -
- For GPT3, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and both reference owners (NV and Google) should provide RCPs (N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester's set of convergence points (2N runs) can be used as temporary RCPs.
- For Llama31_405B, a request to generate RCPs should be made in the Training WG meeting at least 9 weeks before submission deadline and the reference owner (NVIDIA) should provide RCPs (2N runs each) at least 5 weeks before submission deadline so that all submitters have enough time to train with the new hparams. The RCP requests should be handled in FCFS order and if there are more than 5 RCP requests, the WG should decide if the requester's set of convergence points (2N runs) can be used as temporary RCPs.
No description provided.