TransformerEngine : Add test with FSDP (and updates to ddp_wrapper
in test_ddp.py)
#143
Annotations
2 errors and 1 warning
auto-cc
Resource not accessible by integration
{
name: 'HttpError',
id: '8645562921',
status: 403,
response: {
url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/pulls/142',
status: 403,
headers: {
'access-control-allow-origin': '*',
'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset',
connection: 'close',
'content-encoding': 'gzip',
'content-security-policy': "default-src 'none'",
'content-type': 'application/json; charset=utf-8',
date: 'Thu, 11 Apr 2024 10:55:01 GMT',
'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin',
server: 'GitHub.com',
'strict-transport-security': 'max-age=31536000; includeSubdomains; preload',
'transfer-encoding': 'chunked',
vary: 'Accept-Encoding, Accept, X-Requested-With',
'x-accepted-github-permissions': 'pull_requests=write',
'x-content-type-options': 'nosniff',
'x-frame-options': 'deny',
'x-github-api-version-selected': '2022-11-28',
'x-github-media-type': 'github.v3; format=json',
'x-github-request-id': 'A86A:97451:28C0F5:432E78:6617C185',
'x-ratelimit-limit': '15000',
'x-ratelimit-remaining': '14994',
'x-ratelimit-reset': '1712836500',
'x-ratelimit-resource': 'core',
'x-ratelimit-used': '6',
'x-xss-protection': '0'
},
data: {
message: 'Resource not accessible by integration',
documentation_url: 'https://docs.github.com/rest/pulls/pulls#update-a-pull-request'
}
},
request: {
method: 'PATCH',
url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/pulls/142',
headers: {
accept: 'application/vnd.github.v3+json',
'user-agent': 'probot/12.2.5 octokit-core.js/3.6.0 Node.js/16.20.2 (linux; x64)',
authorization: 'token [REDACTED]',
'content-type': 'application/json; charset=utf-8'
},
body: '{"body":"This PR adds test for using TE executor in FSDP and verifies it against Eager + TE. Also we update the `ddp_wrapper` to allow wrapping with different pytest-fixture besides `bucket_size_in_mb` (which errored when I tried to add a different pytest-fixture).\\r\\n\\r\\nPR https://github.com/Lightning-AI/lightning-thunder/pull/80 description details of how TE automatically takes care of syncing FP8 meta-data in distributed setting. \\r\\n\\r\\nAlso, I have verified it on a larger model using the available benchmarking script\\r\\ncmd for benchmark: \\r\\n```\\r\\ntorchrun --nproc-per-node=2 thunder/benchmarks/benchmark_litgpt.py --compile thunder+nvfuser+transformerengine+cudnn --n_layers=10 --distributed_mode=fsdp\\r\\n```\\r\\n\\r\\nNumbers are on RTX 6000\\r\\n\\r\\nWithout TE\\r\\n```\\r\\niter 41: loss 4.6562, iter time: 3180.77ms, t: 4096\\r\\niter 42: loss 4.6250, iter time: 3202.35ms, t: 4096\\r\\niter 43: loss 4.6562, iter time: 3172.88ms, t: 4096\\r\\niter 44: loss 4.6562, iter time: 3181.55ms, t: 4096\\r\\nModel name: Llama-2-7b-hf\\r\\nSeq Length: 4096\\r\\nMicro BS: 1\\r\\nGlobal BS: 2\\r\\nNumber of Layers: 10\\r\\nNumber of parameters: 1.14B\\r\\nDistributed Mode: fsdp\\r\\nSharding Mode: zero2\\r\\nBucketing: none\\r\\nCompiler: thunder+nvfuser+cudnn\\r\\nAverage iter time: 3187.63 ms\\r\\nMemory used: 30.56 GB\\r\\nTokens/s: 2570.17\\r\\nTokens/s/GPU: 1285.09\\r\\nTFLOP/s: 38.40\\r\\n```\\r\\n\\r\\nWith TE\\r\\n```\\r\\niter 42: loss 4.6562, iter time: 3025.66ms, t: 4096\\r\\niter 43: loss 4.6562, iter time: 3030.40ms, t: 4096\\r\\niter 44: loss 4.6562, iter time: 3018.83ms, t: 4096\\r\\nModel name: Llama-2-7b-hf\\r\\nSeq Length: 4096\\r\\nMicro BS: 1\\r\\nGlobal BS: 2\\r\\nNumber of Layers: 10\\r\\nNumber of parameters: 1.14B\\r\\nDistributed Mode: fsdp\\r\\nSharding Mode: zero2\\r\\nBucketing: none\\r\\nCompiler: thunder+nvfuser+transformerenginevfu
|
auto-cc
HttpError: Resource not accessible by integration
at /home/runner/work/_actions/Lightning-AI/probot/v5/node_modules/@octokit/core/node_modules/@octokit/request/dist-node/index.js:86:21
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async Job.doExecute (/home/runner/work/_actions/Lightning-AI/probot/v5/node_modules/bottleneck/light.js:405:18)
{
name: 'AggregateError',
event: {
id: '8645562921',
name: 'pull_request',
payload: {
action: 'labeled',
label: {
color: '3855E2',
default: false,
description: '',
id: 6781712626,
name: 'distributed',
node_id: 'LA_kwDOLiCyD88AAAABlDi48g',
url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/labels/distributed'
},
number: 142,
organization: {
avatar_url: 'https://avatars.githubusercontent.com/u/58386951?v=4',
description: 'Turn ideas into AI, Lightning fast. Creators of PyTorch Lightning, Lightning AI Studio, TorchMetrics, Fabric, Lit-GPT, Lit-LLaMA',
events_url: 'https://api.github.com/orgs/Lightning-AI/events',
hooks_url: 'https://api.github.com/orgs/Lightning-AI/hooks',
id: 58386951,
issues_url: 'https://api.github.com/orgs/Lightning-AI/issues',
login: 'Lightning-AI',
members_url: 'https://api.github.com/orgs/Lightning-AI/members{/member}',
node_id: 'MDEyOk9yZ2FuaXphdGlvbjU4Mzg2OTUx',
public_members_url: 'https://api.github.com/orgs/Lightning-AI/public_members{/member}',
repos_url: 'https://api.github.com/orgs/Lightning-AI/repos',
url: 'https://api.github.com/orgs/Lightning-AI'
},
pull_request: {
_links: {
comments: {
href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/issues/142/comments'
},
commits: {
href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/pulls/142/commits'
},
html: {
href: 'https://github.com/Lightning-AI/lightning-thunder/pull/142'
},
issue: {
href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/issues/142'
},
review_comment: {
href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/pulls/comments{/number}'
},
review_comments: {
href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/pulls/142/comments'
},
self: {
href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/pulls/142'
},
statuses: {
href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/statuses/5553215b8b383ca46f3e2f380e4a64efb8257258'
}
},
active_lock_reason: null,
additions: 170,
assignee: null,
assignees: [],
author_association: 'COLLABORATOR',
auto_merge: null,
base: {
label: 'Lightning-AI:main',
ref: 'main',
repo: {
allow_auto_merge: true,
allow_forking: true,
allow_merge_commit: false,
allow_rebase_merge: false,
allow_squash_merge: true,
allow_update_branch: true,
archive_url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/{archive_format}{/ref}',
archived: false,
assignees_url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/assignees{/user}',
blobs_url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/git/blobs{/sha}',
branches_url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/branches{/branch}',
clone_url: 'https://github.com/Lightning-AI/lightning-thunder.git',
collaborators_url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/collaborators{/collaborator}',
comments_url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/comments{/number}',
|
auto-cc
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: Lightning-AI/probot@v5. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
|