LargeModel

Infra

imbue from baremetal to 70b model

Transformer

Parameter Math llama2-13 example
Annotated Transformer
Illustration: https://jalammar.github.io/illustrated-transformer/
Model Flops calculation: https://zhuanlan.zhihu.com/p/624740065
Model/token/communication estimation: https://www.53ai.com/news/qianyanjishu/303.html
collective communication: https://zhuanlan.zhihu.com/p/435438871
GPU capability/bottleneck: https://mp.weixin.qq.com/s/S7lxmi_Q_Uq23mtMus4KSQ

Training from scratch

https://www.youtube.com/watch?v=ZLbVdvOoTKM

Training framework performance

Megatron analysis: https://www.high-flyer.cn/blog/model_parallel-1/inidex/

Fine-tune with single node

Inference explaination

Performance projection

https://mp.weixin.qq.com/s/ftF3YRXPZ5mjVfqzGCDNYQ

Model reference

https://github.com/HabanaAI/Model-References

Tracing

https://github.com/pytorch/kineto

Profiling

Performance diagnosis toolkit https://arxiv.org/pdf/2205.02473.pdf
nsys profile cli (2023.3) guide https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-options
https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide/index.html#profiling_pytorch_pyprof
https://github.com/NVIDIA/PyProf
https://jingchaozhang.github.io/DLProf-Demo/

Trace analysis: https://github.com/facebookresearch/HolisticTraceAnalysis/tree/main/examples

https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/profiling/flops_profiler

Rewrite

Model Visulization

Training time, Flops estimation

GPU benchmarks

git clone https://github.com/te42kyfo/gpu-benches.git
cd gpu-benches/gpu-stream/
/usr/local/cuda/bin/nvcc -o stream main.cu
./stream

GPU foundamentals

How GEMM works https://siboehm.com/articles/22/CUDA-MMM

Compilation

TVM to custom ML hardware https://www.youtube.com/watch?v=FBdW1gJGx0M

Chip architecture

Groq TSP video intro

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
AI4system		AI4system
chipArchitecture		chipArchitecture
compiler		compiler
dataloader		dataloader
dataset		dataset
deepspeed		deepspeed
distributed		distributed
examples		examples
flops-estimation		flops-estimation
llama2		llama2
memory		memory
nccl		nccl
nv-bench		nv-bench
pipeline-parallelism		pipeline-parallelism
recommendation-ranking		recommendation-ranking
scripts		scripts
slurm		slurm
tensorrt-llm		tensorrt-llm
test-scripts		test-scripts
.gitignore		.gitignore
README.md		README.md
performance-analysis.md		performance-analysis.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LargeModel

Infra

Transformer

Training from scratch

Training framework performance

Fine-tune with single node

Inference explaination

Performance projection

Model reference

Tracing

Profiling

Model Visulization

Training time, Flops estimation

GPU benchmarks

GPU foundamentals

Compilation

Chip architecture

About

Releases

Packages

Languages

Fizzbb/LargeModel

Folders and files

Latest commit

History

Repository files navigation

LargeModel

Infra

Transformer

Training from scratch

Training framework performance

Fine-tune with single node

Inference explaination

Performance projection

Model reference

Tracing

Profiling

Model Visulization

Training time, Flops estimation

GPU benchmarks

GPU foundamentals

Compilation

Chip architecture

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages