llama-from-scratch

Here is 1 public repository matching this topic...

KrishChordiya / nano-llama

A 110M-parameter Llama-style transformer trained from scratch on the TinyStories dataset, optimized for high-throughput training on 4GB VRAM consumer GPUs. The project features a custom asynchronous CUDA-stream prefetcher and KV-cache inference, achieving 10k+ TPS on an RTX 3050.

nlp deep-learning transformers pytorch llama efficient-training tinystories cuda-optimization llama-from-scratch

Updated Apr 8, 2026
Python

Improve this page

Add a description, image, and links to the llama-from-scratch topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llama-from-scratch topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-from-scratch

Here is 1 public repository matching this topic...

KrishChordiya / nano-llama

Improve this page

Add this topic to your repo