Assembling the best SotA AI techniques into a unified model
- 13B parameter BitNet + infini-Attention + DenseFormer + MoD +
In Context-Pretraining + 2 stage pretraining
- upcycle w c-BTX to an 8 expert sparse MoE + MoA
https://twitter.com/winglian/status/1778675583817326842
BitNet: Scaling 1-bit Transformers for Large Language Models
- arXiv: https://arxiv.org/abs/2310.11453
- reference implementations:
DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging
- arXiv: https://arxiv.org/abs/2402.02622
- reference implementations:
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
- arXiv: https://arxiv.org/abs/2404.02258
- reference implementations:
In-Context Pretraining: Language Modeling Beyond Document Boundaries
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Scaling Expert Language Models with Unsupervised Domain Discovery
- arXiv: https://arxiv.org/abs/2303.14177
- reference implementations:
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
- arXiv: https://arxiv.org/abs/2403.07816
- reference implementations:
- errata: