Releases: flagos-ai/FlagScale
Releases · flagos-ai/FlagScale
v1.0.0-alpha.0
- Major updates: Refactored the codebase by moving hardware-specific (multi-chip) support into plugin repositories such as TransformerEngine-FL and vllm-plugin-FL. These plugins build on top of FlagOS, a unified open-source AI system software stack. We also re-initialized the Git commit history to reduce repository size.
- Compatibility (legacy users): If you are using or upgrading from a version earlier than v1.0.0-alpha.0, please use the main-legacy branch. It will continue to receive critical bug fixes and minor updates for a period of time.
v0.9.0
- Training & Finetuning: Added LoRA for efficient finetuning, improved the autotuner for cross-chip heterogeneous training, and enabled distributed RWKV training.
- Inference & Serving: Introduced DiffusionEngine for FLUX.1-dev, Qwen-Image, and Wan2.1-T2V, support multi-model automatic orchestration and dynamic scaling.
- Embodied AI: Full lifecycle support for Robobrain, Robotics, and PI0, plus semantic retrieval for MCP-based skills for RoboOS.
- Elastic & Fault Tolerance: Detect task status automatically (errors, hangs, etc.) and periodically record them.
- Hardware & System: Broader chip support, upgraded patch mechanism with file-level diffs, and enhanced CICD for different chips.
v0.8.0
- Introduced a new flexible and robust multi-backend mechanism and updated vendor adaptation methods.
- Enabled heterogeneous prefill-decoding disaggregation across vendor chips within a single instance via FlagCX (beta).
- Upgraded DeepSeek-v3 pre-training with the new Megatron-LM and added heterogeneous pre-training across different chips for MoE models like DeepSeek-v3.
v0.6.5
- Added support for DeepSeek-V3 distributed pre-training (beta) and DeepSeek-V3/R1 serving across multiple chips.
- Introduced an auto-tuning feature for serving and a new CLI feature for one-click deployment.
- Enhanced the CI/CD system to support more chips and integrated the workflow of FlagRelease.
v0.6.0
- Introduced general multi-dimensional heterogeneous parallelism and CPU-based communication between different chips.
- Added comprehensive support for data processing and faster distributed training of LLaVA-OneVision, achieving SOTA results on the Infinity-MM dataset.
- Open-sourced the optimized CFG implementation and accelerated the generation and understanding tasks for Emu3.
- Implemented the auto-tuning feature to simplify large-scale distributed training, making it more accessible for users with less expertise.
- Enhanced the CI/CD system to facilitate more efficient unit testing across different backends and perform the loss check for the various parallel strategies.
v0.3
v0.2
- Provide the actually used training scheme for Aquila2-70B-Expr, including the parallel strategies, optimizations and hyper-parameter settings.
- Support heterogeneous training on chips of different generations with the same architecture or compatible architectures, including NVIDIA GPUs and Iluvatar CoreX chips.
- Support training on chinese domestic hardwares, including Iluvatar CoreX and Baidu KUNLUN chips.