What's Changed
- add issue template by @liwenchangbdbz in #1
- add cutlass submodule and patches by @liwenchangbdbz in #2
- All gather and reduce scatter on SM80 by @zheng-ningxin in #3
- Reorganize and deduplicate files by @wenlei-bao in #4
- Add arXiv paper link by @wenlei-bao in #5
- Update BibTex by @wenlei-bao in #6
- Support IPC && SM90 version of AG-GEMM, GEMM-RS by @zheng-ningxin in #9
- fix the _allgather_base backend issue(issue11) by @zheng-ningxin in #12
- using c10::intrusive_ptrc10d::ProcessGroup as argument from python by @houqi in #13
- Add more device types for the time estimation. by @zheng-ningxin in #15
- Update README.md by @zheng-ningxin in #16
- zero out all the allocated shm buffer by @zheng-ningxin in #18
- feat: fix tuning for the all-gather gemm && move the reset-signal() to the forward critical path by @zheng-ningxin in #19
- Tune the AG performance for the llama-8b by @zheng-ningxin in #21
- Remove pynvshmem import in gemm_rs_80.py by @tlrmchlsmth in #22
- Support performance tunning for gemm-rs kernel on sm80 by @zheng-ningxin in #23
- add torch version to the whl name by @zheng-ningxin in #24
Full Changelog: https://github.com/bytedance/flux/commits/v1.0.0