-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Regarding improving constensor's performance for LLMs, I've been giving it some more thought. We previously discussed optimizing the CPU backend. It seems to me that to achieve the goal you mentioned – "run an LLM at very competitive speeds on any device" – which sounds a lot like what TVM aims for, we might need to consider introducing a more sophisticated compilation architecture, perhaps something akin to TVM's multi-level IR (a high-level graph IR and a low-level operator IR). This could enable more powerful graph optimizations, operator fusion, and facilitate future extensions to more backends.
However, I also notice that constensor's current design might lean more towards a runtime that executes directly, with graph optimizations being triggered implicitly, perhaps without the pre-conceived complexity of a multi-level IR system. Introducing such an architecture would be a significant undertaking.
I'd be really interested to hear your thoughts on the long-term positioning of constensor. Do you envision it evolving into a general-purpose compiler framework like TVM (perhaps with differentiators like its Rust implementation or a focus on JIT capabilities), or is the focus more on it being a lightweight, intelligent runtime optimized for specific scenarios (like efficient LLM inference)?