Maybe I should learn more about ML to have a better instinct on optimization methods in general, so I can actually build AI optimizers like these.
The system started without paged attention, and recreated its own paged attention implementation automatically once it realized it was a bottleneck.
Pretty cool!
We have considered open-sourcing some of our optimized inference libraries in the future, but have not yet come to a decision on this.
Also if you need a rough intuition as to why this is possible: it's because this entire inference stack was built for exactly one model, and thus we can really tune the entire framework accordingly.
What's the post request latency of this part? What the ftt?
We believe our improvements would hold on BF16, but let me check.
It is also a bit weird that they are not incorporating speculative decoding, that seems like a critical performance optimization, especially for decode heavy workloads.
You don’t even get that with GPUs in general, or really floating point in general.
The Art of Computer Programming. Volume 2: Seminumerical Algorithms section 4.2.2 with explain where it loses floating addition associativity property.
Apartness relations are another possible lens.
Wouldn’t speculative decoding decrease overall throughput, but optimise (perceived) responsiveness?