Cutting inference cold starts by 40x with LP, FUSE, C/R, and CUDA-checkpoint
32 points by charles_irl 3 hours ago | 6 comments

kgeist 41 minutes ago
So, does this snapshotting optimization support arbitrary containers?

I'm currently planning to deploy using Amazon SageMaker, but a cold start takes a whopping ~9 minutes: 6 minutes for instance provisioning + 3 minutes for PyTorch initialization. My Docker image is ~14 GB, and the weights are a few GB. How long would it take to cold start this configuration on Modal?

SageMaker's performance makes it pretty much useless without many warm instances around (= tens of thousands of dollars per month), because users won't be happy if they have to randomly wait 9 minutes

reply
iLoveOncall 2 hours ago
What is "cutting by 40x" supposed to mean?
reply
charles_irl 2 hours ago
Cutting latencies by 40x! Unfortunately couldn't fit the whole title in the character limit :<
reply
aaronblohowiak 43 minutes ago
How can you cut latency by more than 1x? I am no intending to be snarky, it just doesn’t fit my brain how you can reduce a measure time by more than the original starting time.
reply
aaronblohowiak 40 minutes ago
Put differently, 1/40 is not the same as 1x - 40x. I’d phrase as Reduced by 97.5% or 0.975x
reply
bfeynman 40 minutes ago
probably just AI slop and using wrong semantics, they mean speedup ratio.
reply