Cutting inference cold starts by 40x with LP, FUSE, C/R, and CUDA-checkpoint
32 points by charles_irl 3 hours ago | 6 comments
iLoveOncall 2 hours ago
What is "cutting by 40x" supposed to mean?
replycharles_irl 2 hours ago
Cutting latencies by 40x! Unfortunately couldn't fit the whole title in the character limit :<
replyaaronblohowiak 43 minutes ago
How can you cut latency by more than 1x? I am no intending to be snarky, it just doesn’t fit my brain how you can reduce a measure time by more than the original starting time.
replyaaronblohowiak 40 minutes ago
Put differently, 1/40 is not the same as 1x - 40x. I’d phrase as Reduced by 97.5% or 0.975x
reply
I'm currently planning to deploy using Amazon SageMaker, but a cold start takes a whopping ~9 minutes: 6 minutes for instance provisioning + 3 minutes for PyTorch initialization. My Docker image is ~14 GB, and the weights are a few GB. How long would it take to cold start this configuration on Modal?
SageMaker's performance makes it pretty much useless without many warm instances around (= tens of thousands of dollars per month), because users won't be happy if they have to randomly wait 9 minutes