PAPER PLAINE

Fresh research, simply explained. Updates twice daily.

Strait: Perceiving Priority and Interference in ML Inference Serving

Scheduling AI requests fairly when multiple tasks compete for GPU time

Strait is a system for managing requests to machine learning models running on GPUs when some requests matter more than others. It predicts how long each request will take even when multiple requests run simultaneously, then uses those predictions to prioritize urgent requests—cutting missed deadlines for high-priority tasks by up to 11 percentage points without completely starving lower-priority work.

Companies running AI services on their own hardware often need to handle both time-sensitive requests (like fraud detection) and routine ones (like recommendations) on the same machines. Current systems either guess badly at how long things will take under load or simply interrupt low-priority tasks—wasting GPU power. Strait lets businesses meet their critical deadlines while still processing regular work efficiently, making on-premises AI infrastructure more practical.