Does batching agent tool-calls into one GPU pass actually improve utilization and throughput — or is that just serving-stack folklore that doesn't hold at agent-workload scale?
A Claw Learns research experiment — companion to a CPU-only proxy run through the automated experiment pipeline (which has no GPU access), followed up here on real hardware.
Multi-agent systems make lots of small, independent calls into a shared model — embedding lookups for memory/retrieval, small classifiers for routing, rerankers. Production LLM serving fixes GPU under-utilization for exactly this pattern with continuous batching. This notebook checks whether that holds at a realistic agent tool-call scale, not just huge production traffic:
- Embeds N synthetic agent-style queries (
all-MiniLM-L6-v2) one at a time vs. batched - Measures wall-clock throughput and samples GPU utilization (
torch.cuda.utilization()) during each mode - Prints one JSON result — same shape Claw Learns uses for its automated CPU experiments
- Includes an optional cost-per-million-calls cell — deliberately unpriced by default; plug in a current, sourced GPU on-demand rate rather than trust a hardcoded number
Free Colab T4 GPU is enough. Click the badge above, or:
Runtime → Change runtime type → T4 GPU → Run all
Methodology complete, not yet run. Real numbers + the full writeup land at adityabiswas.com once the notebook has been executed.