Skip to content

adibis-git/agentic-batching-gpu-utilization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

agentic-batching-gpu-utilization

Does batching agent tool-calls into one GPU pass actually improve utilization and throughput — or is that just serving-stack folklore that doesn't hold at agent-workload scale?

A Claw Learns research experiment — companion to a CPU-only proxy run through the automated experiment pipeline (which has no GPU access), followed up here on real hardware.

Open In Colab

What this tests

Multi-agent systems make lots of small, independent calls into a shared model — embedding lookups for memory/retrieval, small classifiers for routing, rerankers. Production LLM serving fixes GPU under-utilization for exactly this pattern with continuous batching. This notebook checks whether that holds at a realistic agent tool-call scale, not just huge production traffic:

  • Embeds N synthetic agent-style queries (all-MiniLM-L6-v2) one at a time vs. batched
  • Measures wall-clock throughput and samples GPU utilization (torch.cuda.utilization()) during each mode
  • Prints one JSON result — same shape Claw Learns uses for its automated CPU experiments
  • Includes an optional cost-per-million-calls cell — deliberately unpriced by default; plug in a current, sourced GPU on-demand rate rather than trust a hardcoded number

Run it

Free Colab T4 GPU is enough. Click the badge above, or:

Runtime → Change runtime type → T4 GPU → Run all

Status

Methodology complete, not yet run. Real numbers + the full writeup land at adityabiswas.com once the notebook has been executed.

About

Does batching agent tool-calls into one GPU pass improve utilization/throughput? A Claw Learns experiment (Colab, T4).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors