fix(vllm): stop forcing --enforce-eager on all LLM endpoints by olliestanley · Pull Request #851 · scaleapi/llm-engine

olliestanley · 2026-06-30T10:01:55Z

Summary

Removes the code that unconditionally adds --enforce-eager to the vLLM startup command for every endpoint deployed via the llmengine provider pathway.

In _create_vllm_bundle_command (llm_model_endpoint_use_cases.py) we had:

if vllm_args.gpu_memory_utilization is not None:
    vllm_args.enforce_eager = True

Because infer_addition_engine_args_from_model_name() always seeds a default gpu_memory_utilization (0.9, or 0.95 for ≥70B models), this condition was effectively always true. The net effect:

Every llmengine vLLM endpoint was launched with --enforce-eager.
A user explicitly passing enforce_eager: false had it silently clobbered back to true — there was no way to turn eager mode off.

--enforce-eager disables CUDA graphs and forces eager-mode PyTorch, which massively slows down decode. We were paying that cost globally.

Historical context

Hardcode llama 3 70b endpoint param #524 "Hardcode llama 3 70b endpoint param" introduced the flag, hardcoded only for llama-3-70b, as a coupled pair: --gpu-memory-utilization 0.95 --enforce-eager. The intent was a targeted OOM workaround — at very high memory utilization there is little headroom for vLLM's CUDA-graph capture, so eager mode was forced alongside the high mem-util for that one large model.
Refactor client data types + add vllm arg passthrough #637 "Refactor client data types + add vllm arg passthrough" generalized the coupling to "whenever gpu_memory_utilization is set, force enforce_eager." Combined with the always-on default mem-util, a 70B-specific safety hack quietly became a global default that disabled CUDA graphs on all endpoints.

Change

Remove the auto-injection. Endpoints now use vLLM's default hybrid eager + CUDA graph mode (better decode throughput). Models that genuinely need eager mode (e.g. several vision / qwen entries in the internal model zoo) already set enforce_eager: true explicitly via additional_args and are unaffected.

Risk

Re-enabling CUDA graphs requires a little extra VRAM for graph capture. At the default gpu_memory_utilization of 0.9–0.95 this is the scenario the original workaround guarded against, so watch for startup OOM on memory-tight models after rollout. Mitigations if needed: lower the default mem-util, or have callers set enforce_eager: true for specific models.

Testing

Updated test_update_vllm_force_bundle_recreation_preserves_legacy_vllm_args to reflect that enforce_eager is no longer auto-added.
pytest tests/unit/domain/test_llm_use_cases.py — 74 passed.

Greptile Summary

This PR removes a historical accident where --enforce-eager was silently injected on every vLLM endpoint because any non-None gpu_memory_utilization (which is always populated by the default inference helper) triggered the flag. The fix narrows the auto-injection to only the >= 0.95 utilization band — where CUDA-graph capture is genuinely VRAM-tight — and skips it entirely when the caller has already supplied an explicit enforce_eager value.

The condition in _create_vllm_bundle_command now guards on enforce_eager is None and gpu_memory_utilization >= 0.95, preserving the original 70B OOM workaround while re-enabling CUDA graphs for all lower-utilization endpoints.
The existing bundle-recreation test is updated to reflect that gpu_memory_utilization=0.75 no longer produces --enforce-eager, with the explicit enforce_eager: True fixture value removed accordingly.

Confidence Score: 5/5

Safe to merge; the change is a targeted reduction of scope that restores vLLM default hybrid CUDA-graph mode for endpoints below 0.95 GPU utilization while keeping the existing OOM guard for large models.

The logic change is small and well-reasoned: one compound condition replaces an over-broad one, and the new condition preserves the original safety guard exactly at the threshold it was designed for. Models that explicitly opt into eager mode via additional_args are unaffected.

No files require special attention; both changed files are straightforward and internally consistent.

Important Files Changed

Filename	Overview
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py	Replaces the unconditional `gpu_memory_utilization is not None → enforce_eager = True` with a targeted check that only auto-enables eager mode when the caller didn't set an explicit value and utilization is ≥ 0.95, restoring CUDA-graph mode for all endpoints below that threshold.
model-engine/tests/unit/domain/test_llm_use_cases.py	Updates the existing bundle-recreation test to drop `enforce_eager: True` from the fixture input and flips the assertion from `in` to `not in` for `--enforce-eager`, correctly reflecting that a 0.75 utilization no longer triggers eager mode.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[_create_vllm_bundle_command] --> B{is_worker?}
    B -- Yes --> Z[Skip vllm_args tuning]
    B -- No --> C[Set tensor_parallel_size]
    C --> D{enforce_eager is None?}
    D -- No\ncaller set it explicitly --> E[Respect caller's value]
    D -- Yes --> F{gpu_memory_utilization\nis not None?}
    F -- No --> G[Leave enforce_eager = None\nCUDA graphs ON]
    F -- Yes --> H{gpu_memory_utilization\n>= 0.95?}
    H -- No\n< 0.95 e.g. default 0.9 --> G
    H -- Yes\ne.g. 70B default 0.95 --> I[enforce_eager = True\nCUDA graphs OFF]
    G --> J[Build vllm_cmd]
    I --> J
    E --> J

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[_create_vllm_bundle_command] --> B{is_worker?}
    B -- Yes --> Z[Skip vllm_args tuning]
    B -- No --> C[Set tensor_parallel_size]
    C --> D{enforce_eager is None?}
    D -- No\ncaller set it explicitly --> E[Respect caller's value]
    D -- Yes --> F{gpu_memory_utilization\nis not None?}
    F -- No --> G[Leave enforce_eager = None\nCUDA graphs ON]
    F -- Yes --> H{gpu_memory_utilization\n>= 0.95?}
    H -- No\n< 0.95 e.g. default 0.9 --> G
    H -- Yes\ne.g. 70B default 0.95 --> I[enforce_eager = True\nCUDA graphs OFF]
    G --> J[Build vllm_cmd]
    I --> J
    E --> J

_{Reviews (2): Last reviewed commit: "fix(vllm): only force --enforce-eager at..." | Re-trigger Greptile}

diazagasatya

Reviewed against the code — solid fix, nice catch on the global regression. A couple of things verified + one thing to confirm before I approve.

Verified

The removed block is the only place enforce_eager is force-set — the only other references are the DTO default in common/dtos/llms/vllm.py and explicit model configs (e.g. inference/vllm/examples/v2/llama-3.2-vision/config.json sets enforce_eager: true). So removal cleanly hands control back to explicit additional_args / vLLM's default. 👍
The "always true" analysis checks out: infer_addition_engine_args_from_model_name seeds gpu_memory_utilization = 0.9 (0.95 for ≥70B), so gpu_memory_utilization is not None was effectively always true → eager forced on every endpoint. Accurate.
Test update is correct (not-set → --enforce-eager absent → vLLM hybrid/CUDA-graph default).

One thing to confirm before approval (the OOM risk)
#524 originally forced eager specifically for llama-3-70b at 0.95 mem-util, because at that utilization there's little headroom for CUDA-graph capture → OOM. ≥70B models still default to 0.95 (infer_addition_engine_args_from_model_name), so after this change they re-enable CUDA graphs at 0.95 and could OOM on startup unless they explicitly set enforce_eager: true. You've confirmed the vision/qwen entries set it — can you confirm the ≥70B models (esp. llama-3-70b) also set enforce_eager: true explicitly (or lower their mem-util)? Otherwise this re-introduces exactly the OOM #524 guarded against for the large models. Since the change only takes effect on bundle recreation, rollout is staggered — worth watching ≥70B endpoints as they recycle.

Minor nit
Consider adding a test that an explicit enforce_eager: true still emits --enforce-eager — current test only covers the auto-add-removed (not-present) path, and that explicit path is now the supported way for models that need eager.

Net: approve once the ≥70B enforce_eager question is confirmed — the perf win (global eager was disabling CUDA graphs / hurting decode everywhere) is well worth it.

The vLLM bundle command set enforce_eager=True whenever gpu_memory_utilization was set. Since infer_addition_engine_args_from_model_name always returns a default gpu_memory_utilization (0.9, or 0.95 for >=70B models), this forced eager mode on every llmengine vLLM endpoint and silently overrode any explicit enforce_eager=False from a user. Eager mode disables CUDA graphs, which massively slows down decode. The override originated as a targeted workaround in #524 (hardcoded for llama-3-70b, paired with --gpu-memory-utilization 0.95 to avoid OOM during CUDA graph capture at high memory utilization) and was unintentionally generalized to all models in the #637 client-data-types refactor. Restrict the auto-injection to the regime it was actually meant for: only default eager mode on when gpu_memory_utilization >= 0.95, where there is little headroom for CUDA graph capture. Below that (the default 0.9 for <70B models), CUDA graphs stay enabled for faster decode. An explicit enforce_eager from the caller (True or False) is always respected.

olliestanley · 2026-06-30T18:40:17Z

@diazagasatya I'm not sure if the OOM issue would even be relevant with modern vLLM versions and more recent large models. ~Nobody should really be deploying models without graph capture in 2026. In interests of backwards compatibility, I'll re-enable this guard by default in cases where util target >= 0.95, but this time allowing arguments to explicitly override it

diazagasatya reviewed Jun 30, 2026

View reviewed changes

olliestanley force-pushed the ollie/remove-auto-enforce-eager branch from 5ca087c to e4a588c Compare June 30, 2026 18:39

olliestanley force-pushed the ollie/remove-auto-enforce-eager branch from e4a588c to 38e8621 Compare June 30, 2026 18:40

diazagasatya approved these changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(vllm): stop forcing --enforce-eager on all LLM endpoints#851

fix(vllm): stop forcing --enforce-eager on all LLM endpoints#851
olliestanley wants to merge 1 commit into
mainfrom
ollie/remove-auto-enforce-eager

olliestanley commented Jun 30, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

diazagasatya left a comment

Uh oh!

olliestanley commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

olliestanley commented Jun 30, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Historical context

Change

Risk

Testing

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

diazagasatya left a comment

Choose a reason for hiding this comment

Uh oh!

olliestanley commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

olliestanley commented Jun 30, 2026 •

edited by greptile-apps Bot

Loading