Skip to content

fix(vllm): stop forcing --enforce-eager on all LLM endpoints#851

Open
olliestanley wants to merge 1 commit into
mainfrom
ollie/remove-auto-enforce-eager
Open

fix(vllm): stop forcing --enforce-eager on all LLM endpoints#851
olliestanley wants to merge 1 commit into
mainfrom
ollie/remove-auto-enforce-eager

Conversation

@olliestanley

@olliestanley olliestanley commented Jun 30, 2026

Copy link
Copy Markdown
Member

Summary

Removes the code that unconditionally adds --enforce-eager to the vLLM startup command for every endpoint deployed via the llmengine provider pathway.

In _create_vllm_bundle_command (llm_model_endpoint_use_cases.py) we had:

if vllm_args.gpu_memory_utilization is not None:
    vllm_args.enforce_eager = True

Because infer_addition_engine_args_from_model_name() always seeds a default gpu_memory_utilization (0.9, or 0.95 for ≥70B models), this condition was effectively always true. The net effect:

  • Every llmengine vLLM endpoint was launched with --enforce-eager.
  • A user explicitly passing enforce_eager: false had it silently clobbered back to true — there was no way to turn eager mode off.

--enforce-eager disables CUDA graphs and forces eager-mode PyTorch, which massively slows down decode. We were paying that cost globally.

Historical context

  • Hardcode llama 3 70b endpoint param #524 "Hardcode llama 3 70b endpoint param" introduced the flag, hardcoded only for llama-3-70b, as a coupled pair: --gpu-memory-utilization 0.95 --enforce-eager. The intent was a targeted OOM workaround — at very high memory utilization there is little headroom for vLLM's CUDA-graph capture, so eager mode was forced alongside the high mem-util for that one large model.
  • Refactor client data types + add vllm arg passthrough #637 "Refactor client data types + add vllm arg passthrough" generalized the coupling to "whenever gpu_memory_utilization is set, force enforce_eager." Combined with the always-on default mem-util, a 70B-specific safety hack quietly became a global default that disabled CUDA graphs on all endpoints.

Change

Remove the auto-injection. Endpoints now use vLLM's default hybrid eager + CUDA graph mode (better decode throughput). Models that genuinely need eager mode (e.g. several vision / qwen entries in the internal model zoo) already set enforce_eager: true explicitly via additional_args and are unaffected.

Risk

Re-enabling CUDA graphs requires a little extra VRAM for graph capture. At the default gpu_memory_utilization of 0.9–0.95 this is the scenario the original workaround guarded against, so watch for startup OOM on memory-tight models after rollout. Mitigations if needed: lower the default mem-util, or have callers set enforce_eager: true for specific models.

Testing

  • Updated test_update_vllm_force_bundle_recreation_preserves_legacy_vllm_args to reflect that enforce_eager is no longer auto-added.
  • pytest tests/unit/domain/test_llm_use_cases.py — 74 passed.

Greptile Summary

This PR removes a historical accident where --enforce-eager was silently injected on every vLLM endpoint because any non-None gpu_memory_utilization (which is always populated by the default inference helper) triggered the flag. The fix narrows the auto-injection to only the >= 0.95 utilization band — where CUDA-graph capture is genuinely VRAM-tight — and skips it entirely when the caller has already supplied an explicit enforce_eager value.

  • The condition in _create_vllm_bundle_command now guards on enforce_eager is None and gpu_memory_utilization >= 0.95, preserving the original 70B OOM workaround while re-enabling CUDA graphs for all lower-utilization endpoints.
  • The existing bundle-recreation test is updated to reflect that gpu_memory_utilization=0.75 no longer produces --enforce-eager, with the explicit enforce_eager: True fixture value removed accordingly.

Confidence Score: 5/5

Safe to merge; the change is a targeted reduction of scope that restores vLLM default hybrid CUDA-graph mode for endpoints below 0.95 GPU utilization while keeping the existing OOM guard for large models.

The logic change is small and well-reasoned: one compound condition replaces an over-broad one, and the new condition preserves the original safety guard exactly at the threshold it was designed for. Models that explicitly opt into eager mode via additional_args are unaffected.

No files require special attention; both changed files are straightforward and internally consistent.

Important Files Changed

Filename Overview
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py Replaces the unconditional gpu_memory_utilization is not None → enforce_eager = True with a targeted check that only auto-enables eager mode when the caller didn't set an explicit value and utilization is ≥ 0.95, restoring CUDA-graph mode for all endpoints below that threshold.
model-engine/tests/unit/domain/test_llm_use_cases.py Updates the existing bundle-recreation test to drop enforce_eager: True from the fixture input and flips the assertion from in to not in for --enforce-eager, correctly reflecting that a 0.75 utilization no longer triggers eager mode.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[_create_vllm_bundle_command] --> B{is_worker?}
    B -- Yes --> Z[Skip vllm_args tuning]
    B -- No --> C[Set tensor_parallel_size]
    C --> D{enforce_eager is None?}
    D -- No\ncaller set it explicitly --> E[Respect caller's value]
    D -- Yes --> F{gpu_memory_utilization\nis not None?}
    F -- No --> G[Leave enforce_eager = None\nCUDA graphs ON]
    F -- Yes --> H{gpu_memory_utilization\n>= 0.95?}
    H -- No\n< 0.95 e.g. default 0.9 --> G
    H -- Yes\ne.g. 70B default 0.95 --> I[enforce_eager = True\nCUDA graphs OFF]
    G --> J[Build vllm_cmd]
    I --> J
    E --> J
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[_create_vllm_bundle_command] --> B{is_worker?}
    B -- Yes --> Z[Skip vllm_args tuning]
    B -- No --> C[Set tensor_parallel_size]
    C --> D{enforce_eager is None?}
    D -- No\ncaller set it explicitly --> E[Respect caller's value]
    D -- Yes --> F{gpu_memory_utilization\nis not None?}
    F -- No --> G[Leave enforce_eager = None\nCUDA graphs ON]
    F -- Yes --> H{gpu_memory_utilization\n>= 0.95?}
    H -- No\n< 0.95 e.g. default 0.9 --> G
    H -- Yes\ne.g. 70B default 0.95 --> I[enforce_eager = True\nCUDA graphs OFF]
    G --> J[Build vllm_cmd]
    I --> J
    E --> J
Loading

Reviews (2): Last reviewed commit: "fix(vllm): only force --enforce-eager at..." | Re-trigger Greptile

@diazagasatya diazagasatya left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against the code — solid fix, nice catch on the global regression. A couple of things verified + one thing to confirm before I approve.

Verified

  • The removed block is the only place enforce_eager is force-set — the only other references are the DTO default in common/dtos/llms/vllm.py and explicit model configs (e.g. inference/vllm/examples/v2/llama-3.2-vision/config.json sets enforce_eager: true). So removal cleanly hands control back to explicit additional_args / vLLM's default. 👍
  • The "always true" analysis checks out: infer_addition_engine_args_from_model_name seeds gpu_memory_utilization = 0.9 (0.95 for ≥70B), so gpu_memory_utilization is not None was effectively always true → eager forced on every endpoint. Accurate.
  • Test update is correct (not-set → --enforce-eager absent → vLLM hybrid/CUDA-graph default).

One thing to confirm before approval (the OOM risk)
#524 originally forced eager specifically for llama-3-70b at 0.95 mem-util, because at that utilization there's little headroom for CUDA-graph capture → OOM. ≥70B models still default to 0.95 (infer_addition_engine_args_from_model_name), so after this change they re-enable CUDA graphs at 0.95 and could OOM on startup unless they explicitly set enforce_eager: true. You've confirmed the vision/qwen entries set it — can you confirm the ≥70B models (esp. llama-3-70b) also set enforce_eager: true explicitly (or lower their mem-util)? Otherwise this re-introduces exactly the OOM #524 guarded against for the large models. Since the change only takes effect on bundle recreation, rollout is staggered — worth watching ≥70B endpoints as they recycle.

Minor nit
Consider adding a test that an explicit enforce_eager: true still emits --enforce-eager — current test only covers the auto-add-removed (not-present) path, and that explicit path is now the supported way for models that need eager.

Net: approve once the ≥70B enforce_eager question is confirmed — the perf win (global eager was disabling CUDA graphs / hurting decode everywhere) is well worth it.

@olliestanley olliestanley force-pushed the ollie/remove-auto-enforce-eager branch from 5ca087c to e4a588c Compare June 30, 2026 18:39
The vLLM bundle command set enforce_eager=True whenever gpu_memory_utilization
was set. Since infer_addition_engine_args_from_model_name always returns a
default gpu_memory_utilization (0.9, or 0.95 for >=70B models), this forced
eager mode on every llmengine vLLM endpoint and silently overrode any explicit
enforce_eager=False from a user.

Eager mode disables CUDA graphs, which massively slows down decode. The override
originated as a targeted workaround in #524 (hardcoded for llama-3-70b, paired
with --gpu-memory-utilization 0.95 to avoid OOM during CUDA graph capture at high
memory utilization) and was unintentionally generalized to all models in the #637
client-data-types refactor.

Restrict the auto-injection to the regime it was actually meant for: only default
eager mode on when gpu_memory_utilization >= 0.95, where there is little headroom
for CUDA graph capture. Below that (the default 0.9 for <70B models), CUDA graphs
stay enabled for faster decode. An explicit enforce_eager from the caller (True or
False) is always respected.
@olliestanley olliestanley force-pushed the ollie/remove-auto-enforce-eager branch from e4a588c to 38e8621 Compare June 30, 2026 18:40
@olliestanley

Copy link
Copy Markdown
Member Author

@diazagasatya I'm not sure if the OOM issue would even be relevant with modern vLLM versions and more recent large models. ~Nobody should really be deploying models without graph capture in 2026. In interests of backwards compatibility, I'll re-enable this guard by default in cases where util target >= 0.95, but this time allowing arguments to explicitly override it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants