[torch.compile] Bunch of small changes needed for enabling torch.compile by pggPL · Pull Request #3130 · NVIDIA/TransformerEngine

pggPL · 2026-06-15T14:41:43Z

Description

Small standalone fixes extracted from a larger torch.compile branch, going directly from main. Two independent changes: making Userbuffers pybind11 queries compile-friendly, and freeing quantized grad_output early for column-parallel SP. Plus a custom-recipe caching fix, a split-accumulator refactor, and a CI test hook-up.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Userbuffers pybind11 queries under torch.compile

is_fp8_ubuf() / with_cublasmp() are compile-time constants but graph-break when traced. At the nn.Module.forward boundary (where no UB communicator object is in hand yet) they go through get_ub_is_fp8(name, use_fp8), wrapped in torch.compiler.assume_constant_result — only plain (str, bool) args are baked, so guards are well-defined and don't rely on pybind-object identity.
In the hot forward/backward implementation paths the UB communicator is already fetched, so those call ub_obj.is_fp8_ubuf() / ub_obj.with_cublasmp() directly — no wrapper, no string concatenation, no redundant registry lookup. Eager speed is preserved.

Free quantized grad_output early for column-parallel SP

Row-parallel SP already called clear_tensor_data(grad_output) on the gathered tensor. Column-parallel SP quantizes grad_output to a Float8TensorStorage (an internal tensor) but never freed it. Under torch.compile reduce-overhead this left live pool tensors at recording end ("Detected N tensor(s) in the cudagraph pool not tracked as outputs"). The free now covers row-SP and column-SP-FP8 (column-SP non-FP8 is a no-op view, so it's excluded).

Replace fp8_recipe in LinearBwdArgs with pre-resolved split-accumulator booleans

LinearBwdArgs no longer carries the recipe object (which holds process-group references and is compile-unfriendly). dgrad_use_split_accumulator / wgrad_use_split_accumulator are resolved once in Linear.forward (reusing the existing get_fp8_recipe() call) and threaded through as plain booleans.

Custom-recipe quantizer caching fix

CustomRecipeState early-exit was missing an identity check, so quantizers were rebuilt on every forward even when the recipe was unchanged. Added if recipe_state.recipe is recipe: return.

Test hook-up

Added test_torch_compile.py to L0_pytorch_unittest.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…stants; fix SP memory leak; test suite hook-up Wrap CommOverlapCore pybind11 methods that return compile-time constants so torch.compile(fullgraph=True) can trace through them without graph breaks: - `is_fp8_ubuf()` → `ub_is_fp8()` / `get_ub_is_fp8()` in base.py; `_ub_is_fp8()` in gemm.py - `with_cublasmp()` → `ub_is_cublasmp()` in base.py All callers in linear.py, layernorm_linear.py, layernorm_mlp.py, base.py, gemm.py, userbuffers_backward_linear.py and userbuffers_forward_linear.py updated. Fix quantized grad_output not being freed early for column-parallel SP backward. Row-parallel SP already called clear_tensor_data(grad_output) to release the gathered tensor; column-parallel SP quantizes grad_output to Float8TensorStorage but never freed it before returning. Under torch.compile reduce-overhead this leaves 3 live pool tensors at recording end and triggers "Detected 3 tensor(s) in the cudagraph pool not tracked as outputs". Extend the existing clear_tensor_data guard to cover both parallel modes. Fix custom-recipe quantizer state being re-initialised on every forward call even when the recipe object has not changed. The existing early-exit for CustomRecipeState was missing an identity check on the recipe object, so any repeated call with the same recipe would bypass the early-return and rebuild quantizers unnecessarily. Add `if recipe_state.recipe is recipe: return` to restore the intended caching behaviour. Add test_torch_compile.py to L0_pytorch_unittest so the autocast and existing compile tests run in CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…-accumulator booleans LinearBwdArgs stored the entire FP8 recipe object so the backward could extract fp8_gemm_dgrad.use_split_accumulator and fp8_gemm_wgrad.use_split_accumulator at GEMM time. Recipe objects hold process-group references and are not serialisable as compile-time constants, making them incompatible with torch.compile custom-op paths. Replace fp8_recipe with two plain bool fields: - dgrad_use_split_accumulator (default _2X_ACC_DGRAD) - wgrad_use_split_accumulator (default _2X_ACC_WGRAD) These are resolved once in _linear_setup_ctx and passed into the args struct, so the backward consumes scalars instead of a live recipe object. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-06-15T14:49:03Z

Greptile Summary

This PR makes five targeted changes to enable torch.compile compatibility: wrapping UB is_fp8_ubuf() queries in @assume_constant_result, freeing the quantized grad_output in column-parallel SP FP8 backward, threading split-accumulator booleans from forward to backward (removing the fp8_recipe object from LinearBwdArgs), fixing an identity check that caused CustomRecipeState to rebuild quantizers on every forward, and adding the compile test to the L0 CI suite.

get_ub_is_fp8 + @assume_constant_result: New wrapper function is called only at the nn.Module.forward boundary; inner backward hot-paths call ub_obj.is_fp8_ubuf() directly. destroy_ub() now calls torch.compiler.reset() to invalidate baked constants if UB is torn down and re-initialized.
Column-SP FP8 memory fix: clear_tensor_data(grad_output) is now called after the wgrad GEMM for parallel_mode == "column" and fp8, matching the existing row-SP behavior; the condition correctly skips the non-FP8 column-SP path (which is a view, not an internal allocation) and the delayed-wgrad path (where the tensor is still needed).
Split-accumulator refactor: LinearBwdArgs.fp8_recipe is replaced by two plain bool fields (dgrad_use_split_accumulator, wgrad_use_split_accumulator) resolved once at forward time, eliminating process-group–holding recipe objects from the backward compiled graph.

Confidence Score: 5/5

Safe to merge — all five changes are well-scoped, behavior-equivalent where intended, and correctly fix live-tensor leaks and stale cache hits.

The split-accumulator refactor is a pure structural change (booleans resolved at the same point in time, same recipe object, just threaded as plain values instead of the recipe itself). The column-SP FP8 clear_tensor_data is called only after the wgrad GEMM completes, so no use-after-free risk. The CustomRecipeState identity fix prevents spurious quantizer rebuilds without changing correctness. torch.compiler.reset() in destroy_ub() intentionally nukes all compiled caches on teardown to avoid stale assume_constant_result constants, which is the right trade-off for a function that is almost always called at training end or test boundary.

No files require special attention. layernorm_linear.py and layernorm_mlp.py still carry ctx.fp8_recipe in their backward paths, but that is explicitly out of scope for this PR.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/module/base.py	Adds get_ub_is_fp8 with @assume_constant_result, calls torch.compiler.reset() in destroy_ub(), and adds identity check to CustomRecipeState early-exit — all correct and well-commented.
transformer_engine/pytorch/module/linear.py	Adds dgrad_use_split_accumulator/wgrad_use_split_accumulator to LinearFwdArgs and LinearBwdArgs, resolves them at forward time, removes fp8_recipe from backward args, and extends column-SP FP8 clear_tensor_data — behavior-equivalent refactor.
transformer_engine/pytorch/module/layernorm_linear.py	Switches UB is_fp8 check to get_ub_is_fp8 and extends column-SP FP8 clear_tensor_data to match linear.py; ctx.fp8_recipe split-accumulator refactor not applied here (intentional per PR scope).
transformer_engine/pytorch/module/layernorm_mlp.py	Single-line change switching get_ub(...).is_fp8_ubuf() to get_ub_is_fp8(...) — minimal and correct.
tests/pytorch/test_torch_compile.py	Updates _make_qfactory to dispatch on QuantizerRole.tensor_type, adds get_quantizer_roles to ToyLinear — aligns with the CustomRecipeState caching fix.
qa/L0_pytorch_unittest/test.sh	Adds test_torch_compile.py to the L0 CI test suite — straightforward hook-up.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant M as Linear.forward (nn.Module)
    participant GUF as get_ub_is_fp8() @assume_constant_result
    participant UB as UB Communicator
    participant FWD as _Linear.forward
    participant CTX as LinearFwdArgs / LinearBwdArgs
    participant BWD as _linear_backward

    M->>GUF: get_ub_is_fp8(name, is_fp8_enabled())
    GUF->>UB: is_fp8_ubuf()
    UB-->>GUF: bool
    GUF-->>M: fp8_output / fp8_grad

    M->>M: resolve dgrad/wgrad_use_split_accumulator from recipe
    M->>FWD: LinearFwdArgs(dgrad_use_split_accumulator, wgrad_use_split_accumulator, ...)
    FWD->>CTX: _linear_setup_ctx transfers booleans to LinearBwdArgs

    BWD->>BWD: "use_split_accumulator = bwd_args.dgrad_use_split_accumulator"

    Note over BWD: Column-SP FP8 path
    BWD->>BWD: "grad_output = quantizer(grad_output) [Float8TensorStorage]"
    BWD->>BWD: wgrad_gemm(inputmat_total, grad_output)
    BWD->>BWD: clear_tensor_data(grad_output) [NEW: free pool tensor]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant M as Linear.forward (nn.Module)
    participant GUF as get_ub_is_fp8() @assume_constant_result
    participant UB as UB Communicator
    participant FWD as _Linear.forward
    participant CTX as LinearFwdArgs / LinearBwdArgs
    participant BWD as _linear_backward

    M->>GUF: get_ub_is_fp8(name, is_fp8_enabled())
    GUF->>UB: is_fp8_ubuf()
    UB-->>GUF: bool
    GUF-->>M: fp8_output / fp8_grad

    M->>M: resolve dgrad/wgrad_use_split_accumulator from recipe
    M->>FWD: LinearFwdArgs(dgrad_use_split_accumulator, wgrad_use_split_accumulator, ...)
    FWD->>CTX: _linear_setup_ctx transfers booleans to LinearBwdArgs

    BWD->>BWD: "use_split_accumulator = bwd_args.dgrad_use_split_accumulator"

    Note over BWD: Column-SP FP8 path
    BWD->>BWD: "grad_output = quantizer(grad_output) [Float8TensorStorage]"
    BWD->>BWD: wgrad_gemm(inputmat_total, grad_output)
    BWD->>BWD: clear_tensor_data(grad_output) [NEW: free pool tensor]

_{Reviews (4): Last reviewed commit: "Merge branch 'main' into torch_compile_s..." | Re-trigger Greptile}

pggPL · 2026-06-16T11:18:31Z

/te-ci pytorch L1

…t_result get_ub_is_fp8 bakes is_fp8_ubuf() as a compile-time constant; without a reset, destroy_ub + re-init with different FP8 settings would read stale values until recompile. Only affects in-memory caches, not disk. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

ToyLinear now overrides get_quantizer_roles so CustomRecipeState doesn't hit the no-roles warning, which graph-breaks under fullgraph=True. qfactory dispatches on role.tensor_type instead of a pre-baked string key. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

IvanYashchuk · 2026-06-22T10:35:04Z

+    # Compiled graphs may have baked is_fp8_ubuf() via assume_constant_result;
+    # reset so re-init with different settings doesn't read stale constants.
+    torch.compiler.reset()


The current helper call sites are all inside @no_torch_dynamo() forwards and the added test_torch_compile.py coverage does not exercise user buffers or it's done implicitly in the test?

Is it possible avoid a process-wide compiler reset on UB teardown, or add a targeted compiled UB test that proves the stale-constant case and justifies this global invalidation?

IvanYashchuk · 2026-06-22T10:38:29Z

 python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_nvfp4.xml $TE_PATH/tests/pytorch/nvfp4 || test_fail "test_nvfp4"
 python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_mxfp8.xml $TE_PATH/tests/pytorch/mxfp8 || test_fail "test_mxfp8"
 python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_quantized_tensor.xml $TE_PATH/tests/pytorch/test_quantized_tensor.py || test_fail "test_quantized_tensor.py"
+python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_torch_compile.xml $TE_PATH/tests/pytorch/test_torch_compile.py || test_fail "test_torch_compile.py"


That file only compiles a local ToyLinear helper and torch.nn.Linear under te.autocast. It does not instantiate changed in this PR te.Linear, LayerNormLinear, or LayerNormMLP, and it has no UB, sequence_parallel/parallel_mode.

What tests would fail without changes to layernorm_linear, layernorm_mlp files?

I fix the issue that the test was not connected to the CI.
Currently it tests only if te.autocast() can be traced inside torch.compile.

This is first of series of PRs and I change here only small things to make next PRs cleaner.

ksivaman

LGTM

ksivaman · 2026-06-25T20:38:44Z

/te-ci pytorch L0 L1

pggPL and others added 2 commits June 15, 2026 16:40

pggPL requested a review from ksivaman as a code owner June 15, 2026 14:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

afe364b

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/module/base.py

pggPL added 2 commits June 16, 2026 14:05

pggPL mentioned this pull request Jun 17, 2026

[PyTorch][torch.compile] Decouple amax reduction group from the quantizer #3104

Open

13 tasks

IvanYashchuk reviewed Jun 22, 2026

View reviewed changes

ksivaman approved these changes Jun 25, 2026

View reviewed changes

Merge branch 'main' into torch_compile_small_fixes

6910743

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[torch.compile] Bunch of small changes needed for enabling torch.compile#3130

[torch.compile] Bunch of small changes needed for enabling torch.compile#3130
pggPL wants to merge 6 commits into
NVIDIA:mainfrom
pggPL:torch_compile_small_fixes

pggPL commented Jun 15, 2026

Uh oh!

greptile-apps Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

pggPL commented Jun 16, 2026

Uh oh!

IvanYashchuk Jun 22, 2026

Uh oh!

IvanYashchuk Jun 22, 2026

Uh oh!

pggPL Jun 22, 2026

Uh oh!

ksivaman left a comment

Uh oh!

ksivaman commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

pggPL commented Jun 15, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

pggPL commented Jun 16, 2026

Uh oh!

IvanYashchuk Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

IvanYashchuk Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

pggPL Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

ksivaman left a comment

Choose a reason for hiding this comment

Uh oh!

ksivaman commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Jun 15, 2026 •

edited

Loading