Skip to content

fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849

Merged
sayakmaity merged 8 commits into
scaleapi:mainfrom
sayakmaity:sayakmaity/model-engine-dd-env-per-cluster
Jul 1, 2026
Merged

fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849
sayakmaity merged 8 commits into
scaleapi:mainfrom
sayakmaity:sayakmaity/model-engine-dd-env-per-cluster

Conversation

@sayakmaity

@sayakmaity sayakmaity commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Problem

model-engine's Datadog env tag is derived from the build-time Helm value .Values.context, so every deployment reports the same env (e.g. production) regardless of which cluster/environment it actually runs in. This affects three surfaces, all keyed off the single overloaded context:

  1. Control-plane pods (gateway, builder, cacher, celery autoscaler) — DD_ENV and the tags.datadoghq.com/env label both come straight from .Values.context.
  2. Runtime-launched inference endpoints / batch jobsDD_ENV and tags.datadoghq.com/env are baked into the rendered service_template_config_map from .Values.context at helm template time, so they're frozen in the image and identical for every cluster (unlike the neighboring DD_SERVICE/DD_VERSION, which are already ${...} runtime-substituted).
  3. Gateway custom metricsDatadogMonitoringMetricsGateway tags every metric with env:{infra_config().env}, and the chart wires the infra config's env from .Values.context too (service_config_map.yaml), so these are build-time-static as well.

Approach

Introduce a dedicated, optional datadog.env value, decoupled from context (which is overloaded — it also selects the service_template_config_map variant, drives a context == "circleci" conditional, sets non-DD labels, and seeds the infra config's env, so it must not be repurposed). The fix mirrors the existing GIT_TAG/DD_VERSION split:

  • Control-planeDD_ENV / env label = {{ .Values.datadog.env | default .Values.context }} (Helm value, backwards-compatible: falls back to context when unset).
  • Launched resources → templates emit the literal ${DD_ENV}, resolved centrally at endpoint-creation time in load_k8s_yaml via a single filtered_kwargs.setdefault("DD_ENV", DD_ENV). The value is sourced from the gateway/builder's own DD_ENV env var (which the chart sets to datadog.env | default context), so launched pods inherit the gateway's per-cluster env. This one hook covers endpoints, batch jobs, cron jobs, image-cache, LWS, and all auxiliary resources (service/virtual-service/destination-rule/hpa/keda/vpa/pdb) without threading DD_ENV through every argument class.
  • Gateway custom metrics → tag with the same DD_ENV, so metrics, traces, and labels all report a consistent per-cluster env.

Changes

Chart (charts/model-engine)

  • values.yaml: add optional datadog.env (default "").
  • _helpers.tpl:
    • baseLabels (control-plane object labels): tags.datadoghq.com/envdatadog.env | default context.
    • baseTemplateLabels (launched service + job templates): tags.datadoghq.com/env${DD_ENV}.
    • serviceEnvBase: remove the hardcoded DD_ENV (moved into the wrappers, like GIT_TAG).
    • serviceEnvGitTagFromHelmVar (control-plane): add DD_ENV = datadog.env | default context.
    • serviceEnvGitTagFromPythonReplace (endpoints): add DD_ENV = ${DD_ENV}.
    • baseServiceTemplateEnv / baseForwarderTemplateEnv (legacy endpoint paths): DD_ENV${DD_ENV}.
  • service_template_config_map.yaml: ad.datadoghq.com/main.logs job annotation env:{{ $env }}env:${DD_ENV}; drop the now-unused $env.
  • celery_autoscaler_stateful_set.yaml: $env (drives both the DD_ENV env var and the env label) → datadog.env | default context.
  • Chart.yaml: 0.2.60.2.7.

Server (model-engine)

  • common/env_vars.py: add exported DD_ENV = os.environ.get("DD_ENV") or infra_config().env (falls back to the infra config env locally/CI where the env var isn't set).
  • infra/gateways/resources/k8s_endpoint_resource_delegate.py: in load_k8s_yaml, filtered_kwargs.setdefault("DD_ENV", DD_ENV) before safe_substitute — the single central injection that resolves ${DD_ENV} for every launched resource. (setdefault lets a caller override.)
  • infra/gateways/datadog_monitoring_metrics_gateway.py:
    • self.tags env tag: infra_config().envDD_ENV, so gateway custom metrics report the same per-cluster env as traces/labels.
    • Drive-by bug fix: _format_call_tags and emit_http_call_error_metrics previously did tags = self.tags; tags.extend(...), mutating self.tags in place — leaking per-call model_name/endpoint_name/error_code tags across every subsequent emission and growing the list unboundedly. Now build a fresh list ([*self.tags, ...]).

Generated fallback

  • infra/gateways/resources/templates/service_template_config_map_circleci.yaml: regenerated via just autogen-templates. Picks up the ${DD_ENV} substitutions, plus pre-existing chart-source drift the stale fallback was missing (the type: AverageValue HPA change and the 7th max_concurrency=${FORWARDER_MAX_CONCURRENCY} --set arg, both from [MLI-6891] feat: wire forwarder_max_concurrency through K8s forwarder template #836), and the helm.sh/chart stamp (0.2.4 → 0.2.7).

Validation

  • helm template across values_sample (datadog.env unset → falls back to context), --set datadog.env=sgp-dev, and values_circleci: control-plane DD_ENV/labels resolve to the Helm value; the launched-resource ConfigMap keeps the literal ${DD_ENV} for Python substitution. No <no value> / unrendered {{ }}.
  • The regenerated circleci fallback is byte-identical to helm template --show-only (modulo the autogen header + cosmetic trailing whitespace).
  • ${DD_ENV} resolves at endpoint creation via load_k8s_yaml. Existing test_k8s_endpoint_resource_delegate.py tests render the real chart and assert tags.datadoghq.com/env == "circleci" — DD_ENV resolves to infra_config().envcircleci under CI (env var unset), so they remain green. The metrics-gateway tests assert only that statsd is called, not tag values, so the env-source swap and tag-copy refactor don't break them.
  • Did not run the full model-engine suite locally (heavy service deps) — CI should confirm.

Known limitations / rollout notes

  • Monitor migration: once an operator sets datadog.env, the gateway's custom metrics (model_engine.* / scale_launch.*) move from env:<context> to env:<datadog.env>. Dashboards/monitors filtering on the old env value must be updated. (No-op while datadog.env is unset — both equal context. The infra config's env field, used for DB/celery routing, is intentionally still keyed off context.)
  • Rollout skew: the launched-resource ConfigMap is a Helm pre-upgrade hook, so its new ${DD_ENV} data can be synced into still-running old builder pods before they roll. An old builder (without the central setdefault) would leave the literal ${DD_ENV} in a label, and the endpoint/HPA/service create would be rejected by the API server (${DD_ENV} is an invalid label value). Loud, infrequent, and self-heals once the new builder is Ready — but expect a brief window during upgrade where endpoint creates may fail.
  • Downstream: consumers that bake the rendered service_template_config_map_<env>.yaml into an image (e.g. model-engine-internal's just autogen-templates + image build) should regenerate; the rendered DD_ENV line becomes a runtime placeholder. This is not the production read path (the chart-mounted ConfigMap is), so it mainly matters for local/CI.

Greptile Summary

This PR updates the forwarder worker path and Firehose logging client behavior. The main changes are:

  • Adds an opt-in gevent worker pool for the Celery forwarder.
  • Adds prefork-only worker recycling through CELERY_WORKER_MAX_TASKS_PER_CHILD.
  • Caches the Firehose client and refreshes it before assumed-role credentials expire.
  • Adds Firehose connection pool sizing for shared gevent use.
  • Adds unit and local integration tests for the new worker-pool and Firehose paths.

Confidence Score: 5/5

No blocking correctness or security issues were identified in the reviewed changes.

The changes are focused, covered by targeted validation, and no accepted review findings remain.

T-Rex T-Rex Logs

What T-Rex did

  • Compared the baseline rendering of Datadog environment handling against the current head to verify how --set datadog.env and unset values affect the rendered service templates and Helm outputs.
  • Reviewed tag handling by comparing base and head tag initialization and formatting behavior, confirming that the head uses an env:dd-env-runtime tag and does not mutate internal self.tags, with per-call route/error tags and metrics emitted accordingly.

View all artifacts

T-Rex Ran code and verified through T-Rex

Comments Outside Diff (1)

  1. charts/model-engine/templates/service_template_config_map.yaml, line 1127 (link)

    P1 Preserve DD_ENV quotes

    The batch-job env vars are parsed through fromYaml and re-emitted with toYaml, which drops the quotes around ${DD_ENV} in the rendered ConfigMap. load_k8s_yaml substitutes before yaml.safe_load, so a valid cluster env like 123, true, or null becomes a YAML int, bool, or null for env.value. Kubernetes then rejects the Job because env var values must be strings. This path needs to keep DD_ENV explicitly quoted after the toYaml round trip.

    Artifacts

    Repro: targeted script using the real chart helper and batch-job template path

    • Contains supporting evidence from the run (text/x-python; charset=utf-8).

    Repro: command output showing unquoted DD_ENV substitution parses to non-string types

    • Keeps the command output available without making the summary code-heavy.

    View artifacts

    T-Rex Ran code and verified through T-Rex

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: charts/model-engine/templates/service_template_config_map.yaml
    Line: 1127
    
    Comment:
    **Preserve DD_ENV quotes**
    
    The batch-job env vars are parsed through `fromYaml` and re-emitted with `toYaml`, which drops the quotes around `${DD_ENV}` in the rendered ConfigMap. `load_k8s_yaml` substitutes before `yaml.safe_load`, so a valid cluster env like `123`, `true`, or `null` becomes a YAML int, bool, or null for `env.value`. Kubernetes then rejects the Job because env var values must be strings. This path needs to keep `DD_ENV` explicitly quoted after the `toYaml` round trip.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

Reviews (3): Last reviewed commit: "Merge branch 'main' into sayakmaity/mode..." | Re-trigger Greptile

@sayakmaity

Copy link
Copy Markdown
Collaborator Author

SGP consumer wiring that populates engine.datadog.env from the per-cluster info object: scaleapi/sgp#3730 (draft).

@sayakmaity

Copy link
Copy Markdown
Collaborator Author

Follow-up commit (ad9c421): tag gateway custom metrics with DD_ENV instead of infra_config().env.

DatadogMonitoringMetricsGateway tagged its statsd metrics with env:{infra_config().env} — the build-time deployment class, not the cluster's Datadog env. Switched to the new per-cluster DD_ENV. (celery_autoscaler.py already reads os.getenv("DD_ENV"), so it's covered by the chart change.) This pairs with scaleapi/sgp#3730, which removes a GCP-only infra.env=dev override; without this commit, that removal would have regressed GCP-dev custom metrics to env:production.

@sayakmaity

Copy link
Copy Markdown
Collaborator Author

Review follow-up (64db255): addresses both findings via central injection at the load_k8s_yaml substitution chokepoint.

Root cause confirmed for both:

  • P1set_main_container_datadog_env re-derives the user container's DD_ENV from the pod label tags.datadoghq.com/env (k8s_endpoint_resource_delegate.py), and that label (baseTemplateLabels) was baked context, so runnable-image endpoints lost the new env.
  • P2${DD_ENV} in $service_env is consumed by the batch-job templates (service_template_config_map.yaml:1124/1251), whose arg builders (live_batch_job_orchestration_gateway.py, live_docker_image_batch_job_gateway.py) never supply DD_ENV, so safe_substitute left the literal.

Fix: load_k8s_yaml is the single substitution chokepoint for every launched resource (endpoints, batch, cron, image-cache, all sub-resources). Inject DD_ENV (the gateway's per-cluster env from env_vars) there once: filtered_kwargs.setdefault("DD_ENV", DD_ENV). Then:

  • baseTemplateLabels label tags.datadoghq.com/env${DD_ENV} (so the delegate reads a per-cluster label for the user container), and batch log annotations → env:${DD_ENV}.
  • This covers labels, forwarder env, batch env, and log configs uniformly — no per-builder threading and no risk of a literal ${DD_ENV} in a k8s label (which would be rejected). I reverted the earlier per-constructor DD_ENV plumbing in k8s_resource_types.py (now net-zero) in favor of this.

Verified with helm template: control-plane labels/DD_ENV render datadog.env (Helm), inference labels/env/log render ${DD_ENV} (runtime), and substitution simulation resolves ${DD_ENV}sgp-dev. Full model-engine test suite not run locally (mypy/CI to confirm).

@sayakmaity

Copy link
Copy Markdown
Collaborator Author

Round 3 review follow-up:

[P3] Regenerated the checked-in fallback template (5064a96). service_template_config_map_circleci.yaml is the default for LAUNCH_SERVICE_TEMPLATE_CONFIG_MAP_PATH (local/CI); prod uses the Helm-mounted configmap via LAUNCH_SERVICE_TEMPLATE_FOLDER, and the unit tests render the chart fresh — so this is a local/CI consistency gap, not a prod bug. Regenerated from the updated gotemplate: 41 launched-pod tags.datadoghq.com/env${DD_ENV}, 14 DD_ENV env values → ${DD_ENV}, 3 log annotations → env:${DD_ENV}; the ConfigMap's own metadata label correctly stays baked (baseLabels). Heads-up: the file was already stale at chart 0.2.4, so regenerating at 0.2.7 also syncs 3 unrelated pre-existing drift lines (HPA ValueAverageValue, forwarder max_concurrency) — not my logic changes. (Regenerated with helm v4 + trailing-whitespace strip to match repo style; maintainers may re-run just autogen-templates with the pinned toolchain.)

[Note] Fixed the mutable self.tags leak (fa9ed71) in DatadogMonitoringMetricsGateway while here: _format_call_tags and emit_http_call_error_metrics did tags = self.tags (alias) then .extend(...), permanently appending per-call model_name/endpoint_name/error_code to the shared base list. Now copy via [*self.tags, ...].

[Note] k8s_resource_types.py is intentionally net-zero — the earlier per-constructor plumbing was reverted in favor of the central load_k8s_yaml injection.

@sayakmaity sayakmaity left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review. Traced every path and rendered the chart across value profiles (values_sample, --set datadog.env=sgp-dev, values_circleci). The change is correct and ready to ship: the two-tier split — Helm value for control-plane pods, literal ${DD_ENV} for launched resources resolved centrally in load_k8s_yaml — is clean, all 58 ${DD_ENV} sites are covered by the single setdefault, the regenerated circleci fallback is byte-identical to helm template --show-only, and the existing tests stay green (DD_ENVinfra_config().envcircleci under CI, so the tags.datadoghq.com/env == "circleci" assertions still hold).

The self.tags mutation fix in datadog_monitoring_metrics_gateway.py is the highest-value part: pre-fix, _format_call_tags / emit_http_call_error_metrics did tags = self.tags; tags.extend(...), mutating self.tags in place — so every call permanently appended model_name / endpoint_name / error_code tags (unbounded growth + cross-call leakage that also polluted the self.tags-only emitters like emit_attempted_build_metric). The [*self.tags, ...] copies fix it completely; no in-place mutation remains.

@sayakmaity

Copy link
Copy Markdown
Collaborator Author

Local test run (Python 3.13 venv, matching .circleci server job):

  • pytest tests/unit777 passed, 8 skipped, 0 failed
  • Targeted (test_k8s_endpoint_resource_delegate.py + test_datadog_monitoring_metrics_gateway.py) → 63 passed, incl. test_resource_arguments_type_and_add_datadog_env_to_main_container parametrized over every resource type (batch-job orchestration, docker-image batch gpu/cpu, image-cache, cron, all endpoint variants) — exercises the DD_ENV injection on all of them.
  • ruff check ✅ · black --check ✅ (3 files unchanged) · isort --check ✅ · mypy ✅ (no issues, v1.11.2, all 3 changed files).

Not reproduced locally: diff-cover --fail-under=50 (coverage-delta gate, not correctness).

@sayakmaity sayakmaity marked this pull request as ready for review June 29, 2026 21:27
sayakmaity and others added 2 commits June 29, 2026 18:24
The baseTemplateLabels `tags.datadoghq.com/env` was an unquoted scalar
(`${DD_ENV}`). load_k8s_yaml substitutes before yaml.safe_load, so a
scalar-looking datadog.env (e.g. 123, true, null) would parse as a
non-string and be rejected as a Kubernetes label value. Quote it so the
substituted value is always a string. Regenerated the circleci fallback
to match (label lines only).

Addresses the Greptile review on scaleapi#849. The batch-job env *values*
(`value: ${DD_ENV}`) go through a `$service_env | fromYaml ... | toYaml`
round-trip that strips quotes regardless, and match the pre-existing
`${GIT_TAG}` treatment, so they are left as-is.

@ishanatscale ishanatscale left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Clean, backwards-compatible decoupling of the Datadog env tag from context via the new optional datadog.env, with the central load_k8s_yaml setdefault("DD_ENV", ...) as a single injection point. The self.tags in-place mutation fix in the metrics gateway is a nice correctness catch. CI is green.

Non-blocking: the Greptile P1 about the unquoted ${DD_ENV} in the batch-job fromYaml/toYaml round trip is a real but low-risk edge case (only bites if datadog.env is ever set to a YAML-scalar-like value such as 123/true/null; real Datadog env names are always plain strings). Worth a follow-up to re-quote it, but not a blocker.

@sayakmaity sayakmaity enabled auto-merge (squash) July 1, 2026 20:51
@sayakmaity sayakmaity merged commit a823696 into scaleapi:main Jul 1, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants