fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849
Conversation
|
SGP consumer wiring that populates |
…infra_config().env
|
Follow-up commit (
|
|
Review follow-up ( Root cause confirmed for both:
Fix:
Verified with |
… leakage across emissions
|
Round 3 review follow-up: [P3] Regenerated the checked-in fallback template ( [Note] Fixed the mutable [Note] |
There was a problem hiding this comment.
Self-review. Traced every path and rendered the chart across value profiles (values_sample, --set datadog.env=sgp-dev, values_circleci). The change is correct and ready to ship: the two-tier split — Helm value for control-plane pods, literal ${DD_ENV} for launched resources resolved centrally in load_k8s_yaml — is clean, all 58 ${DD_ENV} sites are covered by the single setdefault, the regenerated circleci fallback is byte-identical to helm template --show-only, and the existing tests stay green (DD_ENV → infra_config().env → circleci under CI, so the tags.datadoghq.com/env == "circleci" assertions still hold).
The self.tags mutation fix in datadog_monitoring_metrics_gateway.py is the highest-value part: pre-fix, _format_call_tags / emit_http_call_error_metrics did tags = self.tags; tags.extend(...), mutating self.tags in place — so every call permanently appended model_name / endpoint_name / error_code tags (unbounded growth + cross-call leakage that also polluted the self.tags-only emitters like emit_attempted_build_metric). The [*self.tags, ...] copies fix it completely; no in-place mutation remains.
|
Local test run (Python 3.13 venv, matching
Not reproduced locally: |
The baseTemplateLabels `tags.datadoghq.com/env` was an unquoted scalar
(`${DD_ENV}`). load_k8s_yaml substitutes before yaml.safe_load, so a
scalar-looking datadog.env (e.g. 123, true, null) would parse as a
non-string and be rejected as a Kubernetes label value. Quote it so the
substituted value is always a string. Regenerated the circleci fallback
to match (label lines only).
Addresses the Greptile review on scaleapi#849. The batch-job env *values*
(`value: ${DD_ENV}`) go through a `$service_env | fromYaml ... | toYaml`
round-trip that strips quotes regardless, and match the pre-existing
`${GIT_TAG}` treatment, so they are left as-is.
ishanatscale
left a comment
There was a problem hiding this comment.
Approved. Clean, backwards-compatible decoupling of the Datadog env tag from context via the new optional datadog.env, with the central load_k8s_yaml setdefault("DD_ENV", ...) as a single injection point. The self.tags in-place mutation fix in the metrics gateway is a nice correctness catch. CI is green.
Non-blocking: the Greptile P1 about the unquoted ${DD_ENV} in the batch-job fromYaml/toYaml round trip is a real but low-risk edge case (only bites if datadog.env is ever set to a YAML-scalar-like value such as 123/true/null; real Datadog env names are always plain strings). Worth a follow-up to re-quote it, but not a blocker.
Problem
model-engine's Datadog
envtag is derived from the build-time Helm value.Values.context, so every deployment reports the sameenv(e.g.production) regardless of which cluster/environment it actually runs in. This affects three surfaces, all keyed off the single overloadedcontext:DD_ENVand thetags.datadoghq.com/envlabel both come straight from.Values.context.DD_ENVandtags.datadoghq.com/envare baked into the renderedservice_template_config_mapfrom.Values.contextathelm templatetime, so they're frozen in the image and identical for every cluster (unlike the neighboringDD_SERVICE/DD_VERSION, which are already${...}runtime-substituted).DatadogMonitoringMetricsGatewaytags every metric withenv:{infra_config().env}, and the chart wires the infra config'senvfrom.Values.contexttoo (service_config_map.yaml), so these are build-time-static as well.Approach
Introduce a dedicated, optional
datadog.envvalue, decoupled fromcontext(which is overloaded — it also selects theservice_template_config_mapvariant, drives acontext == "circleci"conditional, sets non-DD labels, and seeds the infra config'senv, so it must not be repurposed). The fix mirrors the existingGIT_TAG/DD_VERSIONsplit:DD_ENV/ env label ={{ .Values.datadog.env | default .Values.context }}(Helm value, backwards-compatible: falls back tocontextwhen unset).${DD_ENV}, resolved centrally at endpoint-creation time inload_k8s_yamlvia a singlefiltered_kwargs.setdefault("DD_ENV", DD_ENV). The value is sourced from the gateway/builder's ownDD_ENVenv var (which the chart sets todatadog.env | default context), so launched pods inherit the gateway's per-cluster env. This one hook covers endpoints, batch jobs, cron jobs, image-cache, LWS, and all auxiliary resources (service/virtual-service/destination-rule/hpa/keda/vpa/pdb) without threadingDD_ENVthrough every argument class.DD_ENV, so metrics, traces, and labels all report a consistent per-cluster env.Changes
Chart (
charts/model-engine)values.yaml: add optionaldatadog.env(default"")._helpers.tpl:baseLabels(control-plane object labels):tags.datadoghq.com/env→datadog.env | default context.baseTemplateLabels(launched service + job templates):tags.datadoghq.com/env→${DD_ENV}.serviceEnvBase: remove the hardcodedDD_ENV(moved into the wrappers, likeGIT_TAG).serviceEnvGitTagFromHelmVar(control-plane): addDD_ENV=datadog.env | default context.serviceEnvGitTagFromPythonReplace(endpoints): addDD_ENV=${DD_ENV}.baseServiceTemplateEnv/baseForwarderTemplateEnv(legacy endpoint paths):DD_ENV→${DD_ENV}.service_template_config_map.yaml:ad.datadoghq.com/main.logsjob annotationenv:{{ $env }}→env:${DD_ENV}; drop the now-unused$env.celery_autoscaler_stateful_set.yaml:$env(drives both theDD_ENVenv var and the env label) →datadog.env | default context.Chart.yaml:0.2.6→0.2.7.Server (
model-engine)common/env_vars.py: add exportedDD_ENV=os.environ.get("DD_ENV") or infra_config().env(falls back to the infra config env locally/CI where the env var isn't set).infra/gateways/resources/k8s_endpoint_resource_delegate.py: inload_k8s_yaml,filtered_kwargs.setdefault("DD_ENV", DD_ENV)beforesafe_substitute— the single central injection that resolves${DD_ENV}for every launched resource. (setdefaultlets a caller override.)infra/gateways/datadog_monitoring_metrics_gateway.py:self.tagsenv tag:infra_config().env→DD_ENV, so gateway custom metrics report the same per-cluster env as traces/labels._format_call_tagsandemit_http_call_error_metricspreviously didtags = self.tags; tags.extend(...), mutatingself.tagsin place — leaking per-callmodel_name/endpoint_name/error_codetags across every subsequent emission and growing the list unboundedly. Now build a fresh list ([*self.tags, ...]).Generated fallback
infra/gateways/resources/templates/service_template_config_map_circleci.yaml: regenerated viajust autogen-templates. Picks up the${DD_ENV}substitutions, plus pre-existing chart-source drift the stale fallback was missing (thetype: AverageValueHPA change and the 7thmax_concurrency=${FORWARDER_MAX_CONCURRENCY}--setarg, both from [MLI-6891] feat: wire forwarder_max_concurrency through K8s forwarder template #836), and thehelm.sh/chartstamp (0.2.4 → 0.2.7).Validation
helm templateacrossvalues_sample(datadog.env unset → falls back tocontext),--set datadog.env=sgp-dev, andvalues_circleci: control-planeDD_ENV/labels resolve to the Helm value; the launched-resource ConfigMap keeps the literal${DD_ENV}for Python substitution. No<no value>/ unrendered{{ }}.helm template --show-only(modulo the autogen header + cosmetic trailing whitespace).${DD_ENV}resolves at endpoint creation viaload_k8s_yaml. Existingtest_k8s_endpoint_resource_delegate.pytests render the real chart and asserttags.datadoghq.com/env == "circleci"— DD_ENV resolves toinfra_config().env→circleciunder CI (env var unset), so they remain green. The metrics-gateway tests assert only that statsd is called, not tag values, so the env-source swap and tag-copy refactor don't break them.model-enginesuite locally (heavy service deps) — CI should confirm.Known limitations / rollout notes
datadog.env, the gateway's custom metrics (model_engine.*/scale_launch.*) move fromenv:<context>toenv:<datadog.env>. Dashboards/monitors filtering on the oldenvvalue must be updated. (No-op whiledatadog.envis unset — both equalcontext. The infra config'senvfield, used for DB/celery routing, is intentionally still keyed offcontext.)${DD_ENV}data can be synced into still-running old builder pods before they roll. An old builder (without the centralsetdefault) would leave the literal${DD_ENV}in a label, and the endpoint/HPA/service create would be rejected by the API server (${DD_ENV}is an invalid label value). Loud, infrequent, and self-heals once the new builder is Ready — but expect a brief window during upgrade where endpoint creates may fail.service_template_config_map_<env>.yamlinto an image (e.g. model-engine-internal'sjust autogen-templates+ image build) should regenerate; the renderedDD_ENVline becomes a runtime placeholder. This is not the production read path (the chart-mounted ConfigMap is), so it mainly matters for local/CI.Greptile Summary
This PR updates the forwarder worker path and Firehose logging client behavior. The main changes are:
CELERY_WORKER_MAX_TASKS_PER_CHILD.Confidence Score: 5/5
No blocking correctness or security issues were identified in the reviewed changes.
The changes are focused, covered by targeted validation, and no accepted review findings remain.
What T-Rex did
Comments Outside Diff (1)
charts/model-engine/templates/service_template_config_map.yaml, line 1127 (link)The batch-job env vars are parsed through
fromYamland re-emitted withtoYaml, which drops the quotes around${DD_ENV}in the rendered ConfigMap.load_k8s_yamlsubstitutes beforeyaml.safe_load, so a valid cluster env like123,true, ornullbecomes a YAML int, bool, or null forenv.value. Kubernetes then rejects the Job because env var values must be strings. This path needs to keepDD_ENVexplicitly quoted after thetoYamlround trip.Artifacts
Repro: targeted script using the real chart helper and batch-job template path
Repro: command output showing unquoted DD_ENV substitution parses to non-string types
Prompt To Fix With AI
Reviews (3): Last reviewed commit: "Merge branch 'main' into sayakmaity/mode..." | Re-trigger Greptile