Monitoring and Observability Telemetry template-specs-upgraded

Sampling

Sampling is the observability practice of keeping enough telemetry to understand behavior while dropping a controlled portion of high-volume data. In Azure Monitor Application Insights, it usually applies to traces, requests, dependencies, exceptions, and logs from instrumented applications. Good sampling is not random blindness. It preserves useful diagnostic patterns, keeps complete traces where possible, and lets operators estimate the original event volume through counts. The goal is to control ingestion cost and noise without losing the evidence needed to investigate failures.

Aliases
Application Insights sampling, telemetry sampling, OpenTelemetry sampling, fixed-rate sampling, rate-limited sampling
Difficulty
fundamentals
CLI mappings
3
Last verified
2026-05-22

Microsoft Learn

Microsoft Learn explains that Application Insights sampling with OpenTelemetry reduces telemetry volume, lowers ingestion cost, and keeps diagnostic data useful. It supports fixed-rate and rate-limited strategies, relies on the Azure Monitor sampler for complete traces, and should be configured intentionally rather than assumed enabled.

Microsoft Learn: Sampling in Azure Monitor Application Insights with OpenTelemetry2026-05-22

Technical context

In Azure architecture, sampling sits between application instrumentation and Azure Monitor ingestion. The application or OpenTelemetry collector makes a sampling decision before data reaches Application Insights, then the retained telemetry lands in the connected Log Analytics workspace. Metrics are treated differently from sampled traces and should remain the primary alerting signal. Sampling also interacts with Live Metrics, distributed tracing, data caps, workbooks, and KQL queries. The design choice belongs with observability owners, not only with developers changing SDK settings.

Why it matters

Sampling matters because production telemetry volume grows faster than most teams expect. A chatty dependency, a noisy background job, or a popular endpoint can push Application Insights ingestion costs up in a single release. Poor sampling creates a different failure: broken traces, missing failure context, and dashboards that hide rare but serious incidents. A good sampling strategy lets teams keep visibility during traffic spikes, preserve enough data for performance analysis, and avoid emergency data-retention cuts. It also forces healthy discipline: metrics for alerts, sampled traces for diagnosis, and documented exceptions for workloads where full fidelity is required. That balance keeps cost control from weakening real incident response.

Where you see it

Signals, screens, and Azure surfaces where this term usually becomes operational.

Signal 01

In Application Insights Usage and estimated costs, the Data Sampling screen shows retained percentages and helps operators spot whether ingestion sampling is reducing telemetry after source control failed.

Signal 02

In Log Analytics queries, itemCount values above one reveal sampled telemetry and let engineers estimate original request, dependency, trace, or exception volume by table and hour.

Signal 03

In OpenTelemetry configuration files, sampling ratios or rate limits appear beside exporter settings, making release pipelines the place where telemetry volume can drift silently. during each release

Signal 04

In Azure Monitor workbooks, sudden gaps in dependency traces or unusually low retained percentages signal that sampling changed before the incident review could explain it.

When this becomes relevant

Specific situations where this term helps solve real Azure design, operations, migration, security, reliability, cost, or governance problems.

  • Control a sudden Application Insights ingestion spike from a high-traffic endpoint without disabling the traces engineers need for root-cause analysis.
  • Preserve end-to-end trace shape during an OpenTelemetry migration by using the Azure Monitor sampler instead of unrelated downstream dropping.
  • Set rate-limited sampling for noisy background jobs so nightly processing stays visible without dominating the observability budget.
  • Validate that metrics still drive alerts while sampled traces remain useful for diagnosis after a new SDK or collector release.
  • Compare retained percentages across microservices to find one app that is over-sampling and breaking distributed trace investigations.

Real-world case studies

Different enterprise-style examples that show the term being used to hit measurable objectives.

Case study 01

SaaS support platform stops telemetry from overrunning its incident budget

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A SaaS support platform saw Application Insights costs jump after a new message timeline feature emitted dependency spans for every sidebar refresh. Incident responders also complained that the traces they did keep were fragmented.

Business/Technical Objectives
  • Cut telemetry ingestion by at least 35 percent without losing failure investigations.
  • Keep request and dependency traces connected for the busiest customer workflows.
  • Create a repeatable evidence query for FinOps and SRE review.
  • Avoid changing production alert logic during the cost-control effort.
Solution Using Sampling

The observability team moved from portal-side ingestion sampling to source-level sampling with the Azure Monitor OpenTelemetry distro. They applied fixed-rate sampling to high-volume successful traces, kept metrics as the alerting source, and raised retention for exception-heavy paths. Azure CLI ran a daily Log Analytics query that calculated retained percentages from itemCount by table, role name, and hour. The pipeline stored the sampling configuration beside the service deployment manifest, so every release showed whether sampling drifted. Workbooks were updated to show both visible events and estimated original volume.

Results & Business Impact
  • Telemetry ingestion fell 42 percent in the first billing cycle.
  • Trace completeness for top support workflows rose from 61 percent to 93 percent.
  • FinOps review time dropped from three meetings to one evidence export.
  • No production alert thresholds changed during the transition.
Key Takeaway for Glossary Readers

Sampling works best when it is treated as an engineered observability control, not a panic button for an expensive workspace.

Case study 02

Game studio keeps launch telemetry useful during a global event

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

A multiplayer game studio expected a tenfold traffic burst during a weekend tournament. Previous launches flooded Application Insights with successful matchmaking traces while rare payment failures became difficult to find.

Business/Technical Objectives
  • Keep tournament diagnostics queryable during peak traffic.
  • Reduce successful trace volume without hiding payment or login failures.
  • Give live operations staff one check for sampling health.
  • Preserve enough data to tune matchmaking latency after the event.
Solution Using Sampling

The platform team configured rate-limited sampling for matchmaking and lobby spans while keeping failure, payment, and login paths at a higher retention rate. Application metrics stayed unsampled and powered the war-room dashboard. Before the tournament, engineers used Azure CLI to run Log Analytics checks against a staging load test, confirming itemCount behavior and retained percentages. During the event, the same query ran every hour and posted summarized output to the release channel. After the tournament, sampled traces were combined with unsampled metrics to isolate one regional dependency bottleneck.

Results & Business Impact
  • The workspace avoided estimated overage charges of 28,000 dollars.
  • Payment-failure investigation time fell from 47 minutes to 11 minutes.
  • Live operations kept KQL query latency under 20 seconds at peak.
  • Matchmaking latency tuning used representative traces from every region.
Key Takeaway for Glossary Readers

Sampling lets high-volume events stay observable when the team decides in advance which signals deserve full fidelity.

Case study 03

Industrial IoT integrator separates useful traces from machine noise

Scenario, objectives, solution, measured impact, and takeaway.

Scenario

An industrial IoT integrator connected factory gateways to Azure-hosted APIs that produced millions of dependency traces per shift. Maintenance engineers needed failure evidence, but normal polling drowned everything else.

Business/Technical Objectives
  • Lower normal polling telemetry while retaining fault diagnostics.
  • Show original event volume for each factory without storing every record.
  • Protect maintenance dashboards from misleading sampled log alerts.
  • Document the sampling model for plant auditors.
Solution Using Sampling

The Azure team configured sampling in the OpenTelemetry collector used by gateway-facing services. Normal polling traces were sampled at a conservative fixed rate, while exception traces and health metrics stayed available for alerting. Each retained record carried factory, gateway, and firmware dimensions. Operators used Azure CLI to query retained percentage by factory and to compare Application Insights ingestion before and after the rollout. The runbook explained that metrics, not sampled logs, triggered equipment alerts. Auditors received exported evidence showing sampling rates, workspace retention, and the untouched alert source.

Results & Business Impact
  • Daily telemetry ingestion dropped 51 percent across 18 factories.
  • Maintenance dashboards removed three false alert patterns caused by sampled logs.
  • Average fault triage time improved from 36 minutes to 14 minutes.
  • Audit evidence was prepared in two hours instead of two days.
Key Takeaway for Glossary Readers

Sampling can reduce machine-generated noise while still preserving the operational trail needed to repair real failures.

Why use Azure CLI for this?

After ten years of Azure engineering, I use Azure CLI around sampling because the important question is usually not where the portal toggle lives; it is whether the data proves sampling is behaving. CLI lets me query Application Insights or the Log Analytics workspace, export retained percentages, compare ingestion before and after a release, and capture evidence for cost reviews. Sampling configuration often lives in code or OpenTelemetry settings, so CLI is the inspection and validation tool. It gives operators repeatable checks that survive handoffs, incident bridges, and FinOps conversations better than screenshots. It also exposes drift when one service changes telemetry settings unnoticed.

CLI use cases

  • Run KQL through Azure CLI to calculate retained telemetry percentages by table, service, and hour before and after a sampling change.
  • Inventory Application Insights components and connected workspaces so ownership, retention, and sampling validation are tied to the right resources.
  • Export ingestion and metric evidence for a FinOps review without relying on a portal-only screenshot of the Usage blade.

Before you run CLI

  • Confirm tenant, subscription, resource group, Application Insights component, connected workspace, time range, permissions, and output format before running evidence queries.
  • Check whether sampling is configured at source, collector, or ingestion; changing the wrong layer can break traces or hide useful failures.
  • Agree on cost risk, alerting signals, security-sensitive telemetry, and rollback settings before reducing retained trace or log volume in production.

What output tells you

  • KQL output showing itemCount and retained percentage tells whether sampled records represent more original events than the visible row count suggests.
  • Component output gives the Application Insights resource ID, workspace link, instrumentation settings, and location needed for repeatable diagnostics.
  • Metric output shows ingestion, request rate, failures, or availability trends that should remain reliable even when trace records are sampled.

Mapped Azure CLI commands

Sampling evidence queries

operational
az monitor app-insights component show --app <app-insights-name> --resource-group <resource-group>
az monitor app-insights componentdiscoverAI and Machine Learning
az monitor log-analytics query --workspace <workspace-id> --analytics-query "union requests,dependencies,traces,exceptions | where timestamp > ago(1d) | summarize RetainedPercentage=100/avg(itemCount) by itemType"
az monitor log-analyticsdiscoverMonitoring and Observability
az monitor metrics list --resource <app-insights-resource-id> --metric requests/count --interval PT1H
az monitor metricsdiscoverMonitoring and Observability

Architecture context

Architecturally, sampling is an observability control, not a substitute for good instrumentation. I design it as part of the telemetry pipeline: application code emits spans and logs, the Azure Monitor OpenTelemetry distro or collector makes sampling decisions, retained telemetry reaches Application Insights, and Log Analytics stores queryable records. Services that participate in one distributed trace should use compatible sampling so traces do not fragment across boundaries. Alerts should rely on metrics or unsampled health signals, while sampled traces support root-cause investigation. For regulated workloads, the design must document what telemetry may be dropped and which failure classes remain fully observable.

Security

Security impact is indirect but real. Sampling does not grant access, encrypt data, or change network boundaries, yet it changes what evidence security teams can review after an incident. If authentication failures, suspicious dependencies, or unusual IP patterns are sampled too aggressively, investigation becomes weaker. The opposite risk also exists: retained telemetry can still contain user identifiers, request paths, headers, or payload fragments, so sampling must not be treated as a privacy control. Secure designs sanitize sensitive fields before export, keep workspace access least-privilege, retain security-relevant metrics, and document which traces are reduced. Access reviews should include who can change sampling and workspace export.

Cost

Sampling has direct cost impact because Application Insights and Log Analytics charges are driven heavily by ingested data volume and retention. Keeping every request, dependency, trace, and log from a busy application can turn observability into a surprise monthly bill. Sampling reduces ingestion and storage while preserving representative diagnostic value. FinOps teams should not set the rate blindly; they should compare retained percentages, data by table, peak traffic, alert needs, and investigation quality. Cost savings are strongest when sampling is combined with log-level hygiene, data caps, retention policy, and ownership tags. Chargeback reports should show savings beside any diagnostic tradeoffs accepted.

Reliability

Reliability impact is indirect because sampling does not make an application more available. It affects how quickly teams detect and explain reliability problems. If sampling drops too much data, a one-percent failure rate can disappear from traces while users still feel it. If sampling keeps complete traces and metrics remain unsampled, engineers can investigate latency, dependency errors, and retries without drowning in records. Reliable sampling strategies test new settings in staging, keep exception visibility high, watch ingestion throttling, and verify that alerting uses signals that sampling will not distort. That evidence prevents teams from mistaking missing traces for healthy systems.

Performance

Performance impact is mostly in the telemetry pipeline, not the business transaction itself. Source-level sampling can reduce CPU, memory, network egress, exporter queue pressure, and ingestion latency for high-volume services. It also makes KQL queries, workbooks, and incident reviews faster because fewer records must be scanned. Bad sampling can hurt diagnostic performance by forcing engineers to chase incomplete traces. Teams should measure application latency, exporter backlog, ingestion delay, query duration, and retained trace completeness before declaring success. Metrics remain essential because sampled traces alone do not prove runtime health. Those checks keep sampling from becoming a hidden source of troubleshooting delay.

Operations

Operators manage sampling by validating configuration, querying retained percentages, watching ingestion volume, and comparing telemetry quality before and after releases. They inspect Application Insights usage, Log Analytics tables, itemCount values, workspace cost, data caps, and KQL results for each service. When a team changes an SDK or OpenTelemetry distro, operators confirm that traces remain connected and Live Metrics still works. Good runbooks explain which applications sample at source, which rely on ingestion sampling as a temporary fallback, and who approves changes that affect incident evidence. Runbooks should record approved rates, owner, rollback command, and the expected retained percentage for each workload.

Common mistakes

  • Assuming sampling is enabled in every OpenTelemetry deployment and discovering after a release that telemetry volume doubled overnight.
  • Using ingestion sampling as the permanent design, then receiving broken distributed traces that no longer explain dependency failures.
  • Building alerts from sampled logs instead of metrics, causing low-volume but serious failures to disappear from alert calculations.