Part 3 Chapter 24 Last verified 2026-06-14 Fresh

Observability: Seeing What the Agent Did

Observability is four instrumentation surfaces stacked on one ground truth — the session-log transcript. Logging persists it, OpenTelemetry GenAI conventions trace it, attribution ties a diff back to it, and cost-surfacing shows the price. The chapter holds two boundaries — attribution is a provenance hook not an approval gate, and surfacing a cost number is not modeling the economics.

Volatility: feature-surface

Tools compared: claude-codecross-tool

On this page

Four surfaces over one ground truth
Logging: two records, two retention stories
OpenTelemetry GenAI conventions as the substrate
Attribution is the provenance hook, not the approval gate
Surfacing cost at three altitudes — surfacing is not modeling
Quick reference
Practice

Before you start: ch21's frame — observability is the 'see' surface, downstream of an eval target. ch23 scores whether a run was correct; this chapter only records what ran. Vol 1's permission model is assumed; the oversight workflow on top of it is ch26.

You will learn

Why observability is four surfaces over one ground truth — the session-log transcript — and why everything else derives from it
The two retention stories in logging — the local 30-day sweep versus the SDK’s external mirror — and why conflating them loses run history
Why you instrument to the OpenTelemetry GenAI convention, not a vendor schema — and why those names are still moving
The two boundaries that keep the surface honest: attribution is a provenance hook, not an approval gate, and surfacing a cost number is not modeling the economics

ch21 placed observability as the see surface: once an eval defines what “good” means, observability shows what actually happened against it. This chapter takes that surface apart. The thesis is that all of agent observability is four instrumentation surfaces stacked on one ground truth — the session log — and that getting the layering right (what is the record, what derives from it, and what each derived surface does and does not claim) is the whole discipline. Two of those surfaces are easy to over-read, so the chapter spends its honesty budget keeping their boundaries crisp.

Four surfaces over one ground truth

An agent run produces exactly one authoritative record, and everything you later want to see is a different view of it. Claude Code writes a session transcript for each run — “every message, tool call, and tool result” [Official] Explore the .claude directory · AnthropicT1-official original — as a per-session JSONL file, by default under ~/.claude/projects/, one JSON-safe object per line. [Official] Explore the .claude directory · AnthropicT1-official original That transcript is the ground truth. It is not a summary, not a dashboard, not a metric — it is the literal sequence the agent emitted and received, persisted to disk.

The other three things people mean by “observability” — tracing, attribution, cost-surfacing — are not separate sources of truth. They are surfaces derived from that one log, or pointers back to it. A trace re-renders the run as spans; an attributed commit points back to the run that produced it; a cost figure is computed from the tokens the run consumed. So “what did the agent do?” is, in order, first a logging question (is the transcript captured and kept?), then a tracing / attribution / surfacing question (how do I view it, link to it, and price it?). Skip the log and the other three have nothing underneath them.

Logging: two records, two retention stories

The single most common logging error is treating “the transcript” as one thing. There are two records, with two different retention owners, and conflating them is how teams lose run history they assumed was safe.

The first is the CLI local record: the JSONL files under ~/.claude/projects/. These are swept automatically — the cleanupPeriodDays setting deletes local transcript files older than a threshold whose “default is 30 days.” [Official] Explore the .claude directory · AnthropicT1-official original That sweep is a feature, not a bug: it keeps a developer’s disk from filling with months of transcripts. But it means the local files are not a durable archive. Run history older than the window is gone unless something else kept it.

The second is the SDK record. From the Agent SDK, transcripts are still written to JSONL by default, but the SessionStore interface lets a deployment mirror those entries — “JSON-safe objects, one per line in the local JSONL” [Official] Persist sessions to external storage · AnthropicT1-official original — to external storage such as S3, Redis, or a database. The retention of that mirror is the adapter’s responsibility, not Claude Code’s. So a production deployment that needs durable run history cannot lean on the local files and their 30-day sweep; it must mirror via SessionStore and own the retention itself.

OpenTelemetry GenAI conventions as the substrate

When you trace an agent — turn the transcript into spans and metrics a backend can query — the design question is which vocabulary do you instrument to? The answer is a vendor-neutral convention, not a vendor-specific schema.

Claude Code exports three OpenTelemetry signals — “metrics as time series data via the standard metrics protocol, events via the logs/events protocol, and optionally distributed traces” [Official] Monitoring · AnthropicT1-official original — and in the trace tree, “each user prompt starts a” [Official] Monitoring · AnthropicT1-official original claude_code.interaction root span, with API calls, tool calls, and hook executions as its children. Crucially, the per-LLM-request span’s attributes align to the “OpenTelemetry GenAI semantic convention.” [Official] Monitoring · AnthropicT1-official original That alignment is the whole point: Claude Code’s span tree is one realization of a standard the spec defines independently.

On the spec side, the OpenTelemetry GenAI semantic conventions define the same vocabulary from the other direction. The standard token-usage metric is gen_ai.client.token.usage, Semantic conventions for generative AI metrics · OpenTelemetry AuthorsT1-official original documented as the “Number of input and output tokens used,” Semantic conventions for generative AI metrics · OpenTelemetry AuthorsT1-official original and the agent-span operation names are create_agent Semantic Conventions for GenAI agent and framework spans · OpenTelemetry AuthorsT1-official original and invoke_agent. Semantic Conventions for GenAI agent and framework spans · OpenTelemetry AuthorsT1-official original Instrument to those names and your backend — any OpenTelemetry collector — reads the run without knowing it came from Claude Code. The vendor’s spans are swappable; the convention is the contract.

Two caveats are load-bearing, and both move on a release cadence. First, the OpenTelemetry GenAI semantic conventions carry Status: Development — the span and metric names above (gen_ai.client.token.usage, create_agent, invoke_agent) may still change before the convention stabilizes, so treat them as current-as-of, not final. Second, of Claude Code’s three signals, metrics and events are GA while distributed traces are beta; [Official] Monitoring · AnthropicT1-official original a team relying on the claude_code.interaction span tree should track the beta-to-GA transition. Both warrant a recheck after 2026-08-25.

Attribution is the provenance hook, not the approval gate

The third surface ties an agent’s output back to the run that produced it. In Claude Code, attribution to git commits and pull requests is a configurable setting [Official] Claude Code settings · AnthropicT1-official original — by default, commits carry a Co-Authored-By git trailer “which can be customized or disabled.” [Official] Claude Code settings · AnthropicT1-official original The commit itself becomes the handle: from a merged diff you can walk back to the session log and trace that produced it. In CI the same hook holds — a @claude mention triggers an Action so that “Claude can analyze your code, create pull requests, implement features, and fix bugs,” [Official] Claude Code GitHub Actions · AnthropicT1-official original and the commit is stamped with the GitHub-App actor identity rather than the generic Actions user, which is why CI must run “using the GitHub App or custom app (not Actions user)” [Official] Claude Code GitHub Actions · AnthropicT1-official original for those commits to be attributable.

Here is the boundary the chapter will not let blur: attribution is a provenance hook, not an approval gate. It records which run produced this diff — it does not decide whether the diff may be merged. That decision — the human-in-the-loop review and the gate before an irreversible action — is the oversight workflow, and ch26 owns it. The two are easy to conflate because they touch the same pull request, but they answer different questions: provenance is “where did this come from?”, approval is “may this proceed?”. Read a Co-Authored-By trailer as a gate and you have mistaken a label for a checkpoint.

Surfacing cost at three altitudes — surfacing is not modeling

The fourth surface is the cost and usage a team actually watches, and it lives at three altitudes. The most local is the in-CLI /usage command, whose Session block “shows detailed token usage statistics for your current session” [Official] Manage costs effectively · AnthropicT1-official original — what one developer reads mid-run. Above that is the Team/Enterprise analytics dashboard, which surfaces usage and adoption metrics behind a viewer-role gate — “Admins and Owners can view the dashboard.” [Official] Track team usage with analytics · AnthropicT1-official original And at the top is the Console spend view, which surfaces “daily API costs in dollars alongside user count.” [Official] Track team usage with analytics · AnthropicT1-official original

But the surfaced number carries a caveat that defines the boundary. The dollar figure in /usage “is an estimate computed locally from token counts and may differ from your actual bill” [Official] Manage costs effectively · AnthropicT1-official original — the authoritative figure lives in the Console. That single sentence draws the line: surfacing shows the number and points to where the authoritative one lives; it does not model the economics. The per-developer dollar-per-day modeling, the token-reduction tactics, the question of which lever actually moves the bill — that is ch25’s subject, and the input-context cost driver ch25 unpacks. Observability tells you what a run cost as a local estimate; cost modeling tells you how to make it cost less. Mistake the surfaced estimate for the bill, or for an economic model, and you will optimize against a number that was never authoritative.

Four observability surfaces over one ground truth. At the base, the session log — the per-session JSONL transcript of every message, tool call, and tool result. Deriving from it: tracing (OTel GenAI spans and metrics), attribution (diff/PR back to the run), cost-surfacing (three altitudes — /usage, team dashboard, Console spend), and logging/retention (the local 30-day sweep versus the SDK SessionStore). The four surfaces are built on the one log.

Routing an observability question to its surface Worked example

A team says: “We shipped a bad PR last week. We can’t reconstruct what the agent was reasoning about, the trace backend doesn’t recognize our spans, and the bill looks high — what failed and where?”

Route each part to its surface before fixing anything:

“Can’t reconstruct what it was reasoning about” is a logging question. The transcript is the ground truth — but if the run was a week old and lived only in the local files, cleanupPeriodDays may already be irrelevant (30 days is the default), yet the deeper issue is whether it was ever mirrored. If there is no SessionStore mirror, the record may simply not exist to reconstruct from. The fix is durable logging, not a better dashboard.
“Trace backend doesn’t recognize our spans” is a tracing-convention question. If the backend was wired to vendor-specific names, it cannot read a standard collector. Instrument to the OTel GenAI convention (gen_ai.*, create_agent/invoke_agent) so any collector reads the run — while remembering those names are still Development-status.
“We shipped a bad PR” has an attribution part and a non-attribution part. Attribution can tell you which run produced the diff (the Co-Authored-By trailer, the GitHub-App identity) — that is provenance. Whether a human gate should have stopped it is not an observability question at all; it is ch26’s oversight workflow.
“The bill looks high” is a cost-surfacing question only up to the number. /usage surfaces a local estimate that “may differ from your actual bill”; the Console holds the authoritative figure. Why it is high and how to reduce it is ch25’s modeling — surfacing stops at showing the number.

Four surfaces turned one panicked failure into four located questions — and pulled the two that aren’t observability (the approval gate, the cost model) out to their real owners.

Quick reference

One ground truth: the session log — the per-session JSONL transcript of every message, tool call, and result — is the record; tracing, attribution, and cost-surfacing all derive from it. Explore the .claude directory · AnthropicT1-official original
Two retention stories: local files swept on a 30-day default (cleanupPeriodDays) versus the SDK SessionStore mirror whose retention the adapter owns — don’t rely on the local files for durable history. Persist sessions to external storage · AnthropicT1-official original
Trace to the convention: instrument to the OTel GenAI names (gen_ai.client.token.usage, create_agent, invoke_agent), not a vendor schema; Claude Code’s claude_code.interaction tree is one realization. Semantic conventions for generative AI metrics · OpenTelemetry AuthorsT1-official original
Moving target: the GenAI conventions are Status: Development and Claude Code’s traces signal is beta — recheck names after 2026-08-25. Monitoring · AnthropicT1-official original
Attribution = provenance, not approval: the Co-Authored-By trailer ties a diff to its run; the gate is ch26. Claude Code settings · AnthropicT1-official original
Surfacing ≠ modeling: /usage shows a local estimate that “may differ from your actual bill”; modeling the economics is ch25. Manage costs effectively · AnthropicT1-official original

Practice

Practice ◆◆◇◇

Take an agent you run or have read about that writes code or files. For each of the four surfaces, write down the current state of your instrumentation: (1) logging — is the transcript captured, and where is it durably kept beyond the local 30-day sweep? (2) tracing — are you emitting to the OTel GenAI convention, a vendor schema, or nothing? (3) attribution — can you walk from a merged diff back to the run that produced it? (4) cost-surfacing — where do you watch the cost, and do you know which displayed number is an estimate versus authoritative? Then mark which gap would hurt most during an incident — and note explicitly any place where you were about to write down “the gate” (ch26) or “reduce the bill” (ch25), because those are not observability gaps.

Exercise solutions

Solution ↑ Exercise

The ground truth is the session log — the per-session JSONL transcript Claude Code writes for each run, recording every message, tool call, and tool result. The four surfaces are logging (capturing and retaining that transcript), tracing (re-rendering the run as OTel GenAI spans and metrics), attribution (tying a commit/PR back to the run that produced it), and cost-surfacing (showing token/dollar usage at the CLI, team-dashboard, and Console altitudes). The ground truth is primary because it is the literal, authoritative record of what the agent did; the other three are views — a trace re-renders it, an attributed commit points back to it, a cost figure is computed from the tokens in it — so each is only as durable and authoritative as the log beneath it. On attribution: it claims provenance — which run produced this diff, via the Co-Authored-By trailer and the GitHub-App commit identity — and it does not claim approval, i.e. it does not decide whether the diff may be merged; that gate is the human-in-the-loop oversight workflow, which ch26 owns. Conflating the two mistakes a label for a checkpoint.

Solution ↑ Exercise

A worked example. Take a documentation-writing agent that opens PRs. Logging: “Transcripts are written to the local ~/.claude/projects/ JSONL, but we never set up a SessionStore mirror — so anything past the 30-day cleanupPeriodDays sweep is gone. That is our durable-history gap.” Tracing: “We enabled telemetry but pointed it at a vendor-named dashboard; a standard OTel collector wouldn’t recognize our spans. Re-instrumenting to the GenAI convention (gen_ai.*, claude_code.interaction aligning to it) would make the backend swappable — though the names are Development-status, so I’ll pin a recheck.” Attribution: “Commits carry the default Co-Authored-By trailer, so I can walk from a merged doc-PR back to the run — provenance is fine.” Cost-surfacing: “I watch /usage mid-run and the Console for the monthly figure, but I’d been treating the /usage dollar number as the bill — it’s a local estimate that ‘may differ from your actual bill.’” Most painful during an incident: the logging gap — if a bad PR shipped and the run is older than 30 days with no mirror, there is no transcript to reconstruct from, and every other surface is a view of a record that no longer exists, so I’d mirror via SessionStore first. Pulled out as non-observability: “add a human gate before the PR merges” is ch26 (oversight), not a logging/tracing gap; and “the bill is too high, reduce it” is ch25 (cost modeling), not cost-surfacing — surfacing only shows the number. The exercise’s value is feeling that two of the four surfaces are easy to over-read into decisions they don’t make.