Part 3 Chapter 24 Last verified 2026-06-14 Fresh

Observability: Seeing What the Agent Did

Observability is four instrumentation surfaces stacked on one ground truth — the session-log transcript. Logging persists it, OpenTelemetry GenAI conventions trace it, attribution ties a diff back to it, and cost-surfacing shows the price. The chapter holds two boundaries — attribution is a provenance hook not an approval gate, and surfacing a cost number is not modeling the economics.

Volatility: feature-surface
Tools compared: claude-codecross-tool
On this page
  1. Four surfaces over one ground truth
  2. Logging: two records, two retention stories
  3. OpenTelemetry GenAI conventions as the substrate
  4. Attribution is the provenance hook, not the approval gate
  5. Surfacing cost at three altitudes — surfacing is not modeling
  6. Quick reference
  7. Practice

ch21 placed observability as the see surface: once an eval defines what “good” means, observability shows what actually happened against it. This chapter takes that surface apart. The thesis is that all of agent observability is four instrumentation surfaces stacked on one ground truth — the session log — and that getting the layering right (what is the record, what derives from it, and what each derived surface does and does not claim) is the whole discipline. Two of those surfaces are easy to over-read, so the chapter spends its honesty budget keeping their boundaries crisp.

Four surfaces over one ground truth

An agent run produces exactly one authoritative record, and everything you later want to see is a different view of it. Claude Code writes a session transcript for each run — “every message, tool call, and tool result” [Official] Explore the .claude directory · AnthropicT1-official original — as a per-session JSONL file, by default under ~/.claude/projects/, one JSON-safe object per line. [Official] Explore the .claude directory · AnthropicT1-official original That transcript is the ground truth. It is not a summary, not a dashboard, not a metric — it is the literal sequence the agent emitted and received, persisted to disk.

The other three things people mean by “observability” — tracing, attribution, cost-surfacing — are not separate sources of truth. They are surfaces derived from that one log, or pointers back to it. A trace re-renders the run as spans; an attributed commit points back to the run that produced it; a cost figure is computed from the tokens the run consumed. So “what did the agent do?” is, in order, first a logging question (is the transcript captured and kept?), then a tracing / attribution / surfacing question (how do I view it, link to it, and price it?). Skip the log and the other three have nothing underneath them.

Logging: two records, two retention stories

The single most common logging error is treating “the transcript” as one thing. There are two records, with two different retention owners, and conflating them is how teams lose run history they assumed was safe.

The first is the CLI local record: the JSONL files under ~/.claude/projects/. These are swept automatically — the cleanupPeriodDays setting deletes local transcript files older than a threshold whose “default is 30 days.” [Official] Explore the .claude directory · AnthropicT1-official original That sweep is a feature, not a bug: it keeps a developer’s disk from filling with months of transcripts. But it means the local files are not a durable archive. Run history older than the window is gone unless something else kept it.

The second is the SDK record. From the Agent SDK, transcripts are still written to JSONL by default, but the SessionStore interface lets a deployment mirror those entries — “JSON-safe objects, one per line in the local JSONL” [Official] Persist sessions to external storage · AnthropicT1-official original — to external storage such as S3, Redis, or a database. The retention of that mirror is the adapter’s responsibility, not Claude Code’s. So a production deployment that needs durable run history cannot lean on the local files and their 30-day sweep; it must mirror via SessionStore and own the retention itself.

OpenTelemetry GenAI conventions as the substrate

When you trace an agent — turn the transcript into spans and metrics a backend can query — the design question is which vocabulary do you instrument to? The answer is a vendor-neutral convention, not a vendor-specific schema.

Claude Code exports three OpenTelemetry signals — “metrics as time series data via the standard metrics protocol, events via the logs/events protocol, and optionally distributed traces” [Official] Monitoring · AnthropicT1-official original — and in the trace tree, “each user prompt starts a” [Official] Monitoring · AnthropicT1-official original claude_code.interaction root span, with API calls, tool calls, and hook executions as its children. Crucially, the per-LLM-request span’s attributes align to the “OpenTelemetry GenAI semantic convention.” [Official] Monitoring · AnthropicT1-official original That alignment is the whole point: Claude Code’s span tree is one realization of a standard the spec defines independently.

On the spec side, the OpenTelemetry GenAI semantic conventions define the same vocabulary from the other direction. The standard token-usage metric is gen_ai.client.token.usage, Semantic conventions for generative AI metrics · OpenTelemetry AuthorsT1-official original documented as the “Number of input and output tokens used,” Semantic conventions for generative AI metrics · OpenTelemetry AuthorsT1-official original and the agent-span operation names are create_agent Semantic Conventions for GenAI agent and framework spans · OpenTelemetry AuthorsT1-official original and invoke_agent. Semantic Conventions for GenAI agent and framework spans · OpenTelemetry AuthorsT1-official original Instrument to those names and your backend — any OpenTelemetry collector — reads the run without knowing it came from Claude Code. The vendor’s spans are swappable; the convention is the contract.

Two caveats are load-bearing, and both move on a release cadence. First, the OpenTelemetry GenAI semantic conventions carry Status: Development — the span and metric names above (gen_ai.client.token.usage, create_agent, invoke_agent) may still change before the convention stabilizes, so treat them as current-as-of, not final. Second, of Claude Code’s three signals, metrics and events are GA while distributed traces are beta; [Official] Monitoring · AnthropicT1-official original a team relying on the claude_code.interaction span tree should track the beta-to-GA transition. Both warrant a recheck after 2026-08-25.

Attribution is the provenance hook, not the approval gate

The third surface ties an agent’s output back to the run that produced it. In Claude Code, attribution to git commits and pull requests is a configurable setting [Official] Claude Code settings · AnthropicT1-official original — by default, commits carry a Co-Authored-By git trailer “which can be customized or disabled.” [Official] Claude Code settings · AnthropicT1-official original The commit itself becomes the handle: from a merged diff you can walk back to the session log and trace that produced it. In CI the same hook holds — a @claude mention triggers an Action so that “Claude can analyze your code, create pull requests, implement features, and fix bugs,” [Official] Claude Code GitHub Actions · AnthropicT1-official original and the commit is stamped with the GitHub-App actor identity rather than the generic Actions user, which is why CI must run “using the GitHub App or custom app (not Actions user)” [Official] Claude Code GitHub Actions · AnthropicT1-official original for those commits to be attributable.

Here is the boundary the chapter will not let blur: attribution is a provenance hook, not an approval gate. It records which run produced this diff — it does not decide whether the diff may be merged. That decision — the human-in-the-loop review and the gate before an irreversible action — is the oversight workflow, and ch26 owns it. The two are easy to conflate because they touch the same pull request, but they answer different questions: provenance is “where did this come from?”, approval is “may this proceed?”. Read a Co-Authored-By trailer as a gate and you have mistaken a label for a checkpoint.

Surfacing cost at three altitudes — surfacing is not modeling

The fourth surface is the cost and usage a team actually watches, and it lives at three altitudes. The most local is the in-CLI /usage command, whose Session block “shows detailed token usage statistics for your current session” [Official] Manage costs effectively · AnthropicT1-official original — what one developer reads mid-run. Above that is the Team/Enterprise analytics dashboard, which surfaces usage and adoption metrics behind a viewer-role gate — “Admins and Owners can view the dashboard.” [Official] Track team usage with analytics · AnthropicT1-official original And at the top is the Console spend view, which surfaces “daily API costs in dollars alongside user count.” [Official] Track team usage with analytics · AnthropicT1-official original

But the surfaced number carries a caveat that defines the boundary. The dollar figure in /usage “is an estimate computed locally from token counts and may differ from your actual bill” [Official] Manage costs effectively · AnthropicT1-official original — the authoritative figure lives in the Console. That single sentence draws the line: surfacing shows the number and points to where the authoritative one lives; it does not model the economics. The per-developer dollar-per-day modeling, the token-reduction tactics, the question of which lever actually moves the bill — that is ch25’s subject, and the input-context cost driver ch25 unpacks. Observability tells you what a run cost as a local estimate; cost modeling tells you how to make it cost less. Mistake the surfaced estimate for the bill, or for an economic model, and you will optimize against a number that was never authoritative.

Four observability surfaces over one ground truth. At the base, the session log — the per-session JSONL transcript of every message, tool call, and tool result. Deriving from it: tracing (OTel GenAI spans and metrics), attribution (diff/PR back to the run), cost-surfacing (three altitudes — /usage, team dashboard, Console spend), and logging/retention (the local 30-day sweep versus the SDK SessionStore). The four surfaces are built on the one log.A wide blue box at the base labeled 'Session log — the ground truth, per-session JSONL transcript: every message, tool call, tool result'. Four teal boxes sit above it, each with an arrow pointing down to the log: 'Tracing — OTel GenAI spans & metrics', 'Attribution — diff / PR back to the run', 'Cost-surfacing — /usage to dashboard to Console spend', and 'Logging / retention — 30-day sweep vs SDK SessionStore'. A dashed caption strip beneath reads 'four surfaces over one ground truth — instrument what the agent did, not just whether it finished'.
Four observability surfaces over one ground truth. At the base, the session log — the per-session JSONL transcript of every message, tool call, and tool result. Deriving from it: tracing (OTel GenAI spans and metrics), attribution (diff/PR back to the run), cost-surfacing (three altitudes — /usage, team dashboard, Console spend), and logging/retention (the local 30-day sweep versus the SDK SessionStore). The four surfaces are built on the one log.

Quick reference

  • One ground truth: the session log — the per-session JSONL transcript of every message, tool call, and result — is the record; tracing, attribution, and cost-surfacing all derive from it. Explore the .claude directory · AnthropicT1-official original
  • Two retention stories: local files swept on a 30-day default (cleanupPeriodDays) versus the SDK SessionStore mirror whose retention the adapter owns — don’t rely on the local files for durable history. Persist sessions to external storage · AnthropicT1-official original
  • Trace to the convention: instrument to the OTel GenAI names (gen_ai.client.token.usage, create_agent, invoke_agent), not a vendor schema; Claude Code’s claude_code.interaction tree is one realization. Semantic conventions for generative AI metrics · OpenTelemetry AuthorsT1-official original
  • Moving target: the GenAI conventions are Status: Development and Claude Code’s traces signal is beta — recheck names after 2026-08-25. Monitoring · AnthropicT1-official original
  • Attribution = provenance, not approval: the Co-Authored-By trailer ties a diff to its run; the gate is ch26. Claude Code settings · AnthropicT1-official original
  • Surfacing ≠ modeling: /usage shows a local estimate that “may differ from your actual bill”; modeling the economics is ch25. Manage costs effectively · AnthropicT1-official original

Practice

Exercise solutions

Solution ↑ Exercise

The ground truth is the session log — the per-session JSONL transcript Claude Code writes for each run, recording every message, tool call, and tool result. The four surfaces are logging (capturing and retaining that transcript), tracing (re-rendering the run as OTel GenAI spans and metrics), attribution (tying a commit/PR back to the run that produced it), and cost-surfacing (showing token/dollar usage at the CLI, team-dashboard, and Console altitudes). The ground truth is primary because it is the literal, authoritative record of what the agent did; the other three are views — a trace re-renders it, an attributed commit points back to it, a cost figure is computed from the tokens in it — so each is only as durable and authoritative as the log beneath it. On attribution: it claims provenancewhich run produced this diff, via the Co-Authored-By trailer and the GitHub-App commit identity — and it does not claim approval, i.e. it does not decide whether the diff may be merged; that gate is the human-in-the-loop oversight workflow, which ch26 owns. Conflating the two mistakes a label for a checkpoint.

Solution ↑ Exercise

A worked example. Take a documentation-writing agent that opens PRs. Logging: “Transcripts are written to the local ~/.claude/projects/ JSONL, but we never set up a SessionStore mirror — so anything past the 30-day cleanupPeriodDays sweep is gone. That is our durable-history gap.” Tracing: “We enabled telemetry but pointed it at a vendor-named dashboard; a standard OTel collector wouldn’t recognize our spans. Re-instrumenting to the GenAI convention (gen_ai.*, claude_code.interaction aligning to it) would make the backend swappable — though the names are Development-status, so I’ll pin a recheck.” Attribution: “Commits carry the default Co-Authored-By trailer, so I can walk from a merged doc-PR back to the run — provenance is fine.” Cost-surfacing: “I watch /usage mid-run and the Console for the monthly figure, but I’d been treating the /usage dollar number as the bill — it’s a local estimate that ‘may differ from your actual bill.’” Most painful during an incident: the logging gap — if a bad PR shipped and the run is older than 30 days with no mirror, there is no transcript to reconstruct from, and every other surface is a view of a record that no longer exists, so I’d mirror via SessionStore first. Pulled out as non-observability: “add a human gate before the PR merges” is ch26 (oversight), not a logging/tracing gap; and “the bill is too high, reduce it” is ch25 (cost modeling), not cost-surfacing — surfacing only shows the number. The exercise’s value is feeling that two of the four surfaces are easy to over-read into decisions they don’t make.