07 · Observability

Observability & System-State Clarity

If you can't tell what to do next, you don't have observability. You have telemetry.

PipelinesSystem MapsHealth ModelsIncident ResponseFailure Storytelling

Mar 2026

By John Wright-Nyingifa · Product Designer building infrastructure for DeFi, DePIN, and autonomous agents.

Live Signal · March 2026

Starknet suffered a ~9-hour outage (Sept 2025) and a ~4-hour outage (Jan 2026). Linea had a 46-minute unexpected pause. Kroma shut down permanently. Base processes 60%+ of all L2 transactions through a single centralized sequencer. The tooling gap: Dune, Nansen, Arkham track analytics. L2Beat tracks risk. Sentio monitors dApps. Nobody provides unified cross-L2 observability. The "Datadog for rollups" doesn't exist.

In distributed systems (especially rollups and cross-chain systems), "it's down" is rarely true. Most incidents are partial failures: some paths work, some don't, and users get stuck in ambiguous states.

Observability is the product layer that turns raw metrics into system truth, and system truth into actionable guidance. This page frames observability as pipelines and state machines, not just dashboards. 36% of organizations report alert fatigue. 55% use too many monitoring tools with poor integration. The blockchain monitoring stack is worse than where DevOps was 10 years ago.

Pipelines vs Dashboards

Dashboards answer: "What is happening?"

Great for trends, aggregations, capacity planning. But dashboards often fail in incidents because they lack causal structure, hide dependencies, and don't map to user-facing states.

Pipelines answer: "Where is the request stuck?"

A pipeline view tracks an entity through phases, preserves causality, and supports deterministic debugging. If you can't tell what to do next, you don't have observability. You have telemetry.

WHAT EXISTS (March 2026)

  Analytics & Intelligence
  ├─ Dune           SQL-based dashboards, community-driven
  ├─ Nansen         500M+ labeled wallets, Smart Money tracking
  ├─ Arkham         Entity-based intelligence
  └─ DeFiLlama     TVL and protocol metrics

  Developer Observability
  ├─ Tenderly       Contract debugging, simulation, gas profiling
  ├─ Sentio         dApp monitoring, 60+ chains
  └─ Shadow         Offchain event logs, private gasless logs

  WHAT'S MISSING
  ├─ Unified cross-L2 observability      (no "Datadog for rollups")
  ├─ Real-time sequencer health metrics   (most L2s don't expose this)
  ├─ Proving pipeline monitoring          (no standard tooling)
  ├─ Cross-chain bridge health            (highest-risk, least-monitored)
  └─ Middleware-level debugging           (gap between APM and infra)

System Map Patterns

Five patterns for making distributed systems legible:

Execution graphs

Transaction → inclusion → execution → receipts. Show edges for dependencies (approvals, nonce, balance) and ordering constraints.

Message propagation maps

Source emit → Relay transport → Destination verification → Destination execution. Key output: "Message exists on source, not yet verified on destination."

Proof pipelines

Batch selected → Prover job created → Proof generated → Proof submitted → Proof verified. Key output: "In prover queue" vs "waiting for onchain inclusion."

Cross-chain state machines

A user action maps to: Submitted → Included → Available (DA) → Verified → Settled → Finalized (product guarantee). Six states, not "pending/done."

Actor → action → effect

Make actors explicit: Sequencer, Prover, DA publisher, Relayer, Verifier contract. Map what each did, what effect it had, what dependency it's waiting on.

Pipeline-first interface concept: entity timeline with dependency graph

Health & Reliability Models

Liveness vs correctness

Liveness: is the system making progress? Correctness: is the progress valid? Many incidents are liveness failures with correctness intact. These need different indicators and different response protocols.

Node readiness signals

Sync height/lag, RPC latency, error rates, mempool intake. Useful for operators, not end users.

Peer connectivity

Connectivity failures show as delayed propagation, inconsistent views, partial regional outages.

Unhealthy batch detection

Detect early: proving timeouts, DA publication missing, settlement tx not included, verification rejected. Label "bad batches" before user-facing uncertainty spreads.

Starknet Outages, Sept 2025 & Jan 2026

September: Grinta v0.14.0 upgrade caused ~9-hour outage. Block production halted. Two chain reorgs erased ~1 hour of history. January: ~4-hour outage, block production halted again. Communication was Twitter/Discord only: no status page, no health dashboard, no automated user notifications. Users had to actively search for information about whether their funds were safe.

UX Implications

Global health modes

Small set: Healthy (progress and correctness normal), Degraded (progress slower, correctness intact), Paused (safety risk, finalization halted). Tie to feature impact: deposits enabled, withdrawals enabled, bridging enabled.

Alert by user impact

"Withdrawals delayed" beats "Prover queue length high." Alert on what matters to the person, not what matters to the system.

Failure storytelling

A good incident story: what happened, what is currently true, what will happen next, what users should do (if anything).

Human-readable observability

"Sequencer inclusion time is above normal." "Verification delayed due to proving backlog." "DA unavailable: finalization paused." Translate metrics into guidance.

UX Implications

→ Pipeline-first: show data flowing through execution, sequencing, DA, verification, and settlement. When something stalls, the stuck point is immediately visible.

→ Sequencer health as ambient indicator: Normal / Degraded / Down with estimated inclusion time. Visible on every L2 interface.

→ Failure storytelling over error codes. "Sequencer paused during upgrade. Your transaction is safe and will process when block production resumes (~2 hours)."

→ The "Datadog for rollups" opportunity: unified health across L2s. Enterprise operators (Robinhood, Kraken) will demand this.

Glossary

Telemetry

Raw signals: metrics, logs, traces.

Observability

Ability to explain system state and predict behavior.

Pipeline view

Phase-based tracking with causal dependencies.

Liveness

Making progress.

Correctness

Making valid progress.

See this thinking applied

DockHive: Cloud infrastructure & incident response →Aegis: Agent monitoring & oversight →

John Wright-Nyingifa

Product Designer

Work

Fun

About

Notebook

Work

Fun

About

Notebook

ZK & Verification

Failure Taxonomy

07 · Observability

Observability & System-State Clarity

Live Signal · March 2026

Pipelines vs Dashboards

System Map Patterns

Health & Reliability Models

Starknet Outages, Sept 2025 & Jan 2026

UX Implications

UX Implications

Glossary

See this thinking applied

Related Reading