07 · Observability
Observability & System-State Clarity
If you can't tell what to do next, you don't have observability. You have telemetry.
By John Wright-Nyingifa · Product Designer building infrastructure for DeFi, DePIN, and autonomous agents.

Live Signal · March 2026
Starknet suffered a ~9-hour outage (Sept 2025) and a ~4-hour outage (Jan 2026). Linea had a 46-minute unexpected pause. Kroma shut down permanently. Base processes 60%+ of all L2 transactions through a single centralized sequencer. The tooling gap: Dune, Nansen, Arkham track analytics. L2Beat tracks risk. Sentio monitors dApps. Nobody provides unified cross-L2 observability. The "Datadog for rollups" doesn't exist.
In distributed systems (especially rollups and cross-chain systems), "it's down" is rarely true. Most incidents are partial failures: some paths work, some don't, and users get stuck in ambiguous states.
Observability is the product layer that turns raw metrics into system truth, and system truth into actionable guidance. This page frames observability as pipelines and state machines, not just dashboards. 36% of organizations report alert fatigue. 55% use too many monitoring tools with poor integration. The blockchain monitoring stack is worse than where DevOps was 10 years ago.
Pipelines vs Dashboards
Great for trends, aggregations, capacity planning. But dashboards often fail in incidents because they lack causal structure, hide dependencies, and don't map to user-facing states.
A pipeline view tracks an entity through phases, preserves causality, and supports deterministic debugging. If you can't tell what to do next, you don't have observability. You have telemetry.
WHAT EXISTS (March 2026) Analytics & Intelligence ├─ Dune SQL-based dashboards, community-driven ├─ Nansen 500M+ labeled wallets, Smart Money tracking ├─ Arkham Entity-based intelligence └─ DeFiLlama TVL and protocol metrics Developer Observability ├─ Tenderly Contract debugging, simulation, gas profiling ├─ Sentio dApp monitoring, 60+ chains └─ Shadow Offchain event logs, private gasless logs WHAT'S MISSING ├─ Unified cross-L2 observability (no "Datadog for rollups") ├─ Real-time sequencer health metrics (most L2s don't expose this) ├─ Proving pipeline monitoring (no standard tooling) ├─ Cross-chain bridge health (highest-risk, least-monitored) └─ Middleware-level debugging (gap between APM and infra)
System Map Patterns
Five patterns for making distributed systems legible:
Transaction → inclusion → execution → receipts. Show edges for dependencies (approvals, nonce, balance) and ordering constraints.
Source emit → Relay transport → Destination verification → Destination execution. Key output: "Message exists on source, not yet verified on destination."
Batch selected → Prover job created → Proof generated → Proof submitted → Proof verified. Key output: "In prover queue" vs "waiting for onchain inclusion."
A user action maps to: Submitted → Included → Available (DA) → Verified → Settled → Finalized (product guarantee). Six states, not "pending/done."
Make actors explicit: Sequencer, Prover, DA publisher, Relayer, Verifier contract. Map what each did, what effect it had, what dependency it's waiting on.
Health & Reliability Models
Liveness: is the system making progress? Correctness: is the progress valid? Many incidents are liveness failures with correctness intact. These need different indicators and different response protocols.
Sync height/lag, RPC latency, error rates, mempool intake. Useful for operators, not end users.
Connectivity failures show as delayed propagation, inconsistent views, partial regional outages.
Detect early: proving timeouts, DA publication missing, settlement tx not included, verification rejected. Label "bad batches" before user-facing uncertainty spreads.
Starknet Outages, Sept 2025 & Jan 2026
September: Grinta v0.14.0 upgrade caused ~9-hour outage. Block production halted. Two chain reorgs erased ~1 hour of history. January: ~4-hour outage, block production halted again. Communication was Twitter/Discord only: no status page, no health dashboard, no automated user notifications. Users had to actively search for information about whether their funds were safe.
UX Implications
Small set: Healthy (progress and correctness normal), Degraded (progress slower, correctness intact), Paused (safety risk, finalization halted). Tie to feature impact: deposits enabled, withdrawals enabled, bridging enabled.
"Withdrawals delayed" beats "Prover queue length high." Alert on what matters to the person, not what matters to the system.
A good incident story: what happened, what is currently true, what will happen next, what users should do (if anything).
"Sequencer inclusion time is above normal." "Verification delayed due to proving backlog." "DA unavailable: finalization paused." Translate metrics into guidance.
UX Implications
→ Pipeline-first: show data flowing through execution, sequencing, DA, verification, and settlement. When something stalls, the stuck point is immediately visible.
→ Sequencer health as ambient indicator: Normal / Degraded / Down with estimated inclusion time. Visible on every L2 interface.
→ Failure storytelling over error codes. "Sequencer paused during upgrade. Your transaction is safe and will process when block production resumes (~2 hours)."
→ The "Datadog for rollups" opportunity: unified health across L2s. Enterprise operators (Robinhood, Kraken) will demand this.
Glossary
Raw signals: metrics, logs, traces.
Ability to explain system state and predict behavior.
Phase-based tracking with causal dependencies.
Making progress.
Making valid progress.