08 · Failure Taxonomy

Multi-Chain Failure Taxonomy

Execution, verification, and settlement are separated. Liveness failures look like safety failures in the UI. 'Rollback' is usually economic compensation, not state rewind.

Failure ModesRecovery PatternsUX LanguagePost-Mortem Design
Mar 2026

By John Wright-Nyingifa · Product Designer building infrastructure for DeFi, DePIN, and autonomous agents.

Statuspage: system health communication

Live Signal · March 2026

$3B stolen in 119 hacks by mid-2025 (50%+ increase over 2024). Bybit: $1.5B, largest crypto theft ever (Safe{Wallet} JS injection). 88% of stolen funds from private key compromises. Oracle attacks: 13% of DeFi exploits. Starknet: 9-hour + 4-hour outages. Kroma: permanent shutdown. Only 4.6% of stolen bridge funds voluntarily returned.

Multi-chain systems fail in ways that are hard to describe because execution, verification, and settlement are separated. Liveness failures look like safety failures in the UI. And "rollback" is usually economic compensation, not state rewind.

This page is a taxonomy of failures and the UX responses that keep users safe, informed, and unblocked.

Failure Categories

Ten categories of multi-chain failure, each with distinct user symptoms:

1. Verification delay

System executed but proof/finality is not complete. User symptom: "I can't withdraw yet." Signals: prover queue, congestion, slow finality.

2. Remote execution failure

Destination action failed after source succeeded. User symptom: "Funds left, nothing arrived." Causes: contract revert, slippage, out-of-gas, destination congestion.

3. Settlement mismatch

Different layers disagree temporarily about what is final. User symptom: status flips or remains ambiguous. Causes: reorgs, delayed inclusion, dispute windows.

4. Wrong chain execution

Action executed on unintended domain. User symptom: asset appears but unusable. Causes: UI ambiguity, wallet mis-selection, misconfigured routing.

5. Timeouts

Deadline exceeded for some phase. User symptom: "stuck" status. Causes: relayer outage, prover backlog, gas spikes.

6. Sequencer stalls

Inclusion halts or slows materially. User symptom: transactions not included. Causes: outage, leader failover, censorship.

7. DA failure

Data not available, verification cannot proceed. User symptom: finalization paused. Causes: DA network outage, publishing failure.

8. Chain-specific reorgs

Previously seen inclusion is undone. User symptom: "it confirmed, then unconfirmed." Causes: probabilistic finality, congestion.

9. Gas exhaustion

Execution/submission becomes uneconomical or stuck. User symptom: long delay or failure. Causes: sudden market activity, auction dynamics.

10. Permission rejection

Blocked by policy, allowlist, or limits. User symptom: "action not allowed." Causes: spending limits, contract denylist, missing approvals.

INCIDENT TIMELINE (2025-2026)

  Feb 2025   Bybit          $1.5B    Safe{Wallet} JS injection
  Nov 2025   Moonwell       $1M      Chainlink oracle malfunction
  Feb 2026   CrossCurve     $3M      Spoofed cross-chain messages
  Feb 2026   IoTeX          $8.8M    Private key compromise

  L2 OUTAGES
  Sept 2025  Starknet       9 hours  Grinta upgrade, chain reorgs
  Jan 2026   Starknet       4 hours  Block production halt
  June 2025  Kroma          Permanent  Shut down, funds at risk

Recovery Patterns

Seven response patterns, each for a different failure context:

A. Deterministic fallback

When route A fails predictably: auto-switch to route B within same constraints. Show what changed and why.

B. Retry with context

Transient liveness issues: retry with backoff. Show retry count and next attempt time window.

C. Re-route

Dependency is down (relayer, bridge, sequencer): choose alternative path. Show tradeoff (time, fee, trust).

D. Safe claim

Remote execution failed but funds recoverable: provide "Claim refund" action. Show eligibility conditions and expected time.

E. Explicit cancellation

System cannot safely continue: cancel pending steps, settle to stable state. Show what was executed and what was not.

F. Meaningful notifications

Notify on: phase changes, long delays beyond estimate, action required (claim, approve, re-confirm).

G. Post-mortem explanation

After resolution: what happened, impact, how the user was protected, what changed to prevent recurrence. Bybit set the standard with two public forensic reports.

FAILURE → RECOVERY MAPPING

  Verification delay    →  Phase UI + "usable soon" / "final later"
  Remote exec failure   →  Safe claim (refund) or checkpoint resume
  Settlement mismatch   →  Clear status + wait for resolution
  Wrong chain           →  Re-route with explicit tradeoffs
  Timeout               →  Retry with backoff → Cancel → Refund
  Sequencer stall       →  "Stuck" state + forced inclusion if available
  DA failure            →  Queue actions + incident banner
  Reorg                 →  Re-confirm state + explain what changed
  Gas exhaustion        →  Wait or re-submit at new price
  Permission rejected   →  Clear reason + next action to resolve

  PRINCIPLE
  ┌─────────────────────────────────────────────┐
  │  "Safely halted" requires PATIENCE          │
  │  "Failed" requires ACTION                   │
  │  Most UX treats both the same.              │
  └─────────────────────────────────────────────┘

UX Language Guide

Avoid ambiguity in status copy:

Prefer explicit states

"Included by sequencer," "Data available," "Verified onchain," "Finalized for withdrawal." Each state maps to a system phase.

Avoid overloaded terms

"Confirmed" without specifying what kind. "Completed" when finality is pending. These create false confidence.

Status vocabulary map: system phase → user copy for each failure type

Glossary

Liveness failure

System is not making progress (stall, timeout).

Safety failure

System produced an invalid result (exploit, reorg).

Economic reversal

Refund/compensation, not literal state rewind.

Deterministic fallback

Auto-switching to a predefined alternative on failure.

Post-mortem

Explanation of what happened after an incident resolves.