2 Lovelaces

I'm building 1694.io— a place to make Cardano's on-chain governance legible to everyone. At the core of it is an indexer: read every block from a Cardano node, extract governance events (DRep registrations, delegations, votes, proposals), write them to Postgres, serve them to the frontend.

Simple enough in theory. In practice, it's been a few rounds of "this doesn't work the way I thought."

Where it started

First version was written fast. Single async loop. Connect to a relay, request blocks epoch by epoch, decode each one with Pallas, extract governance certs, insert to Postgres. Everything inline, no abstraction. It worked until it didn't.

No retry logic. If the relay dropped, the process died. If Postgres was slow, it backed up the whole chain sync. I parallelized it — multiple epoch ranges running at once with tokio::spawn — but all workers shared a global Mutex<LruCache> to track each delegator's previous DRep. Every delegation event from every worker was blocking every other worker on one lock.

No real checkpointing. I wrote "last completed slot" to a file on SIGTERM. If the pod got OOMKilled, it started from slot 0. Every crash was a full resync.

Silent data bugs. The previous_drep field on delegation events — which DRep a staker was at before they changed — was computed by looking up their last delegation in the DB. Except under the parallel workers, a worker on epoch 400 could cache a DRep ID before the epoch 380 worker had finished writing, so the field ended up wrong in unpredictable ways. I ran this in production for weeks before I noticed.

There was also a bug in how vote events stored the voter's DRep ID. Cardano's CIP-1694 defines two built-in delegation options every ADA holder can choose: AlwaysAbstain (your stake is excluded from quorum calculations) and AlwaysNoConfidence (your stake permanently votes "no confidence"). These are real, first-class governance choices — not placeholder values. What they aren't is voters. They have no keys and cannot sign transactions, so they cannot cast votes.

The patch phase

I didn't rewrite, I patched. Added retry loops around the relay connection. Added a proper sync_checkpoint table in Postgres — now checkpoints survived crashes. Moved slot_to_timestamp and slot_to_epoch out of the duplicate copy-paste they lived in and into a shared utils.rs. Fixed rollback handling — I used to just log a warning and move on. Now it's DELETE FROM drep_timeline_event WHERE slot > rollback_slot. Shallow forks actually clean up after themselves.

But the architecture was the same messy loop. Scanner calling parser calling db calling checkpoint, all inline. Retry logic written three times in three different places. The historical sync and continuous sync were completely separate code paths that did the same thing — two independent implementations of "fetch block, parse, persist" with two separate bugs waiting to happen. At 2,000 lines it was getting hard to hold in your head.

Getting to tip in hours, not days

Before getting into the rewrite, it's worth explaining the fundamental sync problem and how I'm solving it — because it shapes every architectural decision downstream.

Cardano mainnet has been running since 2017. By the time you're reading this, there are 500+ epochs of blocks. If you sync sequentially — one block at a time, stream forward from genesis — you're talking days of wall-clock time even on fast hardware. That's not acceptable for a production indexer that needs to come back up quickly after a restart. Mithril solves a similar problem for Cardano nodes: instead of syncing from genesis, a new node downloads a cryptographically-certified snapshot of the node's full state (a RocksDB database) and starts from there. A node that would take 3 days to sync takes 2 hours instead. I can't use Mithril directly — it gives you a node database snapshot in RocksDB format, which is the internal representation the node uses. I need raw block CBOR bytes to run through the parser and extract governance events. Mithril doesn't expose that. But the idea is the same: don't crawl from genesis when you don't have to. Pre-compute the routing information needed to fetch in parallel, and let N workers race to the finish line simultaneously.

The key constraint is the blockfetch protocol. To fetch a range of blocks from a Cardano relay, you need a concrete (slot, hash) pair at both the start and end of the range — not just a slot number. The Cardano N2N protocol identifies blocks by both slot and hash together. You can't say "give me slots 45,000,000 to 50,000,000." You have to say "give me blocks from (45000000, abc123...) to (50000000, def456...)." For a sequential chainsync, the node handles this — find_intersect lets the node tell you what hash belongs to a given checkpoint. But for parallel fetching, you need to know the hashes upfront to partition the work. That's what slot_hashes.csv is. It's a pre-computed list of epoch boundary hashes:

45158400,a3d6a2...
45590400,b1c4f9...
46022400,e7a012...
...

One entry per epoch boundary. With this file, I can split the chain into N slices and hand each worker a concrete (start_slot, start_hash) → (end_slot, end_hash) range. Workers connect to independent relay nodes and fetch their slices simultaneously. Instead of one thread crawling 500 epochs sequentially, 8 or 16 workers split the work and merge their output into a single parser and sink.

  slot_hashes.csv ──┬──► Worker 0  [epochs   0– 62] ──┐
                    ├──► Worker 1  [epochs  63–125] ──┼──► GovernanceParser ──► EventSink
                    ├──► Worker 2  [epochs 126–188] ──┤
                    └──► Worker N  [epochs 441–503] ──┘

slots

In practice this gets from genesis to chain tip in 2–3 hours instead of what would otherwise be the better part of a day.

The CSV is not truth — it's routing. The actual block data comes from the relay. The CSV just tells each worker where to start and stop. If the CSV is wrong (bad hash), the relay rejects the range and the worker retries with the next relay in the list. The blocks themselves are verified by the protocol.

Keeping the CSV up to date. New epochs arrive every ~5 days on Cardano mainnet. The CSV doesn't need to be updated constantly — when historical workers finish at the CSV's last entry, the continuous sync (ChainFollower) takes over and streams forward from that point. The gap between the CSV tip and the actual chain tip is covered automatically. You only need to update the CSV if you want future re-syncs to parallelize that gap rather than stream it sequentially.

To add new entries, you need the epoch boundary slot and block hash:


# From cardano-db-sync:
psql cexplorer -c "
  SELECT slot_no, encode(hash, 'hex')
  FROM block
  WHERE epoch_no = 465
  ORDER BY block_no DESC
  LIMIT 1;
"


# From Blockfrost:
curl "https://cardano-mainnet.blockfrost.io/api/v0/epochs/465/blocks?count=1&order=desc" \
  -H "project_id: YOUR_KEY" \
  | jq -r '.[0] | "\(.slot),\(.hash)"'

Then append the line to slot_hashes.csv and redeploy. I'll eventually automate this as part of CI — it's a straightforward script, just hasn't been a priority yet.

The rewrite — where I am now

I rebuilt it around Gasket, the same pipeline framework Oura uses. Gasket is a staged event-driven architecture — each stage runs in its own thread, communicates through typed channels, and declares a retry policy. The framework handles backpressure, graceful shutdown, and metrics.
The pipeline for historical sync looks like this:

  BlockFetcher 0 ──┐
  BlockFetcher 1 ──┼──► GovernanceParser ──► EventSink ──┬──► Postgres
  BlockFetcher N ──┘                                     └──► Webhook

Continuous sync after historical is done:

  Cardano Node ──► ChainFollower ──► GovernanceParser ──► EventSink ──┬──► Postgres
                                                                      └──► Webhook

Each stage has one job. BlockFetcher connects to a relay and requests blocks. GovernanceParserdecodes CBOR and extracts governance events. EventSink writes to whatever sinks are configured (Postgres, webhook, or both). They don't call each other — typed channels only.

Retry that never gives up

Every network stage declares a policy with max_retries: usize::MAX and exponential backoff capped at 120 seconds. If the relay goes down for 30 minutes, the stage backs off and keeps retrying. When the relay comes back, it reconnects in bootstrap() and picks up exactly where it left off. The node going down doesn't kill the indexer anymore.

Historical fetcher uses tick_timeout: None because fetching a full epoch range can take hours and I don't want a timeout cutting it short.

Checkpoint resume that works

Each historical worker writes a checkpoint after completing its epoch range, keyed by worker ID like historical_3. On restart, I query all historical_* checkpoints and skip any range already completed. A crash in the middle of epoch 300 means on restart, workers 0–299 skip immediately and worker 300 continues. Not from slot 0.

`previous_drep` fixed properly

The old bug — querying the DB for previous_drep and getting the value from two delegations back — is fixed. The query now reads drep_id from the most recent delegation row ordered by slot. The race condition between parallel workers is also gone: EventSink is a single stage that processes blocks in order, so there's no shared state to race on. The DB query is the source of truth.

Pluggable sinks

The event sink is now a trait:

#[async_trait]
pub trait EventSink: Send + Sync {
    async fn handle_block(&self, block: &ParsedBlock) -> anyhow::Result<()>;
    async fn handle_rollback(&self, to_slot: u64) -> anyhow::Result<()>;
}

Postgres sink and webhook sink ship by default. Enable via env vars:

SINK_POSTGRES_ENABLED=true
SINK_WEBHOOK_ENABLED=true
SINK_WEBHOOK_URL=https://your-api.example.com/events
SINK_WEBHOOK_SECRET=your-hmac-secret

This makes the indexer useful outside my own stack. Anyone running their own Cardano node can pipe governance events to their own systems without touching a database.

What's still being figured out

The current checkpoint resume logic works, but there's an edge case I haven't fully solved: what happens if the slot_hashes CSV is updated (say I extend it to cover more epochs) but the historical worker IDs shift? Worker historical_3 in the old CSV might cover a different range than historical_3 in the new one. For now I handle this by being careful with CSV updates, but a content-addressed checkpoint key (based on start/end slot rather than index) would be cleaner.

The continuous sync transitions from historical seamlessly — when all historical workers complete, I read the last checkpoint and start `ChainFollower` from that slot. But if the historical sync was run months ago and the continuous sync is starting fresh now, it jumps straight to the tip without scanning anything in between. I'm relying on the historical CSV covering that gap. If the CSV is outdated, events in the gap get missed. I haven't built tooling to detect or report on this yet.

The LRU cache for delegation lookups is still there as a performance optimization but it's no longer load-bearing for correctness — the DB query is. I should probably remove it and measure whether the query latency matters at the volumes I'm seeing.

The deployment side

The indexer builds to a Docker image, pushed to my GitLab registry with two tags per commit: a mutable branch tag (:dev, :latest) and an immutable SHA tag (:dev-abc1234). The 1694.io Helm chart pulls the branch tag. When CI triggers a deploy, it passes the SHA as a pod annotation — that annotation changing on every deploy is what forces Kubernetes to actually pull the new image even when the tag hasn't changed.

The deployment is fully automatic between the two repos: indexer CI builds the image and triggers 1694.io CI, which runs `helm upgrade --reuse-values` with the new tag and SHA. I never have to touch the Helm chart for routine indexer deploys.

Where it's going

The architecture is stable enough now that the main thing left is data quality validation — confirming that what I'm indexing matches on-chain reality for a sample of known DReps and votes. I'm also looking at whether the webhook sink is useful enough for external users to justify documenting it properly, or if it stays as an internal feature for now.

This was written while the rewrite was still in progress. Some of what's described above is working well. Some of it I'll probably change. That's the nature of building something against a live chain.