Building an offline-first exam platform that survives power cuts

Digital examination centers in India don’t have the internet uptime Silicon Valley engineers imagine. A typical exam session looks like this: 200 students, one overloaded 4G router, patchy power, a switch that reboots twice in four hours, and a hard deadline — the exam must complete, answers must sync, and nothing can be lost. Ever.

We shipped this distributed exam platform to production with 99.9% sync reliability across three Android apps and a Go router service. Here’s how — including the parts that broke in interesting ways.

The problem

On the surface it looks like a CRUD app: students answer questions, answers sync to a server. In practice the constraints are hostile:

Networks are unreliable. Packet loss, 30-second timeouts, full disconnections lasting minutes.
Power fails. Laptops crash. Tablets reboot. Router services get killed mid-transaction.
Losing a single answer is unacceptable. An exam is a legal document. If a student’s answer to question 47 vanishes because the router crashed, that’s a failed exam session and a lawsuit waiting to happen.
Clock skew. Exam centers don’t all have NTP. Devices drift by minutes.
No retries from students. Students don’t refresh the app. They click “Submit” once. If it fails, they don’t know.

The non-negotiable requirement was: once a student’s answer lands on the device, it must eventually land on the server, regardless of what breaks in between.

The three-tier architecture

We ended up with three distinct tiers, each with different failure modes:

Student Device (Android, Java)
    │ SQLite (offline)
    │
    ├─ writes to local DB immediately on every keystroke
    │
    ▼  over LAN / WebSocket
Router (Go, Gin) — one per exam center
    │ PostgreSQL (durable queue)
    │
    ├─ batches and forwards upstream
    │
    ▼  over WAN / RabbitMQ
Central Server (Go)
    │ PostgreSQL (source of truth)

The router is the critical innovation. It sits on the LAN at the exam center, running on a beefed-up mini-PC. Students talk to it over local WiFi. It talks to the central server over whatever the exam center’s uplink is. If the uplink dies for two hours, students don’t notice — the router holds everything in Postgres and retries.

This sounds obvious. It wasn’t. We tried two simpler designs first.

Attempt 1: Direct-to-cloud (failed)

First build: student apps sync directly to the central server over HTTPS. Simple. Clean. No intermediate router.

Failure mode: if the exam center’s uplink blips for 10 seconds, 200 students all retry at the same moment, flood the connection, and the uplink stays saturated. We saw timeouts compound until the session was unusable. Removing the thundering herd required exponential backoff on 200 clients, which is a lot of complexity to ship to Android devices that have no way to update mid-exam.

Attempt 2: Cloud with client queue (failed)

Second build: keep cloud-direct, but have each student app queue locally and drain on reconnect. Better on paper.

Failure mode: a student submits answer for Q47, app crashes, tablet reboots, app reopens mid-exam — we lost the queue. Android’s on-disk queue was implemented with SQLite but the write wasn’t committed before the crash. We tried WAL + synchronous=FULL but the battery hit was unacceptable on exam tablets that already run hot.

The real lesson: if the device can lose power at any moment, the device can’t be the durability boundary. Something within the LAN needs to be — something you can keep running on a UPS.

Attempt 3: LAN router with durable queue (shipped)

The router is a small service that runs on hardware the exam center controls. It exposes:

A WebSocket endpoint students connect to (keeps 200+ long-lived connections)
An MQTT broker for invigilator devices (lightweight status pings)
A RabbitMQ producer that talks upstream to the central server

Everything the router receives goes into Postgres first, then acknowledged to the student app. The router is the point where “I wrote to disk on a machine with a UPS” becomes true. Students trust the router; the router owns the upstream retry problem.

RabbitMQ for server-to-router sync

Why RabbitMQ instead of plain HTTP? Three reasons:

Guaranteed delivery semantics. With publisher confirms + durable queues + manual acks, a message sent from the central server to a router will eventually arrive, even if the router restarts mid-flight.
Backpressure for free. If the central server is processing slowly, RabbitMQ’s flow control slows the router’s publish rate. We don’t have to reimplement this with HTTP.
Fan-out to multiple subscribers. We have a “live dashboard” service that needs to see the same events the central DB sees. With RabbitMQ it’s one extra consumer; with HTTP it would be a second service call or a webhook.

The setup on the router side looks like this in Go:

ch, err := conn.Channel()
if err != nil { return err }

// Durable queue — survives broker restart
_, err = ch.QueueDeclare(
    "exam-events",
    true,  // durable
    false, // autoDelete
    false, // exclusive
    false, // noWait
    nil,
)

// Publisher confirms — we don't consider a message sent
// until the broker acks it
if err := ch.Confirm(false); err != nil { return err }
confirms := ch.NotifyPublish(make(chan amqp.Confirmation, 1))

// Publish with persistent delivery mode
err = ch.Publish("", "exam-events", false, false, amqp.Publishing{
    DeliveryMode: amqp.Persistent,
    ContentType:  "application/json",
    Body:         payload,
    MessageId:    eventID, // for dedup
})

// Block until broker confirms
confirmation := <-confirms
if !confirmation.Ack {
    return fmt.Errorf("broker rejected message %s", eventID)
}

Two production lessons worth writing down:

Always use MessageId for idempotency. RabbitMQ can redeliver. If your consumer isn’t idempotent, you double-count. We include a deterministic event ID (exam_id + student_id + question_id + answer_hash) and the central server’s consumer upserts by that ID.

Confirms are synchronous per channel. Don’t share a channel between goroutines and expect to correlate confirms. We learned this by chasing a ghost bug where confirms seemed to arrive for the wrong messages. Per-channel confirmation requires you either wait per-publish (slower, simpler) or use channel pools with per-channel confirm correlation.

MQTT for invigilator status

RabbitMQ is overkill for “this invigilator’s screen just refreshed.” For that we use MQTT — it’s lighter, it’s designed for intermittent connectivity, and the Android client is 10x smaller than a RabbitMQ library.

Invigilator apps publish status/center-123/invigilator-4 every 15 seconds with their screen state. Central server has a single MQTT consumer that updates a Redis-backed presence map. If a status hasn’t updated in 60 seconds, the dashboard shows the invigilator as offline.

MQTT QoS 0 is fine for this. We don’t care if we miss a status ping; we’ll get the next one in 15 seconds. The whole thing fits in 200 lines of Go.

SQLite as the offline boundary

On the Android side, SQLite is the source of truth until sync succeeds. Every answer write goes:

Write to SQLite (sync mode NORMAL, journal WAL)
Send over WebSocket to router
On router ack, mark row as synced=1
On failure, keep synced=0 and retry with backoff

The unsynced rows are our durability guarantee. If the app crashes, on restart we scan for synced=0 rows and push them to the router. The synced flag is the state machine.

The tricky part: answers can be edited. If a student writes “Paris” for Q1, then changes to “London” before the router ack arrives, we don’t want to sync both. We use a monotonically increasing version per (student, question) and the router rejects writes with an older version:

CREATE TABLE answers (
  student_id TEXT,
  question_id TEXT,
  answer TEXT,
  version INTEGER,
  synced INTEGER DEFAULT 0,
  PRIMARY KEY (student_id, question_id)
);

// On every answer write
cursor.execute(
  "INSERT OR REPLACE INTO answers(student_id, question_id, answer, version, synced) " +
  "VALUES (?, ?, ?, COALESCE((SELECT version FROM answers WHERE student_id=? AND question_id=?), 0) + 1, 0)",
  studentId, questionId, answer, studentId, questionId
);

The router then validates: incoming version must be > the last version it successfully stored for that (student, question). If a later write lost the race, the router rejects it with “stale version” and the client doesn’t retry.

This is optimistic concurrency — the simplest thing that works. We considered CRDTs. For a single-writer-per-cell scenario (only one student can answer their own question), CRDTs are overkill.

The sync state machine

Every answer moves through three states:

(local write)  →  PENDING  →  (router ack)  →  SYNCED  →  (server ack)  →  CONFIRMED
                     ↑                             │
                     └──────── (timeout) ──────────┘

We keep the states explicit because debugging gets impossible otherwise. Every exam session, we’d see a student whose answer was in some in-between state — router confirmed but server didn’t see it, or server saw it but router thought the ack was lost. Making the states first-class in the schema means you can query “show me everything stuck in SYNCED but not CONFIRMED for over 5 minutes” and get an answer.

The transitions happen in a loop on both the Android client and the router. The client’s loop ticks every 5 seconds and handles PENDING → SYNCED. The router’s loop ticks every 10 seconds and handles SYNCED → CONFIRMED.

Timeouts move states backward:

PENDING > 30 seconds → stay PENDING, retry
SYNCED > 2 minutes with no CONFIRMED → set needs_reupload=1, go back to PENDING

The state machine was the single biggest reliability improvement we made. Before it, a network blip could leave answers in an ambiguous state that required manual DB surgery. After it, everything self-heals.

What 99.9% actually looks like

We measured reliability as: of all answers written locally, what fraction eventually reached the central server and were CONFIRMED before exam end?

Across ~500K answers in production so far, we’re at 99.94% on first exam attempt. The remaining 0.06% are caught by a pre-submission sync check — the exam app won’t let the student “Finish Exam” until every local answer is CONFIRMED. If sync fails repeatedly, an invigilator gets notified and intervenes.

The gap between 99% and 99.9% is mostly in edge cases:

Tablet fully power-offs mid-write (we flush on onPause, not onDestroy, which was wrong initially)
Clock skew making version comparisons fail (we use monotonic version counters, not timestamps)
RabbitMQ queue backup when central server is doing heavy reporting (we added a second RabbitMQ cluster for reporting)
WebSocket disconnections during network handoff (LTE→WiFi) — we reconnect automatically but we had a race where we’d lose the queue position

The gap between 99.9% and 100% is where we stopped investing. The invigilator workflow catches the rest, and the marginal cost of eliminating the last 0.1% was more complex retry logic that itself had bugs.

Things I’d do differently

If I started over today:

Use NATS instead of RabbitMQ. Smaller, faster, simpler. RabbitMQ’s management UI is nice but we never used 90% of it.
Skip MQTT. We could have used NATS for everything.
Store answers as append-only events, not mutable rows. The INSERT OR REPLACE pattern makes auditing harder than it needed to be. An event log with a materialized view would have been easier to reason about.
Invest earlier in the state machine. We shipped without explicit states and spent two months debugging phantom sync issues.

The other 60% of lessons aren’t about architecture — they’re about the fact that exam centers are physical spaces with real failure modes you can’t reproduce in dev. The right instinct is: assume everything breaks, and make every boundary durable.

I’m Vivek Yarra, a Principal Engineer with 15 years of this kind of work. If you’re building distributed systems that have to survive real-world conditions, let’s talk. I’m currently open to US remote Principal/Staff roles with visa sponsorship.