30 Android apps from one codebase: what we got wrong for three years
In 2011 we started building E70 — a K-12 education platform that ended up running on Samsung tablets in ~100,000 Indian classrooms with zero internet connectivity required. By 2014 we had shipped 30+ Android apps for Samsung, TCS, Cambridge, Britannica, Byju’s, and Pearson, all sharing one library called EdutorLibrary.
This post is what I wish I’d known on day one. Some of these decisions saved us months. Others set us back a year. I’ll tell you which was which.
The starting constraints
Before any code:
- Samsung-branded Android tablets distributed to schools. Mixed hardware — Galaxy Tab 2, 3, 4, various OEM boards. Android 2.3 to 4.4 in the same deployment.
- No reliable internet. Some schools had WiFi two hours a day. Some had none. Data sync had to work in hostile conditions.
- Publisher content worth millions — Cambridge, Britannica, Pearson, Byju’s. DRM was non-negotiable; an unencrypted PDF on a student tablet would destroy the business model.
- One codebase, many skins. Samsung wanted their branding. TCS wanted theirs. Each publisher wanted their content-delivery app to feel native. Nobody wanted to pay for 30 separate codebases.
- Budget for maybe 4 Android developers at peak.
These constraints pointed toward a shared-library architecture. The question was how aggressive to make it.
EdutorLibrary: the shared spine
We settled on a library pattern where every app included EdutorLibrary as an Android Library Project (this was pre-AAR days). The library owned:
- DRM / content decryption
- SQLite schema + migrations for the student database
- MQTT sync engine
- The cross-app ContentProvider boundary
- Common UI widgets (player, reader, quiz runner)
Each individual app — e70.samsung, e70.tcs, e70.cambridge, etc. — owned its Activity stack, theming, and per-client customizations. The library became ~70% of the code; each app was ~30%.
What worked: Bug fixes propagated to 30 apps with one rebuild. Security patches went out across the fleet. Adding a new content type meant updating the library once.
What almost killed us: version drift. An app built against EdutorLibrary v1.3 couldn’t read data written by an app built against v1.4 without a migration. Early on, we let each app update on its own schedule. Six months in, we had three incompatible database schemas in production and a cross-app ContentProvider that would silently return nulls when two apps with different schemas shared the tablet. Recovery was painful.
Lesson: if you ship a shared library across apps that share data, you need a hard rule that the whole fleet updates together. We eventually enforced this with a version-check at boot — if EdutorLibrary version mismatched between apps, the older ones refused to start until updated. Painful for users but the alternative was silent corruption.
DRM in the JNI layer
Content protection was the most complex part. A student tablet had to:
- Download an encrypted content package (PDF, video, interactive HTML) from the server
- Verify the license was valid for this student + this tablet + this time window
- Decrypt pages / video frames on-demand
- Never, ever materialize plaintext on disk
We decided early: decryption lives in a native library, not Java. Decompiling Java DEX with apktool was trivial; anyone could grep for “AES” and find the keys. Moving the crypto into a JNI .so didn’t make it impossible to reverse-engineer, but it raised the bar from “curious student” to “determined attacker.”
The architecture:
Java layer (CipherInputStream)
↓ streams encrypted bytes, one page at a time
JNI bridge (AES.cpp)
↓ calls openssl EVP_* APIs
Decryption returns plaintext ONLY to the rendering surface
(never to disk, never back to Java as String)
We used VideoCipherPlayer and CipherInputStream wrappers that intercepted reads, called into JNI to decrypt a block, and passed the plaintext directly to the renderer. The Java heap never saw decrypted bytes.
What we got right
Block-level decryption, not whole-file. Videos are gigabytes. Decrypting the whole file would’ve needed gigabytes of plaintext somewhere. Block-level meant we decrypted 64KB at a time into a native buffer, rendered it, and discarded.
Per-tablet keys derived from a master license. The license file contained a key wrapped with the tablet’s hardware identifier + a server-issued nonce. If you copied the encrypted content to a different tablet, it wouldn’t decrypt. Not unbreakable — hardware identifiers can be spoofed — but it stopped casual piracy (which was 99% of the threat).
License expiry checks in native code. The license file said “valid until 2013-06-15.” If the license check was in Java, a clever student could call the JNI decrypt function directly with a valid-looking license and bypass. We moved the time check into the same JNI call that returned decrypted bytes. If expired, decrypt returned zeros. No way to reach the decrypt function without going through the check.
What we got wrong
Rolling our own AES-GCM. OpenSSL had EVP_* APIs but the NDK version we could use didn’t include them. We hand-wrote AES-CBC with HMAC-SHA256 instead. This kind of worked, but I should have just bundled a recent OpenSSL instead of rolling something a CS textbook would flag. The hand-written version had no known exploits (we were paranoid about the construction) but every senior cryptographer I showed it to made a face. Don’t roll your own crypto. Bundle a tested library even if the binary size hurts.
Keeping keys in memory as byte[]. We’d decrypt a chunk, hold the AES key as a Java byte array for milliseconds, then zero it out. Android’s garbage collector doesn’t zero memory — a heap dump could’ve recovered keys. We moved to keeping keys only in JNI-allocated buffers after an audit pointed this out. Took another two weeks to fix correctly.
MQTT for sync across 100K+ devices
Student progress, quiz results, and content-license updates all synced via MQTT. We picked MQTT over HTTP long-polling because:
- Tablets’ radios were often off (kids close covers, tablets sleep)
- Reconnects were cheap with MQTT’s session state
- Our server could push license revocations or content updates without the tablet polling
- One broker could handle the 100K-device scale with modest hardware
We ran Mosquitto on a cluster of 3 boxes with a shared database for session persistence. Throughput in production was ~5K messages/sec sustained, peaks of 20K during exam periods.
The sync design
Every piece of state on the tablet had an updated_at timestamp and a sync_state enum:
CREATE TABLE student_progress (
student_id TEXT,
lesson_id TEXT,
progress_pct INTEGER,
updated_at INTEGER, -- local clock (see caveat below)
sync_state INTEGER, -- 0=dirty, 1=syncing, 2=synced
PRIMARY KEY (student_id, lesson_id)
);
On connect:
- Tablet publishes dirty records to
student/<id>/push - Server ACKs each record, tablet marks
sync_state=2 - Server publishes any deltas on
student/<id>/pull - Tablet applies deltas, handles conflicts
This is conceptually last-writer-wins with per-row timestamps. Simple, mostly correct, and easy to debug.
The clock problem (this one hurt)
Six months in, we started getting reports of student progress “going backwards.” Kids would finish a lesson, reopen the app a week later, and find it was at 0%.
Root cause: tablets had no NTP. Some were off by hours. A student would finish a lesson at tablet-time 2012-01-01, server would receive and accept, then on next sync the tablet’s local time would correct itself (via cell tower sync, inconsistently) to 2012-06-01. When the server then pushed that student’s progress back to the tablet with the old timestamp, the tablet’s comparison was 2012-01-01 < local_stored_2012-06-01 so it rejected the server update and the new record was interpreted as stale.
The fix was a long ugly migration: every record got a server-assigned monotonic sequence number. Timestamps became advisory. Conflict resolution used sequence numbers, which always increase regardless of clock skew. Took two engineers three weeks and we had to replay six months of MQTT logs to assign sequence numbers retroactively.
Lesson: if any of your clients might have wrong clocks (mobile devices, IoT, any non-NTP-controlled device), don’t use timestamps for conflict resolution. Use monotonic server-assigned IDs.
Cross-app ContentProvider: the sharp edge
Android’s ContentProvider lets one app expose data to other apps. We used it to let e70.samsung (the content player) read student info written by e70.tcs (the admin console), and vice versa.
On paper, elegant. In practice, two footguns we found the hard way.
Footgun 1: signatures
ContentProviders can be restricted to apps signed with the same key. We didn’t do this at first. Consequence: any app on the tablet could read student data if they knew our provider authority. Not a huge deal in a controlled classroom deployment, but in one pilot the tablet was used for general apps too, and a fitness app ended up reading students’ grade levels (we confirmed this in logs). Fixed by adding android:protectionLevel="signature" and ensuring all E70 apps shared a signing key.
Footgun 2: the upgrade deadlock
If e70.tcs v1.3 is installed and shares data with e70.samsung v1.2 via ContentProvider, and we ship e70.tcs v1.4 with a schema change, you can get into a state where:
- The newer app writes data the older app can’t read
- The older app writes data the newer app considers corrupt
- Neither app knows to trigger an update
We eventually required apps to declare the minimum EdutorLibrary version they needed, check peer apps on startup via PackageManager, and refuse to open the shared ContentProvider if any peer was stale. User-hostile but the only way to keep the shared data boundary safe.
SQLCipher and the database recovery system
SQLite is fine until a tablet loses power mid-write. In a production fleet of 100K tablets being used by 10-year-olds, power loss is constant. We saw ~30 tablets/day with corrupted databases.
SQLite’s WAL mode helps but doesn’t eliminate corruption. We eventually built a recovery system that ran at app startup:
- Open DB
- Run
PRAGMA integrity_check - If failed, open the last snapshot (we took a sqlite-backup every hour to external storage)
- Replay any MQTT messages we had queued since the snapshot
- Write a recovery log for ops to see
This was a lot of code but it took tablet-corruption-related support tickets from “weekly” to “never.” SQLCipher (encrypted SQLite) on top added another failure mode — a wrong passphrase looks identical to corruption — but we handled that by keeping the passphrase in the JNI layer and never exposing it to Java.
What 2842 commits of ownership actually looked like
I had 28.4% ownership of the combined codebase — about 2,842 commits across three years, or ~3 commits per working day. This isn’t a flex; it’s a failure mode I want to name.
A platform codebase shared across 30 apps means one engineer’s decisions ripple across the entire fleet. When you’re also the primary architect, you’re also the bottleneck. I was doing code reviews for teammates across three timezones, fixing production fires, and designing the next feature — all at once.
What I’d do differently:
- Document architectural decisions as they happen. We built a recovery system, a DRM system, a sync system — all in my head. When I later stepped back from E70, institutional knowledge walked out the door with me. ADRs (Architecture Decision Records) are cheap insurance.
- Enforce API boundaries earlier. By year 2, EdutorLibrary had grown beyond what one person could reason about. Splitting it into focused libraries (edutor-sync, edutor-drm, edutor-storage) would have let teammates own pieces.
- Invest in testing infrastructure. We had unit tests, but our integration testing relied on physical tablets. Running a fleet of Docker-based Android emulators for CI would have caught 80% of the bugs we shipped.
The outcomes
By 2014, E70 had:
- 30+ Android applications shipping from one codebase
- ~100,000 active student devices
- 84% of the DRM/security code owned by the founding team
- Sub-1% crash rate measured via ACRA (Acralyzer dashboard)
- Content partnerships with 6 major publishers
None of that was because of elegant architecture. It was because we kept the architecture small enough that a small team could hold it in their heads, fought to keep the shared library focused, and recovered from our own mistakes fast enough that they didn’t compound.
If I were building this again today, I’d probably use React Native or Flutter for the app layer (JNI-bridge-free, fewer codepath forks) but keep JNI for crypto. The sync model would still be MQTT, though I’d consider NATS for new deployments. SQLite + server-assigned sequence numbers is still the right pattern for offline-first mobile data. And this time, I’d write the ADRs from day one.
I’m Vivek Yarra, a Principal Engineer who’s built platforms like E70 for 15 years. Currently open to US remote Principal/Staff roles with visa sponsorship. Let’s talk.