# When QUORUM Isn't Enough: How Distributed Clock Skew Silently Discards Data
Table of Contents
TL;DR: I debugged intermittent silent failures in a ScyllaDB/Cassandra cluster where UPDATE queries were ignored despite QUORUM consistency. The root cause was clock skew on a single node caused by a failed NTP daemon. Since distributed databases use “Last-Write-Wins” (LWW) resolution, the drifting node assigned timestamps from the “past”, causing valid updates to be silently discarded.
Background
It started as a typical debugging session. I was investigating an issue where an update query would occasionally fail silently. No exceptions were raised, and the code proceeded as if the update was successful. Downstream features would then fail because the expected data update never happened.
To make things more interesting, the issue had a specific, reproducible pattern:
- It happened only 10-20% of the time.
- It occurred only in the
stagingenvironment. - Production and testing environments (running the exact same code) were fine.
Infrastructure
The architecture consists of several FastAPI services and Celery workers, backed by a 3-node ScyllaDB cluster.
(Note: The logic below applies equally to Apache Cassandra).
The Stack
- Database: ScyllaDB (3 Nodes).
- Replication Factor: 3.
- Consistency Level:
QUORUM(2 nodes must acknowledge). - Driver:
scylla-driverwithGeventConnection.- Reason: The Celery workers are I/O heavy and run with
geventconcurrency. The connection class must match this concurrency model.
- Reason: The Celery workers are I/O heavy and run with
Key Configuration
The system uses CQLEngine ORM for API operations and Prepared Statements for high-performance worker tasks. There was one specific configuration choice that became relevant:
The Investigation
1. The Usual Suspects
Naturally, my first hunch was an environment issue. Since Production and Testing were fine, Staging must be broken, right?
I ran a full health check on the Staging cluster. I verified the node configurations, checked network latency, and compared driver versions. Everything matched Production perfectly. The infrastructure was healthy.
2. The Code Execution Flow
If the infrastructure is healthy, the bug must be in the code. I traced the specific flow that was failing:
- Write A:
INSERTa row intotable1. - Read A:
SELECTthe row to verify it exists. (Success) - Batch: Execute a
LOGGED BATCHof inserts intotable2. - Write B:
UPDATEthe row from Step 1 intable1.
The Failure: The UPDATE in Step 4 acts like a ghost. It executes without error, but the data remains unchanged (stuck at the state from Write A).
3. Debugging Silent Failures
Was the driver swallowing exceptions? I knew that session.execute_async() could hide errors if you don’t check the result:
future = session.execute_async("SELECT * FROM table")try: result = future.result() # Errors surface hereexcept Exception as e: print(f"Query failed: {e}")However, I was using the synchronous session.execute(), which raises exceptions immediately. Just to be sure, I attached custom callback and errback handlers to the driver and enabled verbose logging.
Result: Nothing. Every query returned SUCCESS.
I even restarted the application to force all Prepared Statements to be re-prepared on every node, just in case of a stale state. The issue persisted.
4. Verified Data Integrity
Was the application sending old data? I modified the code to fetch the writetimes of the updated columns before and after the update.
- Expected:
writetimeincreases after Write B. - Actual:
writetimeremained identical to Write A.
This confirmed that the database did not apply the update. It wasn’t a silent application error; the database itself was discarding the write.
5. The Smoking Gun: Trace Patterns
I decided to enable tracing for every query to see exactly which nodes were involved.
session.execute(query, trace=True)After running the reproduction script multiple times, a distinct pattern emerged. The silent failure only happened when Node B was the Coordinator for the update.
- Coordinator A -> Success.
- Coordinator C -> Success.
- Coordinator B -> Silent Failure.
Something was wrong with Node B. But what? To understand why a specific coordinator would cause a write to be ignored, we need to look at the life of a write query in distributed databases like ScyllaDB and Cassandra.
The Mechanics: Inside the Write Path
To understand why “Node B” matters, we need to look at the life of a write query in distributed databases like ScyllaDB and Cassandra.
1. The Setup (Coordinator)
When a write query arrives, the receiving node becomes the Coordinator.
- Replica Selection: It determines which nodes are replicas for that data using the Token Ring.
- The Timestamps Step: The Coordinator generates a mutation. Since
Client Side TimestampisFalse, the Coordinator assigns the timestamp using its own local clock.
Why Timestamps? In leaderless distributed databases, there’s no single node that decides “this write is correct.” When the same data is updated concurrently on different nodes, the database needs a deterministic way to resolve conflicts. Timestamps provide that mechanism: the write with the latest timestamp wins (“Last Write Wins”).
2. The Persistence (Replica Nodes)
The Coordinator sends the mutation to all replicas (including itself, if it is a replica). Each replica performs these steps:
- Mutation Validation: Checks schema and limits.
- CommitLog: Appends to the Write-Ahead Log (WAL) and
fsyncs to disk for durability. - Memtable: Writes to the in-memory data structure for fast access.
- Ack: Sends acknowledgment to the Coordinator.
Once QUORUM is met (2 nodes), the client gets a Success response.
What if a node is down?
If a third node fails to acknowledge, the Coordinator stores a Hint to replay the mutation later. The client still receives a success response because QUORUM was met.
Note: Eventually, the in-memory Memtables are flushed to disk as immutable SSTables (Sorted String Tables). Background processes then compact (merge) these SSTables to discard old data (like overwritten updates) and improve read performance.
- The Retrieval (Read Path) When we read that data back:
- The Coordinator requests data from
QUORUMreplicas (2 nodes).Note: Whether the read requests go to 2 nodes (digest read) or 3 nodes depends on the
read_repairsetting. - In each replica node, the database:
- Reads from the Memtable.
- Reads from SSTables.
- Merge: Merges results from these sources using Timestamps. The data with the latest timestamp wins.
- Returns the merged result to the Coordinator with its timestamp.
- The Coordinator Merge: The Coordinator compares results from the replicas.
- If identical: Returns the result to the application.
- If different: Merges them using Timestamps (“Last Write Wins”) and returns the latest data.
- It may optionally trigger an asynchronous Read Repair to update the stale replicas.
Connecting the Dots
From the Write/Read path, we see that timestamp is the single source of truth for conflict resolution.
- Read A: Returns data from
WRITE A. - Read B: ALSO returns data from
WRITE A(ignoringWRITE B).
According to the “Last Write Wins” rule, this has only one mathematical explanation:
Wait, how can the second write have an earlier timestamp?
WRITE Aused Node A’s clock.WRITE Bused Node B’s clock.
If Node B’s clock is slow, it assigns a timestamp from the past.
The Theory: Distributed Time
To understand exactly how this happened, we need to step back and look at how time works in a distributed cluster.
Physical vs. Logical Time
In modern distributed systems, time is not perceived as a single, absolute “now.” Each node has its own independent hardware (crystal oscillator), causing clocks to drift apart.
- Physical Time (Wall-clock): Tracks UTC. Unreliable for exact ordering across machines due to Clock Drift and Clock Skew.
- Logical Time (Causal Ordering): Focuses on the sequence of events (e.g., Lamport Clocks) to ensure that if Event A causes Event B, A has a lower timestamp.
Synchronization Layers
To bridge the gap between physical drift and logical consistency, systems rely on synchronization protocols:
- NTP (Network Time Protocol): The default for most Linux systems. It synchronizes clocks by exchanging time-stamped messages with a central server.
- PTP (Precision Time Protocol): A hardware-assisted protocol for sub-microsecond accuracy.
- HLC (Hybrid Logical Clock): A hybrid approach combining physical time with logical sequencing.
The Root Cause: Clock Skew
Microsecond Precision
ScyllaDB and Cassandra use microseconds since the Unix epoch for write timestamps. Even though we humans think in milliseconds or seconds, this microsecond precision means that any clock drift—even a fraction of a millisecond—can cause data loss.
The Simulation
Let’s verify my “Node B is slow” theory with a simplified simulation. We’ll use a baseline of 1,000,000 microseconds.
Scenario 1: Healthy Cluster (Expected Flow)
- All Nodes Clock:
1,000,000(Synced)
| Step | Time (Real) | Event | Coordinator | Ass. Timestamp | Result |
|---|---|---|---|---|---|
| 1 | t=0ms | WRITE A | Node A | 1,000,000 | Written |
| 2 | t=50ms | READ A | Node A | - | Returns A |
| 3 | t=150ms | WRITE B | Node B | 1,150,000 | Written (Success) |
| 4 | t=200ms | READ B | Node A | - | Returns B |
Scenario 2: Node B is 200ms Slow (Actual Flow)
- Node A Clock:
1,000,000(Healthy) - Node B Clock:
800,000(200ms behind)
| Step | Time (Real) | Event | Coordinator | Ass. Timestamp | Result |
|---|---|---|---|---|---|
| 1 | t=0ms | WRITE A | Node A | 1,000,000 | Written |
| 2 | t=50ms | READ A | Node A | - | Returns A |
| 3 | t=150ms | WRITE B | Node B | 950,000* | Written (?) |
| 4 | t=200ms | READ B | Node A | - | Returns A |
*Node B Calculation: 1,000,000 (Base) + 150 (Elapsed) - 200 (Skew) = 950,000.
The Database Decision:
When comparing 1,000,000 (Write A) vs 950,000 (Write B), the database correctly concludes that Write A is “newer”, even though Write B happened later in physical time. Write B is silently discarded.
The Smoking Gun
I decided to check the NTP synchronization status on Node B using chronyc tracking.
$ chronyc trackingReference ID : 00000000 ()Stratum : 0Ref time (UTC) : Thu Jan 01 00:00:00 1970System time : 0.200453112 seconds slow of NTP timeLast offset : 0.000000000 secondsRMS offset : 0.000000000 secondsDiscovery: The log confirmed my theory exactly. The System time was 0.200 seconds (200ms) slow.
Upon further investigation, I found that the chronyd daemon on Node B had been silently killed, leaving the clock to drift unchecked. Once I restarted chronyd, the nodes synchronized, and the silent failures vanished immediately.
Conclusion: When Last Write Wins, Time Matters
This investigation reinforced a critical lesson: In leaderless distributed databases, Time isn’t just metadata—it’s logic.
Key Takeaways
-
Monitor Your Clocks: Don’t assume NTP is working. Monitor clock drift (e.g., via
Prometheus node_exporter) and set alerts for skew > 50ms. -
Client-Side Timestamps: If your application logic requires strict causal ordering across multiple writes, considering using Client-Side Timestamps (monotonic). This shifts the source of truth to the client, removing the dependency on individual node clocks.
-
Lightweight Transactions (LWT): For absolute correctness where “Last Write Wins” is insufficient, use LWT (
UPDATE ... IF ...). This uses Paxos to guarantee order regardless of clocks, though it comes with a performance cost.
In Distributed Systems, if you can’t trust your clocks, you can’t trust your data.
References & Further Reading
For those who want to dive deeper into distributed systems theory and the specific technologies mentioned:
-
Distributed Systems Course (Martin Kleppmann)
- YouTube Playlist – Essential viewing. The entire series is a “must view”, but Lectures 3.1-4.1 (Time), 5.1-5.2 (Replication), and 7.1 (Consistency) are particularly relevant here.
-
Designing Data-Intensive Applications (DDIA)
- Martin Kleppmann – Chapter 8 (“The Trouble with Distributed Systems”), specifically the section on Unreliable Clocks.
-
Amazon Dynamo Paper
- Dynamo: Amazon’s Highly Available Key-value Store – The foundational architecture that inspired both Cassandra and ScyllaDB.
-
Logical Clocks
- Time, Clocks, and the Ordering of Events in a Distributed System – Leslie Lamport’s seminal paper on causal ordering.
-
Apache Cassandra Architecture
- Cassandra Architecture Guide – A deep dive into the write path, memtables, and SSTables.