Causal Models of Concurrency
Every programmer who has debugged a race condition has confronted a version of the same conceptual gap: the source code, which is sequential, does not determine the behaviour of the running programme, which is concurrent. What determines the behaviour is not the code in isolation but the causal structure of the execution: which events produced data that other events consumed, which updates preceded which reads, which messages arrived before which operations. The source code specifies a set of possible causal structures. The bug exists in some of them and not in others. A debugging tool that shows you only a single interleaved trace shows you one path through that set and offers no guarantee that it is the path on which the bug occurs.
This article builds the mathematical apparatus for reasoning about causal structures of concurrent executions from first principles. §1–§3 construct the operational layer: labelled transition systems, the four principal concurrency models they capture, and the bisimulation equivalence that identifies models with the same observable behaviour. §4 and §5 cross from the operational to the causal: the Mazurkiewicz independence quotient removes spurious interleaving order and event structures expose the causal and conflict structure directly. §6–§8 develop happens-before, logical clocks, and a measurable observability coefficient. §9–§11 describe the concrete artifact: a shared log format, a Python analyser for recovering causal structure from vector-clock-annotated traces, and a catalogue of the defect classes the analyser detects. §12–§15 survey the ecosystem, place each concurrency paradigm on a comparison matrix, and close with the implementation roadmap for a causal actor layer in Rust.
0. The Problem
Consider a shared integer counter initialised to zero. Two threads each read the current value, add one, and write back. On a sequentially consistent machine the only correct outcome is two: each read sees the result of the preceding write, and the operations serialise into a coherent sequence. But on any real machine, and on any runtime without explicit synchronisation, the following execution is possible:
Thread A Thread B
read → 0
read → 0
write 1
write 1Both threads read zero, both write one, and the counter ends at one. One increment is lost. The failure is not a programming error in the usual sense: both threads executed their logic correctly. The failure is a causal one. The write by thread A and the read by thread B are concurrent: neither precedes the other in the execution order, so thread B cannot see the result of thread A's write. The standard remedy, a mutex or an atomic compare-and-swap, imposes a causal dependency that forces the operations to serialise. But understanding why the remedy works, and what property it restores, requires a mathematical account of causal order.
The lost-update problem is one instance of a class of bugs that are invisible to sequential reasoning and visible only under a causal model. The others include the ABA problem in lock-free data structures, causally inconsistent snapshots in replicated databases, stale reads under weak memory models, and out-of-order message delivery in distributed actor systems. All of them share the same structure: an event that should have observed the result of a prior event instead observed an older state, because the causal dependency between the two events was absent from the execution. Making that structure precise is what this article is about.
1. Labelled Transition Systems
1.1. States, Actions, and Transitions
The foundational operational model of a reactive system is the labelled transition system, introduced by Plotkin in the structural operational semantics programme[1] and developed extensively in Milner's work on CCS.[2]
A labelled transition system (LTS) is a triple where is a set of states, is a set of actions (the alphabet), and is the transition relation. We write for . A distinguished initial state is usually assumed.
An action may be a read, a write, a lock acquisition, a message send, a message receive, or any other observable step. The special action denotes an internal (silent) step that produces no observable effect on the environment. The trace of an execution is the sequence of non- actions performed along a run . Two LTSs are trace-equivalent if they generate the same set of traces; they are bisimilar if they can simulate each other step by step (see §3).
1.2. Interleaving Parallel Composition
The parallel composition of two systems is the system in which either component may take a step at each moment, with the other remaining idle. This is the interleaving model of concurrency: the scheduler resolves nondeterminism by choosing which component acts next.
Given and with , their interleaving parallel composition is where is the smallest relation satisfying:
The left rule allows component 1 to act alone; the right allows component 2; the synchronisation rule (rightmost) allows both to perform silent steps together. For CCS-style synchronisation on complementary channel names and , a fourth rule produces from a matched send-receive pair.
The interleaving model has a critical deficiency: it orders actions that the underlying semantics leaves unordered. If two processes write to disjoint memory locations in parallel, the interleaving semantics produces two distinct traces, one for each scheduling order, even though the two writes are causally independent. This over-specification is the source of the combinatorial explosion in model checking and the reason why the independence-theoretic refinements of §4 are necessary.
1.3. The Reference Workload
For the examples throughout this article we use a four-event reference workload. Three processes exchange two messages. Process sends a message to both and ; process receives from and sends a reply to ; process receives from both. The six events are:
The happens-before partial order on this workload, derived in §6, will serve as the running example for the clock, observability, and analyser sections.
2. Concurrency Models as LTSs
2.1. Shared Memory with Locks
The classic model of concurrent programming on shared-memory multiprocessors uses a global heap of typed memory locations together with mutual exclusion primitives. A mutex is an abstract resource with two actions: and . At most one thread holds at any point; the lock acquisition blocks until the resource is free.
In LTS terms, a mutex has two states and transitions and . The parallel composition of threads and a shared heap of locations under mutexes produces a state space of size where is the value domain size, because the heap state cross-products with the thread scheduling nondeterminism. This state-space explosion is the fundamental obstacle to model checking shared-memory programmes at scale.
A critical section is a sequence of actions bracketed by and . The causal discipline imposed by a mutex is exactly the serialisation of critical sections: the happens-before order includes the edge from the unlock of any critical section to the lock of the next critical section on the same mutex. This is the minimal causal structure that prevents two threads from simultaneously observing the shared state.
2.2. Lock-Free Atomics and Compare-and-Swap
Lock-free data structures replace mutex-based critical sections with atomic read-modify-write instructions. The core primitive is compare-and-swap (CAS): atomically reads location , compares the value to , and writes only if the comparison succeeds, returning a boolean indicating success.
CAS is susceptible to the ABA problem: between a read of value and the subsequent CAS expecting , another thread may have written and then written again. The CAS succeeds even though the location has been modified, because it observes only the current value, not the causal history. The standard remedy is a stamped reference: pair the value with a monotonically increasing version counter, so and are distinct even when the values agree. The version counter makes the causal history partially observable in the data itself.
In LTS terms, CAS is a single atomic transition: the state in which transitions on action to the state in which , with no intermediate states in which other threads may observe the location. The atomicity means the action is indivisible in the interleaving semantics: no other thread's action can be interleaved with it.
Under weak memory models (x86 TSO, ARM, RISC-V RVWMO), CAS with sequentially consistent ordering inserts memory fences that synchronise the store buffers of all participating cores, establishing happens-before edges between the CAS and all preceding and following operations. The formal memory model specifies which fences are generated by which orderings, and the C++ memory_order annotations select among them.
2.3. Actors and Channels
The actor model, introduced by Hewitt, Bishop, and Steiger in 1973[3] and formalised by Agha,[4] replaces shared mutable state with isolated processes that communicate by passing immutable messages. Each actor has a private mailbox; sending is asynchronous and non-blocking; receiving is the only point at which an actor observes the external world. There is no shared heap.
In LTS terms, an actor system is a parallel composition in which each actor is an LTS over the alphabet of message types it can send and receive. The synchronisation rule matches sends from one actor with receives at another, producing the synchronisation steps in the composed LTS. Because actors share no memory, the only causal dependencies between actors are the message-passing dependencies: a receive event is causally downstream of the matching send event.
Erlang and Elixir build their runtime on this model. BEAM (Bogdan's Erlang Abstract Machine) implements actors as lightweight processes scheduled cooperatively across a thread pool, with process-level garbage collection and preemption based on reduction counts. The OTP supervision tree provides a principled fault-tolerance layer: supervisors restart failed actors under configurable strategies. The causal structure visible at the BEAM runtime level is the mailbox FIFO order within each actor: messages from a given sender arrive at a given receiver in the order sent. Between distinct sender-receiver pairs, no ordering is guaranteed.
2.4. Async and Cooperative Scheduling
Modern async runtimes (Tokio in Rust, asyncio in Python, Node.js) implement cooperative multitasking on a shared thread pool. A task is a Future that yields control at each await point, allowing the scheduler to run other tasks. Unlike threads, tasks do not share a stack: each await point is an explicit suspension, and the scheduler maintains a queue of runnable tasks.
The LTS of an async system has states that include the scheduler queue, the state of all parked futures, and the shared heap. Transitions correspond to: a task making progress until the next await; a task waking up after its awaited future resolves; or an I/O event waking a parked task. The interleaving semantics is coarser-grained than for threads: task switches occur only at await points, so the state space between two consecutive awaits in a task is effectively sequential.
The causal structure of an async system is richer than it appears. An await on a channel receive creates a happens-before edge from the send to the receive; an await on a JoinHandle creates an edge from the spawned task's completion to the awaiting task; a Mutex::lock() in an async context creates the same causal serialisation as its synchronous counterpart. The difficulty is that these edges are often implicit and invisible to standard tracing tools, which report wall-clock timestamps rather than causal order.
3. Bisimulation and Expressiveness
3.1. Strong Bisimulation
Trace equivalence is too coarse an equivalence for concurrent systems: two LTSs may generate the same set of traces while behaving differently under composition with other systems. The correct notion of behavioural equivalence for reactive systems is bisimulation, introduced by Park[5] and developed by Milner.[2]
A relation on the states of a single LTS (or relating states of two LTSs over the same alphabet) is a strong bisimulation if, whenever :
(i) for every , there exists with ; and
(ii) for every , there exists with .
Two states are strongly bisimilar, written , if there exists a strong bisimulation containing .
Strong bisimilarity is a congruence: if then for any context . This is the key property that trace equivalence lacks: trace equivalence does not compose under parallel contexts, so it is not a sound basis for component-level reasoning about concurrent systems.
Weak bisimulation (Milner, 1989)[2] abstracts over silent steps. Write for the weak transition that performs any number of -steps, then an -step, then any number of -steps. A weak bisimulation requires: for every , there exists with the states bisimilar; and symmetrically. Weak bisimilarity, written , equates any two systems that have the same observable behaviour up to internal computation.
3.2. Expressiveness and Inter-encodings
A classical question in concurrency theory is whether the different models are expressively equivalent: can shared-memory concurrency simulate the actor model, and vice versa? The answer is nuanced and depends on the synchronisation primitives available.
Shared memory with CAS can simulate any actor system: represent each actor's mailbox as a lock-free queue, implement send as an enqueue, and implement receive as a dequeue. The simulation introduces a spin on empty mailboxes that does not exist in the actor model, but it is observationally equivalent. Actor systems can simulate shared-memory systems: represent a shared location as a dedicated actor that receives read and write requests and serialises them through its mailbox. Both simulations are faithful in the sense of weak bisimulation.
However, the simulations have different complexity profiles. The shared-memory simulation of actors introduces queue data structures with associated contention; the actor simulation of shared memory introduces a message round-trip for every memory access. These overheads are not artefacts of the simulation: they reflect genuine differences in the cost model of each paradigm. The theoretical inter-encodibility does not imply practical equivalence.
4. Independence and Mazurkiewicz Traces
4.1. The Independence Relation
The interleaving model imposes a total order on concurrent actions. The Mazurkiewicz trace theory,[6] developed in the 1970s and 1980s, recovers from this over-specification by quotienting out the spurious order: two sequences of actions are declared equivalent if one can be obtained from the other by swapping adjacent independent actions.
Given an alphabet , an independence relation is a symmetric, irreflexive relation . The dependence relation is its complement restricted to : . Actions with may be swapped freely; actions with are ordered and may not.
The standard independence relation for shared-memory systems declares two actions independent if and only if they access disjoint sets of memory locations, or both are reads (reads cannot conflict with each other). Two writes to the same location are always dependent; a read and a write to the same location are dependent. For message-passing systems, a send and a receive on different channels are independent; a send and the matching receive on the same channel are dependent.
4.2. Trace Monoids and Trace Equivalence
Given an independence relation , the trace equivalence on is the smallest congruence (with respect to concatenation) satisfying whenever . A trace is an equivalence class . The trace monoid is , a monoid under concatenation of representatives.
Every trace in has a unique representative in Foata normal form: a sequence of steps where each step is a set of mutually independent actions, and every action in depends on at least one action in . This form is the direct algebraic precursor to the Hasse diagram of §6.
The significance of the trace monoid for verification is that a property of a concurrent system stated as a predicate on traces is invariant under if and only if the property does not depend on the order of independent actions. Such properties can be verified on a single representative of each equivalence class, collapsing the interleaving explosion: instead of checking all interleavings of independent actions, one checks a single representative trace. This is the theoretical basis for partial-order reduction in model checkers.
5. Event Structures and the Bridge Theorem
5.1. Prime Event Structures
A trace captures the sequential possibilities of a concurrent computation. An event structure captures the concurrent structure directly, without choosing a representative interleaving.
A prime event structure (Winskel, 1987)[7] is a triple where:
- is a countable set of events;
- is the causality relation: a partial order that is well-founded (every downward-closed set is finite) and left-finite ( is finite for every );
- is the conflict relation: irreflexive and symmetric, and conflict-hereditary: if and then .
Two events with stand in the causal order: can only occur if has already occurred. Two events are in conflict: at most one of them can occur in any single execution. Events that are neither causally ordered nor in conflict are concurrent.
A configuration of is a finite subset that is:
- conflict-free: no two events in are in conflict;
- downward-closed (causally closed): if and then .
The set of all configurations, ordered by inclusion, forms a prime algebraic domain.
The configurations of an event structure represent the partial executions of the system. Adding an event to a configuration yields a larger configuration whenever is enabled: all causal predecessors of are already in , and is not in conflict with any event already in .
5.2. The Bridge Theorem
The trace model and the event structure model are not competing formalisms; they describe the same mathematical object from two perspectives. The bridge between them is the following construction.
Let be an LTS with alphabet and independence relation . Define the event structure as follows: events are pairs where is a reachable prefix of a trace and extends to a reachable trace; causality is the prefix order modulo ; conflict is the branching in the quotient. Then the set of configurations of is isomorphic (as a partially ordered set) to the set of traces of ordered by prefix, with concurrent actions identified.[7]
The practical implication of the bridge theorem is that the happens-before partial order on the events of a single execution is not merely an operational convenience: it is the restriction of the causality relation of the underlying event structure to the events that occurred in that execution. The Hasse diagram of the happens-before order on an execution trace is a configuration of the system's event structure.
6. Happens-Before and Logical Clocks
6.1. Lamport's Happens-Before
In his 1978 paper "Time, Clocks, and the Ordering of Events in a Distributed System,"[8] Leslie Lamport defined the happens-before relation for distributed systems in terms of three axioms: process order, message order, and transitivity.
Let be the set of events in a distributed execution, partitioned into process-local sequences . The happens-before relation is the smallest transitive relation satisfying:
- Process order: if and are events in the same process and precedes in the process's execution sequence, then ;
- Message order: if and for the same message , then .
Two events with and are concurrent, written .
The happens-before relation is a strict partial order on : it is irreflexive (no event precedes itself) and transitive by definition. It is not a total order: concurrent events are incomparable. The Hasse diagram of is the transitive reduction: the graph obtained by removing all edges for which a path exists through an intermediate event .
6.2. Lamport Scalar Clocks
A Lamport clock is a function that is consistent with the happens-before order: implies . Lamport gave an algorithm for computing such a clock in a distributed system without shared memory:
- Each process maintains a local counter, initialised to zero.
- On every event, the counter is incremented.
- On sending a message, the current counter value is included in the message.
- On receiving a message with timestamp , the local counter is set to .
The Lamport clock satisfies but not the converse: does not imply . Two concurrent events may receive different clock values, and a larger scalar clock value does not establish causality. This gap makes Lamport clocks insufficient for recovering the full happens-before order from a log: knowing only scalar timestamps, one cannot distinguish a genuine causal dependency from a coincidental ordering of concurrent events.
6.3. Vector Clocks
Vector clocks, independently developed by Fidge[9] and Mattern,[10] close the gap: they satisfy both directions of the implication.
A vector clock for a system of processes is a function . Write for componentwise comparison: for all . Write if and . The vector clock is maintained as follows:
- Each process initialises its vector to .
- On each event at process , increment by one.
- On sending a message, include the current vector in the message.
- On receiving a message with vector , set for all , then increment .
Under the vector clock assignment above, for any two events and :
That is, event happens before event if and only if the vector clock of is strictly dominated componentwise by the vector clock of . Two events are concurrent if and only if their vectors are incomparable.
This characterisation means that a log of events annotated with vector clocks contains complete causal information: the happens-before partial order is recoverable exactly from the clock values, without any additional communication or synchronisation. The cost is proportional to : each message carries a vector of integers, where is the number of processes. For large distributed systems with hundreds of processes, this overhead can be significant; interval tree clocks and plausible clocks offer more compact representations at the cost of tracking weaker approximations.
6.4. The Order Theory of Logical Time
The happens-before partial order on a finite execution is a finite poset. The order theory of finite posets is rich and connects the causal structure of executions to deep combinatorial and lattice-theoretic results.
Every partial order has at least one linear extension: a total order consistent with the partial order. More precisely, if is a partial order and and (that is, and are incomparable), then there exists a total order on such that for all , and either or .[11]
The Szpilrajn theorem is the foundation of the linearisability notion in §7: a concurrent execution is linearisable if there exists a linear extension of its happens-before order that satisfies the sequential specification of the data structure.
In any finite poset, the minimum number of chains needed to partition the poset equals the maximum size of an antichain.[12]
In the context of concurrent executions, Dilworth's theorem says: the minimum number of sequential threads needed to explain the execution equals the maximum number of mutually concurrent events at any moment. An execution with a wide antichain — many pairwise concurrent events — cannot be explained by a single sequential process.
Counting the number of linear extensions of a finite partial order is -complete.[13]
The Brightwell-Winkler result has a direct implication for deterministic replay: the number of distinct interleavings consistent with a given causal structure is exponential in the width of the poset, and counting them exactly is computationally intractable. Deterministic replay does not enumerate all interleavings; it re-executes a single one. But understanding the causal structure is prerequisite to identifying the specific interleaving to replay.
For any finite distributive lattice , there is a unique (up to isomorphism) finite poset such that , the lattice of order ideals of ordered by inclusion.[14]
The configurations of an event structure ordered by inclusion form a distributive lattice. Birkhoff's theorem says that this lattice uniquely encodes the underlying causal poset: the event structure can be recovered from its configuration lattice. This gives the causal structure of an execution a natural algebraic representation.
The happens-before partial order of a distributed system is structurally identical to the causal order of relativistic spacetime events. Malament's theorem (1977)[15] proves that in Minkowski spacetime, the causal partial order uniquely determines the conformal structure: the metric up to a positive scaling factor. This is the mathematical foundation of the causal set programme for discrete quantum gravity, which proposes that spacetime is a locally finite poset and that the causal order is the fundamental physical structure. The parallel with distributed systems is not a metaphor: both are instances of the same mathematical object, a locally finite partial order on a set of events.
6.5. Hybrid Logical Clocks
In production systems, wall-clock time is a useful contextual annotation even though it is causally unreliable (NTP skew, leap seconds, clock drift). The hybrid logical clock (HLC), introduced by Kulkarni, Demirbas, Madeppa, Avva, and Leone (2014),[16] combines the causality guarantee of Lamport clocks with a best-effort approximation of wall-clock time.
An HLC timestamp is a pair where tracks the maximum physical time observed and is a counter that breaks ties when multiple events share the same . The update rules ensure that implies the HLC of is strictly less than the HLC of in the lexicographic order on , and that stays close to physical time. HLC timestamps are suitable for use in systems where debugging requires correlating causal logs with external time series (metrics, alerts), because they provide a causally valid ordering that is also approximately temporally interpretable.
7. Correctness as Predicates over Pomsets
7.1. Partially Ordered Multisets
A pomset (partially ordered multiset) is the natural data type for a concurrent execution: a finite set of event occurrences, each labelled by an action, partially ordered by the happens-before relation.
A pomset is a triple where is a finite set of event occurrences, is a partial order on , and is a labelling function. The labelling assigns to each occurrence the action it performs (a read of value from location , a write of value to location , a send of message , etc.).
A correctness condition for a concurrent data structure or system is a predicate on the pomset of its executions. The major correctness conditions form a hierarchy of decreasing strength.
7.2. Linearisability
Linearisability, introduced by Herlihy and Wing (1990),[17] is the gold standard for shared-memory concurrent objects. It requires that every concurrent execution appear as if the operations were executed atomically in some sequential order consistent with their real-time overlap.
A history of operations on a concurrent object is linearisable if there exists a sequential history such that:
- is a legal sequential execution of the object (it satisfies the object's sequential specification);
- is consistent with the real-time order of : if operation completes before operation begins in , then precedes in .
The sequential history assigns each operation a linearisation point: an instant within its execution interval at which it appears to take effect atomically.
Linearisability is a local property: a system of linearisable objects, each independently satisfying the condition, is itself linearisable. This compositionality makes it the correct correctness condition for library-level concurrent data structures. The std::sync::Mutex in Rust, the java.util.concurrent collections, and the chan primitive in Go are all specified to be linearisable.
7.3. Sequential Consistency
Sequential consistency, defined by Lamport (1979),[18] is the correctness condition for shared-memory multiprocessors. It is strictly weaker than linearisability: it requires a valid sequential order consistent with each process's local order, but not with the real-time order between processes.
An execution is sequentially consistent if there exists a total order on all operations that:
- is consistent with the program order of each individual process (each process's operations appear in execution order); and
- the value returned by each read is the value written by the most recent write to the same location in the total order.
Sequential consistency does not compose: a system of sequentially consistent components is not necessarily sequentially consistent as a whole. This non-compositionality was one of Herlihy and Wing's motivations for defining linearisability.
7.4. Causal Consistency
Causal consistency (Ahamad, Neiger, Burns, Kohli, and Hutto, 1995)[19] is the consistency model appropriate for geo-distributed systems where linearisability and sequential consistency are too expensive to enforce across wide-area networks. It requires only that causally related writes be seen in the same order by all processes; concurrent writes may be seen in different orders at different replicas.
An execution is causally consistent if, for any two writes and to the same location such that (w1 causally precedes w2), every process sees before . Concurrent writes (writes with ) may be seen in any order by different processes.
The strictness hierarchy for shared data is: linearisability implies sequential consistency implies causal consistency implies eventual consistency. Each relaxation trades correctness for availability under partition, as formalised by the CAP theorem. The appropriate choice depends on the application: a financial ledger requires linearisability; a social media feed can tolerate causal consistency; a CDN serving static assets needs only eventual consistency.
8. The Observability Coefficient
8.1. Definition
The observability coefficient measures how much of the true causal structure of an execution is recoverable from the instrumentation applied to it. It is defined as the recall of the true Hasse diagram of the happens-before order under the chosen instrumentation regime.
Let be the true happens-before partial order of an execution, and let be its Hasse diagram (transitive reduction). Under instrumentation regime , let be the set of Hasse edges recoverable from the log produced by regime . The observability coefficient is:
When (a single-event execution), define .
Three instrumentation regimes are of practical interest:
No instrumentation (): the log contains only wall-clock timestamps and actor identifiers, no causal annotations. No Hasse edges are recoverable because wall-clock order is unreliable and does not imply causal order. Thus .
Program order only (): the log includes a per-actor sequence number. This recovers the within-actor Hasse edges (adjacent events in the same process's execution sequence) but not the inter-actor message edges. Thus where is the set of within-actor Hasse edges.
Vector clock (): the log includes a full vector clock for every event. By the characterisation theorem of §6.3, this recovers the complete happens-before order.
Under vector-clock instrumentation, : the complete Hasse diagram of the happens-before order is recoverable from the log.
The proof is immediate from the characterisation theorem: , so the transitive reduction can be computed from the vector clocks by the hasse() algorithm of §10.
8.2. Observability in BEAM Systems
BEAM-based systems (Erlang, Elixir) offer exceptional runtime observability through native primitives: :sys.get_state/1 for inspecting any process's state, Process.info/2 for memory, message queue length, and link topology, and recon_trace for attaching lightweight function-call tracers to running production processes without restart. The distributed Erlang net_kernel allows tracing across nodes with the same tools.
However, these tools expose the operational layer, not the causal layer. :sys.trace reveals the sequence of messages an actor received, in the order they were delivered to its mailbox. It does not reveal the vector clock of each message, the happens-before structure relating messages across actors, or whether a particular read causally depended on a particular earlier write. The observability coefficient of a production BEAM system with default instrumentation is determined by the mailbox FIFO property: within a single sender-receiver pair, messages arrive in order, giving partial recovery of the within-pair causal edges. Across pairs, without vector clocks, .
The contrast with a Rust actor system instrumented with vector clocks is stark: against in all but the trivial case. The observability coefficient is thus a quantitative criterion for comparing instrumentation strategies across runtimes.
9. The Shared Log Format
9.1. Schema
The analyser ingests a newline-delimited JSON log (NDJSON): one JSON object per line, one line per event. The schema is:
{
"id": "string (globally unique event identifier)",
"actor": "string (process or thread name)",
"seq": "integer (monotone within the actor, 1-indexed)",
"vclock": {"actor_name": integer, ...},
"action": "string (human-readable description of the event)",
"ts_wall": "string (ISO-8601 wall clock, informational only)"
}The vclock field is the authoritative causal annotation. The ts_wall field is informational: it is useful for correlating with external time series but must not be used for causal reasoning. The seq field encodes program order within the actor and is sufficient for the program_order regime.
9.2. The POSIX Clock Caveat
Wall-clock time from CLOCK_REALTIME (or SystemTime::now() in Rust, time.time() in Python) is subject to NTP adjustments, leap-second handling, and clock skew between nodes. It is not monotone across events on different machines and may not be monotone within a single machine if the system clock is stepped backward. Using wall-clock order as a proxy for causal order is a source of heisenbugs: the log appears causally ordered in testing (where the clock skew is small and NTP is stable) and appears out of order in production (where nodes diverge under load). The ts_wall field in the log format is intentionally separate from the vclock field to make this distinction structurally explicit.
10. The Analyser
10.1. Happens-Before from Vector Clocks
The Python analyser in concurrency-causality/analyzer/ implements the causal analysis pipeline. The core computations are in two modules: order.py for the causal order and Hasse diagram, and observability.py for the observability coefficient.
The happens_before function implements the vector clock characterisation theorem directly:
def vc_leq(a: dict[str, int], b: dict[str, int]) -> bool:
for k in a.keys() | b.keys():
if a.get(k, 0) > b.get(k, 0):
return False
return True
def happens_before(e: Event, f: Event) -> bool:
if e.vclock == f.vclock:
return False
return vc_leq(e.vclock, f.vclock)The vc_leq function is the componentwise comparison on maps, with missing keys interpreted as zero. Two events are ordered if and only if vc_leq holds and their clocks differ. This is exactly the characterisation .
10.2. The Hasse Diagram
The hasse function computes the transitive reduction of the happens-before order:
def hasse(events: list[Event]) -> set[tuple[str, str]]:
index = {e.id: e for e in events}
ids = list(index)
order: set[tuple[str, str]] = set()
for a in ids:
for b in ids:
if a != b and happens_before(index[a], index[b]):
order.add((a, b))
reduced = set(order)
for a, c in order:
for b in ids:
if b != a and b != c and (a, b) in order and (b, c) in order:
reduced.discard((a, c))
break
return reducedThe algorithm is in the number of events: quadratic to build the full order, cubic to remove transitive edges. For the typical trace sizes encountered in debugging sessions (hundreds to low thousands of events), this is fast enough for interactive use. For production-scale traces with millions of events, a more efficient sparse algorithm using depth-first search would be appropriate.
10.3. The Observability Coefficient in Code
The omega function in observability.py implements the three instrumentation regimes:
def omega(events: list[Event], regime: str) -> float:
true_hasse = hasse(events)
if not true_hasse:
return 1.0
if regime == "none":
recovered: set[tuple[str, str]] = set()
elif regime == "vclock":
recovered = set(true_hasse)
elif regime == "program_order":
by_actor: dict[str, list[Event]] = {}
for e in events:
by_actor.setdefault(e.actor, []).append(e)
recovered = set()
for actor_events in by_actor.values():
ordered = sorted(actor_events, key=lambda e: e.seq)
for a, b in zip(ordered, ordered[1:]):
edge = (a.id, b.id)
if edge in true_hasse:
recovered.add(edge)
return len(recovered) / len(true_hasse)The program_order regime recovers within-actor adjacent edges: pairs of consecutive events in the same actor's sequence that are also in the Hasse diagram. Not every pair of consecutive events at the same actor is a Hasse edge: if a later event at a different actor sends a message that arrives between two consecutive events at this actor, the message reception may create a more direct causal path, making the adjacent within-actor edge redundant in the transitive reduction.
11. Failure Mode Detection
The analyser detects four classes of causal defects from the log.
Stale read. A stale read occurs when an event reads location and there exists a write event to the same location with (f happens before e) but does not read the value written by . This means the read observed an older value than the one causally available to it. Detection: for every read event, find the set of write events that happen before it and check whether the read's observed value matches the most recent such write in the causal order.
Write-write conflict. A write-write conflict occurs when two events and both write to the same location and (they are concurrent). Under any sequential consistency model, last-write-wins is undefined for concurrent writes: the outcome depends on the scheduler. Detection: find pairs of write events to the same location whose vector clocks are incomparable.
Causally inconsistent snapshot. A causally inconsistent snapshot occurs when a process reads from two sources whose writes are mutually concurrent, in a way that violates the causal order visible to that process. Detection: for each actor, check that the set of write events it has observed by each read event forms a downward-closed set in the happens-before order: observing the effect of write requires observing the effects of all writes .
Clock regression. A clock regression occurs when an incoming message carries a vector clock that is not componentwise greater than or equal to the recipient's current clock minus the recipient's own component. This indicates a violation of the vector clock update protocol: either a component is dropping the max step on receive, or messages are being replayed out of order. Detection: check the clock update invariant on every receive event in the log.
12. The Rust Ecosystem and the Three-Layer Design
12.1. Actor Runtimes in Rust
The Rust async ecosystem has several actor frameworks, each with a different design philosophy.
kameo is a Tokio-native async actor library with typed messages: each actor declares its message types via a Message trait implementation, and the compiler statically verifies that only the declared message types are sent to each actor. Messages are processed one at a time from the actor's inbox, ensuring single-threaded access to the actor's state. kameo supports supervision, remote actors over network transports, and graceful shutdown. Its async-native design means actors are zero-cost when idle: a parked actor holds no OS thread.
ractor is an OTP-influenced actor framework that supports both synchronous and asynchronous message handling. It introduces the notion of an Actor trait with pre_start, handle, and post_stop lifecycle hooks, mirroring the Erlang gen_server callback model. ractor actors can form supervision hierarchies, and the supervisor restart strategy mirrors OTP's one_for_one, one_for_all, and rest_for_one strategies.
actix and actix-web were the dominant Rust actor frameworks in the 2019-2021 era. actix-web now uses Tokio directly for its HTTP runtime and has largely moved away from the actor model for request handling; the actor abstraction is maintained for compatibility but is no longer the primary abstraction.
Other crates of interest in this space: thingbuf for lock-free MPSC ring buffers with async support; rtrb for wait-free SPSC; shuttle for randomised, seed-replayable schedule exploration; madsim for seeded deterministic Tokio simulation with mocked I/O; and turmoil for single-thread network simulation. The OpenTelemetry Rust SDK provides the sink layer for causal metadata export described in §12.2.
Neither kameo nor ractor natively instruments their messages with causal clocks. Message delivery preserves the FIFO order within a channel, but no vector clock is attached to messages and no Hasse diagram is recoverable from a default log. The observability coefficient of a production Rust actor system under these frameworks is : within-actor ordering is observable, inter-actor causal structure is not.
12.2. The Three-Layer Causal Instrumentation Design
The gap between the theoretical observability achievable with vector clocks and the practical observability of current Rust actor frameworks motivates a three-layer instrumentation design.
Layer 1: Transport. The actor runtime (kameo, ractor, or a custom implementation) handles message delivery, scheduling, and lifecycle. This layer is not modified. The invariant at this layer is: messages are delivered in FIFO order per channel, and the actor's state is accessible only from within the actor's message handler.
Layer 2: Clock envelope. Every outgoing message is wrapped in a CausalEnvelope<M>:
struct CausalEnvelope<M> {
vclock: HashMap<ActorId, u64>,
payload: M,
}The actor maintains a local vector clock as an Arc<Mutex<HashMap<ActorId, u64>>>. On every send, the local clock is incremented, and the current clock is stamped into the envelope. On every receive, the local clock is updated by max with the envelope's clock, then incremented. This is the vector clock update protocol of §6.3 in Rust.
Layer 3: OpenTelemetry sink. After processing each message, the actor emits an OpenTelemetry span with the causal metadata: actor ID, sequence number, vector clock snapshot, and a description of the action. The OTel span is exported to a collector that stores events sorted by their logical stamp. Because the vector clock is causally valid, the sorted order is a topological sort of the happens-before partial order, and the analyser of §10 can reconstruct the full Hasse diagram from the collector's output.
The design decouples causal instrumentation from the actor runtime: any actor framework that delivers messages can be wrapped with the clock envelope layer without modifying the framework's internals. The overhead is one hash map operation per send and receive (the max update), plus the OTel span emission, which can be asynchronous.
12.3. The BEAM Contrast in Depth
Erlang and Elixir's runtime observability tools deserve careful assessment, because they offer a qualitatively different kind of visibility from what vector clocks provide.
:sys.get_state/1 suspends the target process and returns its internal state. This is a synchronous introspection: the observer sees a consistent snapshot of the actor's state at the cost of a brief pause. There is no Rust equivalent that works without modifying the actor, because Rust's ownership model prevents external code from accessing an actor's state while the actor holds a mutable reference to it.
recon_trace attaches a trace to running production processes based on function name patterns, sampling rate, and a maximum message count, without restarting the process. This is possible because BEAM compiles to a bytecode that includes source metadata, and the VM's trace infrastructure is part of the runtime, not a separate tool. Rust's compiled native binaries do not have an analogous built-in tracing API; the closest analogues are perf probes and DTrace USDT probes, which require symbol table information and root access.
The BEAM tools excel at operational observability: what is the actor doing right now, how full is its mailbox, what messages has it received recently. They are weaker at causal observability: which message caused which side effect, whether a read observed the value written by a causally prior write, whether two concurrent writes produced a conflict. The three-layer design is not a replacement for BEAM's operational tools; it is a complement that adds causal structure to a layer of observability that both ecosystems currently lack.
13. Comparison Table
The five major concurrency paradigms considered in this article differ on six causal-observability axes:
| Paradigm | Causal model | Compositionality | Default | Max | Replay | Formalised |
|---|---|---|---|---|---|---|
| Shared memory / mutex | Serialised critical sections | Yes (linearisability) | 0 | 1 (VC on mutex ops) | Scheduler control | RustBelt, C++ memory model |
| Lock-free atomics (CAS) | Acquire-release fences | Yes (linearisability) | 0 | 1 (VC on sync ops) | Hard (requires hardware replay) | C++ memory model, LLVM IR |
| Actors (BEAM) | Mailbox FIFO per pair | Partial (OTP supervision) | 1 (add VC to messages) | Possible (record mailbox order) | CSP, process algebra | |
| Actors (Rust: kameo, ractor) | Channel FIFO | Partial (typed messages) | 1 (three-layer design) | Possible (record channel order) | None (in progress) | |
| Async cooperative (Tokio) | Await-point boundaries | Yes (task-level) | 0 | 1 (VC on channel ops) | Possible (deterministic scheduler) | None (in progress) |
Here denotes the program-order observability coefficient: the fraction of Hasse edges that are within-actor adjacent edges, which varies by execution and is strictly less than 1 whenever inter-actor causal edges are present in the Hasse diagram.
The formalisation status column reflects the current state of the field. Shared-memory concurrency under C++ memory model semantics is formalised and has machine-checked proofs (RustBelt, the Iris framework). The BEAM actor model is formalised in terms of process algebra and CSP but not with respect to causal observability. Rust actor frameworks and async runtimes have no published formal account of their causal semantics.
14. Interactive Exploration
The following widget renders the happens-before Hasse diagram for four predefined execution scenarios (the §1.3 reference workload, the lost-update race, a mutex-serialised execution, and an actor chain), with the observability coefficient computed live for each instrumentation regime. The computation runs in a Rust/WASM module compiled into this page.
15. Coda
The programme laid out in this article connects three levels: the operational level of labelled transition systems and their bisimulation equivalences, the trace-algebraic level of Mazurkiewicz independence and event structures, and the causal level of happens-before partial orders and logical clocks. These are not three separate theories; they are three presentations of the same structure, related by the bridge theorem of §5.2 and the vector clock characterisation of §6.3.
The practical payoff is the observability coefficient: a single number that quantifies how much causal structure a given instrumentation regime recovers. Under no instrumentation, no causal structure is visible. Under program-order instrumentation, the within-actor structure is visible. Under vector-clock instrumentation, the complete structure is visible. The three-layer actor design in §12.2 is the concrete Rust implementation of vector-clock instrumentation, and the analyser in §10 is the concrete tool for computing the coefficient and detecting causal defects from the resulting logs.
Several directions remain open. First, the formal Lean 4 formalisation of the bridge theorem (§5.2) and the vector clock characterisation theorem (§6.3) would give machine-checked proofs of the results on which the analyser correctness depends. Second, the three-layer design is a library specification; the causal_actor crate is the planned Rust implementation, to be published on crates.io. Third, cross-language trace correlation: a system in which a Rust service calls a Python subprocess which calls an Erlang node would benefit from a shared NDJSON log format that spans all three runtimes, with a common vector clock namespace keyed by actor string identifiers. The format of §9 is designed with this in mind.
References
- G. D. Plotkin, "A Structural Approach to Operational Semantics," Report DAIMI FN-19, Aarhus University, 1981.
- R. Milner, Communication and Concurrency. Prentice Hall, 1989.
- C. Hewitt, P. Bishop, and R. Steiger, "A Universal Modular ACTOR Formalism for Artificial Intelligence," in Proc. 3rd IJCAI, 1973, pp. 235–245.
- G. Agha, Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press, 1986.
- D. Park, "Concurrency and Automata on Infinite Sequences," in Theoretical Computer Science, Lecture Notes in Computer Science, vol. 104. Springer, 1981, pp. 167–183. doi:10.1007/BFb0017309
- A. Mazurkiewicz, "Trace Theory," in Advances in Petri Nets 1986, Lecture Notes in Computer Science, vol. 255. Springer, 1987, pp. 279–324. doi:10.1007/3-540-17906-2_30
- G. Winskel, "Event Structures," in Advances in Petri Nets 1986, Lecture Notes in Computer Science, vol. 255. Springer, 1987, pp. 325–392. doi:10.1007/3-540-17906-2_31
- L. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System," Commun. ACM, vol. 21, no. 7, pp. 558–565, July 1978. doi:10.1145/359545.359563
- C. J. Fidge, "Timestamps in Message-Passing Systems That Preserve the Partial Ordering," in Proc. 11th Australian Computer Science Conf., 1988, pp. 56–66.
- F. Mattern, "Virtual Time and Global States of Distributed Systems," in Parallel and Distributed Algorithms. North-Holland, 1989, pp. 215–226.
- E. Szpilrajn, "Sur l'extension de l'ordre partiel," Fundamenta Mathematicae, vol. 16, pp. 386–389, 1930.
- R. P. Dilworth, "A Decomposition Theorem for Partially Ordered Sets," Ann. Math., vol. 51, no. 1, pp. 161–166, 1950. doi:10.2307/1969503
- G. Brightwell and P. Winkler, "Counting Linear Extensions," Order, vol. 8, no. 3, pp. 225–242, 1991. doi:10.1007/BF00383444
- G. Birkhoff, "Rings of Sets," Duke Math. J., vol. 3, no. 3, pp. 443–454, 1937. doi:10.1215/S0012-7094-37-00334-X
- D. Malament, "The Class of Continuous Timelike Curves Determines the Topology of Spacetime," J. Math. Phys., vol. 18, no. 7, pp. 1399–1404, 1977. doi:10.1063/1.523436
- S. Kulkarni, M. Demirbas, D. Madeppa, B. Avva, and M. Leone, "Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases," in Proc. SSS 2014, Lecture Notes in Computer Science, vol. 8756. Springer, 2014, pp. 17–32. doi:10.1007/978-3-319-11764-5_2
- M. Herlihy and J. M. Wing, "Linearizability: A Correctness Condition for Concurrent Objects," ACM Trans. Program. Lang. Syst., vol. 12, no. 3, pp. 463–492, July 1990. doi:10.1145/78969.78972
- L. Lamport, "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs," IEEE Trans. Comput., vol. C-28, no. 9, pp. 690–691, Sept. 1979. doi:10.1109/TC.1979.1675439
- M. Ahamad, G. Neiger, J. E. Burns, P. Kohli, and P. W. Hutto, "Causal Memory: Definitions, Implementation, and Programming," Distrib. Comput., vol. 9, no. 1, pp. 37–49, 1995. doi:10.1007/BF01784241