Skip to content

10.24

I wrote a TLA+ spec of the Rabia protocol (without pipeline) and translated it to Coq. I'm not sure if the spec is correct, since TLC model checking is not enough for liveness. I am now thinking about how to prove the liveness property in Coq. Below are some design choices and the problems I encountered.

TLA+ Model

The consensus layer is modeled as a global action commit. It observes the proposal of all nodes and (non-deterministically) decides what to commit, as long as some necessary conditions are met. These conditions are weaker than the actual ones to simplify the model, but they should be strong enough to prove the desired properties.

After a value is (globally and virtually) committed, any node can commit the value locally by appending it to its log.

text
either {
  a majority of nodes have proposed the same value v:
    commit(v)
} or {
  there exists a quorum of nodes that have proposed
  values without any majority value in this quorum:
    commit(⊥)
}

I think the modelling of RPC message broadcast is not necessary; we can replace it with a single client picking a new pair from the set (Servers set X client requests set) and directly adds the request to the server's priority queue. This makes client requests totally unordered, but that will not affect our proof, since we only consider what happens after all requests are delivered.

I believe this simplification is ok, but it may be very difficult to prove the correctness of the simplification itself. Two relevant blog posts are Specification Refinement and Why Specifications Don't Compose. So for now, I plan to admit the simplification and prove on the simplified model.

TLA+ to Coq Translation

Some modifications I made during the translation:

  • Types in TLA+ are heterogeneous (e.g., a set containing both integers and strings), while in Coq they are homogeneous. The pc variable includes the pc of all processes. I want to split it into multiple roles (as defined in PlusCal) and have a separate pc for each role. This may make the Coq proof easier. When translated manually, one should be careful not to lose the information that some pc values are unchanged.
  • Make some type Inductive in Coq
  • Add option type (maybe later we can define an option type wrapper in TLA+?)

Coq Proof

I wrote a tactic to automatically do case analysis and try to prove each case. The number of cases grows exponentially with the number of case matches, so it will be slower as the spec becomes more complex. For the simple model (async_net), that tactic terminates in 3 seconds, but now it takes approx. 25 seconds for the same safety proof. This also makes the debugging process harder, since I cannot get immediate feedback on whether the proof is correct.

I'm now working on the Coq proof. I think it will be quite challenging and time-consuming. Hope I can have some progress soon.