We have seen Byzantine Generals’ Problem in the synchronous environment and how it achieves consensus using Lamport’s algorithm. In Lamport’s algorithm, if the system has f number of faulty lieutenants, out of 2*f + 1 number of lieutenants, or if the commander is faulty, the system achieves consensus in both cases.
The real systems mostly behave asynchronously, such as nodes may not receive the messages within a certain timeout duration. Also, according to the impossibility theorem, it states that, in a pure asynchronous environment, the system will not achieve consensus in the presence of a faulty node. So the system can’t ensure the liveness property. To ensure liveness, we need to relax the asynchronous condition, called a weak asynchronous environment.
We will see how the system achieves consensus or ensures the safety property in the closed and asynchronous environment and liveness property in the closed and weak asynchronous environment.
Practical Byzantine Fault Tolerant (PBFT) Model
The algorithm is practical as it ensures safety over an asynchronous network but not liveness; otherwise, it will violate the impossibility theorem. However, liveness can be ensured under the weaker assumption. The system can also ensure Byzantine failure, and it has low overhead. For these properties, this algorithm is widely used in permission blockchain applications such as Tendermint, INB’s Openchain, ErisDB, Hyperledger, etc.
The broad idea about the system is as follows: A client submits the request to the commander. The commander and lieutenants are the special kinds of nodes designated to run the consensus algorithm. Once the system comes to the consensus, it sends a response back to the client whether the system has accepted the request submitted by the client or not.
The basic assumption about the system and working environment: The system works in an asynchronous distributed system, and it can tolerate delay and out-of-order messages. It can also handle Byzantine failure where arbitrary nodes behave maliciously and privacy, tamper-proof message, and authentication.
- The state machine is replicated across different nodes.
- 3f + 1 replicas are there, where f is the number of faulty replicas.
- The replicas move through successive configurations, known as views.
- One replica is a view is primary, and others are backups.
- Views are changed when a primary is detected as faulty.
- A unique integer number v identifies every view.
- Only the messages from the current views are accepted.
A simplified view of the PBFT algorithm
A client sends a request to invoke a service operation to the primary. For example, in the case of storage, the client executes a write instruction, or in the case of a blockchain environment, the client initiates a transaction. So all these kinds of requests will move to the primary.
The primary sends this request to all the secondary replicas. in other words, this primary multicast the received request to the backups or the secondary replicas.
The backups or the secondary replicas execute the request and send a reply to the client. It means these backups execute the request and try to come to a consensus-based on the PBFT algorithm. After executing the request, either the client request will be accepted or rejected, and the client will be informed by the individual replicas independently.
The client waits for f + 1 replies from different backups with the same result, confirming that the client has received the majority of the correct voting. Further, the client accepts or rejects the reply accordingly.
The PBFT setting has a maximum of 3*f + 1 number of different replicas, and out of these replicas, it has f number of faulty replicas and 2*f +1 numbers non-faulty replicas. But whenever a client receives a message, if it gets f+1 messages with the same result, the majority of the correct nodes actually reply to the client. The client accepts or commits to that particular message.
Three Phase Commit Protocol
It works in three different phases as Pre-Prepare, Prepare and Commit.
Pre-prepare Phase
Once the client has sent a request to the Primary, then this pre-prepare phase starts. It assigns a sequence number to the request and multicast a message to all the backups.
- PRE-PREPARE is a message.
- is the current view number. This ensures that the message received by the backups is from the current view.
- is the message sequence number.
- is the message digest.
- is the private key of primary – works like a digital signature.
- is the message to transmit.
The primary uses the public key-based digital signature scheme, and signed the entire message using the private key, and broadcast it to the backups or secondary. In this phase, the primary assigns a unique sequence number to the request message and broadcast the request message to all the replicas.
Pre-prepare messages are used as proof that the request was assigned sequence number n is the view v., And a backup accepts a pre-prepare message if
- The signature is correct, and d is the digest of m.
- The backup is in view v.
- It has not received a different PRE_PREPARE message with sequence n and view v with a different digest.
- The sequence number is within a threshold.
Prepare Phase
If the backup accepts the PRE_PREPARE message, it enters prepare phase by multicasting a message to all the replicas. A replica (both primary and backups) accepts prepare messages if
- The signature is correct.
- The view number is equal to the current view.
- The sequence number is within a threshold.
The primary assigns every individual request one sequence number. So once the primary assigns a sequence to individual messages or individual transactions from the client, all these messages are ordered by the sequence number assigned by the primary. So all the backups process the messages in the order of that sequence number. If the received message which a sequence number is higher than the current sequence number, that will process first.
The pre-prepare and prepare ensure that non-faulty replicas guarantee on total order for the requests within a view and commit a message if
- 2*f prepares from different backups matches with the corresponding pre-prepare.
- It must have a total of 2*f + 1 vote (one from primary that you already have) from the non-faulty replicas.
Why do we require 3*f + 1 replicas to ensure safety in an asynchronous system when there are f faulty nodes only?
If we have 2*f + 1 replicas, we need all the votes to decide the majority, which boils down to the synchronous system. However, we may not receive all the votes in the asynchronous system due to delays, out-of-order messages, or other network issues. So if the system has only 2*f + 1 replicas and f number of nodes can be Byzantine nodes, then f + 1 votes do not ensure a majority. It may be possible that we have received f votes from Byzantine nodes and just one vote from a non-faulty node.
When we do not receive a sufficient f+ 1 vote, there is a possibility that either the node is faulty and not forwarded a vote at all, or the node is non-faulty, forwarded a vote, but the vote got delayed.
Whenever we receive 2*f + 1 vote, the majority can be decided 2*f + 1 vote have arrived even if f is faulty, we know that f + 1 is from correct nodes, and we do not care about the remaining f votes. In this case, we can guarantee that the majority vote.
Commit Phase
In the end, every node multicast the commit message to all the replicas include primary. So whenever we have received 2*f + 1 number of such commit messages and if f+ 1 is from the majority decision, it can decide that whether the transaction has been accepted or not.
Multcast message to all the replicas including primary. It establish consensus throughout the views.
PBFT algorithm – View Change
When the primary becomes faulty, the non-faulty nodes can collectively detect the fault and start a view change operation. In the view change operation, they remove the replica, designated as the primary in the latest view, and elect another replica from the backup designated as primary now.
- The view-change protocol provides liveness and allows the system to make progress when the primary fails. To achieve this, the system deviating from the pure asynchronous assumption to weak asynchronous assumption.
- If the primary fails and backups do not receive any message (such as PRE_PREPARE or COMMIT) from the primary, then view changes are triggered by timeouts which prevent replicas from waiting indefinitely for requests to execute.
Backup starts a timer when it receives a request, and the timer is not already running; the timer is stopped when the request is executed and restarts when some new request comes. If the timers expire at view v, then backup starts a view change to move the system to view v + 1. On timer expiry, a backup stops accepting messages except for checkpoint, view-change, and new-view.
View Change Procedure
When timeout expires, multicast a message to all replicas.
- is the sequence number of the last stable checkpoint known to .
- is a set of 2*f + 1 valid checkpoint messages providing the correctness of .
- is a set containing a set for each request that prepared at with a sequence number higher than .
- Each set contains a valid pre-prepare message and 2*f matching.
- The new view is initiated after receiving 2*f view change messages.
- The view change operation takes care of the synchronization of checkpoints across the replicas, and all the replicas are ready to start at the new view .
Correctness Properties
Safety: The algorithm provides safety if all non-faulty replicas are on the sequence numbers of requests that commit locally. If the client is receiving a minimum of f + 1 number of correct messages compared to the expected 2*f + 1 messages with the latest sequence number, then that message gets committed. Thus, to ensure the safety property, only the message (request) with the latest sequence number gets committed in the system.
Liveness: To provide liveness, replicas must move to a new view (primary or leader) if they cannot execute a request or if the primary became faulty.
- A replica must wait for 2*f + 1 view change messages and then starts a timer to initiate a new view to avoid starting a view change too soon.
- If a replica receives a set of f + 1 valid view change messages for views greater than its current view, it sends a view change message to prevents starting the next view change too late.
- The faulty replicas are unable to impede progress by forcing frequency view change.
References
- NPTEL lecture series on Blockchains Architecture, Design and Use Cases by Prof. Sandip Chakraborty, IIT Kharagpur.
673 total views, 1 views today