This is the story of John. John is a back-end developer working for a video game company called VaporGames.
VaporGames is working on a web store where players will be able to buy in-game items such as weapons, equipment, accessories and skins. Players can buy items using their VaporTokens, a digital currency invented by the company.
VaporGames uses a microservice architecture for their back-end services and for the web store project they have decided to create three new services:
- Wallet Service – Keeps track of the balance of each player and authorizes payments.
- Inventory Service – Manages the inventory of each player.
- Store Service – Processes item purchases.
John’s in charge of the store service and there are other two teams working on the rest of them.
The wallet service will have a ReST API with two endpoints:
POST /payments
– Make a payment. Fails if the player’s balance is insufficient.DELETE /payments/<payment_id>
– Refund payment.
The inventory service will also have a ReST API but only exposing one endpoint:
POST /items
– Add item to player’s inventory.
John’s solution
The problem looks easy enough to John. After a little bit of thinking he comes up with this solution:
When a player makes a request to purchase an item, the store service first tries to create a new payment using the wallet service. If the request to the wallet service fails or the user does not have enough balance, the purchase process will fail altogether. Otherwise, if the user can pay for the item, the store service will issue a request to the inventory service trying to add the item to the player’s inventory. If an error occurs during that second request, the store service will make a request to the wallet service in order to refund the payment.
John and the other teams of VaporGames start working on the project. After a couple of weeks of hard work the web store is ready for testing. During the tests everything works perfectly, thus the team decides to rollout the deploy to production.
The web store works great for a couple of weeks, but that changes when VaporGames releases a new game that attracts an unprecedented amount of newcomers to the platform. Suddenly the web store is getting ten times the projected traffic and the services and network of VaporGames are overloaded during peak hours.
Server and network crashes are to be expected under the circumstances. What’s not to be expected are players reporting weird behaviors of the web store. Apparently some of the players paid for items that were not added to their inventories. To make things even weirder, there are also reports of players getting items for free.
John doesn’t know what happened. The code of the web store looks right and when he tries to recreate the errors, everything works correctly.
What’s wrong? Can it be fixed?
Let’s do a little thought experiment so we can answer these questions.
The Two Generals’ Problem
The first popular problem about distributed consensus is called the Two General’s Problem. It’s first appearance dates back to 1978 [1], although a similar problem about two group of gangsters appeared in 1975 [2].
I think that the version of the ’78 is easier to mathematically formalize, but the OG version that uses gangsters is cooler. So here’s a combination of the two: The Two Gangsters’ Problem.
A group of gangsters is about to rob a bank. The men organize themselves into two squads. One squad is in a safe house with the boss and the other one is hiding in a warehouse a couple of blocks away.
The bank robbery can only be pulled off if both squads do it together. Each squad must decide whether to rob the bank or to retreat. If only one squad tries to rob the bank, they will fail and end up in jail, so both squads need to reach agreement on what to do.
Both squads can communicate with each other by sending messengers. They can send all the messenger they want, but they need to be careful because there are cops on the streets. If a messenger gets caught by a cop, he will be thrown in jail. Luckily no messenger will turn into a snitch, so the plan is not going to be compromised.
Can the two squads reach agreement?
Let’s suppose that both squads want to rob the bank and the boss sends a messenger to communicate the decision to the warehouse squad. Assuming that the messenger arrives safely, both squads know that the boss wants to rob the bank. The warehouse squad now needs to communicate that they also want to go ahead with the plan. So they send a messenger. The problem is that they need confirmation of the boss receiving the message, cause they can’t proceed if he doesn’t receive the message. So the boss sends a messenger. But he also needs confirmation of the warehouse receiving the message. So the warehouse sends a messenger… and we are stuck in a loop. Great!
Intuition tell us that no matter how many messages are sent, the squads will never reach agreement.
Let’s try to formally prove that the problem has no solution.
Impossibility proof
The following section is based on the impossibility proof of coordinated attacks found in Notes on Theory of Distributed Systems [3].
If there’s a protocol that allows the two squads to coordinate their decision, then it must have the following properties:
- Agreement: Both squads agree on a decision, either to proceed with the plan or to retreat.
- Validity: If the two squads want the same result and no messenger gets caught, they will agree on that same result.
- Termination: Both squads reach agreement in a finite amount of rounds. They don’t have an infinite amount of messengers after all.
Let’s imagine a scenario were both squads want to rob the bank and every message arrives its destination. This scenario is an execution of the process that gives an output using the hypothetical message protocol we defined previously. We’ll call this Execution 0 (E0).
Now imagine a new scenario where the last messenger sent by the boss gets caught. This gives us a new execution of the protocol: E1.
From the boss perspective E0 and E1 are identical because he does not know whether last message was delivered. Since the boss sees the same in both execution, he outputs the same decision in both executions. Assuming that the protocol guarantees both squads agreeing, the warehouse group outputs the same decision as the boss’s.
So far we got E0 and E1 which are indistinguishable by the boss. This time we’re going to create a new execution E2 where the last message sent by the warehouse squad is lost.
Both E1 and E2 are indistinguishable by the warehouse squad, so it’s output stays the same. The boss’s output also needs to be same due to the agreement property of the protocol.
Repeating the same process that we did to get from E0 to E2 we can get to execution Ek where no message is delivered. Each execution is constructed by dropping the last delivered message of the previous execution.
Given 0 ≤ i < k, Ei is indistinguishable from Ei+1 by one of the squads. That squad gives the same output on both executions. Because of the agreement property, it is assumed that the other squad outputs the same decision. This means that in all executions from E0 to Ek, the decision from both squads stay the same.
Using Ek as a starting point we create Ek+1, an execution where the warehouse squads starts the process with the preliminary decision to retreat.
According to the boss Ek and Ek+1 are the same, so he’s output for Ek is the same as the one for Ek+1. Although it looks crazy, the warehouse squad needs to agree on the output. Remember that we are assuming that there is a protocol that has all the properties we listed before. This means that at some point during the execution the warehouse needs to change its mind in order to reach agreement.
Ek+2 is going to be the exact same as Ek+1, but this time the boss also starts the process wanting to retreat.
The warehouse group sees the same in Ek+1 and Ek+2, thus its output stays the same. The boss needs to agree, so he also must choose to rob the bank at the end.
Next we’re going to create Ek+2+1 through Ek+2+k’. In each execution where going to restore the last messages sent.
Each execution looks the same as the previous one to one of the squads, so both outputs are the same.
From E0 to Ek+2+k’ we constructed a chain of executions where each execution is indistinguishable from its adjacent one by one of the quads. For executions Ei and Ei+1 one of the squads is going to give the same output because to that squad both executions look the same. Because of the agreement property, we can assume that the other squad outputs the same decision.
After a finite number of executions we reached an execution where both squads want to retreat, all messages are delivered and yet the output of the execution is to rob the bank. This is absurd and does not respect the validity principle of the protocol. So either agreement or validity was violated on some execution.
What’s wrong with John’s solution?
There are a couple of problems with John’s solution.
The first problem is how it’s treating network and service errors. When the store service makes a request that fails, whether to the wallet service or the inventory service, it is assumed that the request was not processed. This assumption is incorrect, a service may process a request but the response may not reach the client due to a network error. The same thing could happen if a service crashes after processing a request but before sending a response.
When an error occurs during the payment request to the wallet service, there is no way to tell if the payment was processed.
This explains why there are some players complaining about not getting the items they paid for.
There is another request that can fail, the request to the inventory service that adds the item to the player’s inventory. When this requests fails, the store service can’t possibly know whether the item was added or not.
This explains why some players are getting their items for free. The store service assumes that the item could not be added to the inventory and refunds the payment.
There is another big problem with the solution that John implemented. It’s trying to solve a problem that cannot be solved. According to the Two Generals’ Problem it is impossible to coordinate two (or more) services into one outcome. For the particular case of the web store, it is not possible to make the wallet service and the inventory service reach agreement into one of the desirable outcomes. Keep in mind that there are only two acceptable outcomes for a purchase attempt, either the item is paid for and added to the inventory, or the item is not paid for and it is not added to the inventory.
Okay, the problem cannot be solved. What then? Should John just give up?
The solution
Although the problem is not theoretically solvable John can still take more pragmatic approaches to find a workaround.
The system cannot guarantee that a paid item is in the player’s inventory. But it can guarantee that a paid item is eventually going to be in the player’s inventory.
This subtle change in the specification transforms the problem into one which can be solved.
The idea is simple. If the payment is made, the web store is going to keep trying to add the item to the inventory until it succeeds.
This is only one of the possible implementations of the idea.
The store services is first going to make a request to the wallet service. If that requests fails, it is going to keep trying until it gets a response. If the player’s balance is not enough, the purchase will end without success. Otherwise, the store service will keep trying to add the item to the player’s inventory until it succeeds.
Note that in no moment it is necessary to refund the payment. That’s because there are not rules that prevent an item for being added to an inventory. As an exercise you can think how the solution would have changed if the inventory did not allow for duplicated entries in a given inventory.
One of the downsides of this solution is that the inventory service may not be immediately reachable after the payment is made. If that’s the case, the player will not see the item in his inventory although it has been paid for. The player may start to worry if the item is not available after a while. But if the inventory service is unusually down for longer than a couple of minutes, it will be completely fine.
The other downside of the solution is that its technical implementation is somewhat challenging.
- The purchase process in the store needs to be executed asynchronously. One or more services might be down during the process of the request that triggers the purchase, and it is important to resume the process so no purchase is left unfinished.
- The store service needs to keep track of the unfinished purchases so it can resume the process if a service is down. This record must persist even if the store service crashes and restarts.
- The wallet service and the inventory service must provide idempotent methods. This means, for example, that a payment is only going to be processed once, no matter how many times is retried. You can learn more about idempotency in this article.
- The asynchronous nature of the purchase process needs to be taken into account to give a compelling user experience. If a service is down, a purchase will have an undefined state until normal operation is resumed. In cases like this, clients of the service, such as a front-end, will not have the complete status of the transaction after a single request-response trip. The front-end may need to display a message such as “We are processing your request” and try to obtain the final result asynchronously in order to display it to the end user.
Conclusion
The Two Generals’ Problem teaches us about one of the fundamental problems that affects distributed systems. Systems communicating over a network cannot make coordinated decisions.
Ideally, the status supplied to a process should specify completely the final outcome of a transaction (i.e. whether the message reached the destination). Such a status is called complete. If a system provides complete status to both processes, then the two parties not only know the ultimate fate of the transaction, but also know that they are in agreement as to what exactly happened.
[…]
In an arbitrary distributed facility, it is impossible to provide complete status.
Some constraints and tradeoffs in the design of network communications, 3.1, page 70 [1]
This limitation is specially problematic when dealing with transactions in microservice architectures, because it’s not possible to rely only on synchronous communications to execute them correctly. For these cases you need more robust solutions that can handle partial outages, such as network errors or service unavailability, and guarantee that every transaction will eventually be completed.
When working with distributed systems, always keep in mind the Two Generals’ Problem.
References
- E. A. Akkoyunlu, K. Ekanadham, and R. V. Huber. Some constraints and tradeoffs in the design of network communications. SIGOPS Oper. Syst. Rev., 9:67–74, November 1975.
- Jim Gray. Notes on data base operating systems. In Operating Systems, An Advanced Course, pages 393–481. Springer-Verlag, London, UK, 1978.
- James Aspnes. Notes on Theory of Distributed Systems, pages 58-60. April 2022.
Nice article.