nox.im · All Posts · All in Distributed Systems · All in Fault-Tolerance · All in Database
Traditionally, transactions guarantee data validity despite errors by SQL ACID (atomicity, consistency, isolation, durability) compliance. Payment handling is often involving database triggers to achieve exact once semantics. In “modern” setups however, the data of operations is spanning many isolated databases as we implement the “database per service” pattern. Service oriented architectures bring added complexity as our transactions are now potentially spanning multiple databases. As engineers we need to understand the implications of the context we’re operating in and that of potential transient failures without canonical sources of truths.
The reason we went down this path is scalability. Each service can be owned by a separate team or developer who make their own technology choices and release schedules. Services and teams thereby interact via well-defined APIs. Business processes need to call multiple services to achieve desired outcomes and every time a system boundary is crossed, the chance of failures multiply.
A simple example problem:
What if the composing service dies after funds have been taken. We’ve taken customer funds but not processed the order yet. The customer likely receives an error.
The most naive approach is to implement a retry mechanism for transient (short lived) failures. However, this can take too long and fail as well if the communication failure isn’t as short lived or not possible at the time. Another simple approach is a cleanup handler that periodically checks the state and fixes inconsistencies. This however may take too long and it’s hard to predict when it should run. Both of these patterns are rather hacky and shouldn’t be relied upon when dealing with critical data such as financial transactions.
Ideally the web request gets confirmed with a job-id, confirming that the transaction is underway. The stored task can now be processed from this instance, another or even an asynchronous worker. The web frontend has the ability to poll for the job-id or get notified by it about (partially) completed processing steps.
The most common and more resilient ways to achieve this is with message brokers, as clients can reject a message for re-queuing and reprocessing. This only requires idempotent handling of work items from the queue which should be standard practice anyways. Idempotency of an operation means it can be applied multiple times without changing the initial result.
With message queues we do lose however a holistic view of the execution steps and there is risk of cycling dependencies if two services wait for one another. This setup is also difficult to test, as an integration test will require all services to be available to simulate a transaction.
A better way to achieve distributed transaction handling is service orchestration. The service executing the high level transactions becomes a controller and tells participating services what local transactions to execute. The orchestrator handles all operations based on events, stores the state and handles failure recovery.
A coordinator instructs services to do a transaction, the services reply with success or failure and if all services are successful, the coordinator issues a commit. The last step requires services to acknowledge the commit. Note that the coordinator is a single point of failures and 2PC doesn’t scale given the number of required messages. For this reason I haven’t seen this pattern used anywhere. I only mention it here for completeness sake as I’ve seen it on the internet multiple times mentioned in this context.
I would not recommend to use this.
Saga allows defining compensation functions for each step that will be automatically applied in case of error on any step. Compensation functions must be idempotent and must have the capability to be retried until it is executed successfully.
To accomplish this a SAGA log is written by a Saga execution coordinator (SEC). The execution can be restored from this log. The Saga coordinator provides guarantees to the logical flow. A transaction is either successful or aborted successfully with necessary rollbacks for any set of potential transient failures.
Idempotency wise, any request that is executed by the coordinator has to follow at most once semantics, while compensation requests usually should follow at least once semantics. This is because on a rollback we have to retry until we succeed (as the only alternative is to call a human). Ideally all sub requests are idempotent, there are however actions that cannot be rolled back once executed. For example, sending an email.
When the SEC crashes we have to assume that the saved state is complete. In the event of uncertainty, it has to rollback. If all sub requests are idempotent we can try to do a forward recovery and can even replay the entire Saga. The standard pattern employed is usually however backwards recovery.
As it shouldn’t impair your thinking on business logic, a Saga pattern may execute as follows from a high level:
In the event of failure of the steps 1,2,3 or 4, each operation preceding the failing one will be rolled back with its compensation function. For example, if step 2 fails, the compensation function will roll back any debit that was made in step 1. Note that this is just an example to convey the concept.
This is more elaborately described on microservice.io/patterns/data/saga. We could also describe this Saga as a mini workflow, describing a single business process that executes required steps in order, independent of failures.
The process described is what is called service orchestration, as a single point of logic does the job. When a problem cannot be covered by a centralized logic a second Saga pattern is called service choreography, where the decision logic is distributed. We often perform choreography without actually knowing it in message broker scenarios.
Choreography is harder to test and setup in an implementation of a deliberate larger workflow. Either one of these patterns are not straightforward to implement as all edge and failure cases have to be accounted for, when the datastore goes away and the Saga log can’t be written etc.
While the Saga patterns work well for a small amount of distributed transactions, we could be dealing with a more complex business workflow that is large in scope. Such workflows could span tens of services or 3rd parties and potentially run with human dependent work items for longer periods of time.
The Cadence Workflow engine from Uber enables to do deep Saga patterns easily. I should note that Cadence is groundbreaking and was forked by Maxim Fateev to the competing product Temporal, with a dedicated company around it.
Cadence/Temporal do microservice orchestration with retries, rollbacks and human interactions. A major upside is the added observability. End to end visibility across the workflow and multiple services is provided through an event history of each workflow state. They can guarantee exact once semantics which are especially important in payments and financial transactions. Guaranteeing consistency and reliability of long running, distributed transactions.
In contrast to other workflow engines, Cadence and Temporal implement workflows as code. Complex decision trees are often better expressed using a general-purpose programming language. This enable users to handle failures programmatically, making it possible to share and reuse logic.