nox.im · All Posts · All in Composability · All in Software Architecture · All in My Take
Traditionally (80s and 90s), we’ve been building applications that are three tiered. The business layer sits in the middle and communicates on the one side with the data layer which could be one or more databases. On the other side it communicates with the representation layer, which could be an API driving a web frontend. These applications don’t scale with team size and also not with complexity. If organizations stick with them, they quickly evolve into monolithic applications that are hard to test, operate and maintain.
The main reason for these boundaries and layering applications is separation of concerns. Code that accesses the data store should only care about accessing the data store, not enforcing business rules on the data.
Eric Evans formalizes an n-layer architecture in his work on “Domain Driven Design” (DDD) in 2003. In particular, the application layer and domain layers are different and should be separate. The application layer’s job should just be to translate between the presentation layer through the API to the domain layer. In my view the application layer can be seen as acting as API. The domain layer then acts as an implementation of said API. The Domain layer is what we traditionally called the Business Logic Layer (BLL) and should be isolated from the rest of the system. This helps to remove business logic from API handlers or what we call controllers and not let UI concerns cross boundaries.
Evans also establishes that layers have a one-way dependency and each layer depends on the layers beneath it. Meaning, the top layer code should dependent on the interface of lower layer rather than the implementation.
The Service Oriented Architecture SOA, break down logical components into many services, each employing a multi tiered architecture. While the terminology can be overwhelming, in this particular case a simple analogy is the quickest way to understand the concepts.
In a nutshell, the same way a coffee shop owner doesn’t care about how his bank account is managed in databases, the bank doesn’t care about how the coffee shop owner runs his operation. This is what we call separation of concerns. The bank has service autonomy on how to implement the transaction. Customers and the shop owner doesn’t need to understand how the payment works, just how to use the interface. A payment only happens when a customer order a coffee, making this setup event driven. The cashier may have a MasterCard or Visa label, broadcasting information for service discovery. When the shop owner needs more staff, he calls a recruitment agency for dynamic provisioning. The services in this scenario are loosely coupled from the system, meaning they can join and leave as need be, maintaining their independence. This allows for load balancing. The next coffee shop can utilize already available services for service composability.
My basic recommendations start with avoiding (or minimize) global state early. Shared mutatable state is error prone and passing context around makes you think about your interfaces more. Remember what Donald Knuth said about “premature optimization is the root of all evil”. Start with concepts, constraints and the basic data models you’re dealing with. I would rarely even think about performance until I have an MVP of something. If the components and structure is right, optimizing certain parts are usually trivial later on. Early optimizations often lead you down a bad path, as you’re not seeing the forest for the trees anymore.
Use linters and static checkers to catch common errors, find unchecked exceptions and to perform basic security checks. This is possible in most languages. For example, in Go there is golangci-lint which enforces sane practices. Also consider code generators. Everything that is generated requires less headspace. User an ORM to encapsulate database access. Some languages have first class citizen ORMs such as Active Record with Ruby on Rails. In Go you can generate ORMs with the likes of gorm, xo or entgo among others.
Also note that there are also a ton of bad examples on the internet. Referencing random posts (like this one) or repositories for arguing for design decisions means we’re not understanding why we do certain things. This is where first principles should come in and we should understand why certain abstractions or patterns are picked at all. This starts with nuances like directory structures.
Avoid at all cost for bad code to exist. By permitting bad code to exist you guarantee that bad code may execute. What is considered bad and proper is difficult and I’ll try my best to summarize the higher level concepts in the following.
Do spend time here and model this out properly. Once in production is becomes an order of magnitude harder to change flaws in your data models as you have to deal with “legacy data” the moment you turn the key around. Data models are the core of what drives your data layer.
A good and brief summary taken from Eric Evans:
Domain Layer (or Model Layer): Responsible for representing concepts of the business, information about the business situation, and business rules. State that reflects the business situation is controlled and used here, even though the technical details of storing it are delegated to the infrastructure. This layer is the heart of business software.
Consider basic constraints on the technical side. I ask myself basic questions like these: How much time can this project take to execute. What are the maintenance requirements. What data store do I use. Why do I use that data store. How much data do I expect to process. How many hits do I expect this API will get on average. What is the maximum acceptable response time. Why is that the maximum. Can this horizontally scale. When do I need to hit the data layer. What cannot be cached. How does this interface with the rest of the system. Which teams do I need to talk to. What are the MVP requirements.
For the business side or if you’re the product owner that proposes this feature: Why are we doing this project. What does the customer get. Why is this good for the customer. What is our expected ROI. How soon will we know if we are failing the ROI.
The best course of action when the technical constraints and the business questions are all in order is to prepare a press release for yourself and your team. Even if its an internal feature you’re building, you will still want to tell people about it. Start from there and work your way back to the technology. Any kind of kickoff should ensure that everyone who works on a project understands it from that direction to the tech. Not the other way around.
At the core of abstractions can be the n-layered application architecture we’ve mentioned in the beginning. When in doubt or going beyond that, it is important to note that there is a difference between creating a higher level of abstraction and simplifying. Design software by finding generalizations and collapse complexity. Take care of the interface and what the surface layer is and represents for a particular service. Don’t add to it feature by feature but do review with every addition if it belongs there and what the overarching purpose is.
Good interfaces are composable. Remember the UNIX philosophy of common text interfaces that allowed to compose complex processing via pipes.
Saga patterns or workflows for distributed transactions as described in the Request Idempotence and Distributed Transactions article.
Saga is a long story of heroic achievement, the exact reason for the term is not disclosed by can be traced to a publication (FTP) by Hector Garica-Molina, Kenneth Salem from 1987,
Long lived transactions (LLTs) hold on to database resources for relatively long periods of time, significantly delaying the termination of shorter and more common transactions. To alleviate these problems we propose the notion of a saga. A LLT is a saga if it can be written as a sequence of transactions that can be interleaved with other transactions.
Avoid naive mechanisms if you’re dealing with multiple service transactions like retries or cleanup functions that periodically run. Those just kick the can down the road and require you to do things “properly” later on. This kind of design is easier when thought about early in an applications evolution.
Try to gather metrics at the edges of the application architecture and avoid added custom aggregation into your domain layer. Business functions and outcomes are usually measurable by what data they produce. Keep metrics to technicalities. For example, on the one side of the spectrum the data layer can be measured by query insights, if the datastore doesn’t expose these try to gather them at the lowest level possible. On the other side we deal with API handlers, messages from a broker and other client interactions. An appropriate way to measure service health would be transparently to the application and domain layer with middlewares.
Tests assert the behavior of a system against a specified set of inputs. Fault injection/chaos engineering is harder to achieve in integration tests and few test with fuzzy inputs. However far we go, the scenarios pale in comparison to the state and inputs complex systems can achieve.
Historically, we used a lot of blackbox monitoring, where we checked uptime of http endpoints and usually are able to identify symptoms only. Logging and metrics fall into the category of whitebox monitoring. These insights are far more actionable than anything we could derive from blackbox monitoring and aid debugging and anomaly detection.
One of the biggest drawbacks of logging is that their sheer volume often makes operations more difficult and/or expensive. The usual recommendation is to log actionable data. Which essentially means when a process starts, major branches and errors. Each trace is a journal of a process through an application and logs explain why a particular path was taken. For an example in Go, see my latest post on observability.
Basics are also, log with a framework and don’t use printf style outputs, ensure you hide sensitive data such as emails and passwords and use log levels. Touching on levels, I find it useful to sprinkle applications with debug level logs that are usually turned off but come in handy to switch on when debugging anything particularly obscure. In Go, wrap errors as you can easily see the call stack when anything doesn’t go as expected. Errors are particularly useful as they can be handed to error reporters.
Error reporters, sometimes called exception trackers like Google Error Reporting or Sentry (open source) allow for observability and augment logging. Do use them if you can.
Metrics are time series of numbers. I.e. metrics are made up of an integer or floating point value and a timestamp. Furthermore they have a name under which we track and aggregate them and in modern tools, additional key value pairs called labels or tags. With labels we can achieve a high degree of dimensionality in the data model. Metrics are better suited to be used for alerting than logging, we can usually aggregate them in a single datastore which is easy to query against.
To scale metrics properly, client libraries ideally aggregate samples locally and are either pushed or pulled by an upstream aggregator.
With regards to what metrics to collect, the best recommention I heard was using the RED methodology, I don’t know where it came from.
The first two RE are just counters, D are gauges that can be plotted using histograms.
The Four Golden Signals are a series of metrics defined by Google Site Reliability Engineering that are considered the most important when monitoring a user-centric system:
Focus on test maintainability. Tests tell you often early if your design is wrong.
In a test pyramid we start with unit testing as the bulk of the work. Unit tests should be usage examples. We test individual components. Test your components like they’re supposed to be used, don’t create arbitrary tests. If a component cannot be tested cleanly, it indicates that the component interface design is wrong. If it brings in too many components, then the component isn’t focused enough and likely violates separation of concerns. There is no reason to test low level components from a high level, use layering and only test at one level with unit tests.
System level or integration testing ensures that the service functions in a single and multi node setup. The outside world might still be stubbed out.
End to end testing involves all required components to be running and ensures simple failures like a component that cannot start because of missing - or misconfiguration. It closely resembles the real world and is the last step before a staging environment where manual tests might be performed.
The favorite topic of every engineer I’m sure. Given services are small, code often reveals itself naturally. Document the interface, the API and optionally provide a client package to integrate the service. We can generate a lot of API docs nowadays with the likes of Swagger (OpenAPI Specification). Serialization formats may use JSON or Google’s protocol buffers (protobuf). Implementing a client package should take minimal amount of time and also tells you how easy your interface is to use.
From Evans again we know to keep the Service Layer thin - as all the key logic lies in the domain layer. Therefore documenting the API shouldn’t be a big deal.
The 12 factors were published by Adam Wiggins around 2011 and are best practices for software as a service applications. They are primarily considerations for portability and resilience. Standard setups include git repositories and CI/CD pipelines nowadays, these first five are fairly standard practice in most environments:
Key takeaways for me are these, some are obvious, config through environment saves a lot of hassle with config files and formats. I sill use application flags for simple toggles or features myself. Scalability is a major concern so especially allowing multi instance deployments should be natural these days:
Depending on if you are running on “modern” kubernetes stacks or if you’re a greybeard deploying servers you might have a different take on logs. I’d allow both, write to file as well as to stdout.
In larger distributed systems, a good way to apply loose coupling is through adding a messaging system. It is handy to know when certain events happen without knowing the consumer. This allows multiple systems to divide and conquer jobs. A messages bus enforces a common message format and/or message envelope to notify arbitrary consumers, unknown to the publisher about events that happened. When employing this pattern the publisher should be strict about enforcing correctness of published messages. The consumer should be responsible of handling the messages if subscribed to a particular type. This means the consumer has to ensure it’s not overwhelmed with the publish rate of what it subscribes to. If it crashes as a result of receiving a message it is also its own fault.
Between producers and consumers we don’t care what operating system or programming language is used. Just the serialization format for message has to be architecture independent with regards to endianness (byte ordering LE or BE), such as JSON or more efficiently protocol buffers (short protobuf) for binary messages.
Employing a message bus can help decouple systems as not everything needs to poll via APIs. New applications can hook themselves into events and perform tasks like billing, email notifications, machine learning applications, trigger workflows, etc. In short, we get loose coupling and divide and conquer patterns that help us with scaling.