From Monolith to Observable Microservices using DDD
Gomez: My name is Maria Gomez, I’ve been working at ThoughtWorks for the last seven years. Currently, I play the role of Head of Technology for ThoughtWorks in Spain in sunny Barcelona. As I said, I’ve been working for ThoughtWorks for seven years, in the last four and a half I’ve been mainly working with organizations that are on a digital transformation journey. This talk will have a lot of the experience and learnings that I picked up along the way and I hope that what I’ll say will resonate with most of you, either because you have gone through similar experience or because you are the in the middle of one of them.
We’re going to be splitting this talk in two parts. We’re going to talk about using domain driven design to help us understand better our system and to make better decisions when we want to break it down into microservices. Then we’re also going to link it to some of the operational concerns and observability concerns that you also need to bake in when you are breaking down your monolith.
Let’s start. This is the typical scenario, you have a monolith that has been working for many years, but is now in slowing stage and is slowing you down because it’s not giving you the delivery speed that you need or the business needs and also, it’s very painful to work with. It’s very brittle and very fragile. You think that moving into the amazing world of microservices will help you solve a lot of those programs. It will help you create that independence that you need and it will help you increase the velocity and the delivery lead time or that will decrease that time, so going to production much faster.
This is the dream and it’s a very nice scenario because in reality, things are a bit more complicated, normally your system will look a lot like this. Sure, you have a monolith, but you’ll have tons of other stuff that is also interacting with that. There will be a lot of history in the system. You might have prior intents or attempts to break that down in different parts and those attempts might not have gone according to plan and they would abandon halfway through. You might have just integrations at various levels.
You might have different databases that are being accessed by different services and even the hard job that synchronize two databases, you will have services that are very chatty. They are mixing or sending and receiving a lot of information among each other. All of these things are a symptom of taking decisions, some of those cases were purely technical and didn’t take into consideration the business. The result of that is, you end up with a set of services that might not have the right boundaries because those boundaries were not set by the business or didn’t have any business capabilities in mind.
That’s what I experienced in a lot of the organizations that I work with and what I do with those teams and what I do with those organizations is I help them take a step back and start looking at what are the business capabilities, what is that business domain, and how can we start modeling our services around those capabilities. Domain driven design allows you to think around those terms and to help you fix that root problem that you have. Then you can end out identifying all of those subdomains and then extract them into better form services.
I’m going to go very quickly to give you a definition of what domain driven design is and how it can help you. Domain driven design is just an approach to build software that has complex and ever-changing business requirements. Most of us are in that situation, as an organization, we have a very complex business and we want to evolve it to be even more complex. There are multiple books in the subjects that are more theoretical than others. Regardless of what your choice is, the truth is that there is a lot of literature that you can go and read to learn more about domain driven design if you haven’t done so. It’s something that ultimately has gained a lot of traction also.
I want to go through a couple of very key concepts. Domain driven design talks about the concept of a domain. A domain is basically the activity that an organization does, where it does its work, finance, health, retail, things like that. What are some domains? They are abstractions that describe a select aspect of that domain. You’re in the finance, as a domain could be payments, statements, credit card applications, etc.
Ubiquitous language is the language that we use to talk about things that happen in that subdomain and a bounded context is explicit boundaries within which some domain exists and it’s where we also use that language that we created. If we tried to illustrate all those concepts still in a very theoretical way, we can use this image from the multi-folder block. Basically, you might say I’m working in the e-commerce domain and within that, I have two subdomains, the sales and the support.
I also have defined the bounded context, that’s where the sales domains finish and the support context starts and I also defined certain entities that will give me the functionally that I need. Studying both of them I have the concept of customer and products, but they will remain different things in each one of those domains and customer in sales will care maybe about financial information and payment information and it will have actions that are related to that, whereas in support, it will be something completely different. The language and the meaning of all of those entities will be different in each one of the domains.
The power of domain driven design is that it allows you to draw all of these boundaries, all of these lines and you can create specializations within them. You can start creating product teams that specialize in each one of those subdomains. You can have the sales team that has product people and technical people that are solely working on evolving that domain. Therefore, you can start achieving that autonomy that you need not just in the technical sense, but also in the business or the organizational sense.
We bring it back to maybe more of implementation details, we’ll go back to our monolith, we can choose domain driven design to identify the subdomains that we have within the monolith, draw those boundaries, focus on isolating them within the monolith. That’s very important, you should not split themes before you have removed all the appendices through very clean APIs in the place where they are. Once you have done that, you can then extract them into different microservices, at the same time as you are also reducing the complexity of that monolith where they no longer live.
There are many ways of going about this process because this is a very complex process. I’m going to show you one exercise that I’ve done in many places that has helped teams to understand a bit better how to go about identifying those domains. Regardless of the tool that you use to identify those things, it’s very important to bring into those talks people from the different disciplines that you have in your organization or that are interested on this domain – developers obviously, but also product owners, subject matter experts, infrastructure people, etc.
The tool that I use a lot is EventStorming. EventStorming is a workshop format that allows you to quickly explore complex business domains. It was created by Alberto Brandolini, so I haven’t created this thing, I’m just using it quite a lot. The basic idea is that it allows you to bring software developers and domain experts together and learn from each other. You’re using the stickies and a wall to make the whole thing much more interactive and engaging.
There are various concepts using this type of exercise that are also in line with the domain driven design concept. This is all nicely put together. A domain event is something that has happened and it’s of interest of a domain expert. A command is an external instruction to do something. What triggers that domain event and an aggregate is the portion of the system that receive the commands and then decide whether it needs to trigger that event as a reaction from the command.
If we are in the retail or in the e-commerce space, an event could be that an item has been added to the basket. A command can be user press “Buy” and the aggregate could be the basket itself. Let’s look at an example and think of how this workshop will go and how it will it work. Let’s imagine we have a cinema website in which we can go and buy tickets. People go and buy their “Avengers: Endgame” ticket in there and what we are doing is getting people from all of these different disciplines and we say, “We need to map the process.” We have the starting event where people enter the site and we have the ending event, where the customer will finish with the flow. The goal of this workshop is to fill the gaps. The process could start at either state of the flow, or in the middle, but the idea is I just start filling it with all the events or follow those things that will happen in between. The show is selected, the seats are reserved, the ticket is added to the cart, and so on until we get to the email that is sent with the ticket confirmation.
Then we look into what are the commands that are triggered in those events and as you can see, we are not going into technical stuff. Here, we are purely talking about business. What are the commands that will trigger all of these events? Once we have that, we can start saying, “What are the aggregates, what are the entities that we think can handle these commands?” Then we start having a discussion and talking about them. We can say this area over here that looks like a buyer is the one that is reacting against those commands and taking the necessary actions whereas over here we see more the figure of an accountant and at the end we see a customer service.
Then we start drawing those lines and those lines are the ones that will tell us where our boundaries or our bounded context will lie. Obviously, the first time you do it, you might not be super certain about those lines and that’s fine. I need to write a process that you will need to repeat a few times over many months to make sure that you’re going in the right direction.
We’ll go to another example – this is something that I’ve done quite recently in one of the projects that I’m involved in. There is a fashion retailer in Spain on a transformational journey, they have a set of legacy systems and they want to update them and evolve it into microservices. The team that I was working on kind of owns the takeout domain, which is massive and it’s also all within this very big and very complex monolith. We were trying to identify what possible source domains we could see within that domain and start to split them into different services.
This is a very simple flow of what a takeout might look like. The one that we use is a bit more complicated, but for the sake of this example, I just use this one. When the user press the basket button, that’s when the takeout start and it starts with the first event, which is, an order gets created. Up until then it was a basket and now, it’s an order. At the end of the takeout process, the confirmation is sent by email. With the similar thing, we found all of the events. We also found all of the commands and identified all of the commands, the aggregates that were managing all the commands and we also started drawing the lines.
Obviously, here, we got a bit more stuff going on. Our lines were a bit fussy, but as I say, we kept repeating this process over a long period of time and we are still repeating it every so often to make sure that we are still where we think those boundaries will be or we are attaining them as we know more.
Now that you have identified all of these boundaries, you can just start with the nicest stuff. You can start isolating those concepts, those contexts, so then they can be extracted into services. I’m going to be repeating this a lot, it’s very important that you isolate them within the monolith, so no database integration, no coupling with other subdomains. You should be having a very nice package or something that is independent, that has a nice and clean API, that the rest of the system can interact with and that thing is very well put together.
This is not a safe, the Lyft. You will need to do a lot of refactoring to get to the point. This is not going to be easy, but once you have done that, it’s going to be much easier to then just break that and creating a new artifact out of that. Bounded context has many attributes because of three of them, it’s high cohesive and loosely coupled context and it represents a business capability, so payments, statements as we say before. If we added other stuff like independence in terms of the tech stack, the independence in terms of deployment, if we isolate the failures, and we scale it in an individual and independent way, we have a microservice. It’s very important to know that well-defining bounded context will keep us halfway through getting a very well-defined microservice.
If we go back to this retail example, we say that we have identified many bounded contexts. We have promotions, there are orders, payments, deliveries. We can’t tackle them all at once, that’s not going to be very fun and is going to take us a long time until we see any results, so we need to pick one. How can we pick one? In this case, we knew that as a business, as the product of all, we want you to start exploring and experimenting with new ways of offering delivery services to the users and just improving their service and the experience that they were having currently.
Delivery seemed to us like a nice little place where we could start extracting that context from that functionality. There are many dimensions to decide where to start. In our case, we use pace of change as the biggest one because we wanted to create a place that would allow us to obtain and experiment very quickly, but in other cases you might see the performance is a better indicator or a higher priority. If might feel like there is a place in your monolith that might be using a lot of resources and therefore, making the whole monolith using a lot of resources and that it might be better off if you split it into something isolated or independent.
Whatever it is that you decide is your driver, make sure that you are able to articulate what’s the business value that is giving you and you do have a pretty good idea right now because you have done a lot of conversations with the business. Freely, we need to be able to say we want to extract these microservices. We want to do this as part of this business initiative and not as a one-off thing completely separate from the business.
Doing things like this for the sake of tech doesn’t make a lot of sense. We should be able to say, we’re doing this because it’s going to provide us this value. It’s going to give us that platform for experimentation that we are looking for. It’s going to let us save some operational cost because we’re going to be able to optimize the resources that we use. Just having that conversation with the business is what is, I think, key to make one of this transformation successful.
We have a framework that help us to identify the domains and extract them into microservices, so our job is done. Now, we just need to go on and start extracting them away. Well, we’re not done, we are actually just scratching the surface of what all of this means. We have identified when to extract, we now have a lot of domain knowledge that is going to help us to identify other things that lie below the surface. There are many other concerns that we need to take care off that are easily overlooked. These are the areas that people get wrong time and time again and this is how we end up with complex systems that are difficult to maintain and difficult to support in production. Microservices, like any other distributed architecture, might make our code simpler and easier, but it’s not making our system any less complex. It’s just moving that complexity to another place.
Testing, deployment, security operations, they’re just a field, the other concerns that you need to take into consideration when you’re going through this journey. Actually, I think, we have seen a lot of this concern been discussed over the morning today and we’re going to see some of them also this afternoon. It’s a very important thing to take into consideration. You have an advantage, instead, that you have been applying domain driven design, so you now have a better understanding of what the business wants and how to prioritize that and also, just having a better idea of how you can tackle all of this concerns and how you can measure that they are actually giving you the value that you need.
For the rest of the talk, we are not going to focus in all of them, we’re just going to choose one. We’re going to be focusing on operations, then specifically on observability, which is the availability of interrogate your system and get accurate answers that improve your understanding of it. It encompasses any questions that you might want to ask your system and those questions are not limited to errors or anomalies. It can be around anything.
Why is that important? Because our system is increasingly complex. It’s not enough to react gracefully to errors. It’s not enough to look at a snapshot of the state of our system. It is very important that we can understand and ask questions to our services that give us insight into what they’re up to. When we start creating new services, which will take observability as a first-class citizen, we should start baking in some of the techniques and attributes that will make our microservices observable.
What does it mean that a microservice is observable? That means that you have to have a clear strategy on how you’re going to go about logging, monitoring, alerting, and tracing because none of these things will be the same as what you would use when you were having a monolith. We’re going to go through each one of this and we’re going to go talk about them in a little more of details.
Let’s start by logging. Logging is your written block for everything else. What you were doing up until now might not be the right thing to do. Definitely, you were accessing the production box and copying application, that log, and then open it in a text editor. That’s not the right to do and that’s also not something that is going to scale when you have 10 or hundreds of services. Also, you were not thinking about the format of what you were logging or where you’re logging. Then also, another thing that is not going to scale.
What can you do about that? Well, the first thing that you can do is to start looking at how you can implement something called the log aggregation pattern, which should be pretty standard by now. You don’t need to have a microservice to implement this pattern. Actually, one of the boxes on the left will probably be your monolith. The log aggregation pattern talks about how you have different services that have different text tag, that they’re going to be writing logs, but you want something that collects those logs, transform them into a structure that can be easily consumed by other systems downstream.
Those can be your monitoring tool, that can be a log visualization tool, where you can query all of those logs, but that’s the idea here. We should set up our new services in a way that the log can be collected and you should also be thinking about building this infrastructure that you don’t have right in place and that’s something that will happen when you start extracting your first microservice. It shouldn’t happen as an afterthought in a month or in a few months’ time.
I don’t know how many of you are familiar with the ELK Stack, but this is a very standard way of implementing this pattern. There are other ones you can use, for example Splunk. The library that each one of the microservices uses to implement or to write logs is really not the most important thing, but what is important is what they log and what is the structure of that. You’re going to have a number of different microservices, if they don’t share the same structure, then it’s going to be very difficult for anyone to try to effectively query and use the visualization tool.
Make sure that you have some kind of a standard. Let’s say we’re going be logging in JSON. These are the key value pairs that are going to be mandatory for all the logs, and then you can start putting the ones that are most specific for your service after that. You can specify what other ones you want always, so then you can index them when they are in whatever data storage site you decide to put them. As with any of the things that we have said, you probably won’t get it right the first time, so just study iterating, that you see that you are learning more about your systems or your services.
The second thing is monitoring. Monitoring has a goal and that goal is to have enough information to help you make decisions. You can make decisions related to technology, so technical decisions or operational decisions. You can also make business decisions with this information, it doesn’t matter, it’s important that you have a centralized place to see all of that. You want to see trends, you want to observe the system overtime and compare your hypothesis.
When a new functionality or a new microservice comes online or in production, you want to see and understand what is the effect that it has in the whole system. Probably, the tool that you’re using for querying your logs has dashboard capabilities and you need to go to something very sophisticated. You can start from something very simple and then keep evolving and if you feel that that tool is not giving you the requirements anymore, you can upgrade it to something different.
If I go back to the example of the retailer, as I said, we start extracting the delivery service, so for us looking at how that service was behaving was really important, so we start monitoring certain metrics in there, data points, response times, and things like that. Then also, we started to understand that this also had an impact on the business. After all the conversation that we have over the process, we had a better understanding of what the impact that might be and we wanted to test out the word hypothesis.
Our hypothesis was that replacing and instructing the microservice will not affect sales. The way that we test that out was by doing something called semantic monitoring in which we automated certain flows and we run them constantly in production. That’s not one of them. We just run them constantly and we grab that data and plot it in our monitoring tool. Now, we can see not only how every single chain affects our service, but also how it affects the whole customer journey.
As I mentioned before, it’s very important that all of this is one centralized place and that can be accessible by everyone in the teams. These dashboards and those metrics that you monitor are not just a thing that are interesting for the operational team or the business people, this will be something interesting for everyone and you can use many tools for doing that. Kibana is a very simple option, it keeps you very nice visualization on dashboard capabilities, and as I say, you’re using Kibana for logging aggregation, you might as well also use it for this, but you can go for something that is more complex like Prometheus or Datadog that can also give you very nice functionalities.
Regardless of what you use, be careful not to add too much stuff, you might end up getting a lot of noise that is not helping you. You have a lot of things in there that might be useful, I don’t have any because everything is just lost in all of the noise. The way that you can do that is by keep reviewing what you’re monitoring, keep reviewing those metrics, and consistently removing the ones that are not of interest for you anymore. If they become interesting again, you can put them back, but just remove them if you don’t use them.
Alerting is another topic. Alerts are things that need immediate reaction. An alert is very important to have an action. The alerts will be triggered by a service, but then the person who is reacting to that alert will probably be someone who hasn’t built a service or doesn’t own the service. It’s very important that your alerts are actually telling what the actions are or what are the things that can be done to close them. You can also start looking at standardization on these states, so what are the important things that an alert can have and how all of your services and all of your microservices could be adopting these templates.
Another thing is that not all alerts need a human interaction. Not every single thing that happen to a service needs someone to press a button at 3 a.m., you should aim for having most of your alerts be resolved through automation. A very good example of this is autoscaling, if you’re using a cloud provider for your infrastructure, you get autoscaling more or less for free, so make use of that. There’s no point in getting someone involved when a process can actually be implemented that will resolve that. Kibana also has some alerting functionality. You can define them. Prometheus as well and OpsGenie is another tool that I have used quite nicely to set up all of those alerts.
Last, but not least, we should also be looking at tracing. Tracing is going to be the field you use for debugging. You, at some point, in the near future, will need to look at production errors or just look at trying to find some bottlenecks in your system or whatever that is. That’s going to be different to what you were doing in a monolith. You should be able to start thinking about how you’re going to solve the tracing problem.
This is super important because you’re going to have a request that will travel through your system going or jumping from service to service, from microservice to microservice. In some cases, that communication might be asynchronous or in some other cases synchronous, whatever that is. It’s going to be very difficult to try to trace all of that. When everything was in a monolith, it was everything in a single transaction probably, so it was much easier to do that.
The first thing that you should do is, add a trace ID or a correlation ID to have a request that enters your system. Most of the frameworks that you will use to build your microservices will have a capability to add that trace ID, so that should be easy enough. Then if you need to go and debug something in production, you can just go to your login visualization tool and then go and search by that trace ID and that will give you a picture of all the places that that request went through. You can then start putting that puzzle together because you will have all of those pieces that might be even out of order, so you will need to go and work on getting all of that sorted and get the information that you need.
This type of work might be a bit tedious and it might require a lot of knowledge of the system as a whole and different microservices. This might not be enough for you, you might want to start looking at a tool that allows you to put that picture in a simpler way or an easier way. That’s when you can use something like distributed tracing, which are tools that will allow you to not just attach that trace ID to the request, but also send the information to a centralized place that will allow you to paint the picture for you. You will have the image of where your request went through and it will also give you a lot more details than the one that you can possibly have by doing it by hand.
It doesn’t really matter what the services used for implementing these distributed tracing capabilities. There are many libraries that you can use. What is important though is to standardize on the formats that they’re going to be using. OpenTracing seems to be the best choice right now. Most of the tools that you will use to visualize and manage sort of that stuff like Zipkin, Datadog, Dynatrace, they are all accepting open trace as a format.
We have reached the end, here are the takeaways, there are a few of them. I think I packed a lot of stuff in a presentation, so there are a lot of things that you can take away from it. This is really not a freelance, there are many things that you need to consider. The most important one of them is that we need to pay attention of what’s the business values that we can provide and how we can align our technical interest with those business interest and create win-win situations.
Prioritize based on that business value and also, don’t wait until you have everything clear and sorted. Start with what you know and keep evolving as you know more. Also, make sure that you include operational concerns until you take advantage of all the knowledge that you have acquired by using domain driven design and by providing EventStorming workshops to then identify what logging, alerting, monitoring, and tracing needs to happen on your new microservices.