“Well, fk…” I thought to myself while I waited for my boss to answer my call. I’m sure that all he wanted to hear on a Friday morning was one of his senior leaders just manually deleted by mistake a main backoffice database.
The phone’s calling ringtone sounded like my career’s dying heartbeat on a hospital heart monitor. In those moments, a shining beacon of inspiring leadership truly makes a difference. Like the first words of my boss; “How the fk did that happen?!”.
Well, let me tell you how:
This happened several years ago, I was working in a fairly recent e-commerce company, currently leading two teams responsible for developing several core backoffice functionalities. The backoffice managed information exposed in the frontends available to users worldwide, maintained by different teams. Despite being a relatively recent company it had a global reach and a user base in the hundreds of thousands.
One of my teams developed the main backoffice product catalog that supported most backoffice flows and tools, ranging from stock and product information management to order fulfillment flows and everything in between. It was rather a critical component as most backoffice services, applications, and business processes accessed it one way or another. You can have an idea with the following diagram:
The platform consisted of a microservice architecture, the product catalog was a read model with denormalized information built from event streams from several different domains, managed by other microservices. The product catalog was supported by an ElasticSearch database which contained 17 million products, with information ranging from product metadata, stock, production information, availability, pricing, etc. exposed in a REST API. We used ElasticSearch mostly due to the high amount and diversity of filters (over 50 different filters, some with text searches).
No one had direct write access to any database in any technology (we used several technologies depending on the use case, from SQL Server to MongoDB and Cassandra). However, ElasticSearch was the exception as it was traditionally managed by engineering teams rather than Infra or DBAs.
Unlike other database technologies, ElasticSearch is accessed by a REST interface. Typically, URLs have the following format (we were using ElasticSearch version 5 at the time):
{cluster_endpoint}/{index_name}/{type}/{document_id}
(example: elastic.com/productIndex/product/152474145)
The type was dropped in newer versions.
Any kind of operation was done through an HTTP call, which you would otherwise do with a SQL script, in ElasticSearch, you would do an HTTP request. For example, adhering to REST guidelines, if you have a product catalog index (an index in ElasticSearch is more or less the equivalent of a SQL table) and you want to get a specific product you would do a GET elastic.com/productIndex/product/152474145. The same endpoint would be used to update that product with a PUT or PATCH, delete it using DELETE, or create it with a POST or PUT. The same can be applied to the different parts of the URL, a GET on elastic.com/productIndex/product would get the type info, the same to create, delete or update the type. The same thing goes with elastic.com/productIndex to get index information, update, delete or create the index.
It was a regular Friday, running from one meeting to the next, as most days. In the fleeting space between meetings I usually squeeze in ad-hoc tasks, like helping with difficult issues or tasks the teams didn’t have the autonomy to do. In this case, there was a business request to export some data with filters that weren’t available in the API, it isn’t a typical operation but given the subject’s business urgency and impact, we decided to lend a hand.
In the fifteen minutes I had before the next meeting, I quickly joined with one of my senior members to quickly access the live environment and perform the query. Since the direct access to ElasticSearch is essentially a REST API, we typically used Postman to do the requests.
My colleague assisted through remote screen sharing. A practice I usually did, was some sort of live code review of any live operation. First I wanted to test the connectivity to make sure I had the right URL, so I copied the live endpoint and index name (something like we discussed above cluster_endpoint/index_name) and submitted a GET request. If you’re familiarized with the Postman interface you might recall you choose the HTTP action from a dropdown:
Unfortunately, to my horror, after I submitted the request, I noticed the DELETE action was selected instead of GET. Instead of retrieving the index information, I just deleted it.
The request took a few seconds to acknowledge, I immediately pressed cancel. The cancel action instantly succeeded. A glimmer of hope blossomed in me like the last leaf of a dying tree. Perhaps I canceled the request just in time, I naively thought.
Quickly blown away by the stark wind of rationality that realized the request would still continue server side (in ElasticSearch) despite being canceled on the client (Postman). I did a regular search without filters on the index to confirm the total count, a query that would normally return 17 million hits returned a few hundred (the service consumed around 70 events per second, those few hundred were the products that were created/edited meanwhile).
And just like that, the main backoffice product catalog comprising a denormalized view of 17 million products with information from dozens of microservices all over the platform was permanently deleted, along with my self-esteem.
I called my boss, and we quickly started a war room as every other area started to report problems. Since it was essentially a read model, it wasn’t the source of truth for any specific information, so we “just” had to bring the information from all other services.
We had a few options:
ElasticSearch doesn’t have a way to evolve schema when breaking changes occur. The strategy is basically reindexing all information into a new index. To account for these situations, we had a component that built every product from scratch by fetching data from every other microservice, through synchronous REST APIs. It was also helpful to solve any consistency issue due to a bug or any incident in the upstream services. However, it took 6 days to fetch all data for all 17 million products. Either way, we started running right away.
Another option we had to mitigate the issue was taking advantage of the event streams. Most services have the functionality to republish events when needed, so we also requested the most critical areas to start replaying data, which would be consumed along with the changes associated with normal usage.
However, where we got really lucky is that we had to do a breaking change in schema a few days back. As we discussed before, to do so, we have to create a new index version and reindex all information. A lengthy process that while running both versions are updated with the most recent changes. We had descoped the old index a few days ago, and the new functionality that required the breaking change wasn’t as critical, so we just changed it to the old version. The data was with delay from a few days behind but wasn’t as critical as not having anything. The other two processes we discussed eventually brought everything up to date.
Backups vs Rebuilding Speed
An old discussion arose about the need for backups. We had backups for most databases but no process was implemented for ElasticSearch databases. Also, that database was a read model and by definition, it wasn’t the source of truth for anything. In theory, read models shouldn’t have backups, they should be rebuilt fast enough that won’t cause any or minimal impact in case of a major incident. Since read models usually have information inferred from somewhere else, It is debatable if they compensate for the associated monetary cost of maintaining regular backups. But in practice, rebuilding a model in a timeframe that has no perceived impact it’s easier said than done. There shouldn’t be an issue to rebuild a read model with hundreds or thousands of records. However, one with several millions of information from dozens of different sources is a completely different matter.
We ended up with a mix of the two. We refactored the process to a point it went from 6 days to a few hours. However, due to the criticality of the component, a few hours of unavailability still had significant impacts, especially during specific timeframes (e.g. sales seasons). We had a few options to further reduce that time but it started to feel like overengineering and incurred a substantial additional infrastructure cost. So we decided to also include backups when there was a higher risk, like during sales seasons or other business-critical periods.
Horizontal Scalability Was a Lie
One of the most advertised advantages of having microservices is the ability to horizontally scale. The often inconspicuous detail is by relying solely on synchronous APIs (like depicted in figure 4), horizontal scalability quickly becomes a fallacy. The component responsible for rebuilding the read model took 6 days to finish but, in theory, we could reduce the time by several degrees of magnitude by horizontally scaling it. The thing is, it relied on synchronous REST APIs to retrieve the information. It requested data from every other microservice through REST requests, built the denormalized view, and persisted the state. Scaling it would trigger large amounts of requests to the other services which weren’t prepared to deal with the considerable additional load and would need scaling themselves. This would trigger a chain reaction that would eventually put the whole platform at risk. Adding to the fact that most of them are very database dependent and their databases would also need scaling.
We did scale, but a conservative amount, even so, repercussions started to appear as other services started to struggle. Looking back, the whole thing started to look more like a monolith than a truly decoupled architecture. But distributed, which is worse, in every possible way.
When refactoring the component, we took a different approach by relying solely on event streams, which have challenges of their own but imbue the system with a truly decoupled nature. Scaling the component impacts only its resources and enables the design to be truly horizontally scalable. A common design challenge is whether events should be larger or smaller (I detailed the topic before in this article), to feed read models usually larger is better, depending on the use case. An interesting strategy we applied was using documents with Kafka compacted topics which helped immensely both in terms of speed and ability to scale (I will detail the technical implementation of this topic in an upcoming article to not turn this one overly extensive). This approach enabled the transformation of the rebuilding strategy from batch processing to stream processing. Instead of requesting data through HTTP requests, data is readily available on the event streams which is much faster since it enjoys lower network latency and doesn’t depend on an intermediary service that fetches data from the database; only on the event stream. Also, the truly decoupled nature of event streams made the whole process truly horizontally scalable, without worrying about unexpected impacts on other services.
Role Based Access
One of the arguably most obvious actions we took was to implement role-based access control. We were using an older version of ElasticSearch that only allowed very basic user authentication, the option was using XPack, which for that version was paid. In more recent versions, XPack already is included on a free license.
We did migrate to a more recent version of ElasticSearch (version 7) and implemented different roles for read and write. In the end, only applications should be able to regularly write directly to the database, users might be able to (at most) read.
Processes are to blame, not people
A common saying I always tell my teams when something goes wrong, and as a general guideline, is that processes are to blame not people. We need to understand which part of the process failed and find ways to change it, in order for us, or anyone else, new joiner or a seasoned employee, never commit the same mistake again.
This is something I truly believe and is chiseled into my way of leading and dealing with incidents. Although it didn’t prevent me from feeling like a complete idiot, and today, years after the incident, still sounds like I’m using excuses, but deep down, the rationality behind it is sound. We changed the ways we accessed live data as no one should have direct write access. Even for read access, it started to be highly avoided as a rogue query can have dire impacts on the resources, especially with ElasticSearch where complex queries (with high deep pagination for example) might crash the cluster (e.g. out of memory in client nodes). It isn’t meant to narrow the teams’ autonomy, but to protect people from doing the wrong thing.
Ad hoc requests were passed to the live engineering teams that managed those kinds of requests, ideally without direct database access. Manual recurring tasks were integrated into the corresponding services’ functionality and properly validated through an application layer, which prevented unwanted deletes or overwhelming queries. Overall, the main takeaway is to guarantee people have the means to do their job and answer to business requests with proper tools and most importantly, in a sustainable and safe way.
I always read about these kinds of issues but was always sure they would never happen to me. “I have a process” — I naively thought, “I don’t take this kind of action lightly”. Sometimes all it takes is a fleeting moment, a split second of distraction to leave a wake of irreparable damage. I was forever humbled by the experience, a cautionary tale I sometimes tell to my teams to show sometimes their boss also makes the worst kind of mistakes and, that ultimately processes exist to protect us from ourselves and our undying stupidity.
Feel free to check my other articles at: https://medium.com/@hugo.oliveira.rocha/