From aa9f2ae06623bf191d3432995cdba8aa865678f7 Mon Sep 17 00:00:00 2001 From: Andrei Sandu Date: Fri, 16 Apr 2021 12:26:18 +0100 Subject: [PATCH 1/3] Added latest draft for blog post about backup restore automation in sbg_core_mysql --- _posts/2021-04-16-if-you-can-describe-it.md | 165 ++++++++++++++++++++ 1 file changed, 165 insertions(+) create mode 100644 _posts/2021-04-16-if-you-can-describe-it.md diff --git a/_posts/2021-04-16-if-you-can-describe-it.md b/_posts/2021-04-16-if-you-can-describe-it.md new file mode 100644 index 0000000..320c923 --- /dev/null +++ b/_posts/2021-04-16-if-you-can-describe-it.md @@ -0,0 +1,165 @@ +--- +layout: post +title: If you can describe it, you can automate it +date: 2021-04-16 +summary: TBC +category: Devops +tags: Percona,pxc,mysql,backup,automation +author: andrei_sandu +--- +## The design pattern + +When you hear stories involving big, monolithic code repositories, they tend to end in, or at least centre around, a lament against legacy code and un-maintainability, +with an optional ridiculous story about a 1000 line single SQL statement. The story of core_mysql is somewhat odd in that category - while absolutely a big, seemingly +monolithic structure, this has a different, more content ending than your usual anecdote of this kind. Put simply, core_mysql is Core Customer Tribe's solution to the +problem of persistent storage, and it has been iterated upon whilst in service for a long time. It has been two years since we put it live for the first time, in which +time more than 60 new versions have been released, expanding well beyond its initial functionality. Initially built for a single use case, with a specific if not so +tight deadline, the Chef-based solution has grown to cover multiple use cases, handle encryption at rest and in transit, multiple databases, users, custom configurable +user permissions, countless monitoring items, backup, disaster recovery and other fancy-sounding features. At the time of writing, the cookbook turned design pattern +has helped deploy five highly available database clusters handling diverse data, from customer Safer Gambling data to Payments, Document Verification data, and static +assets(don't ask!). And, for what it's worth, we are actually really proud of it! Without going into too much detail, core_mysql is a design pattern that uses +[Percona XTraDB Cluster](https://www.percona.com/software/mysql-database/percona-xtradb-cluster) to build and deploy a highly available synchronous cluster. It +also deploys [ProxySQL](https://proxysql.com/documentation/) for service discovery, configures Table Encryption, +[XTraBackup](https://www.percona.com/software/mysql-database/percona-xtrabackup) for data backup, monitoring and alerting via Nagios/NRPE and collectd/Graphite, +and can deploy and configure cross-site asynchronous replication for Disaster Recovery purposes. It is a complex solution, designed to accommodate as many use +cases and provide as many useful features as possible. At the centre of it all sits the sbg_core_mysql Chef cookbook, which installs and configures all the components +needed for this to work. Clocking at fourteen interlinked recipes and seven custom resources, the cookbook has been built upon non-stop since it was first deployed. + +Most components of core_mysql are currently some level of automated - this was the ultimate intention behind it, to make persistent storage available, repeatable, +and easy-to-use across the tribe. However, this was not always the case. As it so often happens in this industry, deadlines meant that we had no shortage of elements +left to be automated at a later date "when we had the time". I like to think we did a great job of delivering on those promises with time. Sometimes out of necessity, +when new requirements arose from the roadmaps of the tribe, other times simply on an ad-hoc basis when there was time, out of passion, or out of the frustration of +some engineer having to deal with manual processes. This post will examine one example of such an arduous manual process I recently automated, the challenges it +presented and the adventure it took me on. The process in question is the restore from backup process. + +## The backup restore process + +While the backup process for core_mysql has always been a question of running a backup script - roughly speaking a push-button job - the backup restore process +has been far less thought out. The process has always been manual, and to put it mildly, it is a very complex operation. Recently, we have better documented the +process as part of a new compliance piece (as well as wanting better on-call documentation in general), but that does not solve the main issue with it: +it's MASSIVE! It is a seven-step, six-page behemoth of a runbook - and it needs to be this big - and has been affectionately called the longest runbook +known to man. Since we rarely had to actually restore from a backup, we have been content with the manual process for a long time. That said, it was a cause +of much toil and frustration whenever the process was needed. There is only so much that knowledge sharing sessions can hope to accomplish, the ideal situation +would be a process that is straightforward. + +SBG often runs what we call a DR Dry Run. We pretend that our live environment has been compromised, and act out our plan for Disaster Recovery, +failing over to a secondary site. In the context of core_mysql, that meant we have to transfer data from our cross-site replication servers into +the database clusters built into the DR site. It was this use case that had got the most use out of the backup restore process, it being the main +cleanup step following the DR Dry Run. I took part in a DR Dry Run in October and finally finished at 7pm after running through the painful manual +process that I had created. It was obvious to me then what a tremendous pain point this process could become in the wrong scenario. The fact we don't +usually need to restore from backup, in my mind, would only amplify the pain when we do need to do so. + +I was taught that if you can describe a process, you CAN automate it. There already was a step-by-step runbook of how the process is performed. +So I decided to make a straightforward, automated process for it. I knew it was possible, it had to be - and I hoped I would learn something from it. + +## Challenges + +There is a reason why the manual runbook was so complex. Aside from the obvious considerations for a backup restore scenario (there will be an outage, +the process is destructive, etc.), there is an important consideration for Table Encryption. A requirement from the very beginning, our solution for +Table Encryption has always been complicated in subtle ways. Without getting into too much detail, we used a readily available Table Encryption solution for +encryption at rest, which relies on a keyring file for encryption and decryption (essentially a master encryption key, although with a few twists that +fortunately are not relevant for the problem at hand). This posed a problem regarding the storage and integrity of the encryption key, as the backups we +take are encrypted (so long as Table Encryption is used). Restoring the backup, therefore, does not guarantee we get the data back, unless we also have the +right encryption key to unlock that data. Known as "Step 5 of the longest runbook known to man", this problem constituted the most esoteric and confusing +obstacle to automation, and it involved interacting with Hashicorp Vault and understanding the custom (and convoluted) way we used it to preserve the +integrity and history of the encryption keys we use. The cookbook deals with key rotation by storing and archiving all encryption keys in Hashicorp +Vault - the company-approved choice for all our secret-keeping needs. This way, both the integrity and the history of our Encryption Keys is preserved via Chef, +courtesy of a big custom-designed resource. We had Chef recognise when the key was missing on disk so it can restore it from Vault; similarly, we had Chef +recognise when the key had rotated, so that it could have the archived key still available if needed. Most relevant for the problem at hand, if we were to +restore from a backup, we would have to make sure that the encryption key on disk was the same as the one present at the time we had taken said backup. + +## The solution + +Early on, a naive solution in the shape of a Jenkins pipeline has quickly presented itself. We are dealing with multi-server clusters, after-all, and Jenkins +is designed for just such requirements. It is a well understood tool, synonymous with push-button jobs, and (in theory) should be able to achieve what I seek. +The only obstacle to this approach was the problem of Encryption restore, and the fact that I would need to make Jenkins interact with our Hashicorp Vault API. +I was initially quite confident that I would learn what I needed to make it work quickly, in an epic feat of Learning and Development, however, after a brief +read-through of everything I could find about Hashicorp Vault API, I eventually had to accept I would not be able to find the odd, specific solution I needed. +In truth, I failed to find a way to interact directly with the Vault API through Jenkins altogether. If I were brutally honest with myself, there probably +is a way to do so, I just never found the right resource for it. All the same, I had grown impatient, and was longing for progress. + +I realised I needed a different approach if I were to deliver anything. It had occurred to me that all these issues I had solved before, that I did not need a +brand new, shiny solution. Instead, I decided to stay faithful to the existing design pattern, and harness the techniques my squad had perfected in dealing with +all things core_mysql (and many other situations, as well). I decided to code what I know, and rely on what I already have and know. If there is one technical skill +I excel at, it is Chef and Ruby! I had worked on core_mysql for three years, and there is a reason why Chef is the heart of it all. I know Chef! I have used it +before to interface with Hashicorp Vault! I could make it work in the context of a backup restore! Likewise, I have solved automation problems before: I have +automated Replication Setup using Bash scripts. I have written the manual runbook, I should be able to translate it into Bash. This would strip away the glamour +of a Jenkins pipeline, with a literal push-button, but on the other hand it would allow me to make progress faster (as well as allowing me to establish a +solid baseline to expand upon). + +The decision was made. A couple of coding sessions, and a handful of testing sessions later, the solution was proven and production-ready. A Chef-controlled +Bash script would be installed on the mysql servers, which takes the backup file paths as parameters and performs the backup restore steps. The brunt of the +necessary work was in truth quite uninteresting - the sequence of steps were already there, in the manual runbook, simply awaiting to be translated into Bash. +I made sure the script fails safely as much as possible, and tested the different failure states on one of the test clusters. I endeavoured to make the script +as verbose as possible, and included a confirmation step as an extra safety measure. But I would like to focus attention on two aspects that I found surprisingly +insightful, and interesting to deal with, respectively. + +First, I want to highlight that while production-ready and deployed, the present solution is still very much bare-bones. Aside from the fact that it is +essentially just a Bash script, rather than an intuitive Jenkins Pipeline, it does not deal with taking the cluster down, and it assumes that the backup +originated from the same server. Yet, I see this as a strength, rather than a weakness. The solution is good enough for now., which has always been the +underlying theme with core_mysql . We have always shipped solutions that are "good enough", then continuously built upon them every step of the way. This +is how this huge design pattern has come to grow this big, and this is why it is successful. The present backup restore solution has achieved what was set +out - the runbook is a third the size of the old manual one, it is far less prone to error or failure, and it shall save a lot of time, especially in DR +exercises. Everything else will come later, as soon as we have the time to improve it. And we will, eventually, because we care about it. I know, for +example, that given the present baseline, I can integrate it in a Jenkins pipeline at some point. + +Secondly, the focus of this whole endeavour has been the problem of the Encryption Keys, and interacting with Hashicorp Vault. As mentioned previously, +we have used Chef resources to interact with Vault's API in the past. It made sense to exploit that capability further, so I created a similar custom Chef +resource, aimed solely at restoring the correct Encryption key to disk. But putting it in Chef meant it would be evaluated every time Chef ran, which, +given we have daemonised chef-client on pretty much all our servers, would be quite often. In order to ensure it only executed when a backup restore is +ongoing, I needed to tie this Chef resource to the Bash script somehow. The way to achieve this is simple, but effective. I had the Bash script place a +file on disk as part of the backup restore process, containing the necessary information for Chef to know which Encryption key to restore. Then, I had +the new Chef resource listen for that file on disk, and only execute if it is present. Lastly, I had Chef delete that file whenever it finds it. That way, +the only time Chef tries to restore a specific Encryption key is when it runs right at the end of a backup restore. I find it almost poetic how Chef +controls the Bash script, but in turn, that Bash script controls Chef's execution whenever it is triggered. + +## The Belly of the Beast + +It was a sunny early spring Friday, when it became clear that something was certainly wrong. The evening before I had tried and failed to perform a now +routine test of the backup restore process on one of our test clusters. It was not the first time I had seen this kind of error, but this time there was +no doubt: the process did not work. I do not mean the automated script had a bug, but the process itself was flawed - I had tried every backup in the +last week, both the ones scheduled each night, and backups taken manually. I had tried the manual runbook, switching the order of steps - nothing was +working. I was off the next week, and on Thursday was scheduled the first DR Dry Run that would have used the new shiny automated process for backup restore. + +I was not to know yet, but the "longest runbook known to man" was missing a step. It transpires that xtrabackup (the backup tool we use) has a --prepare +command, designed to ensure the backup is ready to be restored. This option deals with encryption when present, and is important in order to ensure the +data does not get corrupted after it gets restored. However, it appears this step is not always a deal breaker, as we have successfully restored backups +without using it in the past. But not reliably - in fact, at the second demo we held within the tribe, and the one which we recorded on video, the backup +restore failed. At that time, it was only that backup that failed to restore, so we wrote it off as one corrupt backup - an exception to the rule. + +This terrifying Friday got me thinking about the importance of testing our processes. As previously stated, backup restore was seldom used until recently. +It is this infrequency that allowed a potentially devastating flaw to slip through in our runbook - and by extension, in my script. Conversely, +it is the dawn of routine testing for this process that uncovered it, and prevented it surfacing when we need the process to work perfectly/in an +actual incident. It also got me to appreciate my team's culture of covering for one another. Early in the day, two of my colleagues offered to look +into the issue with me, and continue investigations while I was off. They discovered the missing step, and pointed me to the best way to fix the automated +script upon my return! The DR Dry Run completed without too much trouble thanks to them. + +Fixing this bug required me to re-think the way I dealt with Encryption restore, as the --prepare command requires the encryption key to be present +before the data is restored. I ended up refactoring the chef resources, in order to separate the Encryption Restore logic from the rest of the cookbook. +I had the script run chef-client with a custom runlist, before the --prepare command, using only the recipe necessary for Encryption Restore. + +## Conclusions + +First and foremost, this project has been a success. We have already saved a lot of time when we conducted our first backup restore test a couple of +weeks after the project was done, and not too long before the time of writing, and that economy will only grow as time passes. A less tangible boon, +yet I would argue a more important one, is that we have reduced the chances of human error associated with this process. Now, this is a very hard one +to measure, because we rarely ever used the process at all, and when we did, I was usually the one to perform the feat (even so, I remember screwing +up at times, thankfully not in Production). But there is no denying this is the case, and it is a lot more valuable than the time the new script +will save. We would ideally never need to do a restore from backup in Production, but if we ever do, I am now far less terrified of the thought, and +this is thanks to the shortening of the "longest runbook known to man". + +Furthermore, I take pride in the circumstances by which this script has come to be. We identified a point of pain that was extremely rarely felt, yet +we saw its potentially disastrous impact. In other words, we were proactive! This project was not planned out, it was not put on any roadmaps, it was +not prioritised nor refined - at least not to begin with. It was incepted as a Learning and Development project. and yet, it was delivered. What's more, +my team is now looking at formalizing this kind of project born out of L&D. We saw the value in it, and I was empowered to do it. More than that, I was +given the whole team's support when it went wrong. + +For as large a codebase as core_mysql is, it never ceases to amaze me how much more it always seems to need to grow. This is a loved product, one in which +I personally believe, and one which I want to see succeed. It will never be perfect, but it will always be better than it used to be. This little project, +the automation of backup restore, is but the latest example of a set of engineering practices that has guided my team and I, ones that I found myself +contemplating about every step of the way. I see it as nothing but a good example of those principles. I started by exploring shiny solutions, but realized +the value of coding what I know, and building with the tools closest to me. In the beginning, I had imagined a big single-button cluster-level solution, that +solves all the problems, and covers all the possible edge cases, but then I had to accept that if I were to deliver anything, I had to stick to the bare-bones +solution which is good enough for now. I had to remind myself of the difference between "That's enough, it will do" and "This is good enough for now" - that +is, a difference in mentality spawning from caring about my code and vowing to improve it continuously, and incrementally. This design pattern is not just a +monolith the kind of which engineers tell horror stories about, it is a supported, ever evolving thing, which builds upon its past. From b714961f44d59b1d4db8f4825bf1da199a59e7c6 Mon Sep 17 00:00:00 2001 From: Andrei Sandu Date: Fri, 16 Apr 2021 13:44:33 +0100 Subject: [PATCH 2/3] Added summary and tags --- _posts/2021-04-16-if-you-can-describe-it.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2021-04-16-if-you-can-describe-it.md b/_posts/2021-04-16-if-you-can-describe-it.md index 320c923..bef7028 100644 --- a/_posts/2021-04-16-if-you-can-describe-it.md +++ b/_posts/2021-04-16-if-you-can-describe-it.md @@ -2,9 +2,9 @@ layout: post title: If you can describe it, you can automate it date: 2021-04-16 -summary: TBC +summary: The truly Agile team never has enough time to make any product the best it can be. Yet there is a difference between "it will have to do" and "this is good enough _for now_". This is the story of automation, of continuous iterative improvement. A story of a horrendous manual process that was rarely used and never properly understood, and the work to turn it into a better, more automated process. Along on this adventure, you will meet a Core design pattern, a lot of Bash, a cunning Chef hack, a dramatic mysql oversight, and a surprising lesson in agility. category: Devops -tags: Percona,pxc,mysql,backup,automation +tags: bash,chef,hashicorp,percona,mysql,backup,automation author: andrei_sandu --- ## The design pattern From 7f954f7e3339545d972d993c92f6b0fdca5b9420 Mon Sep 17 00:00:00 2001 From: Andrei Sandu Date: Fri, 16 Apr 2021 13:47:48 +0100 Subject: [PATCH 3/3] Capitalized MySQL properly --- _posts/2021-04-16-if-you-can-describe-it.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2021-04-16-if-you-can-describe-it.md b/_posts/2021-04-16-if-you-can-describe-it.md index bef7028..9c5f107 100644 --- a/_posts/2021-04-16-if-you-can-describe-it.md +++ b/_posts/2021-04-16-if-you-can-describe-it.md @@ -2,7 +2,7 @@ layout: post title: If you can describe it, you can automate it date: 2021-04-16 -summary: The truly Agile team never has enough time to make any product the best it can be. Yet there is a difference between "it will have to do" and "this is good enough _for now_". This is the story of automation, of continuous iterative improvement. A story of a horrendous manual process that was rarely used and never properly understood, and the work to turn it into a better, more automated process. Along on this adventure, you will meet a Core design pattern, a lot of Bash, a cunning Chef hack, a dramatic mysql oversight, and a surprising lesson in agility. +summary: The truly Agile team never has enough time to make any product the best it can be. Yet there is a difference between "it will have to do" and "this is good enough _for now_". This is the story of automation, of continuous iterative improvement. A story of a horrendous manual process that was rarely used and never properly understood, and the work to turn it into a better, more automated process. Along on this adventure, you will meet a Core design pattern, a lot of Bash, a cunning Chef hack, a dramatic MySQL oversight, and a surprising lesson in agility. category: Devops tags: bash,chef,hashicorp,percona,mysql,backup,automation author: andrei_sandu @@ -88,7 +88,7 @@ of a Jenkins pipeline, with a literal push-button, but on the other hand it woul solid baseline to expand upon). The decision was made. A couple of coding sessions, and a handful of testing sessions later, the solution was proven and production-ready. A Chef-controlled -Bash script would be installed on the mysql servers, which takes the backup file paths as parameters and performs the backup restore steps. The brunt of the +Bash script would be installed on the MySQL servers, which takes the backup file paths as parameters and performs the backup restore steps. The brunt of the necessary work was in truth quite uninteresting - the sequence of steps were already there, in the manual runbook, simply awaiting to be translated into Bash. I made sure the script fails safely as much as possible, and tested the different failure states on one of the test clusters. I endeavoured to make the script as verbose as possible, and included a confirmation step as an extra safety measure. But I would like to focus attention on two aspects that I found surprisingly