What is DNS?
Simply put, DNS is what allows you to type in an address like www.getharvest.com to your computer/phone/tablet/smart-fridge or really any internet connected device and be routed to our website. DNS is the tool that translates the human readable domain name into a machine routable IP address.
Domain Name System, or more commonly known as DNS, is one of the foundational protocols that makeup the internet as we know it today. DNS was first introduced in 1983 by researchers at University of Southern California. In 1985 students at UC Berkeley created the first server implementation of DNS which they called the Berkeley Internet Name Domain or BIND system. BIND remains one of the most prolific DNS server implementations in the world, and until recently this included us here at Harvest.
Why is DNS so important?
Because of the decentralized nature of DNS and the incredibly important part that it plays in how internet traffic reaches the correct destination, it isn't uncommon for even the biggest of companies to experience large-scale outages due to DNS issues.
The global outage of Facebook in October of 2021 is a perfect example of DNS issues causing huge outages. Due to a misconfiguration of other essential networking equipment, the servers that Facebook uses to answer all of their DNS queries became unavailable. People all around the world suddenly lost access to Facebook, Messenger, Instagram, WhatsApp and many other services because their devices were asking "How do I reach facebook.com?" and the servers never responded. The problems at Facebook went deep including keeping employees from using their badges to access offices and their data centers, in some cases they literally had to call in locksmiths to open doors to get the issues resolved. This outage lasted five and a half hours and cost Facebook literally billions of dollars of value and more than $60 million in advertising revenue.
How do we manage DNS here at Harvest?
A few years back Harvest identified that DNS can be a tricky problem to solve and so one of our engineers went through a process to improve reliability and introduce redundancy to our DNS system. It is an incredibly good dive into options at the time and at the end of the process we ended up with a fairly robust platform.
The end result of that work was a paradigm that we'll refer to as a "Hidden Primary" setup.
Note: remember how I said that DNS was made in the 1980s? Yeah there are a lot of problematic terms still officially part of the DNS spec that are being phased out of technology, so we are going to use the terms "Primary" and "Secondary" instead of the official terms.
In our Hidden Primary system we had a couple of DNS servers running BIND that we managed ourselves, they were the Primary source of truth for our domains but were not designed to actually answer DNS queries themselves, but instead were setup to Forward their domain configuration to two cloud DNS providers who would actually answer the queries of our users across the world, in this case those were DNSMadeEasy and Google Cloud DNS.
This system has worked reasonably well for Harvest. We managed the DNS servers the same way that we managed the rest of our servers, so they weren't much of a burden. There were some peculiarities of the BIND configuration that would trip up people frequently, but largely things worked well.
Then... well... Harvest moved to the Cloud, more specifically we moved our applications and service from rented servers running in a datacenter in Chicago to running our applications and services on Google Kubernetes Engine (GKE). The details of this are a story for another time...however the end result was that we had fewer and fewer servers to manage ourselves, until we only had two: the DNS Hidden Primary servers.
What was once a nice and neat system being used to manage a fleet of servers was now being used to manage only a pair of servers, and here is a secret: SRE's don't want to manage servers anymore, we want to focus our attention on the services that our products provide. So we started to brainstorm ideas to change the process.
Okay, but why make a change if it is working now?
We didn't want to just get rid of servers, we wanted to enhance our security, improve the experience, and reduce complexity. So we started to think of ways we would tackle the problem in our world today. We identified some goals and improvements that we could achieve by rethinking how we did DNS:
Enhance Security
- Existing hosts are past their vendor support
- Existing hosts pose a potential attack vector
Reduce Complexity
- Fully deprecate our use of SSH and Ansible
- Standardize on a tool/language that we deeply understand
- Remove the only remaining self-managed hosts left in our environment
Improve Experience
- Introduce the ability to better validate inputs
- Allow operators to catch exceptions/errors more quickly
- Reduce common mistakes
- Provide a mechanism to validate changes prior to implementation
Where did we end up?
After some research and experimentation with a few of the tools available in the world today we ended up picking one that we were already familiar with a product called Terraform. Terraform is an open source tool that allows you to define parts of your infrastructure as blocks of code. We use Terraform today to build and manage all of the Google Cloud and Amazon Web Services components that makeup our existing platform.
We also decided that the "Hidden Primary'' pattern didn't really make sense anymore, and so are now using a "Primary/Primary” setup which ensures that our DNS service is available simultaneously across two different providers to provide as much reliability as possible.
As part of this process we also decided to change the providers that we were using to deliver our DNS answers to the public. Today we use Google Cloud DNS and Amazon Route 53 to host our most important DNS domains.
Terraform lets us build our own reusable modules, a pattern that we use heavily to ensure consistency in how our Cloud resources are built today, so we decided to build out a pair of Terraform modules to create and manage our DNS domains and records in Terraform.
Here is an example of what the Terraform configuration looks like for the hrv.st domain. This block of code when executed with Terraform creates each of the domains on both Google Cloud and AWS and sets up the records we use, for example https://try.hrv.st
If we need to make a change, say adding an URL at something like free-tacos.hrv.st all we would have to do is add the following and run Terraform again to create the DNS records.
We took our time with the process and decided to be data-driven by moving some domains like forecastapp.com to the new solution and then waited a month to measure performance and cost impact. In the end we concluded that there was no real measurable performance impact and that we'd likely reduce the cost, though not in significant amounts.
Over the week of March 7th, 2022 we moved the last remaining domains from our old solution to the new solution, taking time to allow for an uninterrupted transition. I'm happy to report that we haven't seen any issues with the migrations.
Sum it up for me
DNS is pretty old, hard to perfect, incredibly important, and easy to get wrong. We've moved away from managing DNS servers and are embracing a fully Infrastructure as Code approach to managing our DNS domains and to provide reliability by serving the same domain from multiple providers simultaneously. The new setup is easier for us to operate and make changes to our domains while reducing human error and risk.