Open Source North
Reliability concepts every engineer should know

So you’ve just built your very first application and it’s running on a couple free credits you got to use a no-name cloud provider. You’ve got three servers. One of them is your database server that you run a backup job on it with a cron job. Another is in charge of the Frontend running an Nginx webserver with your Javascript web application. The last one is running a simple API backend that you have for handling the data from the database to the frontend.

This scenario is a pretty standard stack. It doesn’t use any fancy tooling like functions or containers, and you’re probably deploying to the servers from Gitlab or Github. If you’ve been in the industry for a couple more years, maybe it’s running a C# ASP.NET application, maybe Java Spring. Heck, you could be running a Golang or Rails application and feeding it to a fancy Javascript framework like Angular, Vue, or Express. If you went down the opensource path you’re either using mysql or posgresql. The folks that got pulled into more company backed databases are using Oracle or msSQL.

Either way it always boils down to Frontend, Backend, and Database with a little bit of mixing here and there with the Frontend and Backend. This is of course if you don’t want to deal with the overhead of a separate Frontend and you haven’t taken the time to learn and deal with CORS headers. The issue with all of this is that nothing in this list of goodies prepares you for the inevitable… something going down.

If it’s a poor release that takes out your application - a server with runaway resource utilization or something more critical like the cloud providers hardware hitting an issue - when it comes down to dealing with an outage like this, especially for smaller companies working on smaller margins, it hits. And it hits hard.

You’ll get the unexpected call that you may or may not have known could come. You’ll most likely be ill-prepared and may need to cancel plans very quickly. If you’re older with a partner or family, this shifts burdens and creates stress for everyone. Then comes the communication, long hours in a meeting room or on a video call, rushing back and forth between the bathroom or eating. Finally you’ll find the main problem and start identifying how you can get things back up and take that breath of fresh air.

Well… let’s talk about this and how you can make those moments of panic and stress a simple question of “”What broke? We’ll deal with it during working hours…””

It’s never a question of “”Will it break?””. It always breaks. It takes time, change, and probability. The real question is if you are ready and if you know what to do when it happens.

In this Presentation we’ll discuss the fundamental principles of reliability, the tips and tricks I’ve learned from my time in the industry, and look at ways you can build reliability into both existing and newly built systems.”

John Stupka

Principal Software Engineer

Microsoft