It’s easy to talk at a high level about what site reliability engineers do: They ensure that IT systems achieve availability and performance requirements.
But which skills, exactly, do SREs need to do to perform their jobs? That’s a more complicated question.
To answer, let’s look at the top nine site reliability engineer skills that modern SREs (or aspiring SREs) should master. Although SRE skills may vary from one team to the next depending on the types of systems managed and the main types of reliability challenges faced, virtually all SREs need a core set of standard skills that allow them to understand and manage the type of complex, distributed systems They will have to support at the typical organization today.
Here’s a breakdown of top SRE skills.
Networking Expertise for SREs
The network plays a pivotal role in connecting modern, distributed environments. As such, it’s often the culprit when something goes wrong—a lesson that Facebook, for example, learned when a networking problem brought down its entire global infrastructure.
Situations like this are why SREs should master networking concepts. Even if their organization also employs networking engineers, site reliability engineers need a deep understanding of networking themselves to know when the network is the root cause of an incident and how to resolve network-caused issues effectively.
Linux and Unix
If you come from a Windows background but you want to be an SRE, there’s no getting around it: You’ll need to learn how to work with Linux and other Unix-like systems in addition to Windows.
That’s because, even at organizations that don’t rely heavily on Linux servers, you’re likely to find that Linux and Unix concepts are deeply embedded within other systems that you have to work with. Most public cloud management tools follow the conventions of Linux CLI tools, for example. So do systems like Docker and Kubernetes, even if you run them in a Windows environment.
Like Linux and networking, cloud computing is another category of skill that modern SREs can’t live without.
The reason why is almost self-explanatory: Around 90% of businesses use the cloudand you can’t manage reliability for cloud environments very well if you don’t understand cloud architectures, cloud networking, cloud data storage, cloud observability and so on.
SREs don’t typically help to develop software, but they nonetheless need a deep of how software is written and understanding—which, at most organizations, today, is a process that happens via a CI/CI pipeline.