Strategy for two-tier on-call rotation
TL;DR
- I tried to elaborate on the pros and cons of single-, dual-, and triple-site on-call rotations.
- Single-site on-call rotation is straightforward and easy to deliver
- Dual-site on-call rotation brings huge advantages such as the promise of being able to sleep and have private time, as well as talent pool expansions, if we can break barriers to communication
- Triple-site on-call rotation is not feasible because of the significant cost for communication and delivery
Motivation
I have been working as a Site Reliability Engineer (SRE) for a while now. On-call handlings are part of my work. However, being an on-caller alone is not profitable for me, even though it is indispensable for my organization and users.
The toughest part of the on-call rotation for me is sleep deprivation and unexpected interruptions in my main work, causing not only physical and mental problems but also career stagnation.
I believe that a sophisticated on-call rotation system may mitigate these problems and make team members happier and more productive.
So, I’d like to organize my head and design some strategies of on-call rotations and stock them as my cards.
Goal
- Organize two-tier on-call rotation strategies based on prerequisites
Prerequisites
In this article, I would like to define a team with the following prerequisites:
- The team develops infrastructure for production applications developed by other team members.
- The team has 5–8 members.
- The team has to handle alerts from production infrastructure by being on call 24⁄7.
- The company to which the team belongs has not experienced multisite on-call rotations.
- The members of the team always try to eliminate toil whenever they are available.
Fixed strategy
Multiple strategies can be implemented to handle alerts by 24⁄7. However, “bringing a pager” and “building a two-tier on-call rotation system” must be fundamental and should remain a prerequisite in this article.
Bringing a pager
Although several communication tools such as telephones, emails, and chat apps exist, we should use synchronous notifications to acknowledge system emergencies so that we can notice issues at any time.
When it comes to synchronous notifications, having a pager is one of the most reliable solutions to acknowledge alerts. To get notifications from the pager, we use telephone numbers or push notifications on our smartphones.
Two-tier on-call rotation
A two-tier on-call rotation system means that a team has a primary on-caller and a secondary on-caller to handle alerts from the infrastructure.
A primary on-caller is the first person to acknowledge alerts from their infrastructure. The primary on-caller attempts to handle all pages.
A secondary on-caller is the person who supports the primary on-caller if the primary on-caller cannot handle alerts owing to a lack of knowledge, missing acknowledgment of pages (e.g., people sometimes do not notice pages at midnight), among other reasons.
Any other members are backups of the primary and secondary on-callers and support the primary and secondary on-callers when they are too busy to handle everything by themselves or cannot notice alerts.
Strategy for two-tier on-call rotation
If we choose to bring a pager and use two-tier on-call rotation as a base strategy, we can design the logistics of people subsequently. In many cases, we only think about a single-site team and design how we deliver on-call rotations based on a single-site team.
(Wait! Don’t we lose opportunities for better on-call rotation strategies? What about building teams with multiple locations?)
In this article, I try to seek a possibility of a multi-location team to elaborate the pros and cons.
Single-site team
The first pattern is gathering all the team members at a single site. I have only experienced this type of on-call rotation, and this is really simple and easily understandable by everyone.
Pros of single-site on-call rotation
It is simple and easy to deliver. A simple structure makes related things simple too—for example, human logistics, pager system settings, handing over on-call rotations, and support for other members’ on-calls.
If a team has five members, the opportunity to become the primary on-caller comes for a week every month. If a team has eight members, each one becomes a primary on-caller every 2 months.
From my previous experiences, being a primary on-caller once in every 2 months does not devastate the quality of life (QoL) of on-callers and is acceptable. However, once in every month is too much. The frequencies inevitably affect the QoL.
Cons of single-site on-call rotation
On-callers have to prepare for alerts outside of office hours, even at midnight. Waking up at midnight is always difficult, and it also robs the on-callers of time for their main tasks, concentration, free time for private matters, and both physical and mental health.
As a result, frequent opportunities to handle alerts cause career stagnation and burn-out of the on-callers. I think this is a serious issue for the software industry, especially small- and mid-size companies that do not have enough resources to cover the on-calls.
In addition, if we only hire the people in a single place and want to hire talented engineers only, the talent pool composed of skilled and highly qualified engineers would be limited owing to location issues.
To solve the problems of sleep deprivation and talent pool, we can design a dual-site team pattern.
How to overcome the cons of single-site on-call rotation
Adding members to the team increases the absolute disposable time for the team and reduces the frequencies of getting pages per person.
This approach may increase the communication cost and requires huge efforts for recruitment. However, the worst situation is that all members must work extremely hard to handle pages and their main tasks simultaneously because of the lack of members.
In that case, members cannot afford to eliminate the root cause of alerts with automation. If members have enough slack for their daily work, they definitely try to eliminate the root cause of midnight alerts because they understand how tough sleep deprivation can be.
If there are more than eight members in a team, it is difficult to increase the number of members owing to the communication cost.
However, splitting teams and roles are part of growth. It must be the time to split teams with appropriate boundaries if we need to add absolute disposable time.
Dual-site team
As I mentioned in the above, if we have 8 members in a team as a single-site team, people become primary on-callers once in every 2 months in theory. As I mentioned above, if we have eight members in a team as a single-site team, in theory, people would become primary on-callers once every 2 months.
However, the reality is different. The frequencies might be more than we expect, as we need to catch pages at midnight. When the primary and secondary on-callers are in deep sleep (non-REM sleep), they sometimes miss pages, and backup members have to receive the alerts at midnight. This is inevitable but very stressful for members.
To overcome this situation, we can build a team with a dual-site on-call rotation team. Let us say we can hire people in the United Kingdom (UK) and Japan (JP), in which case, there is a 9-h time difference between the locations.
The following images illustrate what I expect a dual-site team to look like. Members of the same team live in both the UK and JP and rotate on-call roles every 12 h.
In case of opportunities to build a dual-site team, I assume that companies are large enough to own branches all over the world or that companies succeed in a single country and try to expand the market to other regions. The latter situation matches the prerequisite of this article, and I am interested in this situation.
Pros of dual-site on-call rotation
The biggest pros of a dual-site on-call rotation are the promise of sleeping and private time. As I described at first, the toughest experience of on-call rotation is sleep deprivation because of inevitable alert handlings at midnight.
Consequently, sleep deprivation robs concentration for work and time for main tasks, which leads career stagnation of members. If we adopt the dual-site on-call team model, those problems would be solved in theory.
In addition, on national holidays, members can rest without interruptions and tension for carrying their pagers and laptops. For instance, if there are national holidays in the UK, it might be a good idea to hand over the roles of primary and secondary on-callers to members in JP so that members in the UK can relax during the holidays.
Moreover, talent pool expansions are one of the attractive advantages of the dual-site team. A team can have two entrances for their team and can recruit skillful engineers in different regions. If we have high criteria for applicants, these are significant benefits.
Cons of dual-site on-call rotation
Of course, there are many cons of dual-site on-call rotation.
First, the feasibility of a dual-site team is a problem. If we belong to small- or mid-size companies, this formation is hardly feasible because it is difficult to convince business owners to recruit engineers outside of the main offices just for on-call rotations.
Second, the communication cost becomes extraordinarily expensive. If members are staying in the same place, they can easily communicate with all members, and, if there are problems to handle alerts, they can support operations interactively.
If members live in two regions and have a 9-h time difference, there must be limitations for the synchronous type of communication. Although synchronous communication is possible, an increase in communication cost is inevitable. Even in asynchronous communication, members should make an effort to work productively and make efforts for effective communications.
We should not underestimate the risk of high communication costs
How to overcome the cons of dual-site on-call rotation
The first problem is almost unsolvable. People have to recognize that the opportunity cannot be controlled by individuals.
For the second problem, we have to invest time for effective communication between remote members. According to WORKING AT GOOGLE: Working together when we’re not together and GitLab Culture: All Remote, the following efforts are necessary for a successful remote team.
- Get to know each other as people
- Visiting head offices, facilitating informal communication via video calls (making social connections with coworkers)
- Asynchronous communication and documenting everything
- GitHub issues, Google Docs, slide decks
- Supportive technology
- Zoom, Slack, high-speed Internet, and more…
Triple-site team
If we can hire members in regions with different time zones, a triple-site team is possible. For example, members are living in the United Kingdom, the United States, and Japan.
Pros of triple-site on-call rotation
Members can cover 24⁄7 on call within business hours (Wow, this has huge benefits!). Members can be free from pagers and laptops outside of office hours. As described in the dual-site team, there is a greater talent pool than for a dual-site team.
Cons of triple-site on-call rotation
Synchronous communication is impossible. One of the reasons for building a team at multiple sites is to avoid working at midnight. If members in three regions try to communicate with video calls, there must be members who have to wake up at midnight, which contradicts the primary purpose.
I guess there are many companies who compete with this disadvantage. However, the delivery cost of triple-site on-call rotation is expensive for companies that do not have multisite on-call teams ever.
Conclusion
To improve any processes, we should question common assumptions. In this article, I tried to elaborate on the pros and cons of single-, dual-, and triple-site on-call rotations.
For me, single-site on-call rotation is straightforward and easy to deliver. A dual-site on-call rotation also brings huge advantages such as the promise of being able to sleep and have private time, as well as talent pool expansions, if we can break barriers to communication. Triple-site on-call rotation is not feasible because of the significant cost for communication and delivery.
In the next article, I would like to extend the patterns described in the article to build strategies for elimination of context switches owing to reactive work.