An Elegant Puzzle: Systems of Engineering Management [Chapter 2]
Motivation
This is a book reading record of “An Elegant Puzzle: Systems of Engineering Management [Chapter 2]”
The main purpose of reading this book is gaining knowledge to lead teams in rapid growth organizations.
It seems that the author had experienced those kinds of significant organization growth at Uber and Stripe (organization size changed from 200 to 2000) and I think the insights and experiences accumulated in the chaotic environment must be valuable for teams which will face the same kinds of growth and problems embraced by the growth.
I’m looking forward to acquiring the experience and knowledge through this book and try to leave book records in each chapter.
Previous records
Chapter 2: Organizations
Summary
Organizational design lives between process design (for a problem which should be solved quickly and cheaply) and culture evolution (for a problem which should be solved permanently and the organization has enough time to go slow)
This chapter covers the approaches to organizational design and evolution that the author has found effective.
Items
- 2.1: Sizing teams
- 2.2: Staying on the path to high-performing teams
- 2.2.1: Four states of a team
- 2.2.2: System fixes and tactical support
- 2.2.3: Consolidate your efforts
- 2.2.4: Durable excellence
- 2.3: A case against top-down global optimization
- 2.3.1: Team first
- 2.3.2: Fixed costs
- 2.3.3: Slack
- 2.3.4: Shift scope; rotate
- 2.4: Productivity in the age of hypergrowth
- 2.4.1: More engineers, more problems
- 2.4.2: Systems survive one magnitude of growth
- 2.4.3: Ways to manage entropy
- 2.4.4: Closing thoughts
- 2.5: Where to stash your organizational risk?
- 2.6: Succession planning
- 2.6.1: What do you do?
- 2.6.2: Close the gaps
2.1: Sizing teams
- The guiding principles for sizing teams are:
- Managers should support six to eight engineers
- Managers-of-managers should support four to six managers
- On-call rotations want eight engineers
- It is sometimes necessary to pool multiple teams together to reach the eight engineers necessary for a 24⁄7 on-call rotation.
- Keep innovation and maintenance together
- Small teams (fewer than four members) are not teams
- Teams should be six to eight during steady state
- To create a new team, grow an existing team to eight to ten, and then bud into two teams of four or five
- Never create empty teams
- Never leave managers supporting more than eight individuals
- Small teams (fewer than four members) are not teams
2.2 Staying on the path to high-performing teams
2.2.1 Four states of a team & 2.2.2 System fixes and tactical support
- There are four major team’s states which show how productive and innovative your teams are
- Teams want to climb from falling behind to innovating, while entropy drags them backward. Each state requires a different tact
- A team is falling behind
- if each week backlog is longer than it was the week before, typically, people are working extremely hard but not making much progress, morale is low, and your users are vocally dissatisfied
- Effective tact: Hire more people until the team moves into treading water
- A team is treading water
- if people are able to get their critical work done, but are not able to start paying down technical debt or begin major new projects, they are still working hard, and your users may seem happier because they’ve learned that asking for help won’t go anywhere
- Effective tact: Consolidate the team’s efforts to finish more things, and reduce concurrent work until they’re able to begin repaying debt
- A team is repaying debt
- they’re able to start paying down technical debt, and are beginning to benefit from the debt repayment snowball: each piece of debt you repay leads to more time to repay more debt
- Effective tact: Add time
- A team is innovating
- Technical debt is sustainably low, morale is high, and the majority of work is satisfying new user needs
- Effective tact: maintain enough slack in your team’s schedule that the team can build quality into their work, operate continuously in innovation, and avoid backtracking
2.2.3 Consolidate your efforts
- Many folks try to move all teams at the same time, peanut buttering their limited resources, but resist that indecision-framed-as-fairness
- If most teams are falling behind, then hire onto one team until it’s staffed enough to tread water, and only then move to the next. While this is true for all constraints, it’s particularly important for hiring
- Adding new individuals to a team disrupts that team’s gelling process
- it much easier to have rapid growth periods for any given team, followed by consolidation/gelling periods during which the team gels
2.2.4 Durable excellence
- Nurturing great organizations is slow but it consistently leads to enduring, real improvement in the happiness and throughput of an organization
- The improvements stick around long enough to compound, creating a durable excellence
2.3. A case against top-down global optimization
2.3.1 Staying on the path to high performing teams
- Disassembling a high-performing team leads to a significant loss of productivity, even if the members are fully retained
- Shifting individuals across teams can reset the clock on gelling, especially for teams in the early stages of gelling, and when there are significant differences in team culture
- You have to account for re-gelling costs after periods of change, not that you should never change them
2.3.2 Fixed costs
- Most teams have high fixed costs and relatively small variable costs
- Moving one person can shift an innovating team back into falling behind
- The Author’s rule of thumb is that it takes eight engineers on a team to support a two-tier on-call rotation, so the author is reluctant to move any team with membership below that line
2.3.3 Slack
- Teams put spare capacity to great use by improving areas within their aegis, in both incremental and novel ways
- Teams tend to do these improvements with minimal coordination costs,
- Slackful teams function as an organizational debugger
2.3.4 Shift scope; rotate
- If a team has significant slack, then incrementally move responsibility to them, at which point they’ll start locally optimizing their expanded workload
- If it’s a choice of moving people rapidly or shifting scope rapidly, the latter is more effective and less disruptive
- Avoid re-gelling costs and preserve system behavior
- Rotating individuals for a fixed period into an area that needs help also works well
- This is also a safe way to measure how much slack the team really has
2.4 Productivity in the age of hypergrowth
2.4.1 More engineers, more problems
- Challenge level of productively integrating large numbers of engineers is different depending on how quickly you can ramp engineers up to self-sufficient productivity
- You can quickly find a scenario in which untrained engineers increasingly outnumber the trained engineers, and each trained engineer is devoting much of their time to training a couple of newer engineers
- For every additional order of magnitude of engineers, you need to design and maintain a new layer of management
- For every ~10 engineers, you need an additional team, which requires more coordination (ref: Mythical Man-Month, The: Essays on Software Engineering, Anniversary Edition)
- Each engineer means more commits and deployments per day, creating load on your development tools
- Most outages are caused by deployments, so more deployments drive more outages, which in turn require incident management, mitigations, and postmortems
- Having more engineers leads to more specialized teams and systems
- Require increasingly small on-call rotations so that your on-call engineers have enough system context to debug and resolve production issues
- Consequently, the relative time invested in on-call goes up
2.4.2 Systems survive one magnitude of growth
- If your company is designing systems to last one order of magnitude and is doubling every six months, then you’ll have to re-implement every system twice every three years
- This creates a great deal of risk—almost every platform team is working on a critical scaling project—and can also create a great deal of resource contention to finish these concurrent rewrites
- the real productivity killer is not system rewrites but the migrations that follow those rewrites
2.4.3 Ways to manage entropy
- Managers only get value from projects when they finish: to make progress, above all else, you must ensure that some of your projects finish
- Let’s tackle hiring first, as hiring and training are often a team’s biggest time investment
- Larger companies do major investments in both new-hire bootcamps and recurring education classes
- The second most effective time thief is ad hoc interruptions
- (e.g) getting pinged on Slack, taps on the shoulder, alerts from your on-call system, high-volume email lists, etc…
- Effective tact 1: Ask people to file tickets, create chatbots that automate filing tickets, create a service cookbook, and so on
- Effective tact 2: Create a rotation for people who are available to answer questions, and train your team not to answer other forms of interruptions
- This is remarkably uncomfortable because we want to be helpful humans, but it becomes necessary as the number of interruptions climbs higher
- Extremely helpful here is an ownership registry, which allows you to look up who owns what, eliminating the frequent “Who owns X?” variety of question
- The best tool to block ad-hoc meeting out is a few large chunks of time each week to focus (Add focus time on your Google Calendar)
- Experiment a bit and find something that works well for you
- The one thing that the author has found at companies with very few interruptions and have observed almost nowhere else: really great, consistently available documentation
- The best solution to frequent interruptions I’ve seen is a culture of documentation, documentation reading, and a documentation search that actually works
- Antipattern is the gatekeeper pattern
- Having humans who perform gatekeeping activities creates very odd social dynamics, and is rarely a great use of a human’s time
- Exceptions: legal and financial topics
2.4.4 Closing thoughts
- When you’re already underwater with your existing work and maintenance, the most valuable skill in the situation is learning to push back tasks
- Say “No!” in a way that is appropriate to your company’s culture
2.5 Where to stash your organizational risk?
- Most successful is to identify a few areas to improve, ensure you’re making progress on those, and give yourself permission to do the rest poorly
- Work with your manager to write this up as an explicit plan and agree on what reasonable progress looks like
- Ensuring any given area is well on the path to health before moving my focus
2.6 Succession planning
- Succession planning is thinking through how the organization would function without you
2.6.1 What do you do?
- The first step in succession planning is to figure out what you do
- Take a look at your calendar and write down your role in meetings
- Take a second pass on your calendar for non-meeting stuff, like interviewing and closing candidates
- Look back over the past six months for recurring processes
- roadmap planning, performance calibrations, or head count decisions, and document your role
- Audit inbound chats and emails for requests and questions coming your way
- Look at the categories of the work in todo-list you’ve completed over the past six months,
2.6.2 Close the gaps
- Filter the gaps down to two lists
- Cover the easiest gaps to close
- Riskiest gaps
- These are the areas where you’re uniquely valuable to the company, where other folks are missing skills, and where getting the tasks done is truly important
- Write up a plan to close all of the easy gaps and one or two of the riskiest gaps
Comment
This chapter was insightful not only for organization managers such as COO, CTO, and VPoE but small team managers to deliver team building strategy.
Last year, I had read Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations.
This book tries to analyze low and high-performance organizations with a wide variety of numerical data, quantity and qualitative analysis and provides key metrics to improve organization performance. The approach is very scientific and includes evidence to convince readers.
On the other hand, An Elegant Puzzle: Systems of Engineering Management does not have any shreds of evidence but try to share tactics for rapid growth organization with human-centric ways.
The book contains many insights and tips which are based on the author’s experiences and they seem faithful and instinctively correct. I like this book and will apply the approaches to my work.
However, they are also very opinionated. So, I should apply them after reading other engineering management books such as Accelerate and INSPIRED and recognize existing engineering management strategies and solutions.
Adopting the other companies process wholesale has Pros/Cons and an appropriate balance is always essential.
update
In chapter 3, I found the author referred Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations (English Edition) as an example of systems thinking and analyze the metrics introduced in the book. Very interesting…
Related materials
- Irrational Exuberance! Running an engineering reorg
- Irrational Exuberance! Writing strategies and visions
- Irrational Exuberance! Guiding broad change with metrics.
- Increment: On-Call
- Irrational Exuberance! Staying on the path to high performing teams
- Irrational Exuberance! Running an engineering reorg
- 2006-11-18 Yahoo Peanut Butter Manifesto
- Managing Hypergrowth
- Bill Gurley predicts ‘dead unicorns’ in startup-land this year
- Irrational Exuberance! Tools for operating a growing organization
- Irrational Exuberance! Identify your controls
- Irrational Exuberance! Productivity in the age of hypergrowth