What I Learned From On-Call at Scale

I used to think engineering growth came mostly from shipping. Shipping is still the fun part, but on-call changed me in a different way. When you are responsible for services that sit on the path of roughly 350 million requests per hour, production stops feeling abstract very quickly. A Sev2 or Sev3 incident is not just a ticket or a chat thread. It is a very real moment where customers, partner teams, and your own judgment are all under pressure at the same time.

I’ve spent more than a year on the Mail Services on-call roster, and it has been the fastest education in operational ownership I could have asked for.

The emotional weight is real

People sometimes romanticize on-call as a proving ground for heroes. For me, the hard part was the quiet awareness that the system mattered all the time. At this scale, you do not get to pretend impact is theoretical. Over time, I learned that the goal is not to eliminate pressure, but to build habits strong enough that pressure does not take over the room.

Incident one: the TnefContentWriter regression

One incident that really stayed with me involved a TnefContentWriter regression. The customer-visible symptom was that calendar invites were missing GlobalObjectId, which is the kind of detail that sounds obscure until you realize how many downstream behaviors depend on that metadata staying intact.

The first few minutes were a good reminder of how incidents punish vague thinking. It is easy to ask a sloppy question like “why are invites broken?” That question is too big to help. The better questions were narrower:

Is the issue deterministic or intermittent?
Did the regression affect invite creation, transformation, or downstream rendering?
Was the GlobalObjectId absent at source, stripped during TNEF generation, or lost in some later interpretation step?
What changed recently in the relevant path?

The tools mattered here. I leaned on ICM, OSP, Geneva, Jarvis, Torus, plus our TSGs and battle cards. Good incident tooling does not solve the problem for you, but it prevents procedural chaos.

We were able to roll back within hours, which was the correct move. That rollback mattered more than anyone’s ego. Later, I helped drive the RCA because I wanted the lesson to stick in writing, not just in memory. That incident reinforced something I believe strongly now: fast rollback paths are not optional engineering polish. They are part of the design.

Incident two: the attachment severity misclassification

The other incident taught me a different lesson. It started with an attachment issue that was initially misclassified as “not us.” That judgment did not survive contact with the right evidence. An SME check reversed the assumption, and suddenly the incident looked very different.

What stuck with me was not just the correction. It was the danger of premature narrative. In fast-moving incident rooms, people naturally want a crisp explanation early. “Not us” feels efficient because it narrows scope. But if you have not earned that conclusion, it can cost you time you do not actually have.

The biggest lesson I took from that incident was about buying time responsibly. If you are uncertain, say you are uncertain. Protect the SLA. Keep the investigation moving. Do not spend your credibility pretending the ambiguity is smaller than it is.

That sounds obvious in hindsight, but under pressure it is harder than it looks. Confidence is socially contagious. So is bad framing.

Investigation-first became my default style

Across incidents, I developed a style that works better for me than trying to look fast or decisive too early. I think of it as investigation-first with SLA discipline.

Build the causal frame before touching too much

My worst instinct early in my career was to react by doing everything at once: more logs, more tabs, more pings, more hypotheses. Sometimes that helps. Often it just creates noise. Now I try to recover a clean causal chain first.

What changed?
What stayed stable?
What is directly observed?
What is inferred?
What action is reversible right now?

That small structure has saved me from a lot of thrashing.

Separate evidence from story

Telemetry is evidence. A theory is a story. Good incident response requires both, but in the right order. I try to narrate clearly without promoting guesses into facts.

Use the tools, but do not hide behind them

ICM, Geneva, Jarvis, Torus, TSGs, battle cards-they all matter. But tools do not replace ownership. They help you move, they do not move for you.

Document while the context is warm

Post-incident documentation used to feel like a chore. Now I treat it as part of the incident itself. The RCA, the timeline, the “why was this severity chosen?” notes-those are how you turn one painful event into future reliability.

What on-call changed in my engineering habits

The obvious thing on-call improved was my production judgment. The less obvious thing is that it changed how I write code when everything is calm. I think more about observability, rollback design, and whether an error surface gives someone enough signal at 3 a.m. to act intelligently. It also made me more respectful of severity labels and much more serious about ownership: staying with the problem until the system is healthy again.

The part I did not expect

What surprised me most is that on-call made me calmer, not more anxious. Repeated exposure taught me that panic adds no information. A good engineer in an incident room is usually the person who keeps the room honest, the timeline clear, and the next action reversible.

Looking back, the TnefContentWriter regression, the attachment severity reversal, and the many smaller Sev2/Sev3 moments around them taught me more than a string of clean launches ever could. They made me more deliberate, more evidence-driven, and more serious about designing systems that fail safely. I still enjoy building far more than firefighting, but on-call gave me a much deeper kind of confidence: not that things will always go right, but that when they do go wrong, I know how to stay useful.