Build a Humane On-Call for Production Services Without Slowing Incidents

Production incidents demand fast response times, but traditional on-call systems often burn out engineers or create knowledge silos that slow down resolution. This article presents practical strategies for building an on-call system that protects team wellbeing while maintaining rapid incident response, drawing on insights from experienced engineering leaders. Learn how structured context handoffs and primary-secondary escalation models can create a sustainable on-call practice that keeps services reliable without sacrificing the humans who maintain them.

Adopt Structured Context Handoffs

The mistake I see most teams make is optimizing for coverage instead of sustainability. You can always get fast response times by putting more people on rotation—but that doesn't hold up over time.

What's worked better for us is designing around load, not schedule. Before adjusting rotations, we spent time reducing unnecessary alerts and tightening what actually wakes someone up. If everything is urgent, nothing is.

From there, we keep rotations predictable and relatively infrequent, but with very clear ownership during that window. No ambiguity about who responds, and no "soft" handoffs.

The single change that made the biggest difference was introducing a structured handoff with context, not just a calendar switch. At the end of each rotation, the outgoing engineer leaves a short, focused summary: active issues, flaky systems, anything that's "almost broken." It takes 10-15 minutes, but it prevents the next person from rediscovering the same problems at 2 a.m.

That alone reduced repeated incidents and lowered stress a lot more than tweaking schedules ever did.

Ihor KhrypchenkoChief Technology Officer, SkinnyRx

Use Primary Secondary Escalation

On-call rotation got healthier for us the moment we stopped treating every responder as equally interruptible. The change that made the biggest difference was moving to a true primary-secondary schedule, where only the primary gets paged by default and the secondary comes in only on a missed acknowledgement or a genuinely high-severity incident, which lines up with current guidance on clear roles, escalation policies, fair load balancing, and using alert-volume analytics to adjust overload. That protected sleep and reduced alert fatigue without hurting response time because ownership got clearer, not blurrier. My advice is simple: page fewer people, define the escalation path before the incident, and treat alert volume like a team-health metric as much as an ops metric.

Charitarth SindhuLLM Psychologist / Fractional Business & AI Workflow Consultant

Automate Safe Runbook Fixes

Automate common fixes with safe auto-remediation. Begin by encoding well known runbooks into small actions that can run on their own. Add strong guardrails like clear triggers, time limits, and easy rollbacks to lower risk.

Keep humans in the loop with alerts, chat approvals, and the option to stop the action at once. Track every auto-fix with logs and metrics so the team can learn and tune. Start by choosing one noisy alert and build a safe auto-fix for it today.

Leverage Follow the Sun Rotations

Follow-the-sun scheduling avoids nighttime disruptions. Use time zone based rotations so alerts reach awake and fresh eyes. Plan clear handoffs with short overlap windows and a simple template for status, risks, and next steps.

Keep a shared dashboard and playbooks so any region can act fast without guesswork. Review page volume and fairness often to adjust staffing and prevent burnout. Run a two region pilot and refine the handoff steps before scaling.

Appoint Dedicated Response Leadership and Scribes

Separate the incident commander from technical responders. A trained commander leads the call, sets priorities, and keeps clear communication with stakeholders. This frees engineers to focus on fixing the issue without constant context switching.

Add a scribe role to capture timelines and actions for later review and learning. Rotate these roles so many people gain skill and no one is overloaded. Choose commanders for key services now and try this model in the next incident.

Guarantee Recovery Time and Compensation

Guarantee post-incident recovery time and compensation. Write a policy that grants paid recovery hours after late pages and major events. Include extra pay or credits for holidays and high load weeks to make the duty fair.

Block calendars after tough shifts so people can rest and avoid meetings. Track fatigue and page counts so leaders can step in before quality drops. Publish the policy and start granting recovery time after the very next incident.

Run Frequent Realistic Incident Drills

Run frequent incident drills to accelerate responses. Short, realistic drills build muscle memory and reduce panic during real events. Vary the scenarios to practice paging, triage, rollback, and cross team help.

Time each phase and note what slowed people down so the team can improve. Update runbooks and tools right after each drill while lessons are fresh. Schedule a one hour drill this week and commit to ship the top improvements.

Build a Humane On-Call for Production Services Without Slowing Incidents