System Design: Incident response platform like PagerDuty.

In this article we will read about Low level design for a PagerDuty like, incident response and alerting platform for IT companies. Many companies use these platforms to track any incident on production and act on it based on the severity of the incident.

1. Requirements and Goals of the System:

  • Users should be able to configure preferred channels for notifications.
  • Admin should be able to generate/edit a schedule for on-call with primary and secondary on-calls.
  • System should have the ability to escalate the incident based on priority.

2. Entity Identification and Class definitions:

class User {
int employeeId;
String emailId;
String phoneNumber;
Team team;
}
enum Team {
ABC, XYZ;
}

Now this base User class can be used to extend our user set to employees, managers, admin etc. For example:

class Employee extends User {
User manager;
String notificationSettings;
bool isAdmin;
}

Here we are using a bitMap to represent the notification settings of a user, for example (sms, email, phone-call) can be represented as “110” which can be read as sms → ON, email → ON and phone-call → OFF.

Next we need a schedule class which will represent a schedule in our system.

class Schedule {
LocalDateTime startTime;
LocalDateTime endTime;
Team team;
Employee primary;
Employee secondary;
}

Another entity in system is an incident which acts as a trigger for the alerts in the system.

class Incident {
int id;
String description;
Severity severity;
Team team;
String externalUrl;
LocalDateTime startedOn;
IncidentStatus status;
Employee currentEscalation;
}
enum Severity {
SEV1, SEV2, SEV3;
}
enum IncidentStatus {
OPEN, ACKNOWLEDGED, RESOLVED;
}

3. Data structures:

Now if we were to create a schedule starting 1st Jan with a 7 day period. User “A” will be primary on-call in 1st week of Jan, User “B” in 2nd and so on. User “A” will again be primary after 3 weeks. Now if we were to remove/add or shift a user in the schedule we can simple change the list and our logic is not affected. For example after removing user “C” our list looks like:

4. Notify():

notify(Incident incident):
1. fetch schedule where incident.team == schedule.team and currentTime >= schedule.startTime and currentTime <= schedule.endTime
2. fetch employee's notificationSettings where schedule.primary.id == employee.id and schedule.secondary.id == employee.id
3. send notification to employees with incident.description and incident.externalUrl
4. store the incident with IncidentStatus as OPEN, incident.currentEscalation = schedule.primary

How will the escalation work ?
We will run a cron which will monitor the incident schema based on status, startedOn and severity will escalate the incident. For example if we configure SEV1 to be escalated every 10 mins:

escalateCron(Incident incident):
1. check if incident.startOn + 10mins >= currentTime and incident.status == OPEN
2. fetch incident.currentEscalation employee's manager
3. send notification to manager with incident.description and incident.externalUrl
4. Update incident.currentEscalation = manager

As an extension the system can also consume google calendar apis to fetch the leave schedule of an employee and escalation cron be enhanced to incorporate employee leaves.

Thanks for reading the blog. I hope it was somewhat helpful. 😊

Awesome developer. New to writing. Always up for a workout :)