Sep 11, 2023

How to Build Customer Escalation Prediction that Works

Shuo Chen

Director of Engineering, SupportLogic

Escalation Managementpredictive analyticscustomer escalationsmachine learningAI for support

Predictive AI is allowing support organizations to cut escalation rates in half, and the savings on resources and operations cannot be overstated. Predicting escalations is an ideal applications for AI – because it uses the trends and patterns from previous cases and customer interactions to determine whether or not an open case will escalate.

Likely to Escalate (LTE, as we call them) models scan every open case for clues and tell support teams where to focus their energy. Because of this correlation between predictions and team effort, the better the model, the higher the ROI is from using it. On top of this is the benefit to customer relationships – customers’ most pressing issues are being identified and solved faster, like a water glass being refilled without asking a server or sitting thirsty.

So what makes an escalation prediction model better than others? This article explains how we measure our LTE model performance and why.

Our current LTE model is a binary classification model which means that the model can return a Yes or No on whether a case is going to be escalated. This is actually an oversimplification – in reality the model produces a probability and ranks cases based on the number of predicted cases each customer has the capacity to address. Think of this like a resource threshold: if your team can handle 5 escalation predictions a day, the model gives your team the 5 most valuable predictions. If your team can handle 10, then the model adjusts to give you the 10 best predictions.

To measure a binary classification model, accuracy, precision, and recall are all important factors. Our LTE model, however, focuses on recall over accuracy or precision, and I’ll explain why below.

Accuracy, Precision & Recall

Using Accuracy

Accuracy is measuring among all of the predictions made by a model how many are accurate. For example:

	Predicted to be escalated	Actually escalated?
case A	Yes	Yes
case B	Yes	No
case C	No	No

There are 3 cases, case A, case B and case C:

An LTE model makes 3 predictions: case A and B are going to be escalated and case C is not. Eventually, case A is escalated and case B and C are not.
Among the 3 predictions, the LTE model is correct on 2 (i.e. case A is predicated escalated and is escalated and case C is predicated not escalated and is not escalated) and incorrect on 1 (case B is predicted escalated and is not escalated). So the accuracy is 2 out of 3 or ~67%.

The Problem with Accuracy

Given that typical escalation rates are low (such as 5%), if we have a model that just says “not going to be escalated” to all cases, the accuracy of the model would be 100% – 5% = 95%. On paper, this looks really good “wow a model has an accuracy of 95%” but in reality, it DOES NOT help us prevent any escalations because the model predicts no escalations at all. This is why using accuracy as a metric is not suitable for LTE models.

Using Precision

Now knowing the problem with using accuracy, we are more interested in how accurately the model can identify to-be-escalated cases compared to the majority of cases which will never escalate. We categorize the model predictions into two groups: positive and negative.

A positive prediction means that the model predicts a case to be escalated
A negative prediction means that the model predicts a case not to be escalated

Precision is measuring, among all the positive predictions made by a model, how many are accurate. Using the same example as above:

	Predicted to be escalated	Prediction type	Actually escalated?
case A	Yes	Positive	Yes
case B	Yes	Positive	No
case C	No	Negative	No

There are 3 cases, case A, case B and case C:

An LTE model makes 3 predictions: case A and B are going to be escalated and case C is not. Eventually, case A is escalated and case B and C are not.
As calculated above, the accuracy is 2 out of 3 or ~67%.
Since precision is measured from positive predictions only, we need to know how many positive predictions are made by the model and among the positive predictions how many are accurate. In this example, the LTE model makes 2 positive predictions (case A and B are predicated escalated), 1 prediction is correct (case A is predicated escalated and actually escalated) and 1 prediction is incorrect (case B is predicated escalated and actually not escalated). Therefore, the precision is 1 out of 2. or 50%.

The Problem with Precision

In the example, case B is predicated escalated but did not actually escalate which puts a dent on the metric. However, in reality, this is the type of result that we actually want to achieve:

The LTE model flags early about to-be-escalated cases
Agents take actions against those cases
The cases get deflected

The better the model is at predicting to-be-escalated cases, the worse the precision will be, assuming that the agents take actions based on the LTE flags. In other words, if agents follow the predictions of a perfect LTE model and all cases are deflectable, the model should have a precision of 0% because all predicated LTE cases will be deflected. This is why using precision to measure LTE models is MISLEADING and to some extent, we could even argue that the lower the precision, the better business impact the model generates. If someone claims a model is 3x better, be cautious on what it really means.

Using Recall

Besides accuracy and precision, another important metric is recall. The idea of recall is “in hindsight, how many eventually escalated cases does the LTE model catch initially” (that’s why it’s called recall). Use the same example as above:

	Predicted to be escalated	Prediction type	Actually escalated?
case A	Yes	Positive	Yes
case B	Yes	Positive	No
case C	No	Negative	No

There are 3 cases, case A, case B and case C:

A LTE model makes 3 predictions: case A and B are going to be escalated and case C is not. Eventually, case A is escalated and case B and C are not.
As calculated above, the accuracy is 2 out of 3 or ~67% and precision is 1 out 2 or 50%.
In hindsight, we have only 1 case escalated (case A) and the model does predict case A to be escalated. Therefore, the recall is 1 out 1, or 100%.

The Problem with Recall

Is achieving 100% recall a good thing? It depends. We could build a model that just predicts every case to be escalated and eventually the recall is 100%. But the model DOES NOT help customers focus on the important cases because customers don’t have enough resources to treat every case to be escalated, which is no different from predicting every case to be not escalated.

What’s the Best Measurement?

The Ideal World: A Modified Version of Precision

If we could attribute what agents did to deflect a to-be-escalated case back to the LTE model, we could use a modified version of precision to measure the LTE model performance. Using the same example as above:

There are 3 cases, case A, case B and case C. An LTE model makes 3 predictions: case A and B are going to be escalated and case C is not.
Eventually, case A is escalated and case B and C are not.

Since we know that case B is not escalated because an agent takes a clue from our LTE model and proactively follows up with the customer, we count case B also as a good positive prediction even though case B is not escalated. Therefore, the LTE model makes 2 positive predictions, both predictions are good (case A is escalated and case B is deflected because of the early warning from the LTE model) and the modified version of precision would be 2 out of 2, or 100%.

The Real World: Recall

Coming back to reality, it’s extremely difficult to attribute what agents did to deflect a to-be-escalated case back to the LTE model because we need confirmation from the agents on whether they take actions because of LTE prediction.

How can we bridge the two?

As called out above, the vanilla version of precision is bad because the better the model performs, the lower the precision. That’s why recall is used to measure LTE model performance.

But there’s a problem with recall as well, right? Yes, but in reality, we don’t build a model that says “every case is going to be escalated” and moreover we build it to only flag the top percentage of cases (typically 1% to 5%, customized to the account) based on their ranked probabilities of LTE instead of flagging every case to be escalated. Thus, recall is the most optimal metric for measuring LTE.