Build-Learn-Adjust, but know your metrics first
When talking about software engineering — whether you’re building a website or building an ML algorithm, people often talk about the need to “iterate”. If you’ve been associated with software engineering since the early 2000s, you might’ve heard the word “agile”. Some of Agile’s core principles are to deliver software more frequently, get feedback faster and adjust.
One thing people often forget to focus on, which I consider step 0 really is “Observability”. What metrics should you pursue to evaluate how your product is faring, how your software is performing, how your model is predicting. People tend to often embark on lengthy journeys and along the way realize the metrics that are important to measure success. At ReWorked, we spent a lot of time debating the right metric that we should focus on to evaluate Betty and this blog takes us through the journey of how Betty continues to iterate and evolve.
What we’re fundamentally building is a supervised ML model that’s a binary classifier — predicting Inquiries (True, False). In the future we could make it a more complicated model where we have multiple classes — DEAL, Inquiry-Success, Inquiry-Failure, No-Inquiry. Because after all, every marketing campaign that an investor sends out does translate to essentially the 4 classes mentioned above.
So what metric should we focus on while evaluating an ML model that’s a binary classifier?
We first naively thought the right metric to consider is Percentage of True Inquiries Predicted (PTIP). The thesis was more the number of true inquiries predicted the better, of course this needs to be a %, so a 100% predicting model obviously is a better model right? Wrong.
Why? Well, our model could predict 100% of the file as Inquiry = True, that would obviously be wrong as it’s not giving any value to our customers since the whole point is to predict just enough Inquiry = Trues such that our customers save $$s on mailing costs.
Enter the Confusion Matrix; In Machine Learning, the confusion matrix is often used as a performance measurement tool for a classification model for 2 or more classes, well that’s what we need! [p.s read more about Confusion Matrix here.
So, the next metric we looked at was the model accuracy which is typically computed as accuracy = (TP+TN)/(TP+TN+FP+FN)*100
This is a good metric, as it captures both the true positives (i.e. correctly predicted inquiries) as well as true negatives (i.e. correctly predicted non-inquiries).
However, there still is a problem, if we look at model accuracy alone, that’s biasing towards overall accuracy whereas the cost of a False Positive (we predicted an inquiry, in reality it was not) is very different from a cost of a False Negative (we predicted a non-inquiry, in reality it was). While the cost of a False Positive is a few cents (i.e. cost of mailer / sms); the cost of a False Negative is potentially tens of thousands of dollars. So this imbalance in importance of FP and FN is not represented in the model accuracy metric.
So in light of all that, a new metric had to be derived, one that strikes the right balance between identifying sufficient inquiry=False while also keeping the TP’s high. The new metric we came up with was weighted average, calculated as follows: (2*PTIP+Model Accuracy)/3
This turned out to be a great metric because it captures both True Positives as well as the overall model accuracy appropriately.
Now that we’ve defined what metric is important to us, in the upcoming blogs we’ll write about how at ReWorkedREI we’re building ML models to solve an important problem in the REI industry and to help our customers reduce their marketing spend and essentially “Do More with Less”.