The Legal Innovation & Technology Lab's Spot API Beta
@ Suffolk Law School - Spotter Updated: 2019-07-29
Is Spot Useful?
That depends on what you're using it for. Think of Spot like a map. It doesn't have resolution down to the inch, but it can still be helpful. Like a map, Spot is a model, and as the saying goes, "All models are wrong, but some are useful." Because it's "wrong," those using Spot should recognize that its output is not the final word, and if they want to know if it's useful, they'll have to ask "compared to what?" That being said, what you probably want to know is, "how wrong is it?"
Unfortunately, you can't just look at Spot's accuracy. To understand why, let's say you want to evaluate an algorithm that predicts if there will be a snow day tomorrow. I have a algorithm with an accuracy of 98%. Impressive right? What if I told you my algorithm was, "always guess no?" My model is 98% accurate because snow days only happen 2% of the time. To know if my algorithm was any good, you needed to know more. So you might ask what percentage of actual snow days did I "catch?" This is something called recall. The answer is 0%. You could also ask how often am I right when I say something is a snow day. This is something called precision. Since I didn't actually predict any snow days, I can't even calculate this number because I'd have to divide by zero. Either way, these alternative metrics make it clear that my model is no good. And none of that even takes in to account that I may have a preference for false positives over false negatives or vice versa. Consider a screening test for some ailment. The idea is that it's one step in a process, a filter used to identify folks for a diagnostic test (i.e., it's not the final word). In such a case you might care more about minimizing false negatives than you do about false positives. The point is, it's complicated.
That being said, the average weighted† accuracy for our active labels is currently 95%. For recall, it's 83%, and precision is 85%. That "active labels" qualifier limits our model to only those labels for which we can make predictions better than a coin flip or always guessing yes/no, and it only includes the following 12 labels: Accidents, Injuries, and Torts (Problems with Others); Court and Lawyers; Crime & Prisons; Estates & Wills; Family; Health; Housing; Immigration; Money, Debt, and Consumer Issues; Small Business and IP; Traffic and Cars; Work and Employment.
You can see detailed performance metrics for each of our active labels by pinging the API's taxonomy method. It returns a json string with details about each. See https://spot.suffolklitlab.org/v0/taxonomy/
Those performance metrics, however, are the ones you get if you take the model as predicting the presence of a label when it states there is more than a 50% chance of it being there. If you change the cutoff, you can favor recall over precision, or the other way around. So API users should think carefully about their use case and what cutoff is appropriate. See Notes on Uncertainty below.
Of course, we're in very early beta, and we're working to grow our list of active labels and improve the metrics associated with each. If you'd like to help, you can play a few rounds over at Learned Hands and grow our training set.
When Spot returns a list of issues, it provides a prediction accompanied by a lower and upper bound for that prediction. See Documentation. The bounds are intended to provide users with information about the uncertainty for a given prediction, allowing them to tailor their results to their use case and tolerance for error. The larger the distance between the lower bound and the prediction, the more likely it is that the prediction is overly optimistic about finding the issue. Conversely, the larger the difference between the prediction and the upper bound, the more likely it is that the prediction is under-estimating the chance that the issue is present. Users are encouraged to consider this when setting the cutoffs for triggering the return of an issue. For example, a user interested in avoiding false negatives may set Spot to return results based on the upper bound being greater than 50%, while one interested in avoiding false positives might choose to base their cutoff on the lower bound. The distance between the lower and upper bounds communicates something about the overall uncertainty around a prediction, with the most certain predictions being those with the smallest distances between the two. We're still experimenting to find the best method for defining our bounds. So watch this space for more info. in the future.
† Values are weighted based on the number of affirmative examples for each in issue in our dataset.