The Legal Innovation & Technology Lab's Spot API
@ Suffolk Law School  -  Spot Version: 2022-05-21 (Build 10)

Spot's Training Data

Image of the character Data as a giant standing next to the Supreme Court
Big Data & the Law, h/t Tim Sackton & Josh Lee

Spot has been trained on more than one million labels. This data comes from multiple sources, including the crowdsourcing effort Learned Hands and individual API users (e.g., a legal aid organization running an online triage tool). These data include people's natural articulation of an issue (e.g., "my landlord kicked me out") along with LIST/NSMIv2 labels indicating what issues might be relevant (e.g., HO-02-00-00-00: Eviction from a home). Spot works by finding patterns in these associations and attempting to match novel texts to what it has seen before. Consequently, high quality labeled data is the bedrock upon which Spot is built, and the accumulation of new data over time works to improve Spot's performance both in accuracy and coverage. Though Spot was originally trained primarily on Learned Hands data, more and more of its training data is now coming from other users. As we grow this community, we grow Spot's ability to recognize diverse statements of issues. For this reason, we ask that users of Spot share their data when possible. You can learn more about the two main sources of training data—Learned Hands and User-Derived Data—below.

You can also find information on the sourcing of a build's training data by consulting the API's taxonomy method. See training at https://spot.suffolklitlab.org/v0/taxonomy/. You can get info for a specific build by passing the build argument, e.g., https://spot.suffolklitlab.org/v0/taxonomy/?build=6

{
"labels":[...],
"training":
    {
     'build': 6,
     'sources': [{'source': 'r/legaladvice via Learned Hands',
           'all labels': 559366,
           'affirmative labels': 13279,
           'negative labels': 546087},
          {'source': 'privately labeled datasets',
           'all labels': 502311,
           'affirmative labels': 3911,
           'negative labels': 498400},
          {'source': 'simulated queries',
           'all labels': 1261,
           'affirmative labels': 1238,
           'negative labels': 23}]
    }
}

Note: privately labeled datasets includes all User-Derived Data.

Learned Hands Data

Spot builds upon data from the Learned Hands online game, a partnership between the LIT Lab and Stanford's Legal Design Lab. Learned Hands aims to crowdsource the labeling of laypeople's legal questions for the training of machine learning (ML) classifiers/issue spotters. Labels are drawn from the Legal Issues Taxonomy—LIST (formerly National Subject Matter Index, Version 2). Currently, this labeling is limited to publicly available historic questions from the r/legaladvice forum on Reddit. See Stanford and Suffolk Create Game to Help Drive Access to Justice.

The most recent Learned Hands data can be found here: 2022-05-21_95p-confidence_binary.csv. This labeled data is licensed under a CC BY-NC-SA 4.0 International License. Players of Learned Hands are asked to say whether a label is or is not present in a text. These answers are used to calculate a Wilson confidence score interval. In the files linked above and below, an issue's column contains a 1 if the lower bound of this interval exceeded 50% and 0 if the upper interval dropped below 50%. If the interval for a text straddles 50%, no value is included for that label. That is, the values included are those values where we're 95% confident that more than half of folks playing Learned Hands would agree.

Historic Datasets

Releases are made intermitently in connection with new Spot builds and take place a few times a year. A list of prior releases can be found below along with links for the most recent datasets.

User-Derived Data

In addition to the data labeled by Learned Hands, users of the API (those building tools with it) have the option to let Spot forget or remember the content of text shared with it. If Spot is given permission to remember a text, we may use it to improve the issue spotter by having humans perform their own issue spotting and using their insights to retrain the issue spotter.

Occasionally, institutional users of the API, like legal aid organizations, may choose to share bundled historic data with Spot (e.g., previously collected data from a website chatbot or webform). By labeling this data before sharing, they can provide valuable training material to make Spot more responsive to their client base. If you're such a user and need help with labeling, we can assist you.

What are the costs and benefits of letting Spot remember user data?

If Spot is given permission to remember a text, or if a text is shared with our team as part of a bundle, it may be read by people on our team. We do not sell this data to third parties and only share it with a closed group working on quality control and labeling. If the text is labeled it may be used to help improve Spot's performance by acting as an example for training of our algorithms. In this way, sharing a text can help others with similar issues by making it easier for Spot to identify issues.

This sharing is really important for populations not represented in the Learned Hands data. Different communities talk about issues in different ways, and in order for Spot to recognize issues in a text, it needs to have seen them talked about in that way before.

Sharing data with Spot is the best way to improve its performance, esp. via the actions endpoint.

How to talk about your data sharing decision

Developers are encouraged to consider their use case carefully when deciding how to incorporate end user input regarding the remembering of texts. For most cases it will be prudent to have the end user either opt in or opt out (2 below). Given the benefit that accrues to all users when data is shared, an opt out, as opposed to opt in, is encouraged for most use cases.

Below you will find several options for sharing information with Spot along with links to model explanations for your users. These behaviors are controlled by the value you pass through the save-text parameter. See documentation. The more sensitive the information involved in your use case, the further down the list you should go. Please read the descriptions below to determine what is best for your organization:

  1. Remember all query data by hard coding the save-text parameter as 1. If your organization deploys this version of Spot you are helping the legal community by allowing Spot to keep the text of queries so that it can learn to be faster and more accurate. This choice allows you to send data on subsequent user actions that indicate whether Spot's recommendations were correct. See the actions endpoint. It also allows our team to review and label queries even if no subsequent data is sent. For an example of such an implementation visit the MA Legal Resource Finder.

    If you choose this option and are actively collecting data from users, as opposed to using Spot on historic data, you should make this dataflow clear to your users as part of that process. Feel free to link to, or adapt, this model page to provide them with an overview of the dataflow: Always remember.

  2. Set the save-text parameter based on end-user interaction (i.e., let your users decide). This use case is most appropriate if you are actively collecting data and have end-users, as opposed to using Spot on historic data. It allows end-users to individually decide if they will let Spot keep the text of their questions. When end-users enter text for Spot you present them with the option to share information with Spot. For an example of such an implementation visit Court Forms Online.

    You can still send data on subsequent user actions to indicate whether Spot's recommendations were correct, but this will only be associated with the text of a user's query if they have opted to have us save that information.

    If you choose this option and are actively collecting data from users, as opposed to using Spot on historic data, you should make this dataflow clear to your users as part of that process. Feel free to link to, or adapt, this model page to provide them with an overview of the dataflow: User-Driven.

  3. Forget all query data by hard coding the save-text parameter as 0. You may use Spot even if you do not contribute data to make it better. This use case envisions using Spot on sensitive data that you do not wish to persist on our servers.

    If you choose this option and are actively collecting data from users, as opposed to using Spot on historic data, you should make this dataflow clear to your users as part of that process. Feel free to link to, or adapt, this model page to provide them with an overview of the dataflow: Always forget.