Note: Rudy Berry completed a summer Research Experiences for Undergraduate (REU) program at the University of Minnesota with Professor Stevie Chancellor in the summer of 2022. This blog post summarizes his project outcomes. Way to go, Rudy!
Summary: Identifying when relapse has occurred is a key factor to consider when determining how to reach out to individuals with Opioid Use Disorder. Information like time elapsed since a previous relapse influences the type of resources and language that should be presented. In this project, I wrote a script that successfully identifies the date of incidence of relapse from a relapse disclosure post in an opioid addiction recovery community on Reddit. With this information, we were able to determine the amount of time that has passed since an individual’s self-disclosed relapse and the time they reported it to the community. The ability to extract this kind of information from recovery posts may be a valuable tool for the future development of context-sensitive outreach systems.
Overview: Opioid Use Disorder (OUD), colloquially known as Opioid Addiction, is a highly stigmatized health issue that has fueled the growing opioid crisis in the United States for over two decades. The CDC reports that Opioids were responsible for about 75% of all U.S. drug overdoses in 2020. Opioids have been linked to over 500,000 deaths since 1999 (CDC, 2021). In response to this crisis and the difficulty of finding support, there has been growing engagement in online recovery forums for substance abuse. These communities give members an anonymous space to seek advice, share success stories, and vent frustrations. Members of online addiction recovery communities frequently share feelings of shame and guilt (Mudry et al., 2012). So, the ability to detach oneself from a real-world identity is a major draw of these forums. The popular discussion website Reddit is home to a large online recovery community–r/opiatesrecovery.
In this project, our research goal was to identify the date that someone had relapsed in their OUD recovery journey. Identifying when relapse has occurred is key to aiding in the recovery process because advice is dependent on when someone has relapsed. If an individual has relapsed very recently, it is important to direct them to resources that can provide more urgent forms of harm reduction in the moment. If a relapse occurred in the distant past, it may be more appropriate to provide them with resources focused on long-term sobriety tips or maintenance care. The existence of online recovery communities presents a unique opportunity in HCI for researchers to develop technology that could provide additional support and resources to individuals with OUD beyond what community members already provide.
Therefore, the primary goal of this project was to write a script that could identify the date of incidence of relapse from the context of a relapse disclosure post. The project focused on two specific data sets; a set of posts and a set of comments all gathered from r/opiatesrecovery on Reddit. The ability to extract this kind of contextual information from recovery posts would allow outreach systems to provide more context-sensitive resources and messaging to individuals in OUD recovery based on an estimated date of relapse. We also wanted to determine the average window size between the incidence date of relapse and the postdate across all relapse posts and comments on the subreddit.
What We Did: The first step we took was identifying posts where an individual had disclosed the occurrence of a relapse. Working with another team member, we created a regular expression that matches phrases that indicated relapse, like “I relapsed” or “I just relapsed”. This was done in collaboration with another ongoing project in the lab to identify people who disclose that they have relapsed. This allowed us to create reduced datasets of relapse posts and comments from a larger general dataset from across the subreddit.
Once we collected the relapse posts, the next step was to identify nearby temporal expressions from the relapse time frame such as “yesterday” or “a week ago”. To do so we employed the SUTime library, a tool from the Stanford CoreNLP pipeline. SUTime is a powerful temporal tagging library that identifies temporal expressions by tokenizing text. It provides tags for four categories of temporal expressions: “Time”, “Duration”, “Set”, and “Interval”. When SUTime identifies a temporal expression it returns the expression text, type, date in reference to a passed in value or the system date, and the start and end position of the expression in the string of text.
For this project, we were particularly interested in the text of the type “Time” since this allowed for the extraction of the most specific dates. However, we realized that a handful of posts in our dataset were matching the type “Duration”. This included posts with phrases like “I relapsed for a week” or “I relapsed for 5 days”. These phrases were typically found in longer posts with many details and much more context to consider. We took this into account in our validation process and included durations to establish the limitations of our system. We wanted to know whether a human reader could identify a relapse date from the context surrounding a duration. To analyze this, we took a sample of twenty posts where relapse dates were identified and a sample where none were identified and replicated this with and without durations. We then hand annotated the text to identify false positive and negative identifications.
The second part of our validation process involved experimenting and evaluating the size of the character window around the relapse window to effectively identify relevant time words. We picked three different window sizes and analyzed the entire post dataset using accuracy. We wanted to know how many posts our script was able to accurately identify the day, week, or month of relapse for each character count.
The first part of the validation process revealed that the time tagging system was much more accurate when excluding duration temporal types. A negative sample (posts with no relapse dates identified) of twenty posts with durations included revealed that there was only one post where a human reader would be able to establish a relapse date. The system correctly identified that no relapse date was discernible from the other nineteen posts. However, when excluding durations, our system correctly identified that no relapse date could be identified for all twenty posts in the negative dataset. Within a positive dataset (posts with relapse dates identified), the inclusion of durations had a more dramatic effect on the results. In the positive sample with durations included there were eleven posts where the system correctly identified that a relapse date could be identified from the text. However, there were nine posts where the system incorrectly identified the beginning of durations as possible relapse dates. So, for almost half the sample the script would identify a relapse date, while a human reader would not be able to. This can be attributed to the fact that durations were typical of posts with more complexity to consider. For instance, in an example like “I got out of rehab then relapsed for five months”, the system would incorrectly identify the relapse date as five months prior to the post date. In this case a human reader would have to analyze the entire post to make a more accurate relapse date approximation. The results of the positive dataset without durations were better, with only five posts being incorrectly labeled as posts where a relapse date could be determined. Based on this outcome we decided to work only with “time” temporal types and exclude durations.
During the second part of our validation process, we selected character counts of 100, 150, and 200 around our regular expression. The best performance was at one hundred, with an accuracy of 73.4% for the entire dataset of posts. This was verified by reading each post and identifying the correct relapse date. The issue with wider character windows was the inclusion of many temporal expressions. Our script is written to return the first expression it finds. In text like “I started my recovery journey a year ago and today I relapsed”, the relapse date would be incorrectly identified as a year ago. Alternatively, in a phrase like “Starting all over again today after I started relapsing again last month”, the relapse date would be incorrectly identified as the post date or “today”. A window size of 100 fails for both cases, and instances like these are more frequent past one hundred characters. Further testing is necessary to determine the best way for the script to choose between multiple time expressions.
The histogram data we collected reveals spikes in relapse disclosure within the first ten days of relapse as well as at the one-month, two months, one-year, and two-year marks. The post dataset had a mean window size of 64.6 days with a median of 7.0 days. The comment dataset had a mean window size of 177.8 days, with a median window size of 30.0 days.
Overall, the script we created can extract information about relapse incidence dates and could be easily replicated and improved for an outreach system. This system could use the window size in conjunction with other information such as sentiment and prior relapse disclosures to send an individual a message with context-sensitive resources and word choice.
One finding from the identifier I found particularly interesting was how many people reached out to online communities to disclose relapse so soon after it had occurred. This highlights a need for these systems to focus on how to support individuals during the immediate aftermath of a relapse. In the future, further modifications could be made to address the contextual limitations of durations and multiple time expressions. Through this project I learned a lot about the benefits of anonymity in online spaces. It was interesting to see people being open about their setbacks and experiences in real-time. This work has made me more curious about the role that anonymous online communities play in de-stigmatizing OUD as well as mental health risks like anxiety and depression, and the types of systems that can safely facilitate them.