View on GitHub

Predicting Drug Treatment Program Adherence

Analysis by Taylor Names

Download this project as a .zip file Download this project as a tar.gz file

PROJECT OVERVIEW

BACKGROUND AND MOTIVATION

How to treat drug abusers effectively and efficiently is a controversial and timely issue in the United States. In the last 15 or so years, an increasing amount of convicted drug users have been given the option (or have been required) to complete a drug treatment program instead of serving time in prison. Completing treatment instead of spending time in jail have many psychological, financial, health and social benefits.

Many questions surround the current state of drug treatment:

Having data driven insights to help answer these questions can have many practical benefits:

Besides the possible practical benefits of the analysis of the dataset, the size and scope of dataset present many interesting computational challenges.

THE DATA

The data includes demographic, geographic, education and employment, drug use history, and treatment program outcome information for approximately 9.8 million treatment episodes between the years 2006 and 2011. The data does not contain information relating to drug use relapse after the termination of the treatment episode.

The data is made public by the Substance Abuse & Mental Health Services Administration (SAMHSA) and the data is hosted by the Inter-university Consortium for Political and Social Research (ICPSR) at this website. The data consists of 65 features, most of which are entirely categorical. The specific data utilized for this analysis in .tsv format can be downloaded from this link.

PROJECT OBJECTIVES

Hypothetically, with this tool, a drug treatment program administrator could identify the different probabilities of completion of a variety of treatment programs for a given individual and, in turn, make an informed decision that could lead to more efficient use of resources.

METHODOLOGY

The data was provided in a very organised and well documented fashion but, as with most data, there were many missing values. The missing data were either dropped, made equal to the most frequent value or imputed using a random forest algorithm. Then, for the remaining features, a web scraping routine was developed to gather descriptions for the numerically encoded categories. Using these descriptions we could then explore the data by visualising the effect of single and dual variables on treatment adherence rate. Also, further visual exploration of the data was done using Tableau. From the insights gained from the exploratory analysis the features were selected and random forest, 'Elastic Net' multiple logistic regression and gradient boosted tree classifiers were created. Finally, the three models were combined using a unregularized logistic regression model with the final output being predictive probabilities of treatment program adherence.

EXPLORATORY DATA ANALYSIS

RELATIONSHIP BETWEEN SINGLE FEATURES AND ADHERENCE

To gain some insight about the relationship between demographic, geographic, personal characteristics, treatment type and treatment program adherence we plotted the relationship between the two. Some of the most critical features are plotted below.

age_tot

EDUC_TOT

REGION_TOT

treatment_type_tot

These relationships are also explored on a proportional basis. The plots below show the mean adherence rates for individual characteristics as compared to the mean adherence rate from all participants.

DIVISION_PROP

TREAT_TYPE_PROP

INCOME_PROP

Most categories have an adherence rate similar to the population mean, however, certain features stood out. Below are some best and worst groups in terms of adherence rate:

UNDERPERFORMING GROUPS

HIGH-ACHIEVING GROUPS

GEOGRAPHIC FACTORS

The relationships between adherence rates and demographic factors are explored in detail in the following Tableau interactive visualisation.

FEATURE INTERACTIONS

The analysis of single features is useful, but neglects to consider interactions between features. For example, it may be interesting to know what is the adherence rate of heroin users from Wyoming.

The interactions with the greatest difference in adherence rate from the mean are extremely rare occurrences. For example, one of the interaction terms with the lowest adherence rate (of 0%) belongs to an Asian or Pacific islander who is a user of 'other stimulants'. Since there is only one sample that applies, it is unlikely that the true adherence rate for Asian stimulant users is 0%. To form more realistic estimates for the adherence rate for interaction terms, Bayesian inference was performed to shrink (or regularize) small sample size interactions toward the general mean.

INTERACTION_REG

The most relevant interaction terms were included as features in the 'Elastic Net' logistic regression.

PREDICTION

Predicted probabilities of treatment adherence were formulated through the creation of three models: random forest classifier, 'Elastic Net' logistic regression with stochastic gradient descent optimization, and gradient boosted trees classifier.

FEATURE IMPORTANCES FROM MODELS

For each of the models, it is possible to explore 'feature importances' which identify the features that aided the model the most in making accurate predictions.

RF_FI

LR_FI

GBT_FI

These calculated feature importances reinforce some of the insights that we gained from exploratory data analysis because geographic factors, treatment program type, and age also contribute significantly to the predictive modeling.

CALIBRATION

To evaluate the performance of the models, we check to see if they are properly calibrated. An example of calibration is: do the participants who are predicted to have a 90% probability of adherence adhere 90% of the time? For a perfectly calibrated model, this will be the case for all probability values.

UNCALIBRATED GRADIENT BOOSTED TREE MODELGBT_PRE_CAL

As we can see from the above plot, the gradient boosted tree classifier underestimates the probability of adherence as the entire curve falls above the 'perfectly calibrated' dotted line.

Often times, improved calibration leads to improved predictive power. Each of the models were calibrated using 'isotonic' calibration in an attempt to improve their performance.

CALIBRATED GRADIENT BOOSTED TREE MODELGBT_POS_CAL

After performing 'isotonic' calibration, we see a significant improvement in the calibration as the curve is much closer to the 'perfectly calibrated' line.

ENSEMBLE OF THREE MODELS

Finally, the three models were combined in an 'ensemble' model using unregularized logistic regression.

CALIBRATION OF ENSEMBLE MODEL

ENS_CAL

Model performance was measured using 'Log Loss' which is a measure of error for estimated probabilities. A log loss value of zero signifies that the model has no error.

LOG_LOSS_SCORE

Another performance metric utilised in analysis was simple accuracy. Probability estimates less than 50% were classified as 'didn't adhere' and above as 'adhered'.

ACCURACY_SCORE

The final ensemble model is well calibrated and with a final log-loss and accuracy scores that are very similar to those from the GBT model. The ensemble model is a significant improvement upon the baseline 63% model as it yields a higher accuracy and gives a personalized prediction for each individual.

CONCLUSIONS AND FURTHER RESEARCH

REEVALUATING ASSUMPTIONS

A crucial assumption made that could have affected model performance was to group transferred participants together with participants who completed treatment. Although they are both 'positive' conclusions to treatment, they are quite different results. Model performance could be improved through further investigation of the causes for transferring participants and possibly implementing a ternary response variable (completed, transferred, not completed).

MODEL IMPROVEMENT

As expected, the tree based models performed very well on this dataset. Since the models were trained on such a large quantity of data and had very similar performance, it is likely that the main way to improve performance is not to implement more complex models, but to focus more energy on feature engineering.

MORE FEATURE ENGINEERING

Although there was a significant improvement upon the baseline models, the overall model still struggles with the outcome prediction for some individuals. There are many other factors that can contribute to treatment completion. Including data relating to family ties (contact with parents, having brothers and sisters) could help improve the model. As to protect the anonymity of the treatment providers, the dataset provided limited information about details from the treatment program. Data relating to treatment program affiliation (private, public, faith based) as well as program details such as amount of time in contact with health professionals and psychiatric care could also increase the accuracy of the model.