How Child Welfare Workers Reduce Racial Disparities in Algorithmic Decisions

by Logan Stapleton, Assistant Professor of Computer Science at Vassar College

Predictive algorithms are used by about a dozen U.S. child welfare agencies and have been considered by more (Samant et al., 2021). Agencies look to predictive analytics with hopes to make decisions more standardized, accurate, and less biased (Chouldechova et al., 2018; Levy et al., 2021).

This overview summarizes key findings from a study evaluating one particular algorithm, the Allegheny Family Screening Tool (AFST) (Vaithianathan et al., 2017). In the study, we used government data to audit the AFST and went to Allegheny County DHS to observe workers as they used the algorithm (Cheng et al., 2022). We focus on the AFST because it’s been used since 2016 and serves as an example for many other algorithms.

An infographic that shows the disparities in screen-in rate between Black and white children. With AFST only, there is a 17.9% disparity rate compared with a 7.0% disparity rate with a worker and AFST.

How the AFST Works

The AFST uses data on individual families, including demographics (sans race), child welfare records, and other government data like criminal, juvenile justice, or public medical records.

Based on the data in a referral, the AFST predicts if children in the referral will be removed from the home within 2 years of the referral — designers use home removal as a proxy for risk of maltreatment. When call screen workers get a new referral, the AFST gives a score from 1 to 20, where higher scores are interpreted as higher risk. Workers read the referral, the family’s records, and the algorithm’s score, then decide whether to screen-in the family for investigation.

Data and Racial Disparities

One big reason why child welfare agencies are interested in algorithms is to reduce racial biases, ostensibly because algorithms are seen as more objective and consistent than workers (Hurley, 2018). A previous study suggested that the disparity between Black and white children screened-in by Allegheny County DHS fell from 9% before the use of the AFST to 7% after it (Goldhaber-Fiebert & Prince, 2019). DHS then claimed that the AFST caused workers to make less racially disparate decisions (Allegheny County DHS, 2019). Following these claims, more agencies have started using algorithms like the AFST (Samant, 2021). However, we found that the AFST gave more racially disparate recommendations than workers.

From 2016 to 2018, we argue that the AFST would have recommended screening in 68% of Black children and 50% of white children referred to DHS. Workers actually screened in 51% of Black children and 43% of white children. So, workers reduced the screen-in rate disparity from the algorithm’s recommended 18% down to 7%.

From worker observations and interviews, we hypothesize that workers reduced racial disparities by disagreeing with the algorithm to correct for racially biased patterns of overscoring. Although the AFST didn’t use race as a variable, most workers thought the AFST was racially biased because it uses public systems data to measure risk, and Black families are more likely to be involved in public systems, like welfare, the criminal system, child welfare, etc.

One worker said, “if you’re poor and you’re on welfare, [the AFST is] gonna score [you] higher than a comparable family who has private insurance.”

By contrast, workers looked at administrative data about families in the context of the report and made decisions holistically. Workers looked at families’ records for relevant information, but didn’t treat public systems involvement as an automatic strike against the family like the AFST might. For example, when a referral alleged drug abuse, one worker would often look through recent criminal records. But, they said, “somebody who was in prison 10 years ago has nothing to do with what's going on today.”

Accuracy and Proxy Outcomes

Prior evaluations of the AFST have claimed that it is more accurate than workers — so much more accurate that Allegheny County DHS said, “not using [the AFST] might be unethical because of its accuracy” (Allegheny County DHS, 2018). Our audit found that 51% of the AFST’s recommendations were accurate, versus 46.5% for workers’ decisions.

Yet, these results may be misleading. Accuracy is measured based on the outcomes that the algorithm predicts — namely, home removal and re-referral within two years. If a referral is screened-out and the child is then removed from the home within two years of the referral, then the screening decision is deemed inaccurate.

This puts workers at a disadvantage, because they did not make predictions like this when making screening decisions. Many workers said these outcomes were not good proxies for risk.

For example, workers said home removals often occur because a teen chooses to live with relatives, not because their parents are maltreating them.

Furthermore, workers made decisions based on immediate safety concerns, whereas the AFST made predictions on a much longer two-year timespan. This is intentional: Agency leadership designed the AFST to push workers to follow these metrics for longer-term risk (Vaithianathan et al., 2017). However, in practice, this led workers to question whether the algorithm was misaligned with the responsibility that the agency had to screen referrals based on immediate safety concerns and shorter-term risk.

An example of the Allegheny Family Screening Tool that shows how a number can be assigned to the amount of risk

Conclusions

Our work complicates current narratives advocating for predictive algorithms in child welfare (Cheng et al., 2022). Algorithms risk adopting racial biases from the public systems data they use. Current measures of accuracy may be misleading, since they favor algorithms over workers. These findings present fundamental problems to the design and use of algorithms in child welfare.

Logan Stapleton is a PhD candidate in Computer Science at University of Minnesota and an Assistant Professor of Computer Science at Vassar College.