Machine learning can ascertain a lot about you — including some of your most sensitive information. For instance, it can predict your sexual orientation, whether you’re pregnant, whether you’ll quit your job, and whether you’re likely to die soon. Researchers can predict race based on Facebook likes, and officials in China use facial recognition to identify and track the Uighurs, a minority ethnic group.
Now, do the machines actually “know” these things about you, or are they only making informed guesses? And, if they’re making an inference about you, just the same as any human you know might do, is there really anything wrong with them being so astute?
Let’s look at a few cases:
In the U.S., the story of Target predicting who’s pregnant is probably the most famous example of an algorithm making sensitive inferences about people. In 2012, a New York Times story about how companies can leverage their data included an anecdote about a father learning that his teenage daughter was pregnant due to Target sending her coupons for baby items in an apparent act of premonition. Although the story about the teenager may be apocryphal — even if it did happen, it would most likely have been coincidence, not predictive analytics that was responsible for the coupons, according to Target’s process detailed by The New York Times story — there is a real risk to privacy in light of this predictive project. After all, if a company’s marketing department predicts who’s pregnant, they’ve ascertained medically sensitive, unvolunteered data that only healthcare staff are normally trained to appropriately handle and safeguard.
AI and Equality
Designing systems that are fair for all.
Mismanaged access to this kind of information can have huge implications on someone’s life. As one concerned citizen posted online, imagine that a pregnant woman’s “job is shaky, and [her] state disability isn’t set up right yet…to have disclosure could risk the retail cost of a birth (approximately $20,000), disability payments during time off (approximately $10,000 to $50,000), and even her job.”
This isn’t a case of mishandling, leaking, or stealing data. Rather, it is the generation of new data — the indirect discovery of unvolunteered truths about people. Organizations can predict these powerful insights from existing innocuous data, as if creating them out of thin air.
So are we ironically facing a downside when predictive models perform too well? We know there’s a cost when models predict incorrectly, but is there also a cost when they predict correctly?
Even if the model isn’t highly accurate, per se, it may still be confident in its predictions for a certain group of pregnant individuals. Let’s say that 2% of the female customers between age 18 and 40 are pregnant. If the model identifies customers, say, three times more likely than average to be pregnant, only 6% of those identified will actually be pregnant. That’s a lift of three. But if you look at a much smaller, focused group, say the top 0.1% likely to be pregnant, you may have a much higher lift of, say, 46, which would make women in that group 92% likely to be pregnant. In that case, the system would be capable of revealing those women as very likely to be pregnant.
The same concept applies when predicting sexual orientation, race, health status, location, and your intentions to leave your job. Even if a model isn’t highly accurate in general, it can still reveal with high confidence — for a limited group — things like sexual orientation, race, or ethnicity. This is because, typically, there is a small portion of the population for whom it is easier to predict. Now, it may only be able to predict confidently for a relatively small group, but even just the top 0.1% of a population of a million would mean 1,000 individuals have been confidently identified.
It’s easy to think of reasons why people wouldn’t want someone to know these things. As of 2013, Hewlett-Packard was predictively scoring its more than 300,000 workers with the probability of whether they’d quit their job — HP called this the Flight Risk score, and it was delivered to managers. If you’re planning to leave, your boss would probably be the last person you’d want to find out before it’s official.
As another example, facial recognition technologies can serve as a way to track location, decreasing the fundamental freedom to move about without disclosure, since, for example, publicly-positioned security cameras can identify people at specific times and places. I certainly don’t sweepingly condemn face recognition, but know that CEO’s at both Microsoft and Google have come down on it for this reason.
In yet another example, a consulting firm was modeling employee loss for an HR department, and noticed that they could actually model employee deaths, since that’s one way you lose an employee. The HR folks responded with, “Don’t show us!” They didn’t want the liability of potentially knowing w