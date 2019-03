“Dramatically more complicated”

Privacy concerns raised over 2020 Census

By Sherry Mazzocchi

FUN FACT: In 2006, Netflix ran a contest to improve its recommendation system. To protect the privacy of customers, it released a random sample of ratings and replaced names with numbers. But data scientists identified those individuals by comparing rankings and timestamps on IMDb.com (a movie review site).

It’s hard to make personal data anonymous.

When the Census Bureau publishes data at the tract and block level, it can be easy to identify individuals if too many statistics are published. For example, if only one person lives on a block, that person’s identity is obvious. If there is only one person under three or over 100 on another block, that’s also identifying information. Publishing it could be a potential violation of Title 13, which prohibits the Census Bureau from sharing personalized statistics.

But the Census Bureau is all about statistics.

The 2010 Census collected six main stats per person: address, sex, age, race, ethnicity and relationship to person one (the first person listed on the form). It counted a total population of 308,745,538 living in 116,716,292 households. The Bureau published more than 7.7 billion statistics in 2010, or about 25 per person.

“If you know any math at all, then you realize you can create a system with 25 billion simultaneous equations and roughly 1.8 billion unknowns and get a solution that matches our published statistics. We call that database reconstruction,” said Simson L. Garfinkel.

Garfinkel is the Census Bureau’s Senior Computer Scientist for Confidentiality and Data Access, and author of Database Nation: The Death of Privacy in the 20th Century. During a February 5th symposium on data privacy at Rice University in Texas, Garfinkel laid out 21st century methods for protecting individual privacy.

In the past, the Census Bureau took the information of highly identifiable households and swapped that information with other households. The benefits of swapping are that it’s an easy method and doesn’t affect a state count if it is done within the state. The operation is invisible to the rest of the Census process, said Garfinkel. “In fact, less than a handful of people in the Census Bureau understood how the 2010 swapping actually worked.”

But swapping isn’t foolproof. If someone really wanted to reconstruct the database, swapped data wouldn’t necessarily stand in their way. Using easily available data such as voter registration records could be one way to identify individuals at the block level.

In fact, the Census Bureau employed ten PhDs using a few Amazon Clusters (high speed data processing services) working for about three months to reconstruct the micro-data from the published statistics. About 50 percent of the results were accurate. “More than 70 percent of it matched exactly if you allow age within one year,” said Garfinkel. “We collect age by two ways; by date of birth and how old you say you are.”

While it took immense computing power and superior technical expertise to reconstruct the data, once the process is known, it becomes easier. “In ten years, it’s a high school science fair project. And that’s the problem,” he said.

Garfinkel called the results frightening, but said swapping was the best available technique at the time.

This time, the Census Bureau is using a technique called differential privacy.

Differential privacy works by adding noise. Noise, basically an irrelevant set of information, is injected into the data to obscure the individual’s confidential answers.

A formula determines how much noise to add for any desired privacy outcome. Hackers can still reconstruct a database, but they won’t know how much of the information is accurate.

The Census Bureau created its own differential privacy algorithms. It generates varying levels of statistics for each of the national, state, county, tract and block levels. “If there are a lot of people, the statistics are pretty accurate. If there are not a lot of people, the statistics become less accurate. That’s the secret.”

Simspon said the data will be as accurate as it needs to be for legal purposes. “But it’s not going to be more accurate. So there is this public policy trade-off between accuracy and privacy loss, and the data will be accurate enough.”

Each data set will have its own epsilon, or the determinant of how accurate data will be in terms of privacy. “You look at the marginal social benefit and you rate it against the marginal social cost,” Garfinkel said. The Census Bureau’s Chief Data Scientist John Abowd calculated the allocation of federal funds using data at the school district level. “The marginal social benefit is that the money gets allocated properly, and the marginal social cost is that people are subject to identity theft. Doing that, they actually found a correct value of epsilon for that data set, which is pretty cool.” Ultimately, the Census Bureau’s Data Stewardship Executive Policy Committee decides how private the data should be. The committee is headed by a presidential appointee.

The Census Bureau will also tell users how accurate its data is. “We didn’t ever do that in the past. We gave people a measurement error,” Garfinkel said. “But we didn’t ever tell people how much error was introduced by the swapping. And some of our data users have assumed there was no error introduced by the swapping. Like that Committee.”

One of the biggest challenges in creating privacy is the lack of computational infrastructure and the lack of trained data science PhDs. “It’s really, really hard to get this stuff right,” he said. Much of the information obtained from the American Community Survey’s 80 questions is used to weight the Census information.

He added that the decennial Census is the Census Bureau’s simplest data product. “We ask each person 10 questions. Everything else is dramatically more complicated.”