Random numbers to protect privacy in a pandemic app

A computer system, like a car or a coffee machine, is something designed to meet some requirements. These requirements usually force the designer to make a compromise, based on which requirements get more attention than others. (Which is the best car? It depends on whether speed, size, sustainability etc. are most important to you.)

One tool that can be used as part of a way out of the pandemic lockdown is an app that helps track who’s infected, and who might have infected other people. There are several different ways of doing this, based on which requirements are important to you. It might be a bit surprising, but one way of securely passing information about virus infection around the country while protecting people’s privacy involves random numbers.

Corona virus and two red dice — Image composed from two Wikipedia images

Tracking app requirements

Part of the problem with tracking apps is that there isn’t agreement on the requirements, let alone how important they are. One requirement, that I think is reasonably well agreed-on is:

1. Allow members of the public to know when someone they’ve come into contact with recently has tested positive for the virus, so that they can do their part in stopping the spread of the disease (e.g. by quarantining themselves).

This would enable people to circulate more freely than during the lockdown, which is good for their mental and physical health, and would allow the economy to come more back to life. Beyond this there are now two competing sets of requirements, for two different sets of users of this system – administrators wanting to collect data, and the data subjects themselves (members of the public with the app).

The data collectors would like as much information as possible. They might want this for good reasons or less good reasons. Good reasons might be to help academic research into the spread of the disease in particular and other diseases in general. Less good reasons are to take the data collected during a pandemic and then use it e.g. for later political campaigning.

So, one requirement that fits this is:

2. The system must collect as much information relating to the virus as possible.

This could be refined a bit, to something like:

2a. The app must frequently phone home to a central place with a unique id for the device, a date / time stamp, and the device’s location.

As I’ll show below, this isn’t necessary to achieve the first requirement, although it could help with one way of doing that.

The data subjects have a reasonable expectation of privacy, and so might balk at being tracked in this way. Not only is privacy a requirement (to at least some users), a perceived lack of privacy would discourage people from installing the app, which would in turn lead to fewer instances of the app collecting data, which in turn would lead to less data being collected (at least about the users who don’t install it). So, there’s another requirement, that’s in tension with requirement 2:

3. The users’ privacy must be maintained

I’ll now go into two possible implementations, that favour information and privacy respectively.

Tracking app implementation – maximising information

The app phones home every so often, e.g. every 5 minutes, with a unique id, location information and a date/time stamp. If the device has no connection back to base, this information is stored on the device to be sent in bulk later when a connection to base is possible.

When someone thinks they have contracted the virus, they get tested. If the test is positive, the system is told that the user of this particular device tested positive on a particular date and time. The central server looks through all the information it has received, and pulls out all the entries for other devices that have been close to the infected user’s device over the last few days. It then sends out a notification to those devices that their users might be infected, allowing them to quarantine, get tested etc.

Note that there are some details I’m glossing over here because they’re not relevant to the main topic even though they’re important – what if the user can’t get tested? Who can report if a user is infected? If it’s anyone (and not just e.g. doctors) how do we guard against abuse via false reports?

The main point is that work is divided up as follows:

Each device is a dumb reporter of location information, tagged with a device id and date/time stamp;
The correlation of an infected user with other users they’ve come into contact with is done centrally, by matching device ids and location information in the recent past;
The central server notifies specific users via their specific devices that they might be infected.

Tracking app implementation – maximising privacy

In this implementation, there is still a central server acting as a go-between for the devices, but the division of work is different from before.

The app no longer has a unique id that will represent it for all time. Instead, it generates a series of effectively random numbers that are unique to it (or at least, there won’t be any duplicates from it or any other app over a long enough period e.g. a month). It keeps a list of the random numbers it generated over the last e.g. 14 days.

It also broadcasts the random numbers over something like Bluetooth, and listens over Bluetooth for random numbers generated by the app installed on other devices nearby. It keeps a list of other users’ random numbers it has heard over the last e.g. 14 days.

When someone tests positive for the virus, the app will upload to the central server all the random numbers that it generated over the last 14 days. The central server has no idea which device generated those random numbers, or where the device was when it generated them. (This is a big difference from before.)

The central server now acts as a TV station broadcasting winning (or, rather, losing) lottery ticket numbers. That is, it sends out to all devices the random numbers uploaded to it by devices owned by infected users. If these match any of the numbers that an app has heard from devices nearby and stored, this tells the user that they might be infected.

I can understand that it’s the simplest way of describing them, but “centralised” vs. “decentralised” is inaccurate enough to be unhelpful for me. Something like “more centralised” and “less centralised” is more accurate – the second approach still has a central point, but it’s more decentralised (or less centralised) than the other approach.

The division of labour is now:

Each device is doing quite a lot more than before – generating a series of carefully-crafted numbers, storing them, broadcasting them, listening for them from others and storing them, and checking the stored numbers against incoming messages from the central server.
The central server is now merely plumbing between devices – it receives numbers and broadcasts them.

If you want to see this in video form, there’s a good video on 3 Blue 1 Brown about this privacy-protected approach.

The magic random numbers

You might have a couple of reactions to this, to do with these magic random numbers that seem to make everything work:

That’s a lot of unique numbers that must slosh around the country at once – won’t we run out?
All those numbers must be generated by loads of different devices but somehow not clash with any other numbers – how will that work?

Both are problems that I think have been already solved for other applications.

The next time you look at a video on YouTube, look at the URL. It will contain a unique id for that video, that’s different from all other videos that have ever been uploaded to YouTube, and hopefully will be different from all future videos uploaded in the next few years. The id is long, and uses letters as well as numbers, which effectively means it’s to base 36 rather than just base 10 (26 letters A-Z + 10 numbers 0-9 = 36 possibilities per character). Tom Scott has a video describing quite how long it will take before the numbers run out. An approach like this should mean the virus tracking system won’t run out of numbers.

That leaves the problem of independently generating numbers that don’t clash with any others. The next time you visit any website, if the address starts with https (and not http) or there’s a padlock icon next to the address, then your interaction with the website is being protected by a particular kind of security. The security depends on computers trusting other computers, or rather trusting the documents (security certificates) issued by those computers.

One way this trust is built is by generating a number, or hash, that is a mathematical summary of something. The thing being summarised in this case is a security certificate, which is a series of numbers. For the security to work, as well as being impossibly hard to forge, the hash needs two properties:

The hash needs to change if even only a small change is made to the thing being summarised;
No two different things being summarised should produce the same hash.

Instead of taking a security certificate as an input, the app could work on a combination of things like the current date and time, the user’s phone number or other unique id (that doesn’t need to leave the phone) etc. There’s more explanation of hashes elsewhere on the internet, but generating unique numbers while other computers are independently also generating unique numbers is something that has already been solved in the world of hashes. I assume that people who are much better at maths than I am could use similar techniques for the app.

Consequences

This isn’t some dry exercise in system design, cryptography etc. Who has what information about you and what they do with it is important. On top of the normal privacy issue, there’s a specific issue for this which is the Coronavirus pandemic. If people don’t trust an app they won’t download it and use it; if they don’t use it then there’s less warning about infection and so more risk that the virus will spread uncontrollably.

UPDATE 19th May 2020

I realise that there’s probably a detail in the privacy-protecting version that’s wrong, but the fundamental approach I think is OK. Instead of the central server broadcasting numbers to pass on information about infection, each device will periodically ask the central server “Have you been told that any of these random numbers that I’ve heard represent infection?”

That is, instead of the central server pushing the information to the devices, the devices pull from the central server. The same information is moving between the same two places, but who starts that flow of information is different.

The reason why this matters (slightly) is that it will probably reduce the amount of data that needs to move between the central server and the devices. Given that most people will stay in only a relatively small part of the country, most devices aren’t interested in numbers relating to most parts of the country.

With the push approach, the central server can’t do any better than send out numbers from the whole country. With the pull approach, each device asks for only the numbers it has, which will probably be a small fraction of the total numbers. If the population became more mobile, and mixed with people from more of the country, the difference between these two approaches would shrink. However, given the likely amount of mixing of the population, the pull approach is likely to need less data to move about.

UPDATE 19th June 2020

The BBC has a good article on centralised vs. decentralised approaches. I realise I’m kind of going around in circles on this, but I want to be fair and accurate.

What it calls the centralised approach uses the random numbers mentioned above. The central server doesn’t know the locations of phones. It does, however, know which other phone’s random numbers your phone has received recently, because each phone uploads this information to the central server. So, it is able to work out who has possibly been infected by whom, and it will then send out to possibly infected people that they should get tested.

What it calls the decentralised approach is what I’ve called the privacy maximising approach, with the broadcast or push approach. All the central server knows (because individual apps tell them this) is which random numbers belong to people who have reported themselves as infected.

Each phone pulls down all the infected random numbers, so that it doesn’t reveal to the central database which numbers it has been near. Each phone then has to do the matching with its own local list of numbers it has heard in the recent past.