June 23, 2011
A Problem For Web Science

Jame Teevan from Microsoft opened Web Science 2011 with an intriguing keynote about what we can learn from the ways people search. This highlights a pressing problem with the field.

Ideally, we would like to be able to share anonymized logs, so everyone could study real users doing real work without invading people's privacy, much as medical and educational studies share information about patients and students without disclosing individual names. In reality, though, it turns out that the whole world is a small town, and it's sometimes quite possible to identify an individual just by looking at their searches, or knowing their friends, or knowing roughly where their friends live. (Jon Kleinberg gave a related keynote on the subject at Hypertext 2008.) For example, if "user 50271" is searching for

- muffler repair Henniker

- 1,3,5,7 tetramethyl cyclooctateraene

- A. B. Clump

- Linda Clump

you might figure out that 50271 might possibly be a chemist, perhaps named Clump, who has a car and either lives or works or is visiting Henniker, NH. You can see how this could quickly become far from anonymous. We’ve always known this: you can identify lots of famous people from a few facts (great, hard-partying pinstripe CF; German physicists married to each other). You can do the same for family friends. It turns out, with Google you can do this with everybody.

This is a grave problem for Web Science because Web Science wants to study the Web as a natural phenomenon. Lots of companies – Google, Microsoft, Apple, Facebook, Twitter, and plenty more – do important research on the data the acquire. But, right now, they cannot share the data, because that would invade people's privacy. Without shared data, no one can reproduce their experiments. If this can’t be fixed, this entire subfield of Web Science will need to be jettisoned because it won’t be, and cannot be, a science.

One solution might be to accept conventional norms for privacy, allowing us to feel private even though our privacy could be violated. We know that people could intercept our mail, for example, but most of us don’t worry terribly about it because gentlemen don’t read other gentlemen’s mail. We may lock our front doors every day, but still have large glass door on the back porch that someone could break in seconds; we feel fairly secure that no one will break into our house because the risks seem so greatly to outweigh the rewards.

It’s not a good answer, but we’re going to need some answer, or simply agree to discontinue this line of research as infeasible.