I foresee a future where the Adblock crowd runs a plugin that randomizes the data returned by browsers. It wouldn't take many variations in user agent strings, reported browser plugins, and system fonts to give you a quasi-anonymous footprint each time you visit a website.
A couple of years ago, I set up privoxy to rotate my user agent string periodically, usually once an hour (once I even changed it once per request). As fun as this was, even when running adblock and noscript and flashblock, it was obviously not very effective.
This kind of fingerprinting does not bode well for TOR users. What is required is an intermediate proxy that modifies the out-bound browser traffic to delete or randomly modify these highly-unique sets of info (font sets, plugin sets, etc.) presented by the browser.
I'm still amazed at how much info flash, java, and javascript can glean about a system. I'm curious... is it possible to get the username or file/directory names readable by that user from the browser?
Obligatory "Life Of Brian": We are all individuals!
You're correct. Whether this ends up being more distinctive or demarcative depends on how widely deployed it is. A few hundred and it's not very helpful, but if you have tens of millions of browsers reporting only "LOLCAT BROWSER" (or for old sk00l credit, "AOL") it may give data miners some headaches.
It's kind of like Tor (or if I remember right, anonymizer.com) in those respects - until you have scale, it probably gets associated with shenanigans; once you hit the tipping point, it operates as desired/intended.
But hell, I don't mind marking myself as a type of person who doesn't want intrusive ads. (Why yes, I do run adblock!)
Based on this article you can "fingerprint" a computer using the following combination of attributes:
- Precise timestamp
- Monitor size
- User agent
- Browser plugins
- and fonts (!)
Really?? I use several browsers, and switch out a few external displays for my laptop. Plugins come and go with the browser version. My system time is synced to a timeserver (as I imagine are many other computers) but sometimes not. Based on these data, it's hard to believe anyone could truly trace my hardware over time.
The EFF tested this themselves. They run a website, http://panopticlick.eff.org, comparing with computers that have visited in the past. They have over a million fingerprints at this point, and my Chrome install is "unique."
It displays all data it lifted from your machine, along with how rare each datum appears to be. On my Chrome 7.0.517.44 install, I appear to be identifiable by my user agent (1 in 182,518), browser plugins (1 in 1,277,632) and system fonts (1 in 638,816).
Even the default IE8 on my stock Windows 7 Enterprise appears to be unique. In that case it's the browser plugins which identifies it (1 in 1277688). My installed plugins are nothing out of the ordinary: Java, Flash and WindowsMediaplayer.
Sample size seems to be a significant issue here. I'm one of two people on the site that had my user agent (1 in 638851), and while it's rare, sure, there are obviously a lot more than two people in the world running chromium x64 opensuse. Looking at the data it seems like most of the responses are bog standard for my particular software install, so it's basically saying I'm unique because:
* user-agent
* time zone
* screen size
Surely my TZ=EST and screen=1366x768x24 can't be too helpful in a large sample size.
And once chromium updates yet again, I think I'll be lost to the EFF test. It'll still see me as unique, but I'll be a different "unique" than the last time.
It does seem like browsers could easily cut back on user-agent details to the benefit of their customers privacy and security. Is it really necessary to tell every website I visit that I'm x64 instead of i386 just in case I'm not smart enough to know which download now button to click? It's probably most useful to malware domains for determining which version of the latest flash 0-day to push to me. And are we sure we need the exact build number of every browser? Most revisions of chrome aren't changing anything in the rendering behavior.
Panopticlick claims that my user agent (Firefox 4.0b7 on Windows XP) is about 1/15,000. That seems unlikely.
It claims that my http_accept (text/html, STAR/STAR ISO-8859-1,utf-8;q=0.7,STAR;q=0.7 identity en-gb,en;q=0.5 -- except that I've replaced asterisks with "STAR" to avoid HN formatting confusion) is about 1/19,000. That seems even more unlikely.
I suspect these have the same cause: they have this big database of browser information, gathered over time, and recent browsers are underrepresented. So anyone running a Firefox 4 beta, or a recent Chrome build, will show up as being very unusual in the database, but that's misleading because some of the older entries in the database will represent browsers that are no longer in the pool, whose users are now using something more recent.
On the other hand, I don't find it so hard to believe that my browser plugins give 20 bits or so of information. (Though some of the same bias will affect this figure.)
In case it isn't obvious, by the way, their figures are simply the fraction of records in their database that have the same user agent / browser plugins / system fonts as yours. The browser plugins figure is 1/1,277,632 (or, for me a little later, 1/1,277,946) because that's how many times the Panopticlick site has been used.
Hmm, I wonder how they defend against having the same user run their test twice in a row. (Answer, having tried it: they don't. Probably a good thing since the obvious non-privacy-compromising way could skew their figures a bit.)
Panopticlick shows if you can be uniquely identified, but it doesn't show if you have been uniquely identified. I'd find it useful for the site to show me a record of my past visits ("You last visited the site on...") as a proof of concept, but the EFF may be intentionally avoiding any privacy issues. This is important, as uniqueness isn't in itself a bad thing, if your unique identifiers change on every page load. For example, it would be enough for a user agent switcher plugin to append a random number for every page load to make you look like a first-time visitor every time.
It doesn't matter whether it actually works anywhere near as well as the hyperbolic sales pitch as long as advertisers buy it. It's nothing more than cookies and metadata collected by javascript, probably hashed together to produce a unique ID.
They don't really need to trace your hardware over time. They're selling the info against the device, not against the person. Someone might want to target ads to your activity profile at work, someone else might want to target ads to your activity profile on your laptop at night. It might actually be better for the advertisers to have separate profiles for those machines than one for you the person.
You may switch browsers and displays and plugins all the time, but someone like you is very rare. More people have never strayed from their Windows XP and Internet Explorer for the past 9 years than not.
I believe most mobile devices already have unique device IDs built in. Perhaps they're not accessible through JavaScript, but it sounds like this technology already goes beyond the web browser anyways.
Full disclosure: I wrote one of these systems (AMA)
1. You'd probably be surprised what we can figure out from this. You'd also be surprised that there isn't much you can do to stop it from happening b/c we've been able to find ways to get information that is actually outside the browser.
2. You'd probably be surprised to see how embedded this is already. I'm guessing most people have had their browser fingerprinted at least once...for some reason. Knowing who used the services I wrote, I can tell you that they are everywhere...and given that I know our competitors also have this technology and who they work with...well...
3. There are legitimate uses to this tech besides spam. Its just that the money is in spam. I'm near 100% certain that most people will use this for spam in the next year or two.
> You'd also be surprised that there isn't much you can
> do to stop it from happening b/c we've been able to
> find ways to get information that is actually outside the
> browser.
I find it hard to believe that you're going outside of the browser if Java/Flash/JavaScript are uninstalled or disabled. Are you claiming to be using 0-day browser exploits to get information from outside of the browser?
The question we should all be asking is: why on Earth do browsers transmit all of this information to web servers in the first place?
A few details, window size for example, might have a legitimate purpose and could at least be requested by the web server and provided optionally. However, this sort of fingerprinting technique is pretty obvious to anyone who's ever stuck an analyser on their system and looked at what a typical HTTP transaction to fetch a web page looks like. As far as I can see, there is no need for most of it.
In my experience, there's enough basic information in the HTTP transaction to identify a unique visitor for most forensic purposes. However, the level of detail available for fingerprinting goes far beyond this. I just performed a little experiment, disabling as much as I could in Firefox to affect my uniqueness at http://panopticlick.eff.org/. I was surprised when disabling plugins (font info comes from java & flash) and even cookies had virtually no effect. It wasn't until I disabled JavaScript before I dramatically lowered it from 7 figures to 5. I'm going to try surfing this way and adding exceptions to see if I can sustain the experience for a while.
The thing about Panopticlick is that the two most "accurately" identifying methods aren't stable fingerprints. As soon as you install an update to one of your plugins (the constant Adobe updates come to mind), your fingerprint is altered. Similar with the font list; programs seem to be installing new fonts on my system all the time, which makes that fingerprint unstable as well.
Yes. If you wanted to get serious, you should deal with unstable characteristics. Perhaps with Bayesian filtering or something? And maybe adding some outside information, like when the new flash plugin is released; and tracking if a user is likely to upgrade or not.
If you have a working solution for unstable characteristics, you can also add more more characteristics, than they do at the moment.
The more unusual your configuration, the more likely it is to be unique in the context of things you can't change. Most people don't run with all these things disabled...
Really? I can see a good use case for the ones that are described at Panopticlick.
Things like http_accept, plugin details and fonts are necessary if you want to use the capabilities of the client - you need to know what they have.
The time zone, screen size and (to a certain extent) user-agent let you customise the content you serve in an appropriate way.
Providing it "optionally" doesn't seem like a solution. If you were prompted every time then most people would turn it on anyway, and any ad network site would obviously ask for everything that it could possible get.
Things like http_accept, plugin details and fonts are necessary if you want to use the capabilities of the client - you need to know what they have.
I disagree rather strongly with sending things like user agent strings, plugins and fonts by default. There should be no need for a web site to know anything about either. Indeed, customising content to that degree based on specifics of the client's software identity rather than providing adaptive content seems somewhat anti-WWW to me.
Moreover, given the number of plugins with security issues widely available for some browsers, sending a list of installed plugins and versions with every HTTP request is practically begging to get malware back if you're not 100% patched up (and even if you are, if you're unlucky with a 0-day exploit).
Let me put it this way: most web sites that I've encountered actively modifying their content based on specific user agents have been places like banks, and usually the reason I know that they were doing this is because they got it wrong and consequently refused to serve perfectly good content to a browser perfectly capable of rendering it as intended because of concerns about some security flaws in browsers from another era. I'm not sure I have ever come across a web site that actively modifies the content it provides based on the browser plugins or fonts on the client system.
Providing it "optionally" doesn't seem like a solution. If you were prompted every time then most people would turn it on anyway, and any ad network site would obviously ask for everything that it could possible get.
On the contrary. I think people would very quickly tire of visiting sites whose ad networks caused such problems, and the ad networks would have to give up doing it or quality sites would drop them. Indeed, the only common uses for this kind of information today seem to be supporting user tracking or deployment of targeted malware from compromised sites. Providing the detailed information only to sites that explicitly request it and possibly only if the user's security settings permit would make life more difficult for both groups, and I have no problem with this outcome.
Because not transmitting that information doesn't get you much. You only need 30 bits of identifying information to track a billion users. With basic stuff like IP address, time zone, screen size, user agent, and supported plugins, you're pretty much there.
The web server asking the browser "do you support X?" increases latency and doesn't help anonymity. Bad web servers could just ask, "Do you support X, Y, Z, and what's your screen size..." If each of these questions prompts the user, it's a usability nightmare. If the browser's answers are configurable, why not make current browser information configurable instead of inventing a new standard?
You only need 30 bits of identifying information to track a billion users.
You need 30 unique bits of information.
Given that there are only a handful of common screen resolutions, for example, they are unlikely to represent more than a few bits in most cases, and if you provided current size of the viewable area in the browser instead (a more useful measurement anyway) then in many cases this would change over time. If you further provide this information only on explicit request, and potentially lock it down to only the host site rather than third party content, then it is of almost no use to anyone with unwelcome intentions.
I worked on this a long time ago and knew of others working on the same idea. I didn't pursue it for financial gain because it seemed too sleazy.
If you're curious what kinds of information your browser gives to a remote server, creating a "test.php" file with the contents "<?php phpinfo(); ?>", and opening that page in a browser is a good start. There are some other ingenious (read: I didn't think them up) methods, but I don't want to help the spammers n' spies.
We have dozens of desktops running identically imaged copies of Windows/browser/plugins and all behind a single NATed IP. I imagine that they'd all look the same to one of these systems.
If they start serving up adverts then perhaps I'll start seeing ads targeted at something one of my co-workers has been searching for. That could get interesting!
I remember some while back (8 or 10 years? I cant remember exactly when) - Intel was trying to give each processor a GUID so that certain transactions could be traced back to the chip. Didn't go over well at all and Intel eventually pulled back.
I wonder if the same uproar will happen this time around?
Nmap only gives you information on the version of OS and of network applications.
e.g.
nmap -sV 127.0.0.1
Starting Nmap 5.00 ( http://nmap.org ) at 2010-12-01 18:28 GMT
Interesting ports on localhost (127.0.0.1):
Not shown: 997 closed ports
PORT STATE SERVICE VERSION
631/tcp open ipp CUPS 1.4
2000/tcp open callbook?
24800/tcp open kvm Synergy KVM