Technical analysis of client identification mechanisms
http://www.chromium.org/Home/chromium-security/client-identification-mechanisms
Technical analysis of client identification mechanisms
Written by Artur Janc <aaj@google.com> and Michal Zalewski <lcamtuf@google.com>
In common use, the term “web tracking” refers to the process of calculating or assigning unique and reasonably stable identifiers to each browser that visits a website. In most cases, this is done for the purpose of correlating future visits from the same person or machine with historical data. Some uses of such tracking techniques are well established and commonplace. For example, they are frequently employed to tell real users from malicious bots, to make it harder for attackers to gain access to compromised accounts, or to store user preferences on a website. In the same vein, the online advertising industry has used cookies as the primary client identification technology since the mid-1990s. Other practices may be less known, may not necessarily map to existing browser controls, and may be impossible or difficult to detect. Many of them - in particular, various methods of client fingerprinting - have garnered concerns from software vendors, standards bodies, and the media. To guide us in improving the range of existing browser controls and to highlight the potential pitfalls when designing new web APIs, we decided to prepare a technical overview of known tracking and fingerprinting vectors available in the browser. Note that we describe these vectors, but do not wish this document to be interpreted as a broad invitation to their use. Website owners should keep in mind that any single tracking technique may be conceivably seen as inappropriate, depending on user expectations and other complex factors beyond the scope of this doc. We divided the methods discussed on this page into several categories: explicitly assigned client-side identifiers, such as HTTP cookies; inherent client device characteristics that identify a particular machine; and measurable user behaviors and preferences that may reveal the identity of the person behind the keyboard (or touchscreen). After reviewing the known tracking and fingerprinting techniques, we also discuss potential directions for future work and summarize some of the challenges that browser and other software vendors would face trying to detect or prevent such behaviors on the Web. Explicitly assigned client-side identifiersThe canonical approach to identifying clients across HTTP requests is to store a unique, long-lived token on the client and to programmatically retrieve it on subsequent visits. Modern browsers offer a multitude of ways to achieve this goal, including but not limited to:
We believe that the availability of any one of these mechanisms is sufficient to reliably tag clients and identify them later on; in addition to this, many such identifiers can be deployed in a manner that conceals the uniqueness of the ID assigned to a particular client. On the flip side, browsers provide users with some degree of control over the behavior of at least some of these APIs, and with several exceptions discussed later on, the identifiers assigned in this fashion do not propagate to other browser profiles or to private browsing sessions. The remainder of this section provides a more in-depth overview of several notable examples of client tagging schemes that are within the reach of web apps.HTTP cookiesHTTP cookies are the most familiar and best-understood method for persisting data on the client. In essence, any web server may issue unique identifiers to first-time visitors as a part of a HTTP response, and have the browser play back the stored values on all future requests to a particular site. All major browsers have for years been equipped with UIs for managing cookies; a large number of third-party cookie management and blocking software is available, too. In practice, however, external research has implied that only a minority of users regularly review or purge browser cookies. The reasons for this are probably complex, but one of them may be that the removal of cookies tends to be disruptive: contemporary browsers do not provide any heuristics to distinguish between the session cookies that are needed to access the sites the user is logged in, and the rest. Some browsers offer user-configurable restrictions on the ability for websites to set “third-party” cookies (that is, cookies coming from a domain other than the one currently displayed in the address bar - a behavior most commonly employed to serve online ads or other embedded content). It should be noted that the existing implementations of this setting will assign the “first-party” label to any cookies set by documents intentionally navigated to by the user, as well as to ones issued by content loaded by the browser as a part of full-page interstitials, HTTP redirects, or click-triggered pop-ups. Compared to most other mechanisms discussed below, overt use of HTTP cookies is fairly transparent to the user. That said, the mechanism may be used to tag clients without the use of cookie values that obviously resemble unique IDs. For example, client identifiers could be encoded as a combination of several seemingly innocuous and reasonable cookie names, or could be stored in metadata such as paths, domains, or cookie expiration times. Because of this, we are not aware of any means for a browser to reliably flag HTTP cookies employed to identify a specific client in this manner. Just as interestingly, the abundance of cookies means that an actor could even conceivably rely on the values set by others, rather than on any newly-issued identifiers that could be tracked directly to the party in question. We have seen this employed for some rich content ads, which are usually hosted in a single origin shared by all advertisers - or, less safely, are executed directly in the context of the page that embeds the ad. Flash LSOsLocal Shared Objects are the canonical way to store client-side data within Adobe Flash. The mechanism is designed to be a direct counterpart to HTTP cookies, offering a convenient way to maintain session identifiers and other application state on a per-origin basis. In contrast to cookies, LSOs can be also used for structured storage of data other than short snippets of text, making such objects more difficult to inspect and analyze in a streamlined way. In the past, the behavior of LSOs within the Flash plugin had to be configured separately from any browser privacy settings, by visiting a lesser-known Flash Settings Manager UI hosted on macromedia.com (standalone installs of Flash 10.3 and above supplanted this with a Control Panel / System Preferences dialog available locally on the machine). Today, most browsers offer a degree of integration: for example, clearing cookies and other site data will generally also remove LSOs. On the flip side, more nuanced controls may not be synchronized: say, the specific setting for third-party cookies in the browser is not always reflected by the behavior of LSOs. From a purely technical standpoint, the use of Local Shared Objects in a manner similar to HTTP cookies is within the apparent design parameters for this API - but the reliance on LSOs to recreate deleted cookies or bypass browser cookie preferences has been subject to public scrutiny.Silverlight Isolated StorageMicrosoft Silverlight is a widely-deployed applet framework bearing many similarities to Adobe Flash. The Silverlight equivalent of Flash LSOs is known as Isolated Storage. The privacy settings in Silverlight are typically not coupled to the underlying browser. In our testing, values stored in Isolated Storage survive clearing cache and site data in Chrome, Internet Explorer and Firefox. Perhaps more surprisingly, Isolated Storage also appears to be shared between all non-incognito browser windows and browser profiles installed on the same machine; this may have consequences for users who rely on separate browser instances to maintain distinct online identities. As with LSOs, reliance on Isolated Storage to store session identifiers and similar state information does not present issues from a purely technical standpoint. That said, given that the mechanism is not currently managed via browser controls, its use of for client identification is not commonplace and thus may be viewed as less transparent than standard cookies.HTML5 client-side storage mechanismsHTML5 introduces a range of structured data storage mechanisms on the client; this includes localStorage, the File API, and IndexedDB. Although semantically different from each other, all of them are designed to allow persistent storage of arbitrary blobs of binary data tied to a particular web origin. In contrast to cookies and LSOs, there are no significant size restrictions on the data stored with these APIs. In modern browsers, HTML5 storage is usually purged alongside other site data, but the mapping to browser settings isn’t necessarily obvious. For example, Firefox will retain localStorage data unless the user selects “offline website data” or “site preferences” in the deletion dialog and specifies the time range as “everything” (this is not the default). Another idiosyncrasy is the behavior of Internet Explorer, where the data is retained for the lifetime of a tab for any sites that are open at the time the operation takes place. Beyond that, the mechanisms do not always appear to follow the restrictions on persistence that apply to HTTP cookies. For example, in our testing, in Firefox, localStorage can be written and read in cross-domain frames even if third-party cookies are disabled. Due to the similarity of the design goals of these APIs, the authors expect that the perception and the caveats of using HTML5 storage for storing session identifiers would be similar to the situation with Flash and Silverlight. Cached objectsFor performance reasons, all mainstream web browsers maintain a global cache of previously retrieved HTTP resources. Although this mechanism is not explicitly designed as a random-access storage mechanism, it can be easily leveraged as such. To accomplish this, a cooperating server may return, say, a JavaScript document with a unique identifier embedded in its body, and set Expires / max-age= headers to a date set in the distant future. Once this unique identifier is stored within a script subresource in the browser cache, the ID can be read back on any page on the Internet simply by loading the script from a known URL and monitoring the agreed-upon local variable or setting up a predefined callback function in JavaScript. The browser will periodically check for newer copies of the script by issuing a conditional request to the originating server with a suitable If-Modified-Since header; but if the server consistently responds to such check with HTTP code 304 (“Not modified”), the old copy will continue to be reused indefinitely. There is no concept of blocking “third-party” cache objects in any browser known to the authors of this document, and no simple way to prevent cache objects from being stored without dramatically degrading performance of everyday browsing. Automated detection of such behaviors is extremely difficult owing to the sheer volume and complexity of cached JavaScript documents encountered on the modern Web. All browsers expose the option to manually clear the document cache. That said, because clearing the cache requires specific action on the part of the user, it is unlikely to be done regularly, if at all. Leveraging the browser cache to store session identifiers is very distinct from using HTTP cookies; the authors are unsure if and how the cookie settings - the convenient abstraction layer used for most of the other mechanisms discussed to date - could map to the semantics of browser caches. Cache metadata: ETag and Last-ModifiedTo make implicit browser-level document caching work properly, servers must have a way to notify browsers that a newer version of a particular document is available for retrieval. The HTTP/1.1 standard specifies two methods of document versioning: one based on the date of the most recent modification, and another based on an abstract, opaque identifier known as ETag. In the ETag scheme, the server initially returns an opaque “version tag” string in a response header alongside with the actual document. On subsequent conditional requests to the same URL, the client echoes back the value associated with the copy it already has, through an If-None-Match header; if the version specified in this header is still current, the server will respond with HTTP code 304 (“Not Modified”) and the client is free to reuse the cached document. Otherwise, a new document with a new ETag will follow. Interestingly, the behavior of the ETag header closely mimics that of HTTP cookies: the server can store an arbitrary, persistent value on the client, only to read it back later on. This observation, and its potential applications for browser tracking date back at least to 2000. The other versioning scheme, Last-Modified, suffers from the same issue: servers can store at least 32 bits of data within a well-formed date string, which will then be echoed back by the client through a request header known as If-Modified-Since. (In practice, most browsers don't even require the string to be a well-formed date to begin with.) Similarly to tagging users through cache objects, both of these “metadata” mechanisms are unaffected by the deletion of cookies and related site data; the tags can be destroyed only by purging the browser cache. As with Flash LSOs, use of ETag to allegedly skirt browser cookie settings has been subject to scrutiny.HTML5 AppCacheApplication Caches allow website authors to specify that portions of their websites should be stored on the disk and made available even if the user is offline. The mechanism is controlled by cache manifests that outline the rules for storing and retrieving cache items within the app. Similarly to implicit browser caching, AppCaches make it possible to store unique, user-dependent data - be it inside the cache manifest itself, or inside the resources it requests. The resources are retained indefinitely and not subject to the browser’s usual cache eviction policies. AppCache appears to occupy a netherworld between HTML5 storage mechanisms and the implicit browser cache. In some browsers, it is purged along with cookies and stored website data; in others, it is discarded only if the user opts to delete the browsing history and all cached documents. Note: AppCache is likely to be succeeded with Service Workers; the privacy properties of both mechanisms are likely to be comparable. Flash resource cacheFlash maintains its own internal store of resource files, which can be probed using a variety of techniques. In particular, the internal repository includes an asset cache, relied upon to store Runtime Shared Libraries signed by Adobe to improve applet load times. There is also Adobe Flash Access, a mechanism to store automatically acquired licenses for DRM-protected content. As of this writing, these document caches do not appear to be coupled to any browser privacy settings and can only be deleted by making several independent configuration changes in the Flash Settings Manager UI on macromedia.com. We believe there is no global option to delete all cached resources or prevent them from being stored in the future. Browsers other than Chrome appear to share Flash asset data across all installations and in private browsing modes, which may have consequences for users who rely on separate browser instances to maintain distinct online identities. SDCH dictionariesSDCH is a Google-developed compression algorithm that relies on the use of server-supplied, cacheable dictionaries to achieve compression rates considerably higher than what’s possible with methods such as gzip or deflate for several common classes of documents. The site-specific dictionary caching behavior at the core of SDCH inevitably offers an opportunity for storing unique identifiers on the client: both the dictionary IDs (echoed back by the client using the Avail-Dictionary header), and the contents of the dictionaries themselves, can be used for this purpose, in a manner very similar to the regular browser cache. In Chrome, the data does not persist across browser restarts; it was, however, shared between profiles and incognito modes and was not deleted with other site data when such an operation is requested by the user. Google addressed this in bug 327783. Other script-accessible storage mechanismsSeveral other more limited techniques make it possible for JavaScript or other active content running in the browser to maintain and query client state, sometimes in a fashion that can survive attempts to delete all browsing and site data. For example, it is possible to use window.name or sessionStorage to store persistent identifiers for a given window: if a user deletes all client state but does not close a tab that at some point in the past displayed a site determined to track the browser, re-navigation to any participating domain will allow the window-bound token to be retrieved and the new session to be associated with the previously collected data. More obviously, the same is true for active JavaScript: any currently open JavaScript context is allowed to retain state even if the user attempts to delete local site data; this can be done not only by the top-level sites open in the currently-viewed tabs, but also by “hidden” contexts such as HTML frames, web workers, and pop-unders. This can happen by accident: for example, a running ad loaded in an <iframe> may remain completely oblivious to the fact that the user attempted to clear all browsing history, and keep using a session ID stored in a local variable in JavaScript. (In fact, in addition to JavaScript, Internet Explorer will also retain session cookies for the currently-displayed origins.) Another interesting and often-overlooked persistence mechanism is the caching of RFC 2617 HTTP authentication credentials: once explicitly passed in an URL, the cached values may be sent on subsequent requests even after all the site data is deleted in the browser UI. In addition to the cross-browser approaches discussed earlier in this document, there are also several proprietary APIs that can be leveraged to store unique identifiers on the client system. An interesting example of this are the proprietary persistence behaviors in some versions of Internet Explorer, including the userData API. Last but not least, a variety of other, less common plugins and plugin-mediated interfaces likely expose analogous methods for storing data on the client, but have not been studied in detail as a part of this write-up; an example of this may be the PersistenceService API in Java, or the DRM license management mechanisms within Silverlight. Lower-level protocol identifiersOn top of the fingerprinting mechanisms associated with HTTP caching and with the purpose-built APIs available to JavaScript programs and plugin-executed code, modern browsers provide several network-level features that offer an opportunity to store or retrieve unique identifiers:
Machine-specific characteristicsWith the notable exception of Origin-Bound Certificates, the techniques described in section 1 of the document rely on a third-party website explicitly placing a new unique identifier on the client system. Another, less obvious approach to web tracking relies on querying or indirectly measuring the inherent characteristics of the client system. Individually, each such signal will reveal just several bits of information - but when combined together, it seems probable that they may uniquely identify almost any computer on the Internet. In addition to being harder to detect or stop, such techniques could be used to cross-correlate user activity across various browser profiles or private browsing sessions. Furthermore, because the techniques are conceptually very distant from HTTP cookies, the authors find it difficult to decide how, if at all, the existing cookie-centric privacy controls in the browser should be used to govern such practices. EFF Panopticlick is one of the most prominent experiments demonstrating the principle of combining low-value signals into a high-accuracy fingerprint; there is also some evidence of sophisticated passive fingerprints being used by commercial tracking services. Browser-level fingerprintsThe most straightforward approach to fingerprinting is to construct identifiers by actively and explicitly combining a range of individually non-identifying signals available within the browser environment:
According to the EFF, their Panopticlick experiment - which combines only a relatively small subset of the actively-probed signals discussed above - is able to uniquely identify 95% of desktop users based on system-level metrics alone. Current commercial fingerprinters are reported to be considerably more sophisticated and their developers might be able to claim significantly higher success rates. Of course, the value of some of the signals discussed here will be diminished on mobile devices, where both the hardware and the software configuration tends to be more homogenous; for example, measuring window dimensions or the list of installed plugins offers very little data on most Android devices. Nevertheless, we feel that the remaining signals - such as clock skew and drift and the network-level and user-specific signals described later on - are together likely more than sufficient to uniquely identify virtually all users. When discussing potential mitigations, it is worth noting that restrictions such as disallowing the enumeration of navigator.plugins[] generally do not prevent fingerprinting; the set of all notable plugins and fonts ever created and distributed to users is relatively small and a malicious script can conceivably test for every possible value in very little time.Network configuration fingerprintsAn interesting set of additional device characteristics is associated with the architecture of the local network and the configuration of lower-level network protocols; such signals are disclosed independently of the design of the web browser itself. These traits covered here are generally shared between all browsers on a given client and cannot be easily altered by common privacy-enhancing tools or practices; they include:
User-dependent behaviors and preferencesIn addition to trying to uniquely identify the device used to browse the web, some parties may opt to examine characteristics that aren’t necessarily tied to the machine, but that are closely associated with specific users, their local preferences, and the online behaviors they exhibit. Similarly to the methods described in section 2, such patterns would persist across different browser sessions, profiles, and across the boundaries of private browsing modes. The following data is typically open to examination:
On top of this, user fingerprinting can be accomplished by interacting with third-party services through the user’s browser, using the ambient credentials (HTTP cookies) maintained by the browser:
Fingerprinting prevention and detection challengesIn a world with no possibility of fingerprinting, web browsers would be indistinguishable from each other, with the exception of a small number of robustly compartmentalized and easily managed identifiers used to maintain login state and implement other essential features in response to user’s intent. In practice, the Web is very different: browser tracking and fingerprinting are attainable in a large number of ways. A number of the unintentional tracking vectors are a product of implementation mistakes or oversights that could be conceivably corrected today; many others are virtually impossible to fully rectify without completely changing the way that browsers, web applications, and computer networks are designed and operated. In fact, some of these design decisions might have played an unlikely role in the success of the Web. In lieu of eliminating the possibility of web tracking, some have raised hope of detecting use of fingerprinting in the online ecosystem and bringing it to public attention via technical means through browser- or server-side instrumentation. Nevertheless, even this simple concept runs into a number of obstacles:
Potential directions for future workThere may be no simple, universal, technical solutions to the problem of tracking on the Web by parties who are intent on doing so with no regard for user controls. That said, the authors of this page see some theoretical room for improvement when it comes to building simpler and more intuitive privacy controls to provide a better framework for the bulk of interactions with responsible sites and parties on the Internet:
|