How HTTP cache headers betray your privacy

I recently implemented Etag caching support only to learn its a privacy breach. Internet scum have coopted yet another technology to track us as we browse. In this article I’ll look briefly at this problem and how it also applies to Last-Modified headers. I then offer a fix for time based caching that prevents tracking.

Though I offer a fix here, I still believe a proper sandbox browser is the better, more universal, solution to privacy.

Using Etag to track a user

Tracking is primarily a problem with third party websites. These are typically those showing irrelevant ads (like at the bottom this article), or convenient CDNs serving common scripts. To track us all they need is some kind of reasonably unique identifier sent with each request.

Etag was intended for cache validation. A large image, or other resource, on a page will be served with an Etag. When the browser comes back to this page later, or sees the same image URL, it uses the tag in a If-None-Match request to the server. If the resource has changed then a new binary will be sent back. If the resource hasn’t changed then a 304 Not Modified is returned. It’s a simple way to avoid reloading resources already in the cache.

The problem is that Etag is chosen by the destination server. There are no rules to how it is produced. This allows that server to create a unique string for every single visitor. Every time the browser checks the validity of this resource it will send back this tag. If the same URL is used on multiple websites that same tag will be used. It’s become a unique tracking token!

I’ve heard that some browsers do disable caching on third party websites, but can’t find reliable information on this.

Can ETag be saved?

Can the Etag validation mechanism somehow be saved to remain useful without becoming a tracking token?

At first it seems like the problem is the server generating the token string. What if we generate a hash of the binary on the client side instead? This way the server has no control over the resulting identifier. Two people who get the same resource, the same image, would end up with the same identifier.

It doesn’t work. The disreputable third party website will simply start serving unique resources to each user. Image formats, both JPEG and PNG, allow all sorts of headers, even custom ones, to be added to the image. So its very easy to serve the same visual image with a distinct binary. The hash of the binary will be effectively unique for every user.

What if we hash only the visible content of images? In this manner the server can’t add random crap to the header and get a unique id. Unfortunately it’s still rather easy to store information in an image without the user detecting any change. Just lookup “steganography”.

One more possibility. What if the client hash is trimmed to a very small size? Instead of a full 128bit cryptographic hash we reduce to 16bits or even 12bits. This limits the identification space by a large degree, they are still tracking tokens, but it’s difficult to make them unique. Of course, it’s still broken if the third-party website just uses multiple images to construct a compound identifier.

The conclusion here is basically that Etag should be completely disabled, at least for third-party websites (since the primary domain can always track you anyway).

Last-Modified and If-Modified-Since also broken

The Etag problem got me to thinking about the other caching headers. It looks like If-Modified-Since has the same problem, albeit a bit more subtle.

If-Modified-Since is used to check whether a resource is newer than the one in the cache. If it is newer then a new version will be returned. If not then 304 Not Modified will be returned. The browser’s clock and server clock may not be synchronized, so RFC2616 recommends using the Last-Modified header for caching. This seems like trouble to allow the server to decide these things.

When the server gives us a Last-Modified timestamp and we send it back in the If-Modified-Since header, we’ve basically created a new unique token. The server will just create unique timestamps for every user loading the image. If done this way these timestamps are essentially just a random tag, not a legitimate timestamp.

Quantization fix

The first idea to fix this is simply ignore the RFC and disregard the Last-Modified header. It creates a potential time synchronization problem, but that seems less of a problem than creating a unique tracking token.

Another option is to quantize all expiration times. The current datetime format allows for second based precision, which simply isn’t necessary for image tracking. Browsers tend to delay checks for the length of the session, or day, anyway. If we limit the timestamp to increments of 10 minutes we’ve significantly reduced the identifier space.

This can actually solve the time synchronization issue as well. We could always round to 10 minutes and then go back one increment. This ensures that if the resource is modified in the current timeslot our recorded timestamp will be less than that. When we make a new request with If-Modified-Since we will get a new resource.

By using a quantized timestamp derived from the current time, not the servers time, I think we can thwart this privacy invasion and still retain an effective caching mechanism.

Random fix

A second idea is to always subtract a random time from the client computed time. That is, we still use the client time for Last-Modified, but each time we send it to the server we subtract a random amount.

This breaks the ability of the server to track an individual user. Each request the server gets a new time, so it is unable to associate this with a previous request. If we keep the random offset within the same 10 minute window as quantization it also doesn’t really harm the caching mechanism.

As a bonus we could also continually update the Last-Modified time of the resource. Anytime a request returns a 304 Not Modified response we can safely update the time (using the quantized period again). If every browser did this it means the server would be getting a small range of times for the same resources across all of its users.

Browser fixes

This problem exists for third-party and cross-site tracking; obviously the main site you are on can track you while on their site. We’re concerned with the ability of third-party sites tracking you across several different domains.

Abusing caching headers is another way to track us. The safest solution appears to have the browser just not cache third-party content and to send restricted headers to them. If we really wish to retain caching, since we sometimes cache useful things, not just ads, then the quantized and randomly modified time seems positive. If not that, then we need to investigate a new caching mechanism which cannot be abused.

This issue, and several related privacy leaks, need to be fixed in the browsers. Disabling third-party cookies does nothing if its just as easy to use caching headers to achieve the same thing.

Categories: Security

Tagged as: , , ,

6 replies »

  1. The sad (and annoying) thing is: that privacy, and anonymity – at high levels – end up being mutually exclusive to each other in the browser. Start blocking cross-site requests? Then you only give yourself a unique (or rare) ‘meta’ fingerprint, making you stand out very uniquely to state trackers such as the NSA (and their allies) or whoever else has similar means and desire too.

    Start ‘hiding in the crowd’ of the Tor Browser ‘omg fingerprint avoid avoid avoid’ ‘user agent’? Well then you must give up all privacy of what you do on one site to the horde of third parties swarming on it and so have whatever you do, way less private – for the sake of anonymity.

    So for now, my own solution is to create two profiles: one for privacy, and one for anonymity. Pimp Firefox out for the privacy one – lock it down – even use Lynx / curl / wget if you have to. And for anonymity, use some fingerprint-phobia Tor browser crowd hiding setup – and just realize that MANY parties are viewing it – albeit they have no idea WHO it is doing it, if you think through how else you should be doing things under that type of web user profile.

    People need to realize, that ‘tracking’ can either mean from an anonymity perspective, or a privacy one. They are not always the same thing, and unfortunately, one OFTEN comes at a compromise to the other.

    In the future, since I see now that data is the new squabbling resource of this coming century (replacing oil, which was the twentieth’s – I realized this when Elon Musk stood on a stage ten days ago to show us how the world will change via Tesla Energy) – things might get bad enough with data totalitarianism that thankfully, ideas deemed a pipe dream like high latency / data scrambling and just, TBH WHATEVER will be needed to fundamentally thwart the NSA’s now-obvious big data collection, analysis and storage, will end up being coded and made accessible. Code is code and code ultimately has freed people in the end (if they want to be), not enslaved.

    Just like the hippie generation broke free of the cultural dependence and worship of oil, and offered another way (accessible and free for all) – a ‘data hippie’ movement will emerge that can TOTALLY unshackle the individual from the system (and yet allow them to benefit from its public data, all the same) – once enough people realize that the novel benefits of this system (data ubiquity and all the ’empowering’ gadgets and things we can do with it), are not ‘worth it’.

    But until things get ‘that bad’, that they then get ‘that good’ – we may be one of the most surveilled populations in known human history, right now.

    Enjoy this data while you can, system. I think a whole lot of encryption and fundamental (and constitutionally-protected) obfuscation is on its way. Quantum computing era, included.

    • I don’t think I’m following your argument about how anonymity and privacy cannot be maintained at the same time? I don’t see how blocking cross-site requests give one a fingerprint, it should be doing the opposite. The fewer sites I touch the more difficult it is to create a fingerpring of who I am.

      I can imagine if there’s like only one person blockign such requests it would stand out, but slowly as thousands, and tens of thousands, of people do the same thing it becomes more generic. But we need this features inherent to the browsers anyway. So if there is something unique about this, then it’d be completely erased as every user start doing the same thing.

      Note the HTTPS now is also broken, so simply switching the encryption doesn’t actually solve the problem, at the NSA level at least. There’s a real problem of untrustworthy certifacte authorities, invalid certificates, and stolen certificates. HTTPs as well needs improvements to be a trustworthy system.

    • The problem about privacy measures (that hardly anyone else uses) is that it creates a unique ‘signature’ or fingerprint in your browsing activity which can be trivially seen in basic backbone traffic analysis. (To any party recording MULTI-site traffic, aka NSA or any other entity who decides to wiretap). HTTPS (ok, I agree it’s problematic, note at end of comment) helps in a lot of ways by masking *what* resources you’re downloading from the webserver in question (it encrypts what URL/s you request in the session as well as all other resources like images/CSS within the SSL domain) but even then, a basic problem remains, that by you NOT downloading x tracker beacon from (or third party images or js library objects, using e.g. the RequestPolicy add-on), NSA sees that ‘aha this user is NOT making any cross-site requests’, when they know that when visiting that domain or page, requests SHOULD be made to those 3rd party objects at the same time. So sadly, it only creates ANOTHER way to be tracked, by backbone wiretapping, in and of itself.

      The NSA thrives on traffic analysis and I would not underestimate the increasing ability for them to chew on the data and spit out their OWN fingerprint ‘IDs’ based on their analysis of it. (I mean look at ‘panopticlick’ cooked up by just EFF, that’s only just the beginning.)

      So if you block cross-site requests by default over a long period of time (or ANY two activities you don’t want linked), cookies, referer headers, cache, all that good stuff for privacy, you have an INCREDIBLY unique browsing fingerprint all acros the web, no matter what changes in IP Addresses you have.

      So it only gives privacy from first-party sites, third-party trackers (who’s the ‘second party’ here? anyway), but NOT a ‘third’ third party, who’s recording ALL the traffic, from and to everyone.

      But even to that first party site: imagine if you wanted to ‘reset’ your identity to THAT site (aka have ANONYMITY to them as well, not just privacy of you visiting it from the point of view of OTHER sites): if you block 3rd party objects, you’re probably one of a small handful who do it, depending on the size and nature of the site. If you enable say, imgur requests ONLY, but no others, you’re bound to be 100% unique to that site now. There are ways for them to check if 3rd party objects are loaded by you (like messages for ‘please enable js for a better experience on our site’ – unless that’s only browser handled and in the source no matter what but anyway you get the idea), and again there’s other things like just blocking cookies / other first-party domain tracking mechanisms (like indeed, this cache objects issue! I block cache outright for my ‘privacy’ browser profile and this is the biggest reason yet to do so!) which ironically, serve as a SUPER unique ‘meta cookie’ if the webadmin wants to take a look, indeed!

      When whole businesses are made from coding algorithms to track users on the web, I wouldn’t underestimate the ability for this to already be included in some commercial tracking mechanisms as we speak.

      But the NSA are still the ones who can track the most powerfully, based on this ‘browser privacy tweak’ idea in general. It vastly changes your browsing signature, and that’s super bad for anonymity.

      Major, fundamental, latency-increasing data scrambling and obfuscation technologies (I’m talking fake and authentic-looking requests to sites/pages that you don’t actually visit yourself, just to anonymize everything, think Freenet for network requests), are going to have to be introduced to neutralize this IMO, major threat.

      And to me anyway NSA are the most important threat model (at the end of the day) so this is a pretty big issue. It’s far easier to block commercial trackers like Google, but NSA sees ALL.

      So yes, HTTPS is very compromised (conspiracy theories have now become conspiracy fact or conspiracy most liklies). Its centralized nature gives ripe abuse for ‘ownage’ by the NSA and I trust self-signed certificates (especially from sites that openly do this for this reason) more than a posh ‘trusted’ shiny validation cert from Verisign and co. ANY day.

      Still good to only use HTTPS by default though, but I guess one needs to realize that you should treat HTTPS as if it’s HTTP and anything truly important or identity-linked, either use client-controlled encryption (PGP, etc) or don’t do it online at all. Don’t use user accounts when possible, dispose of them after one use, just expect everything to be public unless the encryption can ONLY be decrypted by you or friends given keys offline or via secure mechanisms to set it up in the first place.

  2. Great article, this is something I had never considered, but it definitely fits the bill for ways to track. It made me wonder if plugins like Ghostery would stop this kind of thing from happening. Could you share what setup you use to avoid tracking? I always like to compare and contrast :)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s