I recently implemented Etag caching support only to learn its a privacy breach. Internet scum have coopted yet another technology to track us as we browse. In this article I’ll look briefly at this problem and how it also applies to
Last-Modified headers. I then offer a fix for time based caching that prevents tracking.
Though I offer a fix here, I still believe a proper sandbox browser is the better, more universal, solution to privacy.
Using Etag to track a user
Tracking is primarily a problem with third party websites. These are typically those showing irrelevant ads (like at the bottom this article), or convenient CDNs serving common scripts. To track us all they need is some kind of reasonably unique identifier sent with each request.
Etag was intended for cache validation. A large image, or other resource, on a page will be served with an
Etag. When the browser comes back to this page later, or sees the same image URL, it uses the tag in a
If-None-Match request to the server. If the resource has changed then a new binary will be sent back. If the resource hasn’t changed then a
304 Not Modified is returned. It’s a simple way to avoid reloading resources already in the cache.
The problem is that
Etag is chosen by the destination server. There are no rules to how it is produced. This allows that server to create a unique string for every single visitor. Every time the browser checks the validity of this resource it will send back this tag. If the same URL is used on multiple websites that same tag will be used. It’s become a unique tracking token!
I’ve heard that some browsers do disable caching on third party websites, but can’t find reliable information on this.
ETag be saved?
Etag validation mechanism somehow be saved to remain useful without becoming a tracking token?
At first it seems like the problem is the server generating the token string. What if we generate a hash of the binary on the client side instead? This way the server has no control over the resulting identifier. Two people who get the same resource, the same image, would end up with the same identifier.
It doesn’t work. The disreputable third party website will simply start serving unique resources to each user. Image formats, both JPEG and PNG, allow all sorts of headers, even custom ones, to be added to the image. So its very easy to serve the same visual image with a distinct binary. The hash of the binary will be effectively unique for every user.
What if we hash only the visible content of images? In this manner the server can’t add random crap to the header and get a unique id. Unfortunately it’s still rather easy to store information in an image without the user detecting any change. Just lookup “steganography”.
One more possibility. What if the client hash is trimmed to a very small size? Instead of a full 128bit cryptographic hash we reduce to 16bits or even 12bits. This limits the identification space by a large degree, they are still tracking tokens, but it’s difficult to make them unique. Of course, it’s still broken if the third-party website just uses multiple images to construct a compound identifier.
The conclusion here is basically that
Etagshould be completely disabled, at least for third-party websites (since the primary domain can always track you anyway).
Last-Modified and If-Modified-Since also broken
Etag problem got me to thinking about the other caching headers. It looks like
If-Modified-Since has the same problem, albeit a bit more subtle.
If-Modified-Since is used to check whether a resource is newer than the one in the cache. If it is newer then a new version will be returned. If not then
304 Not Modified will be returned. The browser’s clock and server clock may not be synchronized, so RFC2616 recommends using the
Last-Modified header for caching. This seems like trouble to allow the server to decide these things.
When the server gives us a
Last-Modified timestamp and we send it back in the
If-Modified-Since header, we’ve basically created a new unique token. The server will just create unique timestamps for every user loading the image. If done this way these timestamps are essentially just a random tag, not a legitimate timestamp.
The first idea to fix this is simply ignore the RFC and disregard the
Last-Modified header. It creates a potential time synchronization problem, but that seems less of a problem than creating a unique tracking token.
Another option is to quantize all expiration times. The current datetime format allows for second based precision, which simply isn’t necessary for image tracking. Browsers tend to delay checks for the length of the session, or day, anyway. If we limit the timestamp to increments of 10 minutes we’ve significantly reduced the identifier space.
This can actually solve the time synchronization issue as well. We could always round to 10 minutes and then go back one increment. This ensures that if the resource is modified in the current timeslot our recorded timestamp will be less than that. When we make a new request with
If-Modified-Since we will get a new resource.
By using a quantized timestamp derived from the current time, not the servers time, I think we can thwart this privacy invasion and still retain an effective caching mechanism.
A second idea is to always subtract a random time from the client computed time. That is, we still use the client time for
Last-Modified, but each time we send it to the server we subtract a random amount.
This breaks the ability of the server to track an individual user. Each request the server gets a new time, so it is unable to associate this with a previous request. If we keep the random offset within the same 10 minute window as quantization it also doesn’t really harm the caching mechanism.
As a bonus we could also continually update the
Last-Modified time of the resource. Anytime a request returns a
304 Not Modified response we can safely update the time (using the quantized period again). If every browser did this it means the server would be getting a small range of times for the same resources across all of its users.
This problem exists for third-party and cross-site tracking; obviously the main site you are on can track you while on their site. We’re concerned with the ability of third-party sites tracking you across several different domains.
Abusing caching headers is another way to track us. The safest solution appears to have the browser just not cache third-party content and to send restricted headers to them. If we really wish to retain caching, since we sometimes cache useful things, not just ads, then the quantized and randomly modified time seems positive. If not that, then we need to investigate a new caching mechanism which cannot be abused.
This issue, and several related privacy leaks, need to be fixed in the browsers. Disabling third-party cookies does nothing if its just as easy to use caching headers to achieve the same thing.