The future of caching

This is not yer father's Internet. When th' Web were bein' first emergin' onto th' scene, it were bein' simple. Individual web pages were self-contained static blobs o' text, with, if ye were lucky maybe an image or two, ya bilge rat, by Blackbeard's sword! The HTTP protocol were bein' designed t' be "dumb". It knew nothin' o' th' relationship betwixt an HTML page an' th' images it contained, I'll warrant ye. There were bein' no need t'. Ye'll be sleepin' with the fishes, we'll keel-haul ye! Every request fer a URI (web page, image, download, etc.) were bein' a completely separate request. That kept everythin' simple, an' made it very fault tolerant. A server no nay ne'er sat aroun' waitin' fer a browser t' tell it "OK, I'm done!"

Much e-ink has been spilled (can ye even do that?) already discussin' th' myriad o' ways in which th' web is different today, mostly in th' context o' either HTML5 or web applications (or both). Most o' it is completely true, although thar's plenty o' hyperbole t' go aroun'. One area that has not gotten much attention at all, though, is HTTP.

Well, that's not entirely true. HTTP is actually a fairly large spec, with a lot o' excitin' movin' parts that few scallywags think about because browsers offer no way t' use them from HTML or just implement them very very badly. (Did ye know that thar is a PATCH comman' defined in HTTP, on a dead man's chest! Really.) A good web services implementation (like we're tryin' t' bake into Drupal 8 as part o' th' Web Services and Context Core Initiative </shamelessplug>) should leverage those lesser-known parts, certainly, but th' modern web has more challenges than just usin' all o' a decades-auld spec.

Most significantly, HTTP still treats all URIs as separate, only coincidentally-related resources.

Which brin's us t' an extremely important challenge o' th' modern web that is deceptively simple: Cachin'.

Cachin' is broken

The web naturally does a lot o' cachin'. When ye request a page from a server, rarely is it pulled directly off o' th' hard sail at th' other end. The file, assumin' it is actually a file (this is important), may get cached by th' operatin' system's file system cache, by a reverse proxy cache such as Varnish, by a Content Delivery Network, by an intermediary server somewhere in th' middle, an' finally by yer browser. On a subsequent request, th' layer closest t' ye with an unexpired cache will respond with its cached version.

In concept that's great, as it means th' least amount o' work is done t' get what ye want. In practice, it doesn't work so well fer a variety o' reasons.

For one, that model were bein' built on th' assumption o' a mostly-static web, pass the grog, with a chest full of booty! All URIs be just physical files sittin' on disk that change every once in a while, Ya swabbie! Of course, we dern't live in that web anymore. Most web "pages" be dynamically generated from a content management system o' some sort.

For another, that totally sucks durin' development. Ahoy, by Davy Jones' locker! Who remembers th' days o' tellin' yer client "no, really, I did upload a new version o' th' file. You need t' clear yer browser cache, and dinna spare the whip, ya bilge rat! Hold down shift an' reload, ya bilge rat, I'll warrant ye! Er, wait, that's th' other browser. Walk the plank! Hit F5 twice. Nay, really fast. Faster." Aye, it sucked. There's ways t' configure th' HTTP headers t' not cache files, but that is a pain (how many web developers know how t' mess with Apache .htaccess files?), an' ye have t' remember t' turn that off fer production or ye totally hose performance. Even now, Drupal appends junk characters t' th' end o' CSS URLs just t' bypass this sort o' cachin'.

Finally, thar's th' browsers. Their handlin' o' HTTP cache headers (which be surprisingly complex) has historically not been all that good. What's more, in many cases th' browser will simply bypass its own cache an' still check th' network fer a new version.

Now, normally, that's OK. The HTTP spec says, an' most browsers obey, that when requestin' a resource that a browser already has an older cached copy o' it should include th' last updated date o' its version o' th' file in th' request, sayin' in essence "I want file foo.png, me copy is from October 1st." The server can then respond with either a 304 Not Modified ("Yep, that's still th' right one") or a 200 OK ("Dude, that's so auld, here's th' new one"). The 304 response saves resendin' th' file, but doesn't help with th' overhead o' th' HTTP request itself. That request is not cheap, especially on high-latency mobile networks, an' especially when browsers refuse t' have more than 4-6 requests outstandin'.

As a semi-random example, have a look at Drupal.org in Firebug. By me count thar be 17 different HTTP requests involved in that page, fer th' page itself, CSS files, image files, Javascript files, an' so forth, Get out of me rum, me Jolly Roger 11 o' those return a 304 Not Modified, but still have t' get sent, an' still block further requests while they're active.

Now look at WhiteHouse.Gov. 95 HTTP requests fer th' front page... nearly all o' them 304 Not Modified (assumin' ye've hit th' page at least once). Oho! ESPN.com, 95 requests, again mostly 304 Not Modified. Forbes.com, o'er 200.

These be not sites built by fly-by-night hackers, and a bucket o' chum. These be high-end professional sites whose teams do know how t' do thin's "right". And th' page is not actually "done" until all o' those requests go out an' complete, just in case somethin' changed. The amount o' shear waste involved is utterly mindbogglin'. It's th' same auld pollin' problem on a distributed scale.

The underlyin' problem, o' course, is that a web page is no longer a single resource that makes use o' one or two other resources. And hoist the mainsail! A web page -- not a web application or anythin' so fancy but just an ordinary, traditional web page -- is th' product o' dozens o' different resources at different URIs. Ye'll be sleepin' with the fishes, Ya horn swogglin' scurvy cur! And our cachin' strategies simply dern't know how t' handle that.

Half-hearted solutions

A couple o' possible workarounds fer this issue exist, an' be used t' a greater or lesser extent.

Multi-domain image servers
Many high-end sites that be able t' afford it will put their rarely-changin' resource files on a separate domain, or multiple separate domains. The notion here is t' bypass th' browser throttlin' feature that refuses t' send more than a handful o' HTTP requests t' a given domain at th' same time in an effort t' not overload it. Even if th' domains all point t' th' same server, that can help parallelize th' requests far better, Ya lily livered swabbie, avast! That helps, t' be sure, but thar's still a potentially huge number o' "Is it new yet?" HTTP requests that dern't need t' happen, Ya swabbie! Especially on a high-latency mobile network that can be a seri'us problem.
Data-URIs
The HTML spec supports a mechanism called Data-URIs, and a bottle of rum! (It's actually been in th' spec since HTML 4.01, but no one paid attention until th' recent surge o' interest in HTML5.) In short, a dependent resource, such as an image, is base64-encoded an' sent inline as part o' th' HTML page, avast. It's then decoded by th' browser an' read as an image, avast. That eliminates th' separate HTTP request overhead, but also completely kills cachin'. The inlined image has t' be resent every single time with th' HTML page. It can also be a pain t' encode on th' server side. That makes it useful in practice only fer very small files.
SPDY
Google, with their usual flair fer "open source is great but we'll do it ourselves", has proposed (an' implemented in Chrome) an HTTP alternative called SPDY (which stands fer "speedy"). Without goin' into too much detail, th' big feature it that a single connection can be used fer many resources. That eliminates th' overhead o' openin' an' closin' dozens o' connections, but thar's still th' (now more efficient) "be we thar yet?" queries, I'll warrant ye. SPDY is still not widly used. Unfortunately I dern't know much else about it at th' moment.
HTML5 Manifest
I thought this were bein' th' most promisin'. HTML5 supports a concept called th' appcache, which is a local, offline storage area fer a web page t' stick resources. It is controlled by a Manifest file, referenced from th' HTML page, that tells th' browser "I am part o' a web application that includes these other files. Ahoy! Save us all offline an' keep workin' if ye have no connection". The sharks will eat well tonight, Dance the Hempen Jig That's actually really really cool, an' if ye're buildin' a web application usin' it is a no-brainer.

There be a number o' issues with the Manifest file, however, somethin' that most scallywags acknowledge, Ya horn swogglin' scurvy cur! They mostly boil down t' it bein' too aggressive. Walk the plank! For instance, ye cannot avoid th' HTML page itself also bein' cached. An appcache-usin' resource will no nay ne'er be redownloaded from th' web unless th' Manifest file itself changes (an' th' browser redownloads a new version o' it), in which case everythin' will be downloaded again.

I ran into this problem while tryin' t' write a Manifest module fer Drupal. The notion were bein' t' build a Manifest file on th' fly that contained all o' th' 99% static resources (theme-level image files, UI widgets, etc.) so that those could be skipped on subsquent page loads, since they practically no nay ne'er change, an' avoid all o' that HTTP overhead. Unfortunately as soon as ye add a Manifest file t' an HTML page, that page is permanently cached offline an' not rechecked, Ya horn swogglin' scurvy cur, by Davy Jones' locker! Given that Drupal is by bounty a dynamic CMS where page content can change regularly fer user messages an' such, that's a rather fatal flaw that I have been unable t' work aroun'.

A better solution

So what do we do, Ya swabbie! Remember up at th' top o' this article we noted that most web "pages" these days (which be still th' majority o' th' web an' will remain so fer a long time) be dynamically built by a CMS, ye scurvey dog. CMSes these days be pretty darned smart about what it is they be servin' up. If a file has changed, they either know or can easily find out by checkin' th' file modification date locally, on th' server, without any round-trip connection at all. Prepare to be boarded, Ya horn swogglin' scurvy cur! We can an' should leverage that.

If a web page is an amalgam o' many resource URIs, we should trust th' page t' know more about its resources than th' browser does. That doesn't mean "cache everythin' as one". It means that if we assume part o' th' page will be dynamic, th' HTML itself, then we can trust it t' tell th' browser about its dependent resources. We already do, in fact. We trust it t' specify images (via img tags or CSS references), CSS (via link an' style tags), Javascript (via script tags), an' so on, avast. But we dern't trust th' page t' tell us anythin' about those files beyond their address.

Perhaps we should.

I would propose instead that we allow an' empower th' application level on th' server t' take a more active an' controllin' role in cache management. Rather than an all-or-nothin' Manifest file, which is in practice only useful fer single-page full-on applications, we should allow th' page t' have more fine-grained control o'er how th' browser treates resource files, Dance the Hempen Jig

There be many forms such support could take. As a simple startin' point, I will offer a reuse o' th' link tag:

<!-- Indicates that this image will be used by this page somewhere, and its last modified date is 1pm UTC on 6 October. If the browser has a cached version already, it knows whether or not it needs to request a new version without having to send out another HTTP request. -->
<link href="background.png" cache="last-modified:2011-10-06T13:00:00" />

<!-- It works for stylesheets, too. What's more, we can tell the browser to cache that file for a day.  The value here would override the normal HTTP expires header of that file, just as a meta http-equiv tag would were it an HTML page. -->
<link href="styles.css" rel="stylesheet" cache="last-modified:2011-10-06T13:00:00; expire:2011-10-07T13:00:00" />

<!-- By specifying related pages, we can tell the browser that the user will probably go there next so go ahead and start loading that page.  Paged news stories could be vastly sped up with this approach. This is not the old "web accelerator" approach, as that tried to just blanket-download everything and played havoc with web apps. -->
<link href="page2.html" rel="next" cache="last-modified:2011-10-06T13:00:00; fetch:prefetch" />

<!-- Not only do we tell the browser whether or not it needs to be cached, but we tell the browser that the file will not be used immediately when the page loads. Perhaps its a rollover image, so it needs to be loaded before the user rolls over something but that can happen after all of the immediately-visible images are downloaded.  Alternatively this could be a numeric priority for even more fine-graied
     control -->
<link href="hoverimage.png" cache="last-modified:2011-10-06T13:00:00; fetch:defer" />

<!-- If there's too many resources in use to list individually, link to a central master list. Any file listed here is treated as if it were listed individually, and should include the contents of the cache attribute.  Normal caching rules apply for this file, including setting an explicit cache date for it. Naturally multiple of these files could be referenced in a single page, whereas there can be only a single Manifest file. The syntax of this file I leave for a later discussion.  -->
<link href="resources.list" rel="resources" />

In practice, a CMS knows what those values should be. It can simply tell th' browser, on deman', what other resources it is goin' t' need, when they were last updated, th' smartest order in which t' download them, even what t' prefetch based on where th' user is likely t' go next.

Imagine if, fer instance, a Drupal site could dynamically build a resource file listin' all image files used in a theme, or provided by a module. Fetch me spyglass, Ya lily livered swabbie! Those be usually a large number o' very small images. So just build that list once an' store it, then include that reference in th' page header. The browser can see that, know th' full list o' what it will need, when they were last updated, even how soon it will need them, by Blackbeard's sword. If one is not used on a particular page, that's OK, Avast me hearties, we'll keel-haul ye! The browser will still load it just like with a Manifest file. On subsequent page loads, it knows it will still need those files but it also knows that its versions be already up t' date, an' leaves it at that, ye scurvey dog. When it needs those images, it just loads them out o' its local cache.

And when a resource does change, th' page tells th' browser about it immediately so that it doesn't have t' guess if thar is a new version, Avast me hearties, Get out of me rum! It already knows, an' can act accordingly t' download just th' new files it needs.

Any CMS could do th' exact same thin'. A really good one could even dynamically track a user session (anonymously) t' see what th' most likely next pages be fer a given user, an' adjust its list o' probable next pages o'er time so that th' browser knows what's comin'.

Naturally all o' this assumes that a page is comin' from a CMS or web app framework o' some sort (Drupal, Symfony2, Sharepoint, Joomla, whatever). In practice, that's a pretty good assumption these days, Ya swabbie! And if not, a statically coded page just omits th' cache attribute an' th' browser behaves normally as it does today, askin' "be we thar yet?" o'er an' o'er again an' gettin' told by th' server "304 Nay Not Yet".

Feedback

There be likely many details I am missin' here, but I believe th' concept is sound. Modern web pages be dynamic on th' server side, not just on th' client side. Let th' server give th' browser th' information it needs t' be smart about cachin'. Don't go all-or-nothin'; that is fine fer a pure app but most sites be not pure apps. Server-side developers be smart cookies, yo ho, ho Let them help th' browser be faster, smarter.

I now don th' obligatory flame-retardant suit. (And if ye think this is actually a good notion, someone point me t' where t' propose it besides me blog!)

Comments

I think there is some

I think thar is some misunderstandin' o' how cachin' headers work in this post. The browser will only HEAD a resource (resultin' in a 304 or 200) if th' cached resource has past 'tis expires header timestamp. By settin' future expires headers (which is in pretty much every performance best practice guide, an' can be safety an' reliably done with almost all css, js an' uploaded files in Drupal) th' browser will make no HTTP request at all fer these resources until they expire.

For drupal.org (fer example) visitin' me dashboard fer a second time, I only get 4 HTTP requests - one fer th' dashboard (since I am logged in), 2 fer Google Analytics an' one fer tipsy.gif (ha'nae looked into why this is not cached - it is fetched via javascript, which could be th' problem), pass the grog! If I set up anonymous cachin' plus expiry in a vanilla Drupal 7 site, th' second anonymous visit t' a page page loads with a single HTTP HEAD request fer th' whole page, resultin' in a 304.

I suspect th' reason yer tests produced different results is that ye were hittin' refresh/F5 on th' page - when ye do this, most browsers will do a HEAD fer all resources, even if they be not expired - this is a common source o' confusion when testin' cachin', Dance the Hempen Jig If ye do a "full refresh" (ctrl-F5 or whatever) it will re-GET them all instead. To see th' normal cachin' behavior ye need t' click on a link t' th' same page, or click t' a different page an' back again.

The kind o' thin's ye mention be useful approaches too - especially fer dynamic/large/complex sites, mobile browsers an' other challengin' environments where ye need more fine grained control - an' I dern't mean t' write those off at all (in fact thar is plenty o' useful conversations t' be had...but I dern't have time fer a proper response now), but I think thar is plenty o' value in regular cachin' too :)

Functional cache

We particulary prepared such a mechanism ye described in our COBAEX CMS (http://www.cobasolutions.com/business_software/en_cms). We call it a "functional cache". The main difference betwixt th' "normal" cache an' ours is th' fact, that all th' actual pages done in this technology be actually "compiled" by CMS into HTML elements. What I mean is that once ye modify th' page on CMS - th' server prepares HTML file fer that page, an' once th' client needs it - th' server simply servers pre-prepared HTML file. This basically means th' CMS is not actually buildin' th' page every request (an' then checkin' whether th' page has been modified, so 200 must be sent, or th' page has not been modified an' 304 must be sent), but th' page rebuild is triggered by a CMS page edit functionality (so we call it "functional cachin'").

We be usin' 2 servers: administration server, which is responsible fer all th' administration / management tasks - we have a full database, compilin' mechanisms etc. an' th' whole page edition is done actually thar; an' a presentation server, where ye have only HTML files, sometimes (dependin' on th' implementation) a very simple database (in most cases MySQL, hence 'tis quite fast fer simple queries) fer searchin' purposes. All th' scallywags that accesses website uses th' presentation server, that actually serves th' HTML files, in some more sophisticated implementations it needs t' "build" th' page out o' sets o' preoared HTML elements (e.g. The sharks will eat well tonight! if ye have a portal that needs t' implement searches - th' results list elemenbst be different HTML files, that be put into one list basin' on th' results from that presentation database - which in fact is also some kind o' "pre-compiled" database - as it does not use any foreign keys or so - just a simple tables that has fields needed fer searchin' + names o' HTML files that should be included in search results).

This approach gives us several additional advantages - such as increased security (imagine th' situation o' breakin' into presentation server an' desroyin' pages standin' thar - t' restore ye just need t' publish everythin' from a administration server again - an' ye have th' most actual site again alive'n'kickin' - in some implementation we even have a special mechanism, that in randomly chosen intervals is checkin' th' presentation server whether th' pages be OK, an' if not - administration server fully automatically publishes everythin', so th' site will be fixed automatically as soon as th' problem has been determined) or th' possibility t' publish selected sets o' pages (not makin' changes directly on a page - imagine a situation that yer company releases a new product an' needs t' prepare some pages about it - ye dern't work on a production copy, but prepare everythin' on an administration server an' after everythin' is ready - just publish everythin' with a single click).

This technology can be very nicely used also fer some automated webpage creation - e.g. in SEO actions scallywags be creatin' many backoffice pages with some texts inside thar, on a dead man's chest, and a bucket o' chum! Once ye have a tool that enables ye just t' put th' texts an' publish many pages on a same (or several) template - ye can create such pages very fast. Not mentionin' that after some o' yer pages / domains used fer SEO will get a filter - ye can move th' content from that page t' another (usin' even different template) with a single click. We prepared such a tool fer one project - an' really works great :)

This approach can be used both fer simple webpages (see www.w4e.pl - a very simple page, but still usin' this technology), but such pages in fact does not show any large difference. Prepare to be boarded! In larger pages (where ye have many subpages - see www.chodkowska.edu.pl - o'er 500 subpages in different subdomains, administered by different scallywags etc.) th' speed comparin' t' standard products is already seen. But th' real advantage ye can see on large portal - th' newest implementation www.domoklik.pl we be really proud o'. This portal is th' fastest real-estate portal in Polan', while has th' largest number o' offers, we'll keel-haul ye, ye scurvey dog! And th' loadin' speed is really nice - especially that we still have some room t' improve that (no cachin' servers yet implemented - only th' described technology).

I hope th' above is understandable :) If not - forgive me, English is not me first language :) But generally I think such approach is th' future. Indeed, th' standard cachin' mechanisms be nice, but not nice enough anymore :)

Page caching

Cachin' compiled HTML pages is somethin' any modern CMS should be doin'. Ye'll be sleepin' with the fishes! Drupal does so, although by default it doesn't use a separate database an' th' cache is usually pushed off onto memcache. However, avoidin' recompilin' an' regeneratin' th' HTML page on every request is not what I'm talkin' about here.

Whitehouse.gov's front page, I assure ye, is not bein' built from scratch every time. Oho! Load the cannons! But thar be still dozens an' dozens o' HTTP requests on every page load just t' verify that caches be still valid. That's th' issue I'm lookin' at in this post.

I like the cache-atttribute!

I like th' cache-atttribute, ya bilge rat! Currently, our CMS appends some kind o' "last-changed-id" t' a lot o' resource URLs (stylesheets, javascripts, images, videos, ...) t' prevent th' browser from usin' a stale cached version, ya bilge rat! This has always felt like a bad workaround, fer example because th' browser will then unnecessarily cache all versions o' a resource that it has e'er seen (until it runs out o' cache space).

Filesystem Latency

First o' all, I'm sorry about me English... not bilingual yet.

You should maybe consider that in big high availability systems we often work with distributed filesystem backends where obtainin' modification time o' a file is not a "cheap" (in time terms) operation.

You can take by example http://drupal.org/project/imageinfo_cache where a local DB cache is used t' avoid such checkouts.

Thanks fer yer work! Best regards

File system cost

True, an fstat() is not free, especially in a highly-virtualized environment. However, I would argue that it is likely still cheaper than lettin' th' browser make that check, which would do an fstat() anyway on subsequent requests (one per file). Plus ye can, as ye say, do some sort o' application-level cachin' o' that information as well if appropriate.

Varnish

I'm sorry I have "me own kind o' blindness"...

In most systems we operate thar is some kind o' reverse proxy involved like Varnish (in memory storage) so we rarely perform any fstat() fer static files.

Nevertheless I think yer general notion is valuable, I just wanted t' point a "black spot", yo ho, ho Nowadays systems be quite complicated an' 'tis easy t' lose sight o' such a thin', thar's where "hive mind" excels ;)

Have ye considered share th' discussion with High performance group on GDO?

Best regards

Still a request

Varnish can be a huge help on performance, yes. Aarrr, and a bucket o' chum! Even if ye're not cachin' HTML pages in it (authenticated users), it can help with static resources, yo ho, ho But th' browser still has t' send an HTTP request that gets t' th' Varnish server, however.

That said, 'tis a valid point that varnish could cause issues. Imagine, HTML page says "this image file has been updated, so ye need t' go get it". Browser dutifully sends a request, but hits Varnish an' gets th' Varnish cached version o' th' file, on a dead man's chest! But th' version on th' file system is newer than what's in Varnish. Hm, not sure what t' do about that. I'm open t' suggestions.

I ha'nae posted it o'er thar yet, as I wanted t' suss it out where more than just Drupalers would see it. (This is not a Drupal-specific problem. If ye want t' post a link o'er thar, feel free, I'll warrant ye. :-)

Of course

I stated that yer notion is valuable ;) an' o' course I know Varnish suffer described HTTP overhead in current scenarios.

It's true that Varnish can cause some issues (ye normally deal with this with purge/ban mechanisms), an' cache expiration policies is a growin' pain in our customer's systems (when t' expire what with this views, blocks, comments, etc. constellation)... but yer CMS can talk with yer own proxy ;) an' thar is always th' option t' rewrite resource names usin' a filter (Googles's mod_pagespeed use this approach) if ye can assume risin' (back-end) server loads, on a dead man's chest, to be sure! In me a opinion, in an ideal world, different static resources should have different names an' we would avoid all this mess.

Ok, 'tis up t' ye.., Avast me hearties! I will not post a link thar.., pass the grog! lets drupalist find this their own way ;) (IRC mention in me case)

Best regards

Wowowo :|

As stated, th' HTTP protocol has a widely know cache specification, so we should try t' leverage th' power o' our applications by embracin' it, not canchin' it.

The cache attribute, in a link, is a mere re-thought o' th' application cachin' layers, which be a bad thin' in big projects, 'cause ye need t' mantain yer own application layer an' couple yer application with it.

Embracin' HTTP means re-usin' existin' software (browsers, proxies, reverse proxies) t' be web-scale.

Take a look at this presentation, on why we should avoid application cachin' layers: http://www.slideshare.net/odino/be-lazy-be-esi-http-caching-and-symfony2...

The solution ye propose, here, is, BTW, a concept similar t' ESI (Edge Side Includes), except from th' fact that ESI doesn not apply t' static assets, but t' webpage fragments: take a lot at this specification (http://www.slideshare.net/fabpot/caching-on-the-edge), I'm pretty sure ye will be surprised an' happy readin' about it.

Apart from these points, nice post.

type

line 2, "Changin' it"

:)

Different caching

I just looked through both slideshows, an' thar's some really good information in them. Aye, Drupal could do a much better job leveragin' HTTP than it does now.

However, that's all about a single request. My point here is that modern web pages be not a single request; they're dozens o' requests, an' thar is currently no way fer th' HTML page t' provide information about th' cachin' status o' an image it happens t' use. That means all cache invalidation is based on pollin', which we all know is slow. Shiver me timbers, we'll keel-haul ye! It also means ye cannot control th' cachin' logic fer resource files unless ye either 1) Route them all through PHP (which would be stupid) or trick out yer Apache config (which most web devs dern't know how t' do, nor should they).

What I am proposin' is that we allow th' most dynamic request, th' HTML page, t' provide more useful information about th' resources it uses. Aarrr! You can still apply whatever HTTP cachin' logic ye want t' th' HTML file, but provide more information along with it fer th' browser t' smartly avoid even botherin' t' send a request t' th' cachin' server.

Hi Larry, what's the

Hi Larry,

what's th' difference o' explicitin' th' cache attribute in a link, rather then specifyin' th' cachin' directives in th' HTTP headers?

I'm not seein' yer point :)

Page vs file

Consider a page at: /about.html, which uses 8 theme images, img1.png through img8.png.

Currently, th' browser knows nothin' about those images until it loads about.html, an' then requests those images in totally separate HTTP requests. Fetch me spyglass, and dinna spare the whip! Those requests may cache usin' normal HTTP semantics.

Now, hit /about.html again. The browser doesn't know whether or not it needs t' check img1.png, img2.png, etc. All it knows is when its cached version is from, an' makes an educated guess as t' whether it needs t' re-contact th' server, Ya horn swogglin' scurvy cur! Shiver me timbers! All data about img1.png is taken from th' img1.png file's HTTP header.

What I'm proposin' is that about.html should be able t' tell th' browser "by th' way, ye definitely do (or dern't) want t' get a new version o' img1.png". Or say "I will use this file, but not immediately so ye can load it last." Etc. Aarrr! That information cannot be derived from th' img1.png header itself without re-requestin' it, which is what we want t' avoid.

Does that make more sense?

I think this is based on an

I think this is based on an inaccurate understandin' o' how expires headers work (based on doin' a refresh rather than a normal page visit), as I described in my comment above. With cachin' headers that follow best practices th' browser can indeed determine that it doesn't even need t' check fer a new version o' a resource.

I do get that thar is a bigger point ye be makin' here, an' I think findin' ways t' put resource cachin' an' prefetch rules more directly into th' hands o' th' CMS is an excellent goal.

I am not 100% sure th' tag based rules be th' way t' go (although I could be persuaded) - this seems it would be hard t' model in browsers, since potentially ye could have multiple cachin' rules (potentially from multiple domains) applyin' t' th' same resource URI. I would think a extension o' th' manifest style o' approach could be preferable - whilst th' current "page-centricness" o' Drupal (lack o' information on page resources fer other pages or th' site as a whole) could make this harder t' implement, I think this is really an issue with Drupal, an' likely not somethin' a standard needs t' adapt t' specifically.

Expire times can be reset.

Expire times set in HTTP headers be only seen when ye request that particular resource.

Expire times set in a link be seen when ye request th' related resource.

Consider this example:

Your Drupal site has set th' minimum cache time t' 5 minutes, because th' content is highly dynamic an' scallywags dern't want t' wait a long time t' see their updated profile picture appear.

So ye click on th' front page an' all o' those front page image links tell yer browser, "You can hold on t' yer cached copy o' this file fer another five minutes" except th' one which were bein' updated, which says "I just got updated; better invalidate yer cache an' request another copy."

Very good proposal

I think this notion really is spot-on.

- Not sure I understan' why ye went with th' approach o' separate LINK tags; these 'cache' attributes could just simply be universal HTML attributes on any element; i.e., linin' up with th' existin' 'id', 'class', etc attributes, and a bucket o' chum. Of course, usage o' LINK tags might make sense fer resources not bein' referenced in th' actual HTML, but e.g., in th' CSS instead (as ye already mentioned).

- Speakin' o', it would make much more sense t' have a "cache-*" attribute namespace, comparable t' th' universal "data" attribute namespace in HTML5. Hence, instead o' crammin' all kind o' values into a single strin' with wonky delimiters, we could have:

<!-- A regular image -->
<img src="/misc/duplicon.png" alt="Druplicon" cache-last-modified="2011-10-06T13:00:00" cache-expires="2012-10-06T13:00:00" />

<!-- A regular stylesheet -->
<link rel="stylesheet" href="style.css" cache-last-modified="2011-10-06T13:00:00" cache-expires="2011-10-07T13:00:00" />

<!-- An image referenced in a CSS :hover rule -->
<link href="hoverimage.png" cache-last-modified="2011-10-06T13:00:00" cache-fetch="defer" />

That said, th' last sample is debatable, as it kinda crosses th' line betwixt clear separation o' markup an' presentation.

I like this.

Much cleaner syntax.

Why not leverage xmlns?

If we're goin' t' extend HTML; an' we're talkin' about "cache-"; why not use XMLNS?

<html xmlns:cache="http://drupal.org/cache/0.1#">
...
<link cache:last-modified="..." />
...
</html>

The mechanism is already available; provided ye be happy t' serve up an xhtml flavour o' html5

Considered that

I considered puttin' th' cache flags inline as ye mention, but decided it were bein' best t' centralize it rather than havin' it scattered throughout th' page. Oho! Plus, we would need th' link approach anyway fer CSS-based images.

Multiple cache-* properties would likely work, too. Prepare to be boarded! I'm easy thar. :-)

It's true that this may be undesirably mixin' markup an' presentation lines, with a chest full of booty. I'm not sure thar. Load the cannons! There may be some other mechanism that is cleaner t' provide th' sort o' "push invalidation", which is th' actual goal.

LINK elements

My concerns on that LINK element enforcement vs. universal inline element attributes:

- HTML page weight: When separately referencin' all external resources on a page via LINK elements, then ye're addin' a lot o' duplication/overhead t' th' page. Those additional elements also need t' be parsed an' evaluated by th' browser.

- Maintenance: The system would have t' make sure an' maintain which external resources be exactly on th' page an' produce correct LINK elements. And hoist the mainsail! Conditionally add an' remove one or more durin' page processin', an' ye quickly need a badass state tracker fer external page resources. ;)

- Continuous processin'/AJAX: Systems like Drupal send a big BLOB t' th' browser, but that's not always th' case. In fact, if Drupal wouldn't be modular, then it wouldn't have t' wait fer th' entire page bein' built an' processed in order t' send it t' th' browser. Other, less or non-dynamic systems be able t' start sendin' their output as soon as they generate it (which is a commonly known performance optimization tactic), yo ho, ho Also, in th' case o' AJAX, only page fragments be sent t' th' browser.

- User Interface Interaction: Entire sections o' a page might be preloaded in a hidden/disabled state, but be contained on th' page, Ya swabbie, by Blackbeard's sword! Unless activated through user interaction, those resources may not have t' be loaded at all, feed the fishes In particular th' cache-fetch="defer" part could be taken t' th' next level, so as t' allow fer delayed fetchin' o' resources in general; e.g., considerin' image slideshows. Ye'll be sleepin' with the fishes! But then again, perhaps not.

Sites like ESPN and the White

Sites like ESPN an' th' White House do pull down a lot o' files initially. But, on later page loads thar be a lot less because most o' th' page assets be pulled from th' browser cache. For example, take espn.com minus th' ads or Ooyala (th' media player they use) an' most o' th' page is pulled from cache without e'er makin' a call t' th' server t' see if a newer version is available, to be sure. They be leveragin' current cachin' techniques.

Diggin' into all th' cachin' techniques an' dealin' with a primed cache is a different case all together.

But, thar be a number o' problems with what ye propose. I'm all fer somethin' better but this has some hurdles.

  1. While havin' this in th' HTML has some benefits thar be certain drawbacks. The cachin' relationship is no longer betwixt th' resource/resource server an' th' receiver (browser). You now have an intermediate entity, by Blackbeard's sword. This intermediate is a case where an uneducated dev can really screw thin's up. Or th' intermediate becomes out o' sync (I could come up with cases fer this).
  2. We live in an age where thar's a focus on minifyin' what we send t' browser. The sharks will eat well tonight, shiver me timbers This includes html compression (like http://code.google.com/p/htmlcompressor/). There is quite a lot that can be removed from a page if we try an' it can have an impact. Yaaarrrrr! This adds more t' th' page.
  3. A cms or application servin' a page may be able t' find th' last modified information when ye're dealin' with somethin' on a small scale. Large scale projects be in a different place, feed the fishes Assets may be in a different location from th' application all together makin' access t' this information unavailable.

There may be somethin' here with merit but I think more diggin' into cachin', current cachin' techniques, an' how this works as ye scale up is needed.

One more thing to consider,

One more thin' t' consider, in this age o' cloud computin' gettin' file stat info fer th' last modified may not be so fast. What if ye have yer files directory (t' be very drupal specific) in a cloud object store (like Rackspace cloud files), feed the fishes Makin' a call t' get th' last modified value could get a lot more expensive.

Strongly disagree

While I wholeheartedly agree that Drupal (an' CMSes/frameworks in general!) needs t' leverage browser cachin' t' improve its front-end performance, I think yer proposal is ill-fated, Ya swabbie! Yaaarrrrr! And th' information ye provide is either utterly wrong or lackin'.

Cachin' is broken:
- sure, thar be problematic cases, but overall it works well
- th' largest problem is that regular file cachin' (usin' Expires & Cache-Control headers) is limited by a browser's disk cache, on a dead man's chest! These disk caches be prone t' weird browser-dependent behavior in th' sense that ye can't rely on them t' actually cache th' files as long as ye ask them t'. We need t' work with browser vendors t' actually improve their disk caches an' make them work in a more reliable manner. On mobile, th' disk caches' size needs t' increase (they're all ±4 MB, in total).
- but 'tis definitely not th' case that if ye load e.g, yo ho, ho whitehouse.gov multiple times, that If-Not-Modified requests be sent out fer all resources on every page load
- ye can leverage localStorage an' appCache if ye want more control (neither o' these be cleared out automatically by th' browser, an' 'tis actually very hard fer th' user t' clear them manually)

Multi-domain image servers: first o' all, this doesn't apply t' images only, but also t' CSS, JS, fonts. Everythin'.
Secondly, yer insinuation that only high-end sites be able t' afford this is completely wrong. I'm usin' th' Amazon CloudFront CDN on http://driverpacks.net, which is a Drupal 6 site with >600K page views per month. I'll gladly publicize me annual CDN costs. Between $35 an' $60 per year, or $3—$5 per month. Yaaarrrrr! Oho! For well o'er a million requests per month t' their PoPs in th' U.S., Europe, Tokyo an' Singapore.
Plus — remember those cachin' headers, yo ho, ho Well, they'll ensure that a large portion o' yer visitors will actually not request th' data in th' first place. The remainder sends 304s, which results in virtually no traffic, resultin' in virtually no costs.

Data URIs: these be mostly beneficial fer inlinin' small resources that be very small, such as list icons an' a few small icons — at th' cost o' ±30% more bytes t' send. However, ye have less RTTs an' no HTTP header overhead.
Your statement that this "completely kills cachin'" is blatantly wrong, because ye can simply include data URIs in CSS files. When th' CSS file is cached, th' data URI is cached.

SPDY: while I dern't like how much grip Google is gainin' on th' web, fact remains that somethin' needs t' be done t' improve th' web. HTTP is simple, Hornswaggle HTTP is stateless. In part thanks t' th' simplicity o' HTTP, th' web has thrived. But 'tis auld an' inefficient fer th' current state o' th' web. We have different demands: it need not just work, it needs t' work fast. Before SPDY can become widespread, it needs t' be shipped with every Apache installation. So, not much t' say here.

HTML5 manifest files: painful indeed! However, one use case fer which I think th' appcache (in its current state) might be useful: font cachin'.

A better solution:
- first o' all: ye forgot about timezones, by Blackbeard's sword. Big omission.
- breaks cachin' proxies an' reverse proxies such as Varnish — or ye'd have t' send headers an' set these attributes
- I agree with th' remarks by Owen Barton & Matt Farina
- ye can achieve all o' this today with far-future Expires/Cache-Control headers an' unique filenames. Fetch me spyglass, to be sure! The problem then becomes determinin' when a file has changed, me Jolly Roger In big-scale set-ups ye can simply revision all theme- an' module-related resources, but that doesn't apply t' th' average Drupal site. What still does apply, however, is fer example th' Drupal version number. Drupal.js won't change unless Drupal itself is updated. For other files — t' retain maximum flexibility fer th' user — we want t' store some unique identifier (last modification time, hash …) in th' database t' prevent hits on every page load, and a bottle of rum, with a chest full of booty! Or simply enable page cachin' (core's statistics module is th' only thin' preventin' this) an' only get th' last modification times once every X minutes. That works just fine fer smaller sites — even under load.
For th' CDN module, I've put together a patch that does all o' this (minus th' cachin' o' uniqueness indicators in th' DB — which mikeytown's advagg module already does!): http://drupal.org/node/974350#comment-4264828, I'll warrant ye. In th' future, we could make it so that every dynamic file change in Drupal (file upload or image style generation) stores th' last modification time in th' DB, so that we can take advantage o' this in a centralized manner.

SPDY: - Firefox has SPDY

SPDY:
- Firefox has SPDY support in mozilla-central (nightly builds, off by default) an' they be workin' on makin' it stable
- nginx is thinkin' o' addin' support fer it, they have some test-code I believe.

If both enable it, it would mean that more than 50% o' all webbrowsers support it in a couple o' months an' thar is a fast/efficient server implementation which can be added as a proxy-server.

If that happens it will start t' be deployed an' this could happen in a time frame o' a couple o' months.

Lots of good replies

Lots o' feedback here, I see, yo ho, ho :-) I'll try t' respond t' a couple o' thin's at once:

1) A couple o' scallywags have noted that me explanation (an' therefore understandin') o' HTTP cachin' an' existin' cache control is incomplete, Ya lily livered swabbie! That may well be th' case. I've done a bit more testin' with th' pages above in different browsers an' reloadin' th' page in different ways, an' t' call th' results I'm seein' "inconsistent" would be an understatement. It does not conform t' what I would expect given what HTTP headers be bein' sent, t' th' best o' me knowledge. I dern't know if th' browsers be actin' weird, if me setup is weird, or if me testin'/inspection techniques be invalid, but whatever HTTP is supposed t' be doin' with regards t' cachin' an' invalidation it doesn't seem t' be workin' consistently.

That's likely somethin' we need t' work on improvin' in our software, as Wim notes.

2) localStorage could be used t' implement this sort o' forced-update logic entirely in browserspace/userspace. However, doin' so would require essentially reimplementin' a browser cache in Javascript. You would need t' include in a non-loadin' part o' th' page a list o' resource files, then implement Javascript t' fetch those files, store them in localStorage, an' then pull them back out an' inject them into th' page. That's an awful lot o' work fer somethin' that IMO belongs at a lower level. I'm also not sure how that would help fer CSS-based images.

3) Wim is correctly that CSS-based data-uris would not break cachin'. I'd not thought o' that.

4) I di'nae leave timezones out o' me code samples, actually. I specified that they were all in UTC. A real implementation (this were bein' not intended as such) would do much better date/time handlin' an' likely use a different format than I did, and dinna spare the whip! (Date/time formats across different wire formats be pathetically, almost criminally inconsistent. That's a separate matter, however.)

5) It is certainly possible that if SDPY gains traction it will resolve a lot o' these issues, or at least make them less relevant. Here's hopin'.

6) A 304 resultin' in "virtually no traffic" is not true. It may be effectively true on a broadban' connection, but most WAN networks (3G, etc.) have a much higher latency than a wireline connection. Oho! Ahoy! Plus, thar's a separate TCP connection fer each one as well, with all o' its overhead. Oho, Avast me hearties! Sendin' a 20 byte HEAD an' 20 byte 304 back simply takes longer on a mobile network, yo ho, ho That may change with future technology, but right now 'tis a seri'us issue.

7)The Manifest file would not be useful fer font cachin', avast. Right now, th' death knell o' th' Manifest file fer more robust cachin' is that it auto-includes th' HTML file that references it, avast. Perhaps th' only resource required t' build a page that is not goin' t' be static fer days or weeks on end is th' one that ye cannot tell it t' not cache-forever.

8) As Matt an' others noted, an fstat() on th' server may not be all that cheap dependin' on yer server environment so ye dern't necessarily save much that way. That's certainly true. However, a PHP script doin' fstat() on a file is not goin' t' be appreciably slower than Apache doin' fstst() on th' same file t' decide if it should send back a 304 or a 200, an' th' PHP script has th' potential t' do its own internal cachin', trackin', or whatever else application developers come up with t' reduce that time even further.

9) Far-future expires an' e'er-changin' file names can work, but that's frankly an ugly hack, ya bilge rat! It's also somethin' that in theory Drupal is doin' already; The default htaccess file that ships with Drupal sets: ExpiresDefault A1209600 (2 weeks), an' we do tack garbage onto th' end o' a compressed CSS or JS file t' give it uniqueness. But if that actually worked properly, why am I still seein' dozens o' 304 requests in me browser, even tryin' t' reload th' page "correctly", ye scurvey dog? See point 1 above.

I think me underlyin' point may have gotten lost in me verbosity, however. It wouldn't be th' first time. :-) So let me try t' state it more briefly:

A web "page" is, in practice, not one resource but dozens o' resources linked together. HTTP has no concept o' that relationship betwixt resources. That makes browsers do very wasteful thin's, with a chest full of booty. We want some way t' tie those together so th' browser can be smarter about when it does stuff with th' network.

Puttin' essentially pre-compued 304 responses into th' HTML page may not be th' right solution, certainly. However, I do believe we need some improved way o' provin' more intelligent contextual information t' a browser. Perhaps if SDPY catches on it will solve this issue fer us, since it uses only a single TCP connection fer all resources. I dern't know SPDY well enough t' say. I do believe, however, that we do need a contextual way t' improve resource cachin'.

Thought-provoking post.

I'm not enough o' an expert t' judge whether ye have correctly diagnosed th' problem, but I like yer reasonin'. To help th' browser manage cachin', it makes more sense t' tell th' browser about a change after it happens, rather than try t' predict when in th' future 'tis goin' t' change by settin' an expire time (sounds obvi'us, doesn't it?). Here's a brain dump:

  • The attributes dern't need t' be timestamps: they could be anythin' that th' browser can compare with its version in cache t' know if somethin' has changed, like, say, an md5 hash.
  • I dern't see anythin' that inherently requires a CMS. The web server could look fer dependencies betwixt objects an' take care o' addin' in th' metadata, even if th' page is static.
  • Ideally, this information might properly belong in HTTP instead o' th' page markup. Throw a list o' URLs an' their stamps into th' HTTP header fer dependencies o' th' current URL. I dern't know much about HTTP -- thar might already be a provision in th' spec somewhere fer this sort o' thin' that nobody's usin', Ya horn swogglin' scurvy cur! Is thar a place in th' header where ye can stick application-specific data without breakin' existin' clients?
  • If th' browser provided a javascript API t' th' browser cache, ye could do a whole bunch o' thin's without havin' t' change well-known standards. Maybe try t' get google interested?
  • I'm thinkin' about th' bounty o' HTTP; I think it were bein' designed based on a model where cachin' can take place anywhere betwixt th' server an' th' client (such as in a proxy), with th' server, client, an' any intervenin' caches all bein' pretty dumb an' not havin' t' know about each other. Hence th' dependency on future expiration times fer thin's, so a cachin' proxy can know when t' discard thin's without havin' t' ask. A new cachin' protocol might need t' consider how it affects th' overall model, and a bucket o' chum. I'm not sure -- just thinkin' out loud.
  • Doin' fstat()s in distributed storage environments shouldn't be a bounty concern, because this system would, out o' necessity, be optional, me Jolly Roger The administrator could just disable it if it doesn't make sense in a particular environment.

I hope somethin' in that brain dump were bein' useful.
I'm glad ye're givin' thought t' this kind o' stuff.

Responding to #8

If ye use some o' th' cachin' techniques Wim talked about ye can avoid a fstat() call all together (fer php or apache), Avast me hearties, by Davy Jones' locker! When I say a fstat() call is slow I'm comparin' it t' nothin' because a cachin' method that works now is already bein' used.

I think you are right

It occurred t' me that static files dern't actually change all that much. Which is likely why th' current, brain-dead system already works. I think this is a problem that doesn't need t' be solved.

To be fair to the cache-manifest

To be fair t' th' cache-manifest: th' "cms o' today" should be able t' handle th' assets intelligently fer th' cache-manifest, All Hands Hoay, yo ho, ho Arguably isn't a similar problem visible on th' server side with say edge side includes - where th' reverse proxy cache actually knows a lot about th' page but only needs little pieces updated? The all-or-nothin' take on cache-manifest seems like a great tool t' focus bounty an' architecture. Yaaarrrrr, I'll warrant ye! You touched on it with img, script, an' style tags - each o' these provides th' tools t' manage cache invalidation granularly until we need t' refresh th' "app" itself (much like a release... Oho! that will need t' be delivered updates). I believe thinkin' o' web page delivery in this a/b manner is healthy - either we be deliverin' hypertext pages o'er HTTP (th' "days o' yore" ;) or we be deliverin' a web-app that takes advantage o' CSS fer style an' uses JS fer app logic (includin' loadin' dynamic content onto th' "page"), by Blackbeard's sword. Additionally if this done "right" th' same logic that exists client side could be on edge servers t' allow fer delivery t' "dumb" devices.., Hornswaggle back t' static pages with dynamic content, ya bilge rat! I dern't think constraint is th' right word, but if it is a constraint lets embrace it!

We run a fairly popular site

We run a fairly popular site an' what we do is we just add an encoded 'mtime' in th' URL o' each static file:

/cache234723/path/static-image.jpg

And we add th' headers fer 'cache public fer one year'.

Which is very similair t' what ye be doin'.

It means more filesystem stats, but it is faster especially because our server has enough memory t' cache all static files.

What is annoyin' is that cachin' th' filesystem stats is interrestin' but it conflicts with cachin' th' HTML. Because if one static file changed (maybe even by FTP/SCP or other means) ye dern't know what HTML ye should remove from th' outputcache.

So it is a good notion, it kind o' works but yer cache has t' be really good organised an' maintained if ye want t' profit a 100% from it, by Blackbeard's sword. If ye have a logo in a template fer example, ye might need t' remove all yer cache or cache th' template as a seperate item.

A suggestion: maybe ye should add 'size' as well in th' attributes, this will allow a browser t' determine which TCP-connection fer each request t' get th' most benefit.