This is not yer father's Internet. When th' Web were bein' first emergin' onto th' scene, it were bein' simple. Individual web pages were self-contained static blobs o' text, with, if ye were lucky maybe an image or two, ya bilge rat, by Blackbeard's sword! The HTTP protocol were bein' designed t' be "dumb". It knew nothin' o' th' relationship betwixt an HTML page an' th' images it contained, I'll warrant ye. There were bein' no need t'. Ye'll be sleepin' with the fishes, we'll keel-haul ye! Every request fer a URI (web page, image, download, etc.) were bein' a completely separate request. That kept everythin' simple, an' made it very fault tolerant. A server no nay ne'er sat aroun' waitin' fer a browser t' tell it "OK, I'm done!"
Much e-ink has been spilled (can ye even do that?) already discussin' th' myriad o' ways in which th' web is different today, mostly in th' context o' either HTML5 or web applications (or both). Most o' it is completely true, although thar's plenty o' hyperbole t' go aroun'. One area that has not gotten much attention at all, though, is HTTP.
Well, that's not entirely true. HTTP is actually a fairly large spec, with a lot o' excitin' movin' parts that few scallywags think about because browsers offer no way t' use them from HTML or just implement them very very badly. (Did ye know that thar is a PATCH comman' defined in HTTP, on a dead man's chest! Really.) A good web services implementation (like we're tryin' t' bake into Drupal 8 as part o' th' Web Services and Context Core Initiative </shamelessplug>) should leverage those lesser-known parts, certainly, but th' modern web has more challenges than just usin' all o' a decades-auld spec.
Most significantly, HTTP still treats all URIs as separate, only coincidentally-related resources.
Which brin's us t' an extremely important challenge o' th' modern web that is deceptively simple: Cachin'.
The web naturally does a lot o' cachin'. When ye request a page from a server, rarely is it pulled directly off o' th' hard sail at th' other end. The file, assumin' it is actually a file (this is important), may get cached by th' operatin' system's file system cache, by a reverse proxy cache such as Varnish, by a Content Delivery Network, by an intermediary server somewhere in th' middle, an' finally by yer browser. On a subsequent request, th' layer closest t' ye with an unexpired cache will respond with its cached version.
In concept that's great, as it means th' least amount o' work is done t' get what ye want. In practice, it doesn't work so well fer a variety o' reasons.
For one, that model were bein' built on th' assumption o' a mostly-static web, pass the grog, with a chest full of booty! All URIs be just physical files sittin' on disk that change every once in a while, Ya swabbie! Of course, we dern't live in that web anymore. Most web "pages" be dynamically generated from a content management system o' some sort.
For another, that totally sucks durin' development. Ahoy, by Davy Jones' locker! Who remembers th' days o' tellin' yer client "no, really, I did upload a new version o' th' file. You need t' clear yer browser cache, and dinna spare the whip, ya bilge rat! Hold down shift an' reload, ya bilge rat, I'll warrant ye! Er, wait, that's th' other browser. Walk the plank! Hit F5 twice. Nay, really fast. Faster." Aye, it sucked. There's ways t' configure th' HTTP headers t' not cache files, but that is a pain (how many web developers know how t' mess with Apache .htaccess files?), an' ye have t' remember t' turn that off fer production or ye totally hose performance. Even now, Drupal appends junk characters t' th' end o' CSS URLs just t' bypass this sort o' cachin'.
Finally, thar's th' browsers. Their handlin' o' HTTP cache headers (which be surprisingly complex) has historically not been all that good. What's more, in many cases th' browser will simply bypass its own cache an' still check th' network fer a new version.
Now, normally, that's OK. The HTTP spec says, an' most browsers obey, that when requestin' a resource that a browser already has an older cached copy o' it should include th' last updated date o' its version o' th' file in th' request, sayin' in essence "I want file foo.png, me copy is from October 1st." The server can then respond with either a
304 Not Modified ("Yep, that's still th' right one") or a
200 OK ("Dude, that's so auld, here's th' new one"). The 304 response saves resendin' th' file, but doesn't help with th' overhead o' th' HTTP request itself. That request is not cheap, especially on high-latency mobile networks, an' especially when browsers refuse t' have more than 4-6 requests outstandin'.
Now look at WhiteHouse.Gov. 95 HTTP requests fer th' front page... nearly all o' them 304 Not Modified (assumin' ye've hit th' page at least once). Oho! ESPN.com, 95 requests, again mostly 304 Not Modified. Forbes.com, o'er 200.
These be not sites built by fly-by-night hackers, and a bucket o' chum. These be high-end professional sites whose teams do know how t' do thin's "right". And th' page is not actually "done" until all o' those requests go out an' complete, just in case somethin' changed. The amount o' shear waste involved is utterly mindbogglin'. It's th' same auld pollin' problem on a distributed scale.
The underlyin' problem, o' course, is that a web page is no longer a single resource that makes use o' one or two other resources. And hoist the mainsail! A web page -- not a web application or anythin' so fancy but just an ordinary, traditional web page -- is th' product o' dozens o' different resources at different URIs. Ye'll be sleepin' with the fishes, Ya horn swogglin' scurvy cur! And our cachin' strategies simply dern't know how t' handle that.
A couple o' possible workarounds fer this issue exist, an' be used t' a greater or lesser extent.
There be a number o' issues with the Manifest file, however, somethin' that most scallywags acknowledge, Ya horn swogglin' scurvy cur! They mostly boil down t' it bein' too aggressive. Walk the plank! For instance, ye cannot avoid th' HTML page itself also bein' cached. An appcache-usin' resource will no nay ne'er be redownloaded from th' web unless th' Manifest file itself changes (an' th' browser redownloads a new version o' it), in which case everythin' will be downloaded again.
I ran into this problem while tryin' t' write a Manifest module fer Drupal. The notion were bein' t' build a Manifest file on th' fly that contained all o' th' 99% static resources (theme-level image files, UI widgets, etc.) so that those could be skipped on subsquent page loads, since they practically no nay ne'er change, an' avoid all o' that HTTP overhead. Unfortunately as soon as ye add a Manifest file t' an HTML page, that page is permanently cached offline an' not rechecked, Ya horn swogglin' scurvy cur, by Davy Jones' locker! Given that Drupal is by bounty a dynamic CMS where page content can change regularly fer user messages an' such, that's a rather fatal flaw that I have been unable t' work aroun'.
So what do we do, Ya swabbie! Remember up at th' top o' this article we noted that most web "pages" these days (which be still th' majority o' th' web an' will remain so fer a long time) be dynamically built by a CMS, ye scurvey dog. CMSes these days be pretty darned smart about what it is they be servin' up. If a file has changed, they either know or can easily find out by checkin' th' file modification date locally, on th' server, without any round-trip connection at all. Prepare to be boarded, Ya horn swogglin' scurvy cur! We can an' should leverage that.
Perhaps we should.
I would propose instead that we allow an' empower th' application level on th' server t' take a more active an' controllin' role in cache management. Rather than an all-or-nothin' Manifest file, which is in practice only useful fer single-page full-on applications, we should allow th' page t' have more fine-grained control o'er how th' browser treates resource files, Dance the Hempen Jig
There be many forms such support could take. As a simple startin' point, I will offer a reuse o' th' link tag:
<!-- Indicates that this image will be used by this page somewhere, and its last modified date is 1pm UTC on 6 October. If the browser has a cached version already, it knows whether or not it needs to request a new version without having to send out another HTTP request. -->
<link href="background.png" cache="last-modified:2011-10-06T13:00:00" />
<!-- It works for stylesheets, too. What's more, we can tell the browser to cache that file for a day. The value here would override the normal HTTP expires header of that file, just as a meta http-equiv tag would were it an HTML page. -->
<link href="styles.css" rel="stylesheet" cache="last-modified:2011-10-06T13:00:00; expire:2011-10-07T13:00:00" />
<!-- By specifying related pages, we can tell the browser that the user will probably go there next so go ahead and start loading that page. Paged news stories could be vastly sped up with this approach. This is not the old "web accelerator" approach, as that tried to just blanket-download everything and played havoc with web apps. -->
<link href="page2.html" rel="next" cache="last-modified:2011-10-06T13:00:00; fetch:prefetch" />
<!-- Not only do we tell the browser whether or not it needs to be cached, but we tell the browser that the file will not be used immediately when the page loads. Perhaps its a rollover image, so it needs to be loaded before the user rolls over something but that can happen after all of the immediately-visible images are downloaded. Alternatively this could be a numeric priority for even more fine-graied
<link href="hoverimage.png" cache="last-modified:2011-10-06T13:00:00; fetch:defer" />
<!-- If there's too many resources in use to list individually, link to a central master list. Any file listed here is treated as if it were listed individually, and should include the contents of the cache attribute. Normal caching rules apply for this file, including setting an explicit cache date for it. Naturally multiple of these files could be referenced in a single page, whereas there can be only a single Manifest file. The syntax of this file I leave for a later discussion. -->
<link href="resources.list" rel="resources" />
In practice, a CMS knows what those values should be. It can simply tell th' browser, on deman', what other resources it is goin' t' need, when they were last updated, th' smartest order in which t' download them, even what t' prefetch based on where th' user is likely t' go next.
Imagine if, fer instance, a Drupal site could dynamically build a resource file listin' all image files used in a theme, or provided by a module. Fetch me spyglass, Ya lily livered swabbie! Those be usually a large number o' very small images. So just build that list once an' store it, then include that reference in th' page header. The browser can see that, know th' full list o' what it will need, when they were last updated, even how soon it will need them, by Blackbeard's sword. If one is not used on a particular page, that's OK, Avast me hearties, we'll keel-haul ye! The browser will still load it just like with a Manifest file. On subsequent page loads, it knows it will still need those files but it also knows that its versions be already up t' date, an' leaves it at that, ye scurvey dog. When it needs those images, it just loads them out o' its local cache.
And when a resource does change, th' page tells th' browser about it immediately so that it doesn't have t' guess if thar is a new version, Avast me hearties, Get out of me rum! It already knows, an' can act accordingly t' download just th' new files it needs.
Any CMS could do th' exact same thin'. A really good one could even dynamically track a user session (anonymously) t' see what th' most likely next pages be fer a given user, an' adjust its list o' probable next pages o'er time so that th' browser knows what's comin'.
Naturally all o' this assumes that a page is comin' from a CMS or web app framework o' some sort (Drupal, Symfony2, Sharepoint, Joomla, whatever). In practice, that's a pretty good assumption these days, Ya swabbie! And if not, a statically coded page just omits th' cache attribute an' th' browser behaves normally as it does today, askin' "be we thar yet?" o'er an' o'er again an' gettin' told by th' server "304 Nay Not Yet".
There be likely many details I am missin' here, but I believe th' concept is sound. Modern web pages be dynamic on th' server side, not just on th' client side. Let th' server give th' browser th' information it needs t' be smart about cachin'. Don't go all-or-nothin'; that is fine fer a pure app but most sites be not pure apps. Server-side developers be smart cookies, yo ho, ho Let them help th' browser be faster, smarter.
I now don th' obligatory flame-retardant suit. (And if ye think this is actually a good notion, someone point me t' where t' propose it besides me blog!)