Ted

  博客园 :: 首页 :: 博问 :: 闪存 :: 新随笔 :: 联系 :: 订阅 订阅 :: 管理 ::

英文版:How To Optimize Your Site With HTTP Caching

I’ve been on a web tweaking kick lately: how to speed up your javascriptgzip files with your server, and now how to set up caching. But the reason is simple: site performance is a feature.

For web sites, speed may be feature #1. Users hate waiting, we get frustrated by buffering videos and pages that pop together as images slowly load. It’s a jarring (aka bad) user experience. Time invested in site optimization is well worth it, so let’s dive in.

What is Caching?

Caching is a great example of the ubiquitous time-space tradeoff in programming. You can save time by using space to store results.

In the case of websites, the browser can save a copy of images, stylesheets, javascript or the entire page. The next time the user needs that resource (such as a script or logo that appears on every page), the browser doesn’t have to download it again. Fewer downloads means a faster, happier site.

Here’s a quick refresher on how a web browser gets a page from the server:

HTTP_request.png

1. Browser: Yo! You got index.html?
2. Server: (Looking it up…)
3. Sever: Totally, dude! It’s right here!
4. Browser: That’s rad, I’m downloading it now and showing the user.

(The actual HTTP protocol may have minor differences; see Live HTTP Headers for more details.)

Caching’s Ugly Secret: It Gets Stale

Caching seems fun and easy. The browser saves a copy of a file (like a logo image) and uses this cached (saved) copy on each page that needs the logo. This avoids having to download the image ever again and is perfect, right?

Wrongo. What happens when the company logo changes? Amazon.com becomes Nile.com? Google becomes Quadrillion?

We’ve got a problem. The shiny new logo needs to go with the shiny new site, caches be damned.

So even though the browser has the logo, it doesn’t know whether the image can be used. After all, the file may have changed on the server and there could be an updated version.

So why bother caching if we can’t be sure if the file is good? Luckily, there’s a few ways to fix this problem.

Caching Method 1: Last-Modified

One fix is for the server to tell the browser what version of the file it is sending. A server can return a Last-modified date along with the file (let’s call it logo.png), like this:

Last-modified: Fri, 16 Mar 2007 04:00:25 GMT
File Contents (could be an image, HTML, CSS, Javascript...)

Now the browser knows that the file it got (logo.png) was created on Mar 16 2007. The next time the browser needs logo.png, it can do a special check with the server:

HTTP-caching-last-modified_1.png

1. Browser: Hey, give me logo.png, but only if it’s been modified since Mar 16, 2007.
2. Server: (Checking the modification date)
3. Server: Hey, you’re in luck! It was not modified since that date. You have the latest version.
4. Browser: Great! I’ll show the user the cached version.

Sending the short “Not Modified” message is a lot faster than needing to download the file again, especially for giant javascript or image files. Caching saves the day (err… the bandwidth).

Caching Method 2: ETag

Comparing versions with the modification time generally works, but could lead to problems. What if the server’s clock was originally wrong and then got fixed? What if daylight savings time comes early and the server isn’t updated? The caches could be inaccurate.

ETags to the rescue. An ETag is a unique identifier given to every file. It’s like a hash or fingerprint: every file gets a unique fingerprint, and if you change the file (even by one byte), the fingerprint changes as well.

Instead of sending back the modification time, the server can send back the ETag (fingerprint):

ETag: ead145f
File Contents (could be an image, HTML, CSS, Javascript...)

The ETag can be any string which uniquely identifies the file. The next time the browser needs logo.png, it can have a conversation like this:

HTTP_caching_if_none_match.png

1. Browser: Can I get logo.png, if nothing matches tag “ead145f”?
2. Server: (Checking fingerprint on logo.png)
3. Server: You’re in luck! The version here is “ead145f”. It was not modified.
4. Browser: Score! I’ll show the user my cached version.

Just like last-modifed, ETags solve the problem of comparing file versions, except that “if-none-match” is a bit harder to work into a sentence than “if-modified-since”. But that’s my problem, not yours. ETags work great.

Caching Method 3: Expires

Caching a file and checking with the server is nice, except for one thing: we are still checking with the server. It’s like analyzing your milk every time you make cereal to see whether it’s safe to drink. Sure, it’s better than buying a new gallon each time, but it’s not exactly wonderful.

And how do we handle this milk situation? With an expiration date!

If we know when the milk (logo.png) expires, we keep using it until that date (and maybe a few days longer, if you’re a college student). As soon as it goes expires, we contact the server for a fresh copy, with a new expiration date. The header looks like this:

Expires: Tue, 20 Mar 2007 04:00:25 GMT
File Contents (could be an image, HTML, CSS, Javascript...)

In the meantime, we avoid even talking to the server if we’re in the expiration period:

HTTP_caching_expires.png

There isn’t a conversation here; the browser has a monologue.

1. Browser: Self, is it before the expiration date of Mar 20, 2007? (Assume it is).
2. Browser: Verily, I will show the user the cached version.

And that’s that. The web server didn’t have to do anything. The user sees the file instantly.

Caching Method 4: Max-Age

Oh, we’re not done yet. Expires is great, but it has to be computed for every date. Themax-age header lets us say “This file expires 1 week from today”, which is simpler than setting an explicit date.

Max-Age is measured in seconds. Here’s a few quick second conversions:

  • 1 day in seconds = 86400
  • 1 week in seconds = 604800
  • 1 month in seconds = 2629000
  • 1 year in seconds = 31536000 (effectively infinite on internet time)

Bonus Header: Public and Private

The cache headers never cease. Sometimes a server needs to control when certain resources are cached.

  • Cache-control: public means the cached version can be saved by proxies and other intermediate servers, where everyone can see it.
  • Cache-control: private means the file is different for different users (such as their personal homepage). The user’s private browser can cache it, but not public proxies.
  • Cache-control: no-cache means the file should not be cached. This is useful for things like search results where the URL appears the same but the content may change.

However, be wary that some cache directives only work on newer HTTP 1.1 browsers. If you are doing special caching of authenticated pages then read more about caching.

Ok, I’m Sold: Enable Caching

First, make sure Apache has mod_headers and mod_expires enabled:

 ... list your current modules... apachectl -t -D DUMP_MODULES ... enable headers and expires if not in the list above... a2enmod headers a2enmod expires 

The general format for setting headers is

  • File types to match
  • Header / Expiration to set

A general tip: the less a resource changes (images, pdfs, etc.) the longer you should cache it. If it never changes (every version has a different URL) then cache it for as long as you can (i.e. a year)!

One technique: Have a loader file (index.html) which is not cached, but that knows the locations of the items which are cached permanently. The user will always get the loader file, but may have already cached the resources it points to.

The following config settings are based on the ones at AskApache.

Seconds Calculator

All the times are given in seconds (A0 = Access + 0 seconds).

Using Expires Headers

 ExpiresActive On ExpiresDefault A0 # 1 YEAR - doesn't change often <FilesMatch "\.(flv|ico|pdf|avi|mov|ppt|doc|mp3|wmv|wav)$"> ExpiresDefault A29030400 </FilesMatch> # 1 WEEK - possible to be changed, unlikely <FilesMatch "\.(jpg|jpeg|png|gif|swf)$"> ExpiresDefault A604800 </FilesMatch> # 3 HOUR - core content, changes quickly <FilesMatch "\.(txt|xml|js|css)$"> ExpiresDefault A10800 </FilesMatch> 

Again, if you know certain content (like javascript) won’t be changing often, have “js” files expire after a week.

Using max-age headers:

 # 1 YEAR <FilesMatch "\.(flv|ico|pdf|avi|mov|ppt|doc|mp3|wmv|wav)$"> Header set Cache-Control "max-age=29030400, public" </FilesMatch> # 1 WEEK <FilesMatch "\.(jpg|jpeg|png|gif|swf)$"> Header set Cache-Control "max-age=604800, public" </FilesMatch> # 3 HOUR <FilesMatch "\.(txt|xml|js|css)$"> Header set Cache-Control "max-age=10800" </FilesMatch> # NEVER CACHE - notice the extra directives <FilesMatch "\.(html|htm|php|cgi|pl)$"> Header set Cache-Control "max-age=0, private, no-store, no-cache, must-revalidate" </FilesMatch> 

Final Step: Check Your Caching

To see whether your files are cached, do the following:

  • Online: Examine your site in the cacheability query (green means cacheable)
  • In Browser: Use FireBug or Live HTTP Headers to see the HTTP response (304 Not Modified, Cache-Control, etc.). In particular, I’ll load a page and use Live HTTPHeaders to make sure no packets are being sent to load images, logos, and other cached files. If you press ctrl+refresh the browser will force a reload of all files.

Read more about caching, or the HTTP header fields. Caching doesn’t help with the initial download (that’s what gzip is for), but it makes the overall site experience much better.

Remember: Creating unique URLs is the simplest way to caching heaven. Have fun streamlining your site!

 

中文版:利用HTTP Cache来优化网站

 

对于网站来说,速度是第一位的。用户总是讨厌等待,面对加载的Video和页面,是非常糟糕的用户体验。所以如何利用Cache来优化网站,值得深入研究。

 

什么是缓存?

 

缓存是一个到处都存在的用空间换时间的例子。通过使用多余的空间,我们能够获取更快的速度。用户在浏览网站的时候,浏览器能够在本地保存网站中的图片或者其他文件的副本,这样用户再次访问该网站的时候,浏览器就不用再下载全部的文件,减少了下载量意味着提高了页面加载的速度。

 

下面这个图例说明了浏览器和服务器之间如何进行交互。

 

缓存的缺点

 

缓存非常有用,但是也带来了一定的缺陷。当我们的网站发生了更新的时候,比如说Logo换了,浏览器本地仍保存着旧版本的Logo,那么浏览器如何来确定使用本地文件还是使用服务器上的新文件?下面来介绍几种判断的方法。

 

Caching Method 1:Last-Modified

 

服务器为了通知浏览器当前文件的版本,会发送一个上次修改时间的标签,例如:

 

Last-modified: Fri, 16 Mar 2007 04:00:25 GMT

File Contents (could be an image, HTML, CSS, Javascript...)

 

 

这样浏览器就知道他收到的这个文件创建时间,在后续的请求中,浏览器会按照下面的规则进行验证:

1、浏览器:Hey,我需要logo.png这个文件,如果是在  Fri, 16 Mar 2007 04:00:25 GMT 之后修改过的,请发给我。

2、服务器:(检查文件的修改时间)

3、服务器:Hey,这个文件在那个时间之后没有被修改过,你已经有最新的版本了。

4、浏览器:太好了,那我就显示给用户了。

 

在这种情况下,服务器仅仅返回了一个304的响应头,减少了响应的数据量,提高了响应的速度。

 

Caching Method 2: ETag

 

通常情况下,通过修改时间来比较文件是可行的。但是在一些特殊情况,例如服务器的时钟发生了错误,服务器时钟进行修改,夏时制DST到来后服务器时间没有及时更新,这些都会引起通过修改时间比较文件版本的问题。

 

ETag可以用来解决这种问题。ETag是一个文件的唯一标志符。就像一个哈希或者指纹,每个文件都有一个单独的标志,只要这个文件发生了改变,这个标志就会发生变化。

 

服务器返回ETag标签:

 

ETag: ead145f

File Contents (could be an image, HTML, CSS, Javascript...)

 

接下来的访问顺序如下图所示:

 

 

1、浏览器:Hey,我需要Logo.png这个文件,有没有不匹配“ead145f”这个串的

2、服务器:(检查ETag...)

3、服务器:Hey,我这里的版本也是"ead145f",你已经是最新的版本了

4、浏览器:好,那就可以使用本地缓存了

 

如同 Last-modified 一样,ETag 解决了文件版本比较的问题。只不过 ETag 的级别比 Last-Modified 高一些。

 

Caching Method 3:Expires

 

缓存一个文件,并且与服务器确认版本的方式非常好,但是仍有一个缺点,我们必须连接服务器。每次使用前都进行一次比较,这种方法很安全,但还不是最好的。我们可以使用 Expiration Date 来减少这种请求。

 

就像我们用牛奶来煮麦片一样,每次喝之前都要检查一下牛奶是否安全。但是如果我们知道牛奶的过期时间,我们就可以在过期之前,直接使用而不用再送去检查。一旦超过了过期时间,我们再去买一份新的回来。服务器返回的时候,会带上这份数据的过期时间:

 

Expires: Tue, 20 Mar 2007 04:00:25 GMT

File Contents (could be an image, HTML, CSS, Javascript...)

 

 

这样,在过期之前,我们就避免了和服务器之间的连接。浏览器只需要自己判断手中的材料是否过期就可以了,完全不需要增加服务器的负担。

 

Caching Method 4:Max-age

 

Expires的方法很好,但是我们每次都得算一个精确的时间。max-age 标签可以让我们更加容易的处理过期时间。我们只需要说,这份资料你只能用一个星期就可以了。

 

Max-age 使用秒来计量,下面是一些常用的单位:

1 days in seconds = 86400

1 week in seconds = 604800

1 month in seconds = 2629000

1 year in seconds = 31536000

 

额外的标签

 

缓存标签永远不会停止工作,但是有时候我们需要对已经缓存的内容进行一些控制。

 

     Cache-control: public 表示缓存的版本可以被代理服务器或者其他中间服务器识别。

     Cache-control: private 意味着这个文件对不同的用户是不同的。只有用户自己的浏览器能够进行缓存,公共的代理服务器不允许缓存。

     Cache-control: no-cache 意味着文件的内容不应当被缓存。这在搜索或者翻页结果中非常有用,因为同样的URL,对应的内容会发生变化。

 

注意:有些标签只是在支持HTTP/1.1的浏览器上可用,如果想要了解更多,那么推荐RFC2616以及Cache docs

posted on 2011-12-24 23:41  wufawei  阅读(379)  评论(0编辑  收藏  举报