If you've read anything about scaling large websites, you've probably heard about memcached. memcached is a high-performance, distributed memory object caching system. Here at Facebook, we're likely the world's largest user of memcached. We use memcached to alleviate database load. memcached is already fast, but we need it to be faster and more efficient than most installations. We use more than 800 servers supplying over 28 terabytes of memory to our users. Over the past year as Facebook's popularity has skyrocketed, we've run into a number of scaling issues. This ever increasing demand has required us to make modifications to both our operating system and memcached to achieve the performance that provides the best possible experience for our users.

Because we have thousands and thousands of computers, each running a hundred or more Apache processes, we end up with hundreds of thousands of TCP connections open to our memcached processes. The connections themselves are not a big problem, but the way memcached allocates memory for each TCP connection is. memcached uses a per-connection buffer to read and write data out over the network. When you get into hundreds of thousands of connections, this adds up to gigabytes of memory-- memory that could be better used to store user data. To reclaim this memory for user data, we implemented a per-thread shared connection buffer pool for TCP and UDP sockets. This change enabled us to reclaim multiple gigabytes of memory per server.

Although we improved the memory efficiency with TCP, we moved to UDP for get operations to reduce network traffic and implement application-level flow control for multi-gets (gets of hundreds of keys in parallel). We discovered that under load on Linux, UDP performance was downright horrible. This is caused by considerable lock contention on the UDP socket lock when transmitting through a single socket from multiple threads. Fixing the kernel by breaking up the lock is not easy. Instead, we used separate UDP sockets for transmitting replies (with one of these reply sockets per thread). With this change, we were able to deploy UDP without compromising performance on the backend.

Another issue we saw in Linux is that under load, one core would get saturated, doing network soft interrupt handing, throttling network IO. In Linux, a network interrupt is delivered to one of the cores, consequently all receive soft interrupt network processing happens on that one core. Additionally, we saw an excessively high rate of interrupts for certain network cards. We solved both of these by introducing “opportunistic” polling of the network interfaces. In this model, we do a combination of interrupt driven and polling driven network IO. We poll the network interface anytime we enter the network driver (typically for transmitting a packet) and from the process scheduler’s idle loop. In addition, we also take interrupts (to keep latencies bounded) but we take far fewer network interrupts (typically by setting interrupt coalescing thresholds aggressively). Since we do network transmission on every core and since we poll for network IO from the scheduler’s idle loop, we distribute network processing evenly across all cores.

Finally, as we started deploying 8-core machines and in our testing, we discovered new bottlenecks. First, memcached's stat collection relied on a global lock. A nuisance with 4 cores, with 8 cores, the lock now accounted for 20-30% of CPU usage. We eliminated this bottleneck by moving stats collection per-thread and aggregating results on-demand. Second, we noticed that as we increased the number of threads transmitting UDP packets, performance decreased. We found significant contention on the lock that protects each network device’s transmit queue. Packets are enqueued for transmission and dequeued by the device driver. This queue is managed bv Linux’s “netdevice” layer that sits in-between IP and device drivers. Packets are added and removed from the queue one at a time, causing significant contention. One of our engineers changed the dequeue algorithm to batch dequeues for transmit, drop the queue lock, and then transmit the batched packets. This change amortizes the cost of the lock acquisition over many packets and reduces lock contention significantly, allowing us to scale memcached to 8 threads on an 8-core system.

Since we’ve made all these changes, we have been able to scale memcached to handle 200,000 UDP requests per second with an average latency of 173 microseconds. The total throughput achieved is 300,000 UDP requests/s, but the latency at that request rate is too high to be useful in our system. This is an amazing increase from 50,000 UDP requests/s using the stock version of Linux and memcached.

We’re hoping to get our changes integrated into the official memcached repository soon, but until that happens, we’ve decided to release all our changes to memcached on github.

Like · Comment · Share

Jónas Tryggvi Jóhannsson, Saal Suzane, Sathiya N Sundararajan and 72 others like this.
- Alan Formy-Duval Linux Forever!!! :) (Though I'm also partial to Solaris)
  December 17, 2008 at 10:59am · Report
- Alan Formy-Duval I agree it's great to be able to follow these happenings. One thing I'm curious about is whether CPU types are having any impact. I read above they're using AMD processors. Are they still using AMD, or is it a mix of AMD and Intel or a total switchover? I'd like to get some real insight into which works better in this type of environment and why? (Hypertransport, true-multi-core... etc.)
  December 17, 2008 at 11:09am · Report
- Alex Kompel Great effort but you still are trying to scale memcached vertically which is the battle you are ultimately going to lose. Perhaps you should also look into partitioning caches. Connection pooling on the host level will also help (perhaps even serializing memcached requests from apache at the host level into a single TCP connection)
  December 17, 2008 at 11:40am · Report
- Neal Mueller way to get /.'d Paul.
  December 17, 2008 at 2:11pm · Report
- Logan Lindquist good job FB
  December 17, 2008 at 3:17pm · Report
- Changhao Jiang This is awesome!
  December 17, 2008 at 4:34pm · Report
- Timmothy Posey What about cloud/grid computing?
  December 17, 2008 at 8:11pm · Report
- Federico Ceratto Great job!
  December 18, 2008 at 3:15am · Report
- Mohit Aggarwal You guys are awesome!! Many thanks.
  December 18, 2008 at 3:45am · Report
- Gerhard Mack Are you guys working with the Linux kernel folks to deal with Linux issues you have discovered?
  
  December 18, 2008 at 8:52am · Report
- Luke Shepard Great post Paul. You guys rock.
  December 18, 2008 at 9:34am · Report
- John Yee Great job with the optimizing and the explanation. Most importantly, thank you for releasing your hard work for others to use!
  December 18, 2008 at 10:31am · Report
- Oz Chan
  It sounds like the main cause of doing all these is the high number of connections -- each machine is running hundreds of apache processes, which in turn connect to each memcached instance. I don't know php enough, but I am curious to kno...w if it a php/apache limitation that we can't put just few instances of apache processes to utilize all machine resources? I assume you guys are using (prefork) mpm in apache.
  
  With that many connections, it will take more memory to manange TCP vs UDP. Furthermore, as connections are not shared between processes, it is harder to keep them alive or detect stale connections.
  
  If prefork mpm is required, is it better to introduce mid layer (app layer) to handle connection related challenges, so each process only connection to one guy who proxy to other logic (e.g. memcached) instead of changing everything to UDP and changing buffering of memcached?
  
  Regardless, I have to say this is some amazing hacks, sometimes I'd dream of doing it myself :)See More
  
  December 18, 2008 at 10:39am · Report
- Nathan Trujillo
  it's always very cool to see how the big guys do it, and what's even more cool is that it is open sauce.
  
  kudos.
  
  And yes, I am with Dagas well, but hopefully they will put these tweaks into a configure option, so we have the option of not usin...g them
  
  Now I guess I have to start/join the group of people who are asking why you chose linux over FreeBSD :)See More
  
  December 18, 2008 at 2:06pm · Report
- Manish Shrestha wow...awesomeness!!!
  December 22, 2008 at 11:57am · Report
- Nu'man Harun wow~~! you are so awesome that I don't really understand most of it but I'm sure this is impressive
  December 23, 2008 at 7:06am · Report
- Moshe Kaplan
  Hi,
  
  Do you use or consider to use Infiniband as backbone solution to solve CPU utilization due to network processing?
  
  Best,
  ...MosheSee More
  
  December 26, 2008 at 4:46pm · Report
- Chris Thompson This amount of caching/traffic is just insane. I would kill to get to work with this stuff.
  January 2, 2009 at 12:51pm · Report
- Christopher Dale Fogleman I'm an audio engineer looking for work, I have done this for 20 years and would like to make a living with my degree. I've done many projects from bands including my own music, foley work, raido bits, anything that has sound. If anyone has an opening please contact chrisfogleman@yahoo.com. thank you
  January 11, 2009 at 1:15pm · Report
- Tuhin Barua how do you distribute keys or partition your data in 800 servers?
  do you use any consistent hashing algo like ketama?
  
  cant wait for your php client :)
  January 21, 2009 at 10:56pm · Report
- Malcolm Derksen wow geek
  January 28, 2009 at 9:26pm · Report
- Trivuz Alam just great :)
  February 10, 2009 at 8:28am · Report
- Rayhan Chowdhury great :)
  
  Facebook is promoting open source. Interested to know what tweak you have applied on MySQL.
  February 10, 2009 at 9:04am · Report
- Jay Lazy Yukes sickkk
  February 18, 2009 at 12:53pm · Report
- Yingkuan Liu ‎@Cao Li
  >>"under load" appears several times in the article.
  >>is it "heavy load" or "light load"?
  
  No it's meant "under load" where on a multi-core system one core get saturated serving requests while the others stay idle.
  February 22, 2009 at 6:06pm · Report
- Eliot Shepard Thanks. Where has the repository gone?
  February 23, 2009 at 10:59am · Report
- Benjamin Herrenschmidt
  Hi ! reading this a bit late...
  
  Regarding rx network interrupts, recent kernels now support multiple rx queues. With appropriate network adapters, you can then get a queue + a separate interrupt per core, with the adapter hashing incoming pa...ckets to try to keep a given flow onto one core.See More
  
  March 9, 2009 at 9:10pm · Report
- Matthew Sinclair also having PHP running hogs a lot of memory so why use it!???
  April 11, 2009 at 2:38pm · Report
- Francesco Vollero Where we can found the kernel patches?
  
  @Matthew: What language have to use facebook for you?
  April 29, 2009 at 4:07pm · Report
- Jeremy Lemaire Nice work. Have these changes been integrated into the official memcached repository yet or do I still need to pull them from github?
  May 6, 2009 at 9:49am · Report
- Mathieu Richardoz Thank you for sharing, this information should come in handy.
  August 5, 2009 at 7:23am · Report
- Daniel Yuan batch, reduce lock contenction
  April 20 at 4:19pm · Report
- Moronkreacionz In ‎@Paul : the github link is broken
  - http://github.com/fbmarc/facebook-memcached
  can you provide the most recent github repos for memcached+udp, also where can I find the FB php memcache client ?
  ~ Vj
  April 29 at 5:07am ·
- Moronkreacionz In Sorry Paul, found the FB memcached+udp link http://github.com/facebook/memcached
  April 29 at 5:09am ·
- Sathiya N Sundararajan this is very valuable information/experience, to any team building large scale distributed applications.
  November 8 at 9:38am ·

posted on 2010-12-20 17:49 Jacky 阅读(387) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

公告

导航

Scaling memcached at Facebook