Scalable Web Architecture and Distributed Systems
转自:http://aosabook.org/en/distsys.html
Scalable Web Architecture and Distributed Systems
Open source software has become a fundamental building block for someof the biggest websites. And as those websites have grown,best practices and guiding principles around their architectures haveemerged. This chapter seeks to cover some of the key issues toconsider when designing large websites, as well as some of thebuilding blocks used to achieve these goals.
This chapter is largely focused on web systems, although some of thematerial is applicable to other distributed systems as well.
1.1. Principles of Web Distributed Systems Design
What exactly does it mean to build and operate a scalable web site orapplication? At a primitive level it's just connecting users with remoteresources via the Internet—the part that makes it scalable isthat the resources, or access to those resources, are distributedacross multiple servers.
Like most things in life, taking the time to plan ahead when building aweb service can help inthe long run; understanding some of the considerations and tradeoffs behind bigwebsites can result in smarter decisions at the creation ofsmaller web sites. Below are some of the key principles that influencethe design of large-scale web systems:
- Availability: The uptime of a website is absolutely critical to the reputation and functionality of many companies. For some of the larger online retail sites, being unavailable for even minutes can result in thousands or millions of dollars in lost revenue, so designing their systems to be constantly available and resilient to failure is both a fundamental business and a technology requirement. High availability in distributed systems requires the careful consideration of redundancy for key components, rapid recovery in the event of partial system failures, and graceful degradation when problems occur.
- Performance: Website performance has become an important consideration for most sites. The speed of a website affects usage and user satisfaction, as well as search engine rankings, a factor that directly correlates to revenue and retention. As a result, creating a system that is optimized for fast responses and low latency is key.
- Reliability: A system needs to be reliable, such that a request for data will consistently return the same data. In the event the data changes or is updated, then that same request should return the new data. Users need to know that if something is written to the system, or stored, it will persist and can be relied on to be in place for future retrieval.
- Scalability: When it comes to any large distributed system, size is just one aspect of scale that needs to be considered. Just as important is the effort required to increase capacity to handle greater amounts of load, commonly referred to as the scalability of the system. Scalability can refer to many different parameters of the system: how much additional traffic can it handle, how easy is it to add more storage capacity, or even how many more transactions can be processed.
- Manageability: Designing a system that is easy to operate is another important consideration. The manageability of the system equates to the scalability of operations: maintenance and updates. Things to consider for manageability are the ease of diagnosing and understanding problems when they occur, ease of making updates or modifications, and how simple the system is to operate. (I.e., does it routinely operate without failure or exceptions?)
- Cost: Cost is an important factor. This obviously can include hardware and software costs, but it is also important to consider other facets needed to deploy and maintain the system. The amount of developer time the system takes to build, the amount of operational effort required to run the system, and even the amount of training required should all be considered. Cost is the total cost of ownership.
Each of these principles provides the basis for decisions in designing a distributed web architecture. However, they also canbe at odds with one another, such that achieving one objective comesat the cost of another. A basic example: choosing to addresscapacity by simply adding more servers (scalability) can come at theprice of manageability (you have to operate an additional server) andcost (the price of the servers).
When designing any sort of web application it isimportant to consider these key principles, even if it is toacknowledge that a design may sacrifice one or more of them.
1.2. The Basics
When it comes to system architecture there are a few things toconsider: what are the right pieces, how these pieces fit together,and what are the right tradeoffs. Investing in scaling before it is needed is generally not a smartbusiness proposition; however, some forethought into the design cansave substantial time and resources in the future.
This section is focused on some of the core factors that are central toalmost all large web applications: services,redundancy, partitions, and handlingfailure. Each of these factors involves choices and compromises,particularly in the context of the principles described in theprevious section. In order to explain these in detail it isbest to start with an example.
Example: Image Hosting Application
At some point you have probably posted an image online. For bigsites that host and deliver lots of images, there are challenges in building an architecture that is cost-effective, highlyavailable, and has low latency (fast retrieval).
Imagine a system where users are able to upload their images to acentral server, and the images can be requested via a web link orAPI, just like Flickr or Picasa. For the sake of simplicity, let'sassume that this application has two key parts: the ability to upload(write) an image to the server, and the ability to query for animage. While we certainly want the upload to be efficient, we care most about having very fast delivery when someone requests an image(for example, images could be requested for a web page or otherapplication). This is very similar functionality to what a web serveror Content Delivery Network (CDN) edge server (a serverCDN uses to store content in many locations so content is geographically/physically closer to users, resulting in fasterperformance) might provide.
Other important aspects of the system are:
- There is no limit to the number of images that will be stored, so storage scalability, in terms of image count needs to be considered.
- There needs to be low latency for image downloads/requests.
- If a user uploads an image, the image should always be there (data reliability for images).
- The system should be easy to maintain (manageability).
- Since image hosting doesn't have high profit margins, the system needs to be cost-effective
Figure 1.1 is a simplified diagram of the functionality.
Figure 1.1: Simplified architecture diagram for image hosting applicationIn this image hosting example, the system must be perceivably fast,its data stored reliably and all of these attributes highlyscalable. Building a small version of this application would betrivial and easily hosted on a single server; however, that would not beinteresting for this chapter. Let's assume that we want to buildsomething that could grow as big as Flickr.
Services
When considering scalable system design, it helps to decouplefunctionality and think about each part of the system as its ownservice with a clearly defined interface. In practice, systemsdesigned in this way are said to have a Service-Oriented Architecture(SOA). For these types of systems, each service has its own distinctfunctional context, and interaction with anything outside of thatcontext takes place through an abstract interface, typically thepublic-facing API of another service.
Deconstructing a system into a set of complementary services decouplesthe operation of those pieces from one another. This abstraction helpsestablish clear relationships between the service, its underlyingenvironment, and the consumers of that service. Creating theseclear delineations can help isolate problems, but also allows eachpiece to scale independently of one another. This sort ofservice-oriented design for systems is very similar to object-orienteddesign for programming.
In our example, all requests to upload and retrieve images areprocessed by the same server; however, as the system needs toscale it makes sense to break out these two functions intotheir own services.
Fast-forward and assume that the service is in heavy use; such ascenario makes it easy to see how longer writes will impact the time it takesto read the images (since they two functions will be competing forshared resources). Depending on the architecture this effect can besubstantial. Even if the upload and download speeds are the same(which is not true of most IP networks, since most are designed for atleast a 3:1 download-speed:upload-speed ratio), read files will typically be readfrom cache, and writes will have to go to disk eventually (and perhapsbe written several times in eventually consistent situations).Even if everything is in memory or read from disks (like SSDs),database writes will almost always be slower than reads. (PolePosition, an open source tool for DB benchmarking,http://polepos.org/ and resultshttp://polepos.sourceforge.net/results/PolePositionClientServer.pdf.).
Another potential problem with this design is that a web server likeApache or lighttpd typically has an upper limit on the number ofsimultaneous connections it can maintain(defaults are around 500, but can go much higher) and inhigh traffic, writes can quickly consume all of those. Since reads canbe asynchronous, or take advantage of other performance optimizationslike gzip compression or chunked transfer encoding, the web server canswitch serve reads faster and switch between clients quickly servingmany more requests per second than the max number of connections (withApache and max connections set to 500, it is not uncommon to serveseveral thousand read requests per second). Writes, on the other hand,tend to maintain an open connection for the duration for the upload,so uploading a 1MB file could take more than 1 second on most home networks,so that web server could only handle 500 such simultaneouswrites.
Figure 1.2: Splitting out reads and writesPlanning for this sort of bottleneck makes a good caseto split out reads and writes of images into their ownservices, shown in Figure 1.2. This allows us to scale each of them independently (since itis likely we will always do more reading than writing), but also helpsclarify what is going on at each point. Finally, this separates futureconcerns, which would make it easier to troubleshoot and scale aproblem like slow reads.
The advantage of this approach is that we are able to solve problems independently of one another—we don't have to worry aboutwriting and retrieving new images in the same context. Both ofthese services still leverage the global corpus of images, but theyare free to optimize their own performance with service-appropriatemethods(for example, queuing up requests, or cachingpopular images—more on this below). And from a maintenance and costperspective each service can scale independently as needed, which isgreat because if they were combined and intermingled, one couldinadvertently impact the performance of the other as in the scenariodiscussed above.
Of course, the above example can work well when you have two differentendpoints (in fact this is very similar to several cloud storageproviders' implementations and Content Delivery Networks). There arelots of ways to address these types of bottlenecks though, and eachhas different tradeoffs.
For example, Flickr solves this read/write issue by distributing usersacross different shards such that each shard can only handle a setnumber of users, and as users increase more shards are added to thecluster (see the presentation on Flickr's scaling,http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html). Inthe first example it is easier to scale hardware based on actual usage(the number of reads and writes across the whole system), whereasFlickr scales with their user base (but forces the assumption of equalusage across users so there can be extra capacity). In the former anoutage or issue with one of the services brings down functionalityacross the whole system (no-one can write files, for example), whereasan outage with one of Flickr's shards will only affect those users. Inthe first example it is easier to perform operations across the wholedataset—for example, updating the write service to include newmetadata or searching across all image metadata—whereas with theFlickr architecture each shard would need to be updated or searched(or a search service would need to be created to collate thatmetadata—which is in fact what they do).
When it comes to these systems there is no right answer, but it helpsto go back to the principles at the start of this chapter, determinethe system needs (heavy reads or writes or both, level of concurrency,queries across the data set, ranges, sorts, etc.), benchmark differentalternatives, understand how the system will fail, and have a solid planfor when failure happens.
Redundancy
In order to handle failure gracefully a web architecture must haveredundancy of its services and data. For example, if there is only onecopy of a file stored on a single server, then losing that server means losing that file. Losing data is seldom a good thing, and acommon way of handling it is to create multiple, or redundant, copies.
This same principle also applies to services. If there is a core pieceof functionality for an application, ensuring that multiple copies orversions are running simultaneously can secure against the failure ofa single node.
Creating redundancy in a system can remove single points of failureand provide a backup or spare functionality if needed in a crisis. Forexample, if there are two instances of the same service running inproduction, and one fails or degrades, the system can failoverto the healthy copy. Failover can happenautomatically or require manual intervention.
Another key part of service redundancy is creating a shared-nothing architecture. With this architecture, each node is able tooperate independently of one another and there is no central "brain"managing state or coordinating activities for the other nodes. Thishelps a lot with scalability since new nodes can be added withoutspecial conditions or knowledge. However, and most importantly, thereis no single point of failure in these systems, so they are much moreresilient to failure.
For example, in our image server application, all images would haveredundant copies on another piece of hardware somewhere (ideally in adifferent geographic location in the event of a catastrophe like anearthquake or fire in the data center), and the services to access theimages would be redundant, all potentially servicing requests. (See Figure 1.3.)(Load balancers are a great way to make this possible, but there ismore on that below).
Figure 1.3: Image hosting application with redundancyPartitions
There may be very large data sets that are unable to fit on a singleserver. It may also be the case that an operation requires too manycomputing resources, diminishing performance and making itnecessary to add capacity. In either case you have two choices: scalevertically or horizontally.
Scaling vertically means adding more resources to an individualserver. So for a very large data set, this might mean adding more (orbigger) hard drives so a single server can contain the entire data set. Inthe case of the compute operation, this could mean moving thecomputation to a bigger server with a faster CPU or more memory. Ineach case, vertical scaling is accomplished by making the individualresource capable of handling more on its own.
To scale horizontally, on the other hand, is to add more nodes. In thecase of the large data set, this might be a second server to store parts of the data set, and for the computing resource itwould mean splitting the operation or load across some additionalnodes. To take full advantage of horizontal scaling, it should beincluded as an intrinsic design principle of the systemarchitecture, otherwise it can be quite cumbersome to modify andseparate out the context to make this possible.
When it comes to horizontal scaling, one of the more commontechniques is to break up your services into partitions, or shards. The partitions can be distributed suchthat each logical set of functionality is separate; this could bedone by geographic boundaries, or by another criteria like non-paying versuspaying users. The advantage of these schemes is that they provide a serviceor data store with added capacity.
In our image server example, it is possible that the single fileserver used to store images could be replaced by multiple fileservers, each containing its own unique set of images. (See Figure 1.4.) Such anarchitecture would allow the system to fill each file server withimages, adding additional servers as the disks become full. Thedesign would require a naming scheme that tied an image's filenameto the server containing it. An image's name could be formed from aconsistent hashing scheme mapped across the servers. Or alternatively,each image could be assigned an incremental ID, so that when a clientmakes a request for an image, the image retrieval service only needsto maintain the range of IDs that are mapped to each of the servers(like an index).
Figure 1.4: Image hosting application with redundancy and partitioningOf course there are challenges distributing data or functionalityacross multiple servers. One of the key issues is datalocality; in distributed systems the closer the data to theoperation or point of computation, the better the performance of thesystem. Therefore it is potentially problematic to have dataspread across multiple servers, as any time it is needed it may not belocal, forcing the servers to perform a costly fetch of the requiredinformation across the network.
Another potential issue comes in the form ofinconsistency. When there are different services reading andwriting from a shared resource, potentially another service or datastore, there is the chance for race conditions—where some data is supposed to be updated, but the read happens prior to the update—andin those cases the data is inconsistent. For example, in the imagehosting scenario, a race condition could occur if one client sent arequest to update the dog image with a new title, changing it from"Dog" to "Gizmo", but at the same time another client was readingthe image. In that circumstance it is unclear which title, "Dog" or"Gizmo", would be the one received by the second client.
There are certainly some obstacles associated with partitioning data,but partitioning allows each problem to be split—by data, load, usagepatterns, etc.—into manageable chunks. This can help with scalabilityand manageability, but is not without risk. There are lots of ways to mitigate risk and handle failures; however,in the interest of brevity they are not covered in this chapter. Ifyou are interested in reading more, you can check out my blog poston fault tolerance and monitoring.
1.3. The Building Blocks of Fast and Scalable Data Access
Having covered some of the core considerations in designingdistributed systems, let's now talk about the hard part: scalingaccess to the data.
Most simple web applications, for example, LAMP stack applications,look something like Figure 1.5.
Figure 1.5: Simple web applicationsAs they grow, there are two main challenges: scaling access to theapp server and to the database. In a highly scalable applicationdesign, the app (or web) server is typically minimized and oftenembodies a shared-nothing architecture. This makes the app serverlayer of the system horizontally scalable. As a result of this design,the heavy lifting is pushed down the stack to the database server andsupporting services; it's at this layer where the real scaling andperformance challenges come into play.
The rest of this chapter is devoted to some of the more commonstrategies and methods for making these types of services fast andscalable by providing fast access to data.
Figure 1.6: Oversimplified web applicationMost systems can be oversimplified to Figure 1.6.This is a great place to start. If you have a lot of data, you wantfast and easy access, like keeping a stash of candy in the top drawerof your desk. Though overly simplified, the previous statement hintsat two hard problems: scalability of storage and fast access of data.
For the sake of this section, let's assume you have many terabytes (TB)of data and you want to allow users to access small portions of that data at random. (See Figure 1.7.) This is similar to locating an image filesomewhere on the file server in the image application example.
Figure 1.7: Accessing specific dataThis is particularly challenging because it can be very costly to loadTBs of data into memory; this directly translates to disk IO. Readingfrom disk is many times slower than from memory—memory access isas fast as Chuck Norris, whereas disk access is slower than theline at the DMV. This speed difference really adds up for largedata sets; in real numbers memory access is as little as 6 timesfaster for sequential reads, or 100,000 times faster for randomreads, than reading fromdisk (see "The Pathologies of Big Data", http://queue.acm.org/detail.cfm?id=1563874). Moreover, even with unique IDs, solving the problem ofknowing where to find that little bit of data can be an arduoustask. It's like trying to get that last Jolly Rancher from your candy stash withoutlooking.
Thankfully there are many options that you can employ to make thiseasier; four of the more important ones are caches, proxies,indexes and load balancers. The rest of this section discusses how each of these concepts can be used to make dataaccess a lot faster.
Caches
Caches take advantage of the locality of referenceprinciple: recently requested data is likely to be requestedagain. They are used in almost every layer of computing:hardware, operating systems, web browsers, web applications andmore. A cache is like short-term memory: it has a limited amount ofspace, but is typically faster than the original data source andcontains the most recently accessed items. Caches can exist at alllevels in architecture, but are often found at the level nearestto the front end, where they are implemented to return dataquickly without taxing downstream levels.
How can a cache be used to make your data access faster in our API example? In this case, there are a couple of places you can insert a cache. One option is to insert a cache on your request layer node, as inFigure 1.8.
Figure 1.8: Inserting a cache on your request layer nodePlacing a cache directly on a request layer node enables the localstorage of response data. Each time a request is made to the service,the node will quickly return local, cached data if it exists. If itis not in the cache, the request node will query the data from disk. Thecache on one request layer node could also be located both in memory(which is very fast) and on the node's local disk (faster than goingto network storage).
Figure 1.9: Multiple cachesWhat happens when you expand this to many nodes?As you can see in Figure 1.9, if the request layer is expanded to multiple nodes, it's still quitepossible to have each node host its own cache. However, if your loadbalancer randomly distributes requests across the nodes, the samerequest will go to different nodes, thus increasing cache misses. Twochoices for overcoming this hurdle are global caches and distributedcaches.
Global Cache
A global cache is just as it sounds: all the nodes use the same single cachespace. This involves adding a server, or file store of some sort,faster than your original store and accessible by all the requestlayer nodes. Each of the request nodes queries the cache in the sameway it would a local one. This kind of caching scheme can get a bitcomplicated because it is very easy to overwhelm a single cache as thenumber of clients and requests increase, but is very effective in somearchitectures (particularly ones with specialized hardware that makethis global cache very fast, or that have a fixed dataset that needs to becached).
There are two common forms of global caches depicted in thediagrams. In Figure 1.10, when a cached response is not found inthe cache, the cache itself becomes responsible for retrieving themissing piece of data from the underlying store. In Figure 1.11it is the responsibility of request nodes to retrieve any data that isnot found in the cache.
Figure 1.10: Global cache where cache is responsible for retrieval Figure 1.11: Global cache where request nodes are responsible for retrievalThe majority of applications leveraging global caches tend to use thefirst type, where the cache itself manages eviction and fetching data toprevent a flood of requests for the same data from theclients. However, there are some cases where the secondimplementation makes more sense. For example, if the cache is beingused for very large files, a low cache hit percentage would cause thecache buffer to become overwhelmed with cache misses; in thissituation it helps to have a large percentage of the total data set(or hot data set) in the cache. Another example is an architecture wherethe files stored in the cache are static and shouldn't be evicted.(This could be because of application requirements around that data latency—certain pieces of data might need to be very fast for largedata sets—where the application logic understands the evictionstrategy or hot spots better than the cache.)
Distributed Cache
In a distributed cache (Figure 1.12), each of its nodes own part of the cached data,so if a refrigerator acts as a cache to the grocery store, adistributed cache is like putting your food in several locations—yourfridge, cupboards, and lunch box—convenient locations forretrieving snacks from, without a trip to the store. Typically the cache is divided upusing a consistent hashing function, such that if a request node islooking for a certain piece of data it can quickly know where to lookwithin the distributed cache to determine if that data isavailable. In this case, each node has a small piece of the cache, andwill then send a request to another node for the data before going tothe origin. Therefore, one of the advantages of a distributed cache isthe increased cache space that can be had just by adding nodes to the request pool.
A disadvantage of distributed caching is remedying a missingnode. Some distributed caches get around this by storing multiplecopies of the data on different nodes; however, you can imagine howthis logic can get complicated quickly, especially when you add orremove nodes from the request layer. Although even if a nodedisappears and part of the cache is lost, the requests will just pullfrom the origin—so it isn't necessarily catastrophic!
Figure 1.12: Distributed cacheThe great thing about caches is that they usually make things muchfaster (implemented correctly, of course!) The methodology you choosejust allows you to make it faster for even more requests. However, allthis caching comes at the cost of having to maintain additionalstorage space, typically in the form of expensive memory; nothing isfree. Caches are wonderful for making things generally faster, andmoreover provide system functionality under high load conditions whenotherwise there would be complete service degradation.
One example of a popular open source cache isMemcached (http://memcached.org/) (which can work bothas a local cache and distributed cache); however, there are many otheroptions (including many language- or framework-specific options).
Memcached is used in many large web sites, and even though it can bevery powerful, it is simply an in-memory key value store, optimizedfor arbitrary data storage and fast lookups (O(1)).
Facebook uses several different types of caching toobtain their site performance (see "Facebook caching andperformance"). They use
$GLOBALS
andAPC caching at the language level (provided in PHP at the cost of a function call) which helps make intermediatefunction calls and results much faster. (Most languages have thesetypes of libraries to improve web page performance and
they should almostalways be used.) Facebook then use a global cache that isdistributed across many servers (see "Scaling memcached at Facebook"), suchthat one function call
accessing the cache could make many requests inparallel for data stored on different Memcached servers. This allowsthem to get much higher performance and throughput for their userprofile data, and have one central place to update data (which isimportant,
since cache invalidation and maintaining consistency can bechallenging when you are running thousands of servers).
Now let's talk about what to do when the data isn't in the cache…
Proxies
At a basic level, a proxy server is an intermediate piece ofhardware/software that receives requests from clients and relays themto the backend origin servers. Typically, proxies are used to filterrequests, log requests, or sometimes transform requests (byadding/removing headers, encrypting/decrypting, or compression).
Figure 1.13: Proxy serverProxies are also immensely helpful when coordinating requests frommultiple servers, providing opportunities to optimize request trafficfrom a system-wide perspective. One way to use a proxy to speed updata access is to collapse the same (or similar) requests togetherinto one request, and then return the single result to the requestingclients. This is known as collapsed forwarding.
Imagine there is a request for the same data (let's call it littleB)across several nodes, and that piece of data is not in the cache. Ifthat request is routed thought the proxy, then all of those requestscan be collapsed into one, which means we only have to read littleBoff disk once. (See Figure 1.14.) There is some cost associated with this design, sinceeach request can have slightly higher latency, and some requests maybe slightly delayed to be grouped with similar ones. But it willimprove performance in high load situations, particularly when thatsame data is requested over and over. This is similar to a cache, butinstead of storing the data/document like a cache, it is optimizingthe requests or calls for those documents and acting as a proxy forthose clients.
In a LAN proxy, for example, the clients do not needtheir own IPs to connect to the Internet, and the LAN will collapsecalls from the clients for the same content. It is easy to getconfused here though, since many proxies are also caches (as it is avery logical place to put a cache), but not all caches act as proxies.
Figure 1.14: Using a proxy server to collapse requestsAnother great way to use the proxy is to not just collapse requestsfor the same data, but also to collapse requests for data that isspatially close together in the origin store (consecutively on disk).Employing such a strategy maximizes data locality for the requests,which can result in decreased request latency. For example, let's saya bunch of nodes request parts of B: partB1, partB2, etc. We canset up our proxy to recognize the spatial locality of the individualrequests, collapsing them into a single request and returning onlybigB, greatly minimizing the reads from the data origin. (See Figure 1.15.) This can makea really big difference in request time when you are randomlyaccessing across TBs of data! Proxies are especially helpful underhigh load situations, or when you have limited caching, since theycan essentially batch several requests into one.
Figure 1.15: Using a proxy to collapse requests for data that is spatially close togetherIt is worth noting that you can use proxies and caches together, butgenerally it is best to put the cache in front of the proxy,for the same reason that it is best to let the faster runners start first in acrowded marathon race. This is because the cache is serving data frommemory, it is very fast, and it doesn't mind multiple requests for thesame result. But if the cache was located on the other side of theproxy server, then there would be additional latency with everyrequest before the cache, and this could hinder performance.
If you are looking at adding a proxy to your systems, there are manyoptions toconsider;Squid andVarnish have both beenroad tested and are widely used in many production web sites. Theseproxy solutions offer many optimizations to make the most ofclient-server communication. Installing one of these as a reverseproxy (explained in the load balancer section below) at the web serverlayer can improve web server performance considerably, reducing theamount of work required to handle incoming client requests.
Indexes
Using an index to access your data quickly is a well-known strategyfor optimizing data access performance; probably the most well known when it comes to databases. An index makes the trade-offs ofincreased storage overhead and slower writes (since you must bothwrite the data and update the index) for the benefit of faster reads.
Just as to a traditional relational data store, you can also applythis concept to larger data sets. The trick with indexes is you mustcarefully consider how users will access your data. In the case ofdata sets that are many TBs in size, but with very small payloads(e.g., 1 KB), indexes are a necessity for optimizing dataaccess. Finding a small payload in such a large data set can be a realchallenge since you can't possibly iterate over that much data in anyreasonable time. Furthermore, it is very likely that such a large dataset is spread over several (or many!) physical devices—this meansyou need some way to find the correct physical location of the desireddata. Indexes are the best way to do this.
Figure 1.16: IndexesAn index can be used like a table of contents that directs you to thelocation where your data lives. For example, let's say you are looking for apiece of data, part 2 of B—how will you know where to find it? Ifyou have an index that is sorted by data type—say data A, B, C—itwould tell you the location of data B at the origin. Then you justhave to seek to that location and read the part of B you want. (See Figure 1.16.)
These indexes are often stored in memory, or somewhere very local tothe incoming client request. Berkeley DBs (BDBs) and tree-likedata structures are commonly used to store data in ordered lists,ideal for access with an index.
Often there are many layers of indexes that serve as amap, moving you from one location to the next, and so forth, untilyou get the specific piece of data you want. (See Figure 1.17.)
Figure 1.17: Many layers of indexesIndexes can also be used to create several different views of the samedata. For large data sets, this is a great way to define differentfilters and sorts without resorting to creating many additional copiesof the data.
For example, imagine that the image hosting system from earlier isactually hosting images of book pages, and the service allows clientqueries across the text in those images, searching all the bookcontent about a topic, in the same way search engines allow you tosearch HTML content. In this case, all those book images take many,many servers to store the files, and finding one page to render to theuser can be a bit involved. First, inverse indexes toquery for arbitrary words and word tuples need to be easilyaccessible; then there is the challenge of navigating to the exactpage and location within that book, and retrieving the right image forthe results. So in this case the inverted index would map to alocation (such as book B), and then B may contain an index with allthe words, locations and number of occurrences in each part.
An inverted index, which could represent Index1 in the diagram above,might look something like the following—each word or tuple of wordsprovide an index of what books contain them.
Word(s) | Book(s) |
---|---|
being awesome | Book B, Book C, Book D |
always | Book C, Book F |
believe | Book B |
The intermediate index would look similar but would contain just thewords, location, and information for book B. This nested indexarchitecture allows each of these indexes to take up less space thanif all of that info had to be stored into one big inverted index.
And this is key in large-scale systems because even compressed,these indexes can get quite big and expensive to store. In this systemif we assume we have a lot of the books in the world—100,000,000(see Inside Google Books blog post)—andthat each book is only 10 pages long (to make the math easier), with250 words per page, that means there are 250 billion words. If weassume an average of 5 characters per word, and each character takes 8bits (or 1 byte, even though some characters are 2 bytes), so 5 bytesper word, then an index containing only each word once is over aterabyte of storage. So you can see creating indexes that have a lotof other information like tuples of words, locations for the data, andcounts of occurrences, can add up very quickly.
Creating these intermediate indexes and representing the data insmaller sections makes big data problems tractable. Data canbe spread across many servers and still accessed quickly. Indexes area cornerstone of information retrieval, and the basis for today'smodern search engines. Of course, this section only scratched thesurface, and there is a lot of research being done on how to makeindexes smaller, faster, contain moreinformation (like relevancy), and update seamlessly. (There are somemanageability challenges with race conditions, and with the sheer number ofupdates required to add new data or change existingdata, particularly in the event where relevancy or scoring isinvolved).
Being able to find your data quickly and easily is important; indexesare an effective and simple tool to achieve this.
Load Balancers
Finally, another critical piece of any distributed system is a loadbalancer. Load balancers are a principal part of any architecture, as theirrole is to distribute load across a set of nodes responsible forservicing requests. This allows multiple nodes to transparentlyservice the same function in a system. (See Figure 1.18.) Their main purpose is to handlea lot of simultaneous connections and route those connections to oneof the request nodes, allowing the system to scale to service morerequests by just adding nodes.
Figure 1.18: Load balancerThere are many different algorithms that can be used to servicerequests, including picking a random node, round robin, or even selecting the nodebased on certain criteria, such as memory or CPU utilization. Load balancers canbe implemented as software or hardware appliances. One open sourcesoftware load balancer that has received wide adoption isHAProxy).
In a distributed system, load balancers are often found at the veryfront of the system, such that all incoming requests are routedaccordingly. In a complex distributed system, it is not uncommon for arequest to be routed to multiple load balancers as shown in Figure 1.19.
Figure 1.19: Multiple load balancersLike proxies, some load balancers can also route a requestdifferently depending on the type of request it is. (Technically these arealso known as reverse proxies.)
One of the challenges with load balancers is managing user-session-specific data. In an e-commerce site, when you only have one client itis very easy to allow users to put things in their shopping cart andpersist those contents between visits (which is important, because itis much more likely you will sell the product if it is still in theuser's cart when they return). However, if a user is routed to onenode for a session, and then a different node on their next visit,there can be inconsistencies since the new node may be missing thatuser's cart contents. (Wouldn't you be upset if you put a 6 pack ofMountain Dew in your cart and then came back and it was empty?) One wayaround this can be to make sessions sticky so that the user is alwaysrouted to the same node, but then it is very hard to take advantage ofsome reliability features like automatic failover. In this case, theuser's shopping cart would always have the contents, but if theirsticky node became unavailable there would need to be a special caseand the assumption of the contents being there would no longer bevalid (although hopefully this assumption wouldn't be built into theapplication). Of course, this problem can be solved using otherstrategies and tools in this chapter, like services, and many notcovered (like browser caches, cookies, and URL rewriting).
If a system only has a couple of a nodes, systems like round robin DNSmay make more sense since load balancers can be expensive and add anunneeded layer of complexity. Of course in larger systems there areall sorts of different scheduling and load-balancing algorithms,including simple ones like random choice or round robin, and moresophisticated mechanisms that take things like utilization andcapacity into consideration. All of these algorithms allow traffic andrequests to be distributed, and can provide helpful reliability toolslike automatic failover, or automatic removal of a bad node (such aswhen it becomes unresponsive). However, these advanced features canmake problem diagnosis cumbersome. For example, when it comes to highload situations, load balancers will remove nodes that may be slow ortiming out (because of too many requests), but that only exacerbatesthe situation for the other nodes. In these cases extensive monitoringis important, because overall system traffic and throughput may looklike it is decreasing (since the nodes are serving less requests) butthe individual nodes are becoming maxed out.
Load balancers are an easy way to allow you to expand system capacity, and likethe other techniques in this article, play an essential role indistributed system architecture. Load balancers also provide thecritical function of being able to test the health of a node, suchthat if a node is unresponsive or over-loaded, it can be removed fromthe pool handling requests, taking advantage of the redundancy ofdifferent nodes in your system.
Queues
So far we have covered a lot of ways to read data quickly, butanother important part of scaling the data layer is effectivemanagement of writes. When systems are simple, with minimalprocessing loads and small databases, writes can be predictably fast;however, in more complex systems writes can take an almostnon-deterministically long time. For example, data may have to bewritten several places on different servers or indexes, or the systemcould just be under high load. In the cases where writes, or any taskfor that matter, may take a long time, achieving performance andavailability requires building asynchrony into the system; acommon way to do that is with queues.
Figure 1.20: Synchronous requestImagine a system where each client is requesting a task to be remotelyserviced. Each of these clients sends their request to the server,where the server completes the tasks as quickly as possible andreturns the results to their respective clients. In small systemswhere one server (or logical service) can service incoming clientsjust as fast as they come, this sort of situation should work justfine. However, when the server receives more requests than it canhandle, then each client is forced to wait for the other clients'requests to complete before a response can be generated. This is anexample of a synchronous request, depicted in Figure 1.20.
This kind of synchronous behavior can severely degrade clientperformance; the client is forced to wait, effectively performing zerowork, until its request can be answered. Adding additional servers toaddress system load does not solve the problem either; even witheffective load balancing in place it is extremely difficult to ensurethe even and fair distribution of work required to maximize clientperformance. Further, if the server handling requests is unavailable,or fails, then the clients upstream will also fail. Solving thisproblem effectively requires abstraction between the client's requestand the actual work performed to service it.
Figure 1.21: Using queues to manage requestsEnter queues. A queue is as simple as it sounds: a task comes in, isadded to the queue and then workers pick up the next task as they havethe capacity to process it. (See Figure 1.21.) These tasks could represent simple writes to adatabase, or something as complex as generating a thumbnail previewimage for a document. When a client submits task requests to a queuethey are no longer forced to wait for the results; instead they needonly acknowledgement that the request was properly received. Thisacknowledgement can later serve as a reference for the results of thework when the client requires it.
Queues enable clients to work in an asynchronous manner, providing astrategic abstraction of a client's request and its response. On theother hand, in a synchronous system, there is no differentiationbetween request and reply, and they therefore cannot be managedseparately. In an asynchronous system the client requests a task, theservice responds with a message acknowledging the task was received,and then the client can periodically check the status of the task,only requesting the result once it has completed. While the client iswaiting for an asynchronous request to be completed it is free toperform other work, even making asynchronous requests of otherservices. The latter is an example of how queues and messages areleveraged in distributed systems.
Queues also provide some protection from service outages andfailures. For instance, it is quite easy to create a highly robustqueue that can retry service requests that have failed due to transientserver failures. It is more preferable to use a queue to enforcequality-of-service guarantees than to expose clients directly tointermittent service outages, requiring complicated andoften-inconsistent client-side error handling.
Queues are fundamental in managing distributed communication betweendifferent parts of any large-scale distributed system, and there arelots of ways to implement them. There are quite a few open sourcequeues like RabbitMQ,ActiveMQ,BeanstalkD, but somealso use services likeZookeeper, or even datastores like Redis.
1.4. Conclusion
Designing efficient systems with fast access to lots of data isexciting, and there are lots of great tools that enable all kinds ofnew applications. This chapter covered just a few examples, barelyscratching the surface, but there are many more—and there will onlycontinue to be more innovation in the space.
This work is made available under the Creative Commons Attribution 3.0 Unported license. Please see the full description of the license for details.