Data Center手册(1):架构
如图是数据中心的一个基本架构
最上层是Internet Edge,也叫Edge Router,也叫Border Router,它提供数据中心与Internet的连接。
- 连接多个网络供应商来提供冗余可靠的连接
- 对外通过BGP提供路由服务,使得外部可以访问内部的IP
- 对内通过iBGP提供路由服务,使得内部可以访问外部IP
- 提供边界安全控制,使得外部不能随意访问内部
- 控制内部对外部的访问
为了HA的需要,往往会有两个Border Router
Typical enterprise Internet connectivity design:
- Two edge routers, IR1 and IR2, provide direct connectivity to the Internet.
- a pair of firewalls provides stateful inspection capabilities and access to both the internal network and the demilitarized zone (DMZ).
- The DMZ contains public-facing services such as web; this is the only network accessible directly from the public Internet.
- The internal network should never be accessed directly by the Internet,
- but traffic sourced from the internal network must be able to reach Internet sites.
在Border Router上往往设置ACL对访问进行控制
Border Router ACLs:
- Special-use address and anti-spoofing entries that deny illegitimate sources and packets with source addresses that belong within your network from entering the network from an external source
- RFC 1918 defines reserved address space that is not a valid source address on the Internet.
- RFC 3330 defines special-use addresses that might require filtering.
- RFC 2827 provides anti-spoofing guidelines.
- Explicitly permitted return traffic for internal connections to the Internet
- Explicitly permitted externally sourced traffic destined to protected internal addresses
- Explicit deny statement
ACL的一个例子
------------------------------------------------------------------------------------------------
!--- Add anti-spoofing entries.
!--- Deny special-use address sources.
!--- Refer to RFC 3330 for additional special use addresses.
access-list 110 deny ip 127.0.0.0 0.255.255.255 any
access-list 110 deny ip 192.0.2.0 0.0.0.255 any
access-list 110 deny ip 224.0.0.0 31.255.255.255 any
access-list 110 deny ip host 255.255.255.255 any
!--- The deny statement should not be configured
!--- on Dynamic Host Configuration Protocol (DHCP) relays.
access-list 110 deny ip host 0.0.0.0 any
!--- Filter RFC 1918 space.
access-list 110 deny ip 10.0.0.0 0.255.255.255 any
access-list 110 deny ip 172.16.0.0 0.15.255.255 any
access-list 110 deny ip 192.168.0.0 0.0.255.255 any
!--- Permit Border Gateway Protocol (BGP) to the edge router.
access-list 110 permit tcp host bgp_peer gt 1023 host router_ip eq bgp
access-list 110 permit tcp host bgp_peer eq bgp host router_ip gt 1023
!--- Deny your space as source (as noted in RFC 2827).
access-list 110 deny ip your Internet-routable subnet any
!--- Explicitly permit return traffic.
!--- Allow specific ICMP types.
access-list 110 permit icmp any any echo-reply
access-list 110 permit icmp any any unreachable
access-list 110 permit icmp any any time-exceeded
access-list 110 deny icmp any any
!--- These are outgoing DNS queries.
access-list 110 permit udp any eq 53 host primary DNS server gt 1023
!--- Permit older DNS queries and replies to primary DNS server.
access-list 110 permit udp any eq 53 host primary DNS server eq 53
!--- Permit legitimate business traffic.
access-list 110 permit tcp any Internet-routable subnet established
access-list 110 permit udp any range 1 1023 Internet-routable subnet gt 1023
!--- Allow ftp data connections.
access-list 110 permit tcp any eq 20 Internet-routable subnet gt 1023
!--- Allow tftp data and multimedia connections.
access-list 110 permit udp any gt 1023 Internet-routable subnet gt 1023
!--- Explicitly permit externally sourced traffic.
!--- These are incoming DNS queries.
access-list 110 permit udp any gt 1023 host <primary DNS server> eq 53
!-- These are zone transfer DNS queries to primary DNS server.
access-list 110 permit tcp host secondary DNS server gt 1023 host primary DNS server eq 53
!--- Permit older DNS zone transfers.
access-list 110 permit tcp host secondary DNS server eq 53 host primary DNS server eq 53
!--- Deny all other DNS traffic.
access-list 110 deny udp any any eq 53
access-list 110 deny tcp any any eq 53
!--- Allow IPSec VPN traffic.
access-list 110 permit udp any host IPSec headend device eq 500
access-list 110 permit udp any host IPSec headend device eq 4500
access-list 110 permit 50 any host IPSec headend device
access-list 110 permit 51 any host IPSec headend device
access-list 110 deny ip any host IPSec headend device
!--- These are Internet-sourced connections to
!--- publicly accessible servers.
access-list 110 permit tcp any host public web server eq 80
access-list 110 permit tcp any host public web server eq 443
access-list 110 permit tcp any host public FTP server eq 21
!--- Data connections to the FTP server are allowed
!--- by the permit established ACE.
!--- Allow PASV data connections to the FTP server.
access-list 110 permit tcp any gt 1023 host public FTP server gt 1023
access-list 110 permit tcp any host public SMTP server eq 25
!--- Explicitly deny all other traffic.
access-list 101 deny ip any any
------------------------------------------------------------------------------------------------
第二层core network,包含很多的core switches
- Available Zone同Edge router之间通信
- Available Zone之间的通信提供
- 提供高可用性连接HA
- 提供Intrusion Prevention Services
- 提供Distributed Denial of Service Attack Analysis and Mitigation
- 提供Tier 1 Load Balancer
为了HA,一般会创建两个core network,两个core network通过vlan相互隔离,互不干扰,每个core network都能够连接到两个border router和所有的Available Zone。
core network里面的switch都是强大的switch,为了提供高可用性,然而又不需要STP,则多个switch之间Link Aggregation
- Link aggregation allows you to bond multiple parallel links into a single virtual link (from the STP perspective).
- With parallel links being replaced by a single link, STP detects no loops and all the physical links can be fully utilized.
- Traditional LA : port channel, Etherchannel, link bonding or multi-link trunking
Traditional Link Aggregation
- A port channel bundles up to eight individual interfaces into a group to provide increased bandwidth and redundancy.
- Port channeling also load balances traffic across these physical interfaces.
- You create a port channel by bundling compatible interfaces.
- You can configure and run either static port channels or ports channels running the Link Aggregation Control Protocol (LACP).
LACP (Link Aggregation Control Protocol)
- individual links can be combined into LACP port channels and channel groups
- Static LACP : creation of channel groups and addition of ports are manually configured. LACP is to determine the ports are selected or standby
- Dynamic LACP : all above are negotiated via LACPDU between both sides
当前有些高级的switch,本身就提供cluster的功能,使得多个switch通过本身定义的协议,以及硬件的支持,形成一个cluster,即能满足HA,也能提高带宽。
HA是这样实现的
如果想进一步扩大带宽,还可以两个switch cluster之间通过LACP进行link aggregation,这种方式称为Multi-Chassis Link Aggregation
- In Multichassis EtherChannel (MCEC), the DHD is dual-homed to two upstream PoAs(points of attachment).
- The DHD is incapable of running any loop prevention control protocol such as Multiple Spanning Tree (MST).
- One method is to place the DHD's uplinks in a LAG, commonly referred to as EtherChannel. (LACP enabled).
- LACP is a link-level control protocol that allows the dynamic negotiation and establishment of LAGs.
- Multichassis LACP: An extension of the LACP implementation to PoAs is required to convey to a DHD that it is connected to a single virtual
LACP peer and not to two disjointed devices.
如下图就是L2 core network的架构,其中红色和绿色表示不同的vlan
接下来是一个个Available Zone,或者称为Data Center LAN
第三层也即每个AZ的最上层,我们称为Aggregation layer
- 在这一层上,是aggregation router或者三层的aggregation switches
- 同样会有IDS/IPS
- 会有Tier 2 Load Balancer
Aggregation Layer是一个AZ的对外入口,上接L2 Core Network。
Border Router和Aggregation Router是通过L2 Core Network连接在一起的,是一个大二层连接。
这两层router之间需要通过路由协议,使得Aggregation router可以得知border router的路由,从而AZ里面的机器可以访问外网,也使得border router可以得知Aggregation router的路由,从而外网可以访问AZ内部的public IP
Routing Algorithm
- Nonadaptive algorithms/static routing
- do not base their routing decisions on any measurements or estimates of the current topology and traffic
- computed in advance, offline, and downloaded to the routers when the network is booted
- Adaptive algorithms/dynamic routing
- change their routing decisions to reflect changes in the topology and the traffic
- distance vector routing, RIP (Routing Information Protocol)
- link state routing, OSPF (Open Shortest Path First)
Distance Vector Routing
- Each router maintains a routing table
- one entry for each router in the network
- each entry has two parts:
- preferred outgoing line for that destination
- estimate of the distance to that destination
- The router is assumed to know the ‘‘distance’’ to each of its neighbors.
- Once every T msec, each router sends to each neighbor a list of its estimated distances to each destination.
- It also receives a similar list from each neighbor.
- if neighbor X says that its estimate of distance to router I is Xi, and my distance to X is M, so my distance to router I is Xi + M
- the old routing table is not used in the calculation.
- Convergence(收敛): After a number of rounds, The routes to best paths across the network will be settled.
- It reacts rapidly to good news, but leisurely to bad news.
Link State Routing
1.Discover its neighbors and learn their network addresses.
•When a router is booted, its first task is to learn who its neighbors are.
•It sends a special HELLO packet on each point-to-point line.
•The router on the other end will send back a reply giving its name (globally unique)
2.Set the distance or cost metric to each of its neighbors.
•To determine the delay:
–send over the line a special ECHO packet that the other side is required to send back immediately.
–By measuring the round-trip time and dividing it by two
3.Construct a packet telling all it has just learned.
•The packet contains:
–identity of the sender
–a sequence number and age
–a list of neighbors and the cost to each neighbor.
4.Send this packet to and receive packets from all other routers.
•Basic Algorithm
–Each packet contains a sequence number incremented for each new packet sent.
–Routers keep track of all the (source router, sequence) pairs they see.
–When a new link state packet comes in, it is checked against the list of packets already seen.
–If it is new, it is forwarded on all lines except the one it arrived on.
–If it is a duplicate, it is discarded.
–If a packet with a sequence number lower than the highest one seen so far ever arrives, it is rejected as being obsolete
•Problems:
–if a router ever crashes, it will lose track of its sequence number. If it starts again at 0, the next packet it sends will be rejected as a duplicate.
–if a sequence number is ever corrupted and 65,540 is received instead of 4 (a 1-bit error), packets 5 through 65,540 will be rejected as obsolete
•Refine 1: for problem 1
–include the age of each packet after the sequence number
–decrement it once per second.
–When the age hits zero, the information from that router is discarded.
•if a router crashes, and seq. starts from 0, all these packages are discarded until the entry in the list times out. then the new package can come in.
•Refine 2: for problem 2
–When a link state packet comes in to a router for flooding, it is not queued for transmission immediately.
–Instead, it is put in a holding area to wait a short while in case more links are coming up or going down.
–To guard against errors on the links, all link state packets are acknowledged.
The packet buffer for router B
•Each row here corresponds to a recently arrived, but as yet not fully processed, link state packet.
•The send flags mean that the packet must be sent on the indicated link.
•The acknowledgement flags mean that it must be acknowledged there.
5.Compute the shortest path to every other router.
•Once a router has accumulated a full set of link state packets, it can construct the entire network graph
•Dijkstra’s algorithm can be run locally to construct the shortest paths to all possible destinations.
•link state routing requires more memory and computation.
–For a network with n routers, each of which has k neighbors
–the memory required to store the input data is proportional to kn
–the computation time grows faster than kn
Hierarchical Routing
•The routers are divided into regions.
•Each router knows all the details about how to route packets to destinations within its own region but knows nothing about the internal structure of other regions.
OSPF—An Interior Gateway Routing Protocol
•requirements
–Open
–support a variety of distance metrics (physical distance, delay)
–dynamic algorithm
–support routing based on type of service (not used in IP header)
–load balancing, splitting the load over multiple lines. (do not sent all packets over a single best route)
–support for hierarchical systems
–Security : prevent routers from sending false routing information
–Can deal with routers that were connected to the Internet via a tunnel.
–OSPF supports both point-to-point links (e.g., SONET) and broadcast networks (e.g., most LANs).
–support networks with multiple routers, each of which can communicate directly with the others (called multi-access networks)
1. an autonomous system network
2. abstract actual networks, routers, and links into a directed graph
3. Use the link state method to have every router compute the shortest path from itself to all other nodes.
•Multiple paths may be found that are equally short.
•OSPF remembers the set of shortest paths and during packet forwarding, traffic is split across them.
•This helps to balance load. It is called ECMP (Equal Cost MultiPath).
4. OSPF allows an AS to be divided into numbered areas
•Internal Routers : Routers that lie wholly within an area.
•Backbone Routers:
–Every AS has a backbone area, called area 0.
–The routers in this area are called backbone routers.
–All areas are connected to the backbone
•Area border router:
–Each router that is connected to two or more areas. It must also be part of the backbone.
–Its job is to summarize the destinations in one area and to inject this summary into the other areas
–Passing cost information allows hosts in other areas to find the best area border router to use to enter an area.
–Not passing topology information reduces traffic and simplifies the shortest-path computations
•AS boundary router:
–It injects routes to external destinations on other ASes into the area.
–The external routes then appear as destinations that can be reached via the AS boundary router with some cost.
–An external route can be injected at one or more AS boundary routers.
OSPF All in One•Using flooding, each router informs all the other routers in its area of its links to other routers and networks and the cost of these links.
•This information allows each router to construct the graph for its area(s) and compute the shortest paths.
•The backbone area does this work, too.
•The backbone routers accept information from the area border routers in order to compute the best route from each backbone router to every other router.
•This information is propagated back to the area border routers, which advertise it within their areas.
•Using this information, internal routers can select the best route to a destination outside their area, including the best exit router to the backbone.
BGP—The Exterior Gateway Routing Protocol•The goals of an intradomain protocol and an interdomain protocol are not the same.
–All an intradomain protocol is to move packets as efficiently as possible from the source to the destination.
–In contrast, interdomain routing protocols have to worry about politics
•A routing policy is implemented by deciding what traffic can flow over which of the links between ASes.
BGP – Transit Service•The customer ISP can buy transit service from the provider ISP.
•Provider ISP
–It should advertise routes to all destinations on the Internet to the customer over the link that connects them
–so that the customer will have a route to use to send packets anywhere.
•Customer ISP
–the customer should advertise routes only to the destinations on its network to the provider.
–This will let the provider send traffic to the customer only for those addresses
–the customer does not want to handle traffic intended for other destinations.
BGP – Peering
•Suppose that AS2 and AS3 exchange a lot of traffic. and their networks are connected already, they can send traffic directly to each other for free.
•This policy is called peering.
•To implement peering, two ASes send routing advertisements to each other for the addresses that reside in their networks.
•AS2 can send AS3 packets from A destined to B and vice versa.
•Peering is not transitive.
–AS3 and AS4 also peer with each other.
–This peering allows traffic from C destined for B to be sent directly to AS4
–if C sends a packet to A, traffic will not pass from AS4 to AS3 to AS2, even though a physical path exists.
–Because AS3 is not paid, so it do not want to do so, it is AS1 who will carry the packet from C to A
BGP – Multi-homing
•A is a stub network that is connected to the rest of the Internet by only one link, so it do not need to run BGP
•some company networks are connected to multiple ISPs, named multi-homing, they should run an interdomain routing protocol (e.g., BGP) to tell other ASes which addresses should be reached via which ISP links.
How BGP advertise routes
•Path vector protocol
•The path consists of the next hop router and the sequence of ASes, or AS path
•Pairs of BGP routers communicate with each other by establishing TCP connections.
•Carrying the complete path with the route makes it easy for the receiving router to detect and break routing loops.
–When a router receives a route, it checks to see if its own AS number is already in the AS path.
–If it is, a loop has been detected and the advertisement is discarded.
•BGP dose not tell the differences between difference ASes
•Path vector protocol
•The path consists of the next hop router and the sequence of ASes, or AS path
•Pairs of BGP routers communicate with each other by establishing TCP connections.
•Carrying the complete path with the route makes it easy for the receiving router to detect and break routing loops.
–When a router receives a route, it checks to see if its own AS number is already in the AS path.
–If it is, a loop has been detected and the advertisement is discarded.
•BGP dose not tell the differences between difference ASes
iBGP
•So far we have seen how a route advertisement is sent across the link between two ISPs.
•We still need some way to propagate BGP routes from one side of the ISP to the other, so they can be sent on to the next ISP.
•This task could be handled by the intradomain protocol (IGP), but because BGP is very good at scaling to large networks, a variant of BGP is often used.
•It is called iBGP (internal BGP) to distinguish it from the regular use of BGP as eBGP (external BGP).
iBGP rules•Each BGP router may learn a route for a given destination from the router it is connected to in the next ISP and from all of the other boundary routers
•Each router must decide which route in this set of routes is the best one to use.
iBGP strategies
•routes via peered networks are chosen in preference to routes via transit providers.
•the default rule that shorter AS paths are better (many small AS vs. a large AS)
•prefer the route that has the lowest cost within the ISP. (early exit or hot-potato routing)
Aggregation Layer的switches往往也是cluster的,并通过multi-chassis LACP同access layer的switch相连。
第四层是access layer
就是一个个机架的服务器,用TOR连接在一起
Top of Rack (TOR) vs. End of Row (EOR)
第五层称为storage layer
很多数据中心会为存储系统部署单独的网络
通过iSCSI或者Fibre Channel连接SAN,将block storage attached到机器上。