System Design Interview

1 Proximity Service

1) requirements

  Functional Requirements Non-Functional Requirements
1

Return all businesses based on user's location (latitute and longitute pair) and radius (5km)

Low latency.

  • Users see nearby business quickly
2

Business owners can use Restful API to deal with a business.

  • not to be refelcted in real-time

Data privacy

  • comply with GDPR & CCPA
  • location info is private
3 Custerms can view business detail

High availability and scalibility requirements.

  • Proximity Service can handle the spike in traffic during peak hours

 

2) Basic Culculatiion

QPS

Seconds in a day = 24 * 60 * 60 = 86400, round it up for 10**5.  the fifth power of ten

Users: 100 million

Searches: 5 times.

QPS: 100m * 5 / 10**5 = 5000

5000
Users 100 million  
Business 200 million  

 

3) High Level Design

User API design

Users search for business.

  • Restful API
  • Pagination

Parameters.

  • latitude
  • longitude
  • radius

{

"radius": 10,

"business": [business Object]

}

GET / search / nearby  
Business API design

Restful API.

  • GET
  • POST
  • UPDATE
  • DELETE
   
Data model

Read / Write Ration

Read:

  • Read-heavy system

Write:

  • Write is infrequent operation.

Schema

  • Geo index table (geohash and businessid)
  • business table (detail about business)

MySql. (PostgreSQL)

 
Algorithms to find near by business

1) Geohash

problems:

  • not enough business. 1) return the results directly. 2) remove the last index.

 

2) Quatree

 

   

 

4) design diagram

Load Balancer

receive requests (latitute, longitute, radius) from users 

Location-based service

Responsibility

  • calculate geohash from a user and neibor geohash
  • call redis for nearby businessids and business_objects
  • calculate distance extra
  • rank result to return
  • pagination

 

characters:

  • read-heavy service
  • QPS is high during the peak
  • the service is stateless so it is easy to scale horizontally

 

it is a multi location services. 

  • users are physically close to local services
  • can set up a region DNS to follow the local laws or requirements.
Business Service
  • write infrequently
Database Cluster
  • primary-second setup
  • data is saved in primary database then replicated to replicas
  • some discrepency between replicas and primary databse is not an issue

Scale:

1) Business Table

  • Good for sharding

2) Geo index Table

 

  • not sharding (sharding is not a good choice, because it has a lot of same geohash keys.)
  • use read replicas

Redis Cluster

 

Caching is not necessary. Because read geo index from database is fast enough.

we can use cache to handle the spike during peak hours.  We use caching to enhance the performance.

1) key: geohash, value: [business_list]

storage for value: 200 m * 32 bytes * 3 precisions = 17gb

storage for key: negligible.

 

we deploy this cache globally to ensuer high availability

2) key: businessid, value: business_object

 

2 nearyby friends

people's loocation changes a lot. business location is static.

1) requirements

Functional requirements Non-functional requirements

each entry in the list has locaiton and timestamp.

friends can see nearby friends

low-latency

  • receive friends' upadate not too late
nearby firend lists should be updated every 30 seconds.

reliability

  • occational data loss is acceptable
 

eventually consistency

  • a few seconds delay is acceptable

 

2) back-of-the-envelope estimation

distance 5 miles
location refresh interval 30 secs
users per day 100m
concurrent users 10m
a user's friends 400
number of people per page 20
QPS 10 m / 30s = 334k

 

3) high level design

 

Load Balancer In front of Restful API and bi-direction WebSocketServers
Web Socket Servers

WebSocket Servers. 

  • Servers should update user's nearyby friend list 
  • each client maintains one persistent

Initialization for the 'nearby friends'

 

stateful.

  • Before a node can be removed. all connections should be allowed to drain.
  • New WS connection is routed to the draining server. So when the old connection is closed, we remove the old server.
  • releasing a new version of the application software on a websocket server requires the same level of care
Restful API servers

Friendship management

A cluster of stateless HTTP servers that handles common request / response like, adding, removing, updating profile.

 

stateless. so we can auto-scale hardware.

Redis Location Cache

Location and TTL

key: userid

value: latitude, longitude, timestamp

 

QPS is 334k. But we can shard the location server by user_id. We can spread the load among several Redis servers.

10% of 10 million are online.

calls: 334k * 400 * 10% = 14 m

location history databse historical location data. for data analysis
User database

user & friends database 

{user_id, user_name, profile, url, etc}

 

ralational database

easy to scale by sharding based on their userid.

Redis  Pub / Sub Server
  • GB memory could hold millions of channels(topics)

 

Location updates via websocket server are published to the user's own channel.

 

Memory Usage (100 m * 100 *20 = 200GB, 2 servers (1 server has 100GB memory))

  • 100 million channels
  • a user has 100 friends using this feature
  • a pointer is 20 bytes

 

CPU usage (14m / 10k = 140 servers)

  • 334k * 10%online * 400 friends = 14 m calls
  • one server can handle 10k calls a sec.

Bottleneck of Redis Pub / Sub server is the CPU usage. not the memory usage.

we can use a hash ring  to mark the each server to handle the calls. 

scale:

if a node is replaced, we should let the user subscribe the new node. so this cluster is stateful

if we have to scale the cluster, we should

  • determine the new ring size, and if scaling up, provision enough new servers
  • update the keys of the hash ring with the new content
  • monitor your dashboard.. There should be some spike in CPU usage

 

users with many friends

5000 at most, it is not a problem

 

3 Google Maps

1) requirements

functional requirements non-functional requirements
User location update Accuracy: users should not be given the wrong directions
Navigation service, including ETA service Smooth navigation: On the client-side, users should experience very smooth map rendering.
Map rendering

Data and battery usage: The client should use as little data and battery as possible.

This is very important for mobile devices.  

  General availability and scalability requirements

 

2) back of the envelope estimation

Map storage

50PB.

  • one map tiling png is 100kb.
  • Entire set of the highest level is 4.4 trillion * 100kb = 440PB
  • 80 ~90% is ocean
  • So storage size to a range of 44 to 88 PB. Round it to 50PB
server throughput

1 billion DAU

navigation, 35 minutes per week. 5 billion minutes per day

 

GPS coordinates (users update their coordinates): QPS 3million

  • 5billion minutes * 60  / 10**5 = 3 million 

batch GPS coordinates 3 m / 15 = 200k. (every 15secs)

 

peak QPS = 1 million

 

 

3) Design

 

Location Service

 

save locations can use to find new roads.

It is a write-heavy service. So cassandra could be a good candidate. 

 

Prioritize availability over consistency. 

We could  choose 2 attributes accroding CAP. -> AP.

 

Updater services.

We can calulate new roads and remove an unused road. Impove the accuracy of our map. Use kafka

 

 

navigation service

find a reasonably fast route from point A to point B

culculation speed is not important, but accuracy is critical

user sends an HTTP request to the navigation service through a load balancer

 

Routing tiles.  (路组成的图片, map 是建筑物之类的背景)

This dataset contains a large number of roads and associated metadata such as names, county, longitude  and latitude. We run a periodic offline processing pipeline to capure new changes oto the road data. Output is routing tiles

  • 1) Output: each tile contains a list of graph nodes and edges.
  • 2) storage: in database, not in memory

 

Shorest-path service

Return top-k shortest paths without considering traffic or current conditions. This computation only depends on the structure of the roads. Caching the routes could be beneficial because the graph rarely changes. Algorithms A*. 

  • begins in starting tile, then search the neighbors until a set of best routes is found

 

ETA service 

Get the time estimate according to a list of shortest path.

 

If there is an incident in a routing tile. We need find affected users.

origin: r_1, r_2, ...., r_n

alternative: r_1 ... super(super(r_1)). So can only check if the last routing tile is affected.

 

 Map rendering service

Precomputed Map images

CDN (content delivery network)

Map tiles are generated already in database. Calculate the geohash to get the appropriate zoom level map tiles

 

scenarios of updating the map tiles:

  • The user is zooming and planning the map viewpoint on the client to explore their surroundings
  • During navigation, the user moves out of the current map tile into a nearby tile

 

 

  1. A user makes an HTTP request to fetch a tile from CDN
  2. If CDN doest have the copy of the tile, then it touches the origin map database. 
  3. CDN can fetch the data according user's location. Map tiles are served from the nearest point of presence (POP) to the client

Each image is 200 * 200 square-meters. User's speed is 30km/h. An area of 1km * 1km needs 25 images. (1 / 0.04) or 2.5 MB (100kb * 25). An hour we need 30 * 2,5 data. one minute is 1.25MB of data

 

Traffic through CDN.

5 billion minutes DAU * 1.25 MB data per minute = 6.25 billion MB one day.

6.25 billion one day / 10**5 = 62500 MB one second. Let's assume there are 200 POPs,

each POP serves 312.5 MB data per second.

 

URL: Geohash. https://cdn.map-provider.com/tiles/9q9hvu.png

 

 

 

Instead of using a hardcoded client-side algorithm to convert a latitue/longitude (lat/lng) pair and zoom level to a tile url, we could introduce a service as an intermediary whose job is to construct the tile URLS.

It returns 9 URLS (8 surrounding tiles) 

 

 

 

4 Distributed Message Queue

1) requirements

functional requirements non-functional-requirements
Producers send messages to a message queue high throughput or low latency, configurable based on use cases.  
Consumers consume messages from a message queue  Scalable. The system should be distributed in nature. It should be able to support a sudden surge in message volume. 
Messages can be consumed repeatedly or only once Persistent and durable. Data should be persisted on disk and replicated across multiple nodes 
Historical data can be truncated   
Message size is in the kilobyte range   
Ability to deliver messages to consumers in the order they were added to the queue  
Data delivery semantics (at-least once, at-most once, or exactly once) can be configured by users  

 

2) Design

 

 

Brokers

 

 

When data volume is too larg. We divide a topic into partitions (sharding). These servers hold partitions are called brokers (not one partition).

  • Each broker has the FIFO mechanism.
  • position of a message is called the offset 
Consumer Group  

We can group consumers by use cases, one group for billing and the other for accounting.

A single partition can only be consumed by one consumer.

coordination service

service discovery: which brokers are alive

leader election: one of the brokers is selected as  the active controller. Only one controller is responsible for assigning partitions 

data storage

WAL.  

  • write-heavy, read-heavy
  • no update or delete options
  • predominantly sequential read / write access

Disk performance of sequential access is very good

 

 

Divide a file into segments. With segments, new messages are appended only to the active segment file.

  • segment files of the same partition are organized in a folder named Parittion {: partition_id}

 

Use RAID. hundred MB / sec of read and write speed

 

Message data structure

  • key: to determine the partition of the message. hash(key) % numPartitions. not the p_id
  • value: plain text or binary block
  • topic
  • partition: p_id
  • offest
  • size
  • ts
  • CRC Cyclic redundancy check is used to ensure the integrity of raw data

 

Batching.

  • A trade off between throughput and latency

 

Each partition has multiple replicas. 

In-sync replicas (ISR).  ISR reflets the trade-off between performance and durability.

  • a slow replica causes the whole service to become slow
  • you can set ACK = ALL or 1 ....

 

The numebr of replicas of a partition is also a trade-off

 

Add one broker.

 Producer  
  • Routing is wraped into the producer.
  • A buffer layer can imporve the throughput. Send batches in a single request
  • Producer + Buffer + Routing = Producer Client

 

Messages delivery policy.

1) At-most-once

ACK = 0

 

2) At-least-once

ACK = 1 or ACK = all

3) Exactly once

Has a high cost for the system. (financial-related use cases.)

 

Delayed message

 Consumer

Push vs Pull (whether a broker should push data to a consumer or consumer pull data from a broker?

Push:

  • Low latency
  • Consumers will be overwhelmed if they can not process messages in time

Pull:

  • Consumers control the consumption rate
  • Suitable for batching
  • When there is no message in the broker, a consumer might still keep pulling data.

We prefer pull model.

 

New consumer joins.

 

Existing consumer leaves.

 

Existing consumer crashes.

 State storage

Stores:

  • Mapping between partitions and consumers
  • The last consumed offsets of consumer group for each partition 
Metadata storage Stores configuration and properties of topics

 

5 Metrics Monitoring and Alerting System

 

1) requirements

Functional Non-functional
cpu usage

scalability

  • accommodate growing metrics and alert volume
request count

low latency

  • low query latency
memory usage

reliability

  • avoid missing hive-level alerts
message count in message q

flexibility

  • easily integrate new technologies

data retention policy

  • raw form 7 days
  • 1 minute resolution for 30 days
  • 1 hour resolution for 1 year
 

 

2) back-of-the-envelope estimation

DAU 100 million
Metrics

10 million metrics

  • 1000 server pools
  • 100 machines per pool
  • 100 metrics per machine
data retention 1 year
   
   

 

3) design

 

data storage system

(time series DB)

InfluxDB

  • wirte-heavy DB
  • calculate time-series metrics so the db should support time-series calculation or window queries
  • efficient aggregation and analysis
  • builds indexes on labels to facilitate the fast lookup

 

Data encoding and compression can significantly reduce the size of the data

 

Modern ts DB has its own cache layer and query service. 

Metrics sources This can be application servers, DB, message qs, etc
Metrics collector

Gathers data and writes to time-series db. Collectors are from a cluster of servers.

Two ways to collect metrics,  pull vs push. (not an ensure answer)

1) Pull

 

pros 

  • easy debugging. pull metrics at anytime and on any machines
  • heath check is easy

2) push

 

pros

  • some batch-jobs are last long. 
  • receive data from anywhere
  • use UDP. low latency
Query serivce

makes it easy to query data from ts db.

This is where aggregation happens

 

Modern ts DB has its own cache layer and query service.

Cache layer

To reduce the load of the time-series database and make query service more performant.

Cache layer is used to store query result. 

 

Modern ts DB has its own cache layer and query service.

Alerting system  
  1. Config, yaml. Load configs to cache
  2. Manager reads configs from cache
  3. Manager call queries and check if the result violates the threshold.
  4. Store can be Cassandra keeps the state of alerts (inactive, pending, firing, resolved)
  5. Alerts -> Kafka
  6. Consumer pulls alerts from kafka
  7. process alerts
Visulization system  Grafana
Consumers 

 Saprk, Flink, Storm etc. Can decouple the data collection and data processing services.

Kafka

 Can decouple the data collection and data processing services. 

 

6 Add Click Event Aggregation

1) requirements

Functional Non-functional
aggregate the number of clicks of ad_id in the last M minutes correctness of the aggregation result is important
return 100 most clicket ad_ad every minute properly handle delayed or duplicate events
support aggregation filtering by different attributes robuness. 
  latency requirements. few minutes at most

 

2) back-of-the-envelope estimation

DAU 1 billion
Daily clicks

1 billion

  • 1 user 1 click.
QPS

10k.

  • 1 billion / 10**5
Peak QPS 50k
Capacity of storage

100GB

  • 0.1kb one click
  • 0.1 kb * 1 b = 100G
  • 3T / month
Number of adds

2 million

Business grows

30% / year

 

3) API Design

API-1: Aggregate the number of clicks in last M minutes 

API:

Request:

Response:

API-2: Top-N most clicked ad-ids in the last M minutes

 API:

Request:

Response:

 

 

3) Aggregation design

 

 

Message 

Q1

If we don't use a message q. when the traffic is heavy, aggregation service would shut down.

Decouple aggregation service and the write raw data service from a message q.

{ad_id, click_timestamp, uers_id, ip, country}

 

Q2

{ad_id, click_minute, count}

 

Duplicate events can result to million dollars. So the delivery method is exactly-once.

 

We can save offset in s3 to avlod ack fails directly. 

 

Scaibility.

1) producer: easy

2) consumer: hundred of consumers. rebalance during off-peak. 

3) brokers: it is better to pre-allocate enough partitions.

Raw data db

write heavy. Cassandra or InfluxDB

{ad_id, click_timestamp, uers_id, ip, country}

Aggregation db

write heavy. 

{ad_id, click_minute, filter_id, count}

Aggregation service

We need calculate metrics and normarise raw data so wo don't save raw data. 

Map node

 

We need map nodes because kafka may send the same ad_id to different partitions.

 

Aggregation Node

Partition different result to different node. It is a part of reduce service

 

Reduce Node

 

 

 

 

We use event_ts. It is more important for business analysis.

We use watermark (extended window) to handle data latency.  It is a tradeoff. long watermark more accurate but more latency time

 

Aggregation window

4 windows. sliding (hopping), tumbling(fixed), session. we use tumbling.

 

Scale:

Deploy aggregation service on Apache Hadoop Yarn. It is easy to add computing resources.

 

Hotspot:

Allocate more resources to the aggregation calculation.

 

Fault tolerance

We can use a snapshot if a node is down. If there is no snapshot we replay aggregation from kafka.

 

Monitoring:

  • Latency
  • Message queue size
 Batching vs Streaming

We use both.

Streaming for aggragation calculation.

Batching for storing histrocal data.

 

Two architecture type.

Lambda vs Kappa.

Lambda

Two layers.

Kappa

Only one layer. We choose Kappa

recalculation service.

It reuses the aggregation service. But use raw db. 

Reconcilition service.

It is to check the reult between raw data and aggregation data

 

7 Hotel Reservation System

also the same topics of  ticket booking system

1) requirements

functional non-functional
show the hotel-related pages support high concurrency during some peak season. some popular hotels may have a lot of customers trying to book the same room
show the room-related detail page moderate latency
reserve a room  
admin pnel to add/remove/update hotel or room info  
support the overbooking feature  

 

2) back-of-the-envelope estimation

hotels and rooms 5000 hotels and 1 million rooms
rooms occupied and average stay duration 70% of the rooms are occupied and duration is 3 days
daily reservation 1 m * 70% / 3 = 240k
reservation per second 240 k / 10**5 = 3 rooms
QPS

10% users reach the next step. we can work backward to see other QPS

 

 

3)design

API

They are all RESTful APIs

Hotel-related APIs

 data model

we choose relational db. Because it is a read-heavy service. And a relational database provides ACID guarantees. ACID properties are important for a reservation system. 

1) hotel db

hotel_id, name, address, location

2) room db

room_id, room_type_id, floor, number, hotel_id, name, is_available

3) rate_db

hotel_id, dt, rate

4) reservation

reservation_id, hotel_id, room_type_id, start_date, end_date, status, guest_id

5) room_type_inventory

hotel_id, room_type_id, date, total_inventory, total_reserved

 

status: canceled,  paid -> refunded, rejected.

 

If the reservation data is too large for a single database, what would you do? 

  • store only current and future reservation data. Reservation history is not frequently accessed. So they can be archived and some can even be moved to cold storage.
  • DB sharding. The most frequent queries include making a reservation or looking up a reservation by name. So we can shard db by hotel_id.

when QPS is high, assume it is 30k, after database sharding, let's say it's 16 shards. Each shard handles 30k / 16 = 1.875QPS.

hotel management

It is only for hotel staff. Those are microservices, for example RPC, etc. 

 

hotel reservation

concurreny issues

1)The same user clicks on the book button multiple times.

 

solution: idempotent APIs. Add an idempotency key in the reservation API request.

In this service design we can use reservation_id as an idenpotency key. So reservation_id is a primary key, it can not insert into db twice.

 

2) Multiple users try to book the same room at the same time 

SQL has Two parts:

  1. check room inventory (select)
  2. reserve rooms. (update)

According ACID, we the isolation level is not serializable. U1 anbd U2 can book the room at the same time.

So we need locking mechanism 

Solutions:

  1. pessimistic lock. (it is useful when contention is heavy, but deadlock may happen and other request can not access to this room), we don't use this
  2.  optimics lock.   ( we can add a version column, read version number at first, if a user updates, v_num += 1, so if there is a conflct (user2_v_num must exceed the previous v_num) , the later user can not make a commit)  , when concurrency is high, performance is bad. because at the same time, only 1 user can reserve a room successfully) , it is a good option when qps is not high
  3. database constraits.  if condition is violated what we set, then roll back. when data contention is not high we use this                                                                                                                                   

    

 If scalability is an issue, we can add a cache layer. It can improve the performance. But maintaining data consistency between the database and cache is hard.

 

 

If reservation service and inventory service are from microservices. It might cause inconsistency.  They have their db. If one reservation fails, inventory has to roll back. 

Solution

  • 2PC. database protocal used to guarantee atomic transaction commit across multiple nodes.
  • Saga

 

7 Distributed Email Service

1) requirements

Functional Non-functional
Authentication

Reliability

not lose email

Send and receive emails

Availability

Email and user data should be automatically replicated across multiple nodes

Fetch all emails

Scalability

As the number of users grows. The system should be able to handle the increasing number of users and emails

Filter emails by read and unread status

Flexibility and extensibility

Easy to add new components

Search emails by subjet, sender, and body  
Anti-spam and anti-virus  

 

2) Back-of-the-envelope estimation

users 1 billion
QPS

100k.

  • 1 person 10 emails
  • 10 ** 9 * 10 / 10**5
number of emails a person receives

40.

size of an emial is 50kb

storage of emails for 1 year 730PB
storage for attachments in 1 year 

1460pb

  • 1 billion
  • 40 emails / day
  • 365 days
  • 20% have attachments
  • 500kb per attachment

 

3) design

traditional mail servers

in a traditional mail server, emails were stored in local file directories and each email was stored in s separate file with a unique name. Each user maintained a user directory. It works well when the user base was small. Disk I/O becomes a bottleneck.

 

In this design, we focus on HTTP protocal, which we build a web-based distributed email service

 

Email protocols

1) SMTP

send emails

2) POP

receive and download emails. Once emails are downloaded on your phone. They are deleted from server.

3) IMAP 

receive emails. not deleted. So you can access emails from different devices.

4) HTTPS.

not used exluding web emails.

API

GET /v1/folders/{:folder_id}/messages.

  • return all messages under a folder

GET /v1/messages/{:message_id}

  • Get all information about a specific message

Response:

{user_id, from, to, subject, body, is_read}

web servers

all email API reqeusts are go through web servers.

 

Email sender reputation. 

  • Warm up new email server IP addresss slowly to build a good reputation. Big providers such as 365, Gmail, Yahoo Mail, etc. 

 

Email authentication  

  • To combat phishing, SPF, DKIM, DMARC are some common techniques
real-time servers

pushing new emial updates to clints in real-time. 

  • stateful
  • use websocket
metadata database

mail subject, body, from, weto.

relational db is not a good choice. 

  • can not store a lot of data
  • not easy to give index to HTML scripts

 

NoSql

  • Bigtable. not open source.
  • Cassandra not a good choice.

 

Custom Design (good choice)

  • A single column can be a single of MB
  • strong data consistency
  • high I/O
  • It should be highly available and fault-tolerant
  • easy to create incremental backups

 

support queries

1) get all folders of a user

{user_id, folder_id, folder_name)

one user_id is in one partiton_id

 

2) display all emials for a specific folder

{user_id, folder_id, email_id, from, subject, preview, is_read}

 

3) create / delete / get a specific email

 {user_id, emial_id, from, to, subject, body, attachments}

{email_id, filename, url}

 

4) fetch all read or unread emails

divide emails to read_emails and unread_emails

 

distributed databases that rely on replication for high availability must take a fundamental trade-off between consistency and availability. In the event of a failover, the mailbox isn't accessible by clients, so their sync / update operation is paused until failover ends. It must make a fundamental take-off

 

Users communicate with a mail server that is physically closer to them in the network topology..

 

attachment store s3. we can not use NoSQL. because it's hard tu put attachments in db like Cassandra
distributted cache caching recent emails in memory significantly imroves the load time.
search store

supports full-text searches. document store.

 

serch has quite different characteristics compared to google search

  • scope is google search: whole internet, sort by relevance, indexing generally takes time
  • scope is user's own email, sort by attributes such as time, etc. Indexing should be near real-time. result has to be accurate

 

Solution Elasticsearch and native search embedded in the database. 

1) Elasticsearch

one challenge of adding elasticsearch is to keep our primary email store in sync with it.

 

2) custom search solution

The main bottleneck of the index server is usually disk I/O.  To support an email system at Gamil or OUtllok scale, it might be a good idea to have a native search embeded in the database.

Email sending flow
  1. RESTFUL API request
  2. load balancer check the traffic limit
  3. check validation of the emial.  If the sender is the recipient (same person), and he or she can get emails via RESTful API, don't need to go to step 4.  The email is inserted into the sent folder and recipient's folder.
  4. send the email to the outgoing q or to the error q
  5. SMTP outgoing workers pull messages from outgoing q
  6. The email is stored in sent folder
  7. smpt sends the email to the recipient mail server
 Email receiving flow  
  1. Load balancer
  2. Lb distributes traffic among SMPT servers
  3. If email is too large we put it into s3
  4. incoming q. q decouple the processing server from SMTP server
  5. Mail processing server check spam, etc
  6. emails are stored in db
  7. if recipient is online, move it to real-time server
  8. get email vis ws
  9. or recipients get emails via web servers

 

9 S3-like Object Storage

storage system fall into three broad categories

  • Block storage. ( common storage devices are all considered as block storage)
  • File storage. It is built on top of block storage. It provides a higher-level abstraction to make it easier to handle files and directories
  • Object storage. It is mainly used for archival and backup. It sacrifice performance for high durability, vast scale, and low cost.

  • Bucket. A logical container for objects. The bucket name is globally unique. To upload data to S3,  we must first create a bucket
  • Object. An individual piece of data we store in a bucket. Objects can be any swquence of bytes we want to store. 
  • Versionaing.  A feature that keeps multiple variants of an object in the same bucket. This feature enables users to recover objects that are deleted or overwritten by accident.
  • Uniform Resource Identifier(URI). The object storage provides RESTful APIs to access its resources. It is unique
  • Service-level agreement (SLA). A service-level agreement is a contract between a service provider and a client. 

One of the main differences between object storage and the other two types of storage systems is that the objects stored inside of object storage are immutable. We can delete and replace them. But we cannot make incremental changes.

1) requirements

Functional Non-functional
Bucket creation 100 PB of data / one year
Object uploading and downloading Data durability is 6 nines. (99.9999%)
Object versioning Service availability is 4 nines.
Listing objects in a bucket. It's mililar to the aws s3 ls command  Reduce costs while maintaining a high degree of reliability and performance

 

2) back-of-the envelope estimation

Objects
  • 20% less than 1MB
  • 60% 1MB ~ 64MB
  • 20% 64MB

numbers of objs = 10**11 * 0.4 / (0.2 * 0.5 + 0.6 * 32 + 0.2 * 200) = 0.68 billion objects

  • 40% storage usage

metadata. 1KB each object, we need 0.68TB

 

3) design

 

API Service

We could use object URI to retrieve object data. The object URI is the key and object data is value.

Request:

GET /bucket1/object1.txt HTTP/1.1

Response:

HTTP/1.1 200 OK

Content-length: 4567

 

It is stateless so it can be horizontally scaled.

 

For example when we upload a file to s3, steps are

  • The client sends an HTTP PUT request to create a bucket named bucket-to-share
  • API service calls the IAM to ensure the user is authorized and has write permissions
  • API service calls the metadata store to create an entry with the bucket info.
  • Once validation succeeds, the API service sends the object data in the HTTP PUT payload to the data store.  The data store persists the payload as an object and returns the UUID of the object. 
  • The api service calls the metadata store to create a new entry in the metadata databse. It contains important meta data such as the object_id (UUID), bucket_id, object_name, etc. 

 

data store 

Similar to unix file system

 

All data-related operations are based on object ID. 

 

1 Data routing service

  • read and write data from datanodes
  • query the placement service to get the best data node

2 Placement service 

  • chose which node to store an object.It maintains a virtual cluster map. This separation is key to high duratbility. 

3 Data routing service

  • read and write data from datanodes
  • query the placement service to get the best data node

4 Placement service 

  • chose which node to store an object.It maintains a virtual cluster map. This separation is key to high duratbility. 

5 Data node

  • store the actual object data. It ensures reliability and durability by replicating data to multiple data nodes, also called a replication group

 

How data persisted in the data node?

  • API service forwards the object data to the data store
  • routing service generates a UUID for object. find a data node
  • send data -> node directly
  • node saves data locally and replicates it to two secondary data nodes
  • UUID is returned to API service

Small files

1) How data is organized?

If too many small size files:

  • too many small files on a file system waste many data blocks. A file system stores files in discrete disk blocks. Disk blocks have the same size. 
  • Performance is bad
  • It could exceed the system's node capacity.inodes are fixed.

Storing mall objects as individual files does not work well in practice. To Address this issues, we can merge many small objects into a larger file.It is lke a WAL. Usually it is appended to an existing read-write file. when the rw file reaches it's capacity threshould, the rw file is marked as read-only.

2) lookup object

deploys relational database to support look up.  read-heavy so a relational database is a good choice. Mapping data is isolated within each data node. So we could simply deploy a simple relational database on each data node. 

object mapping table:

{object id, file_name, start_offset, object_size}

Identity and access management The central place to handle authentication, authorization, and access control.  
Metadata store

Objects and metadata stores are just logical components, and there are different ways to implement them. 

Durability

Hardware failure

  • Let's assume the spinning hard drive has an annual failure rate of 0.81. So making 3 copies of data gives us 1 - 0.0081**3 = 0.999999 reliability. 

Domain failure

  • we store data nodes in different datacenter

 

write performance is good. Read performance is good. less compute resource 

 

Other option. Erasure coding. 

  • It chuncks data into smaller pieces and creates parities(same) for redundancy. We can use chunk data and parities to reconstruct the data.

Eaxample

matrix is fixed. 

[1,3,5]

[4,6,7]

we have 3 nodes, d1, d2,d3. result in p1, p2. so we can recover data when we lost 2 nodes. more duability,  more storage-efficiency

Scalability

1) Bucket table

  • 1 user has 10 buckets
  • 1 kb / per record
  • 1 million users

That means we need 10 GB of storage space. 10 * 1m * 1kb = 10GB. Storage is not a problem. But a single db server doesn't have enough CPU or network bandwidth to handle requests.  So we can spread the read load among multiple database replicas.

 

2) object table

shard table accroding <bucket_name, object_name>. 

 

3) distributed databases 

query to get objects on different partition is complicated.

select * from metadata where bucket_id = "123" and object_name like 'a/b/%'

order by object_name offset 10 limit 10;

we have to track a lot of offsets in all shards.

 

we can support object listing with sub-optimal performance. we can denormalize the listing data into a separate table sharded by bucket. 

 Object versioning metadata has a column named version 
 upload large size object

slice a large object into smaller parts and upload them independently 

 

 garbage collection

 targets

  • lazy object deletion. An object is marked as deleted.
  • Orphan data. half uploaded data or abandoned multipart uploads
  • corrupted data

The garbage collector does not remove objects from the data store, right away. Deleted objects will periodically cleaned up with a comapction mechanism.

 

10 Real-time Gaming leaderboad

1) requirements

Functional Non-functional
display top 10 players on the leaderboard real-time update on scores
show a user's specific rank score update is reflected on the leaderboard in real-time
display palyers who are four places above and below the desired user general scalability, availability, and reliability requirements

 

2) back-of-the-envelope estimation

DAU

5 million

5 million / 10 **5  = 50 users / per_second.

a peak load 500

 

3) Design

API Design  We can build this service by ourselves or on cloud (serverless).
Message Q

If there are other needs apart from Leaderboard servce. We can use a message q. 

 Data models

1) relational database solution

A rank operation over million users is not acceptable because this operation is not performant. 

select row_number() over(order by points desc) from db

 

2) redis

Redis provides a potential solution to our proble. Redis has a specific data type called sorted sets that are ideal for solving leaderboard system design problems.

 

some operations. update, query and range query

  • ZADD
  • ZINCRBY
  • ZRANGE
  • ZRANK

 

Storage requirement:

  • 25 million users
  • store user_id, score
  • user_id 24 bytes. 
  • score is 2 bytes
  • 26 * 25 = 650MB

one redis server is enough more than enough to hold the data

peak qps 2500 / s. one server is enough to serve for the query.

 

when the redis server fails, we can restore the service by relational db

 

Scaling the server

1) asumtion

  • 500 million DAU (5 * 100 )
  • 650MB * 100 = 65GB
  • QPS , 250 * 100 = 250000

Data sharding.

  • by score range. exm: 1~100 ... (900 ~ 1000)
  • by hash.

by score range, retriving is simple. but we need to relocate the data.

 

by hash.

  • we can compute the hash slot of a given key. This allows us to add and remove easily without redistributing all the keys

Compute the top 10 users.

This approach has some limitation. we have to wait for the lowest node. 

 

Sizing the node

allocate twice the amount of memory for write-heavy applications. 

 

3) Nosql

we can use cassandra, etc.

 

11. Payment Service 

 1) requirements

Functional  Non-functional

Pay-in flow:

payment receives money from cutomers on behalf of sellers 

Reliability and fault tolerance. Failed payments need to be carefully handled. 

Pay-out flow:

payment system sends money to sellers around the world

A reconcliation process between internal and extern services.

 

2) back of the envolupe estimation

DAU 1 million
QPS 1 million / 10**5 = 10
   
   
   

 

3) design

pay-in flow

 

  1. a user clicks "place order". a payment event is generated
  2. payment service stores the payment event in db
  3. sometimes an event may have several payment orders, it calls payment executor for each payment order
  4. payment executor stores a payment order in db
  5. payment executor calls payment service
  6. if payment executor successfully calls payment service, payment service updates the wallet db
  7. after the wallet service successfuly updating the db, the payment service calls Ledger to update it
  8. the ledger appends the new ledger information to the db
Payment service

accepts payment events from users and coordinates the payment process.

First thing it usually does is a risk check

Payment executor Execute a single payment order via a Payment Service Provider(psp). A payment event may contain serveral payment orders
PSP

moves money from A to B. If a company can not store personal information like credit card, etc, it can choose PSP service. PSP provides a hosted payment page to collect card payment details.

Hosted payment flow

  1. user clicks checkout(结账) button. client calls the payment service
  2. create nonce (UUID, ensure the exactly-once-registration). It is also the ID of payment order
  3. PSP returns a token back to payment service. A token is a UUID on the PSP side
  4. payment service stores the token in db before calling the PSP-hosted payment page
  5. once the token is persisted, the client displays a PSP-hosted payment page. PSP provides a js library that displays the payment UI, collects sensitive payment information, and calls the PSP directly to complete the payment. It never reaches our payment system. The HPP needs two pieces of information. (1) token. (2) redirect URL. 
  6. user fills in the payment details on PSP's web page.
  7. PSP returns the payment status
  8. The web page is now redirected to the redirect URL.
  9. Asynchronously, the PSP calls the payment service with the payment status via a webhook. update the paymen_order_status field in the payment order db table.

 

When a payment is delayed.

update status as PENDING, send payment service information. when the payment is completed, update status.

card schemes organizations that process credit card operations
Ledger

Keeps a financial record of the payment transaction. Record the debit from user and credit to the seller.

Double-entry ledger system (复式记账)

Wallet keeps the account balance of the merchant
API

POST/v1/payments

{buyer_info, checkout_id, credit_card_info, payments_orders}

  • check_id is unique
  • payment_orders {seller_account, amount, currency, payment_order_id}
  • payment_order_id is unique
  • amount is string

GET/v1/payments{:id}

  • return the execution status
 Data Model

We need mature db. We prefer a traditional relational database with ACID transaction support over NoSQL.

1) payment event table

 

 

2) payment order table

 

Reconciliation

there is an ansynchronous step. But sometimes information may be lost. So reconciliation is a practice compares the states among  the settlement file (from PSP) and the ledger system. 

To fix mismatches found during reconciliation, we usually rely on the finance team to perform manual adjustments.

 
Communication among internal services

1) synchronous communication

  • HTTP
  • works well for small-scall systems
  • low performance
  • poor failure isolation
  • tight coupling
  • hard to scale

2) asynchronous communication

single receiver. once a message is processed, it is removed

multiple receivers

 

handle failed payments

dead letter q. it keeps the failed payment_order to analyse

exactly-once delivery

one of the most serious problems a payment system can have is double charge a customer.

exactly-once:

  • it is executed at-least-once. sometimes the ack is delayed and sometimes the payment is failed, etc.

 

  • at the same time, it is executed at-most-once . we use an idempotency key to ensure the operation is only executed once.

 

consistency

To ensure data consistency, idempotency and Reconciliation are techs we can use. Even if an external service supports idempotent APIs, reconciliation is still needed.

 

If data is replicated, replication lag could cause inconsistent data between the primary database and the replicas. There are generally tow options to solve this:

  1. serve both read and writes from the primary only.
  2. ensure all replicas are always in-sync.

 

12 Digital wallet

1) requirements

  • support balance transfer operation between two digital wallets
  • support 1 million TPS
  • reliability is at least 99.99%
  • support transactions
  • support reproducibility

 

2) estimation

TPS 1milliong / s
nodes

each transaction has two operations.

  • deducting
  • depositing

relational db. 1000 TPS / node

 

total 2 million / tps

100 TPS / node -> 20k nodes

1k TPS / node -> 2k nodes

10 / node -> 200 nodes

 

 

 3) design

API: 

POST/v1/wallets/balance_transfer

{from_account, to_account, amount, currency, transfer_id}

 

There are 3 solutions.

  • in-memory,
  • db-based distributed transaction solution
  • event sourcing solution with reproducibility

 

1. In-memory of sharding solution

 

Redis If one node is failed, it may cause a incomplete transfer. The 2 updates (minus and add) must be an atomic transaction
Zookeeper 

storage of configuration

maintain the sharding information

 

 2.1 Distributed transactions: 2pc

To guarante 2 updates are one atomic transaction. we can use 2-phase commit

 

  1. coordinator is the wallet service, write dbs and they are locked,
  2. when the db is about to commit, the coordinate asks the db to prepare the transaction
  3. if all db reply with a yes, the coordinates asks all db to commit the transcation. Else abort.

cons:

  • not performant. locks can be hold for a long time. 
  • single point failure. (one coordinate)

 

2.2.1 Distributed transaction: Try-confirm cancel (TC/C)

these are seperate operations. 

first phase try: NOP. no operation.  

  • reduce A by 1
  • C NOP. reply is always a yes.
2.2.2 second phase: confirm 

  • in phase 1, if A and C are all with replies yes. 
  • Adds 1$ to db C. 
  • A NOP. reply is always a yes
2.2.3 second phase: cancel

If one step is failed, add 1$ back to db A.

 

2.2.4 phase status table

If the coordinator restarts in the middle of the process, we need previous operation history. 

Phase status tables are stored with the local balance table.

 

2.2.5 Unbalanced state

Before T/C C, A + C = 1, at the end of the first phase, A + C is 0. It violates the fundamental rule of accounting that sum should be the same. 

 

2.2.6 Out-of-order execution

C receives cancel before try. So there is nothing to cancel. 

solution: add an out-of-order flag in phase status table, try operation check the flag first

 

2.2.7 2PC VS TC/C

 

2.3 distributed transcation: saga

do the operation one by one. we can use only one coordinator to handle this process. 

 

2.3.1 saga or TC/C
  • microservice artritecture, saga
  • latency sensitive, TC/C

 

3.1 event sourcing

  • Do we know account balance at any time?
  • How do we know the historical and current account balances are correct
  • How do we prove the system's logic is correct

The design philosophy is event sourcing. 4 important terms in event sourcing

  1. command (A transfers 1$ to B)
  2. event. (paste tense, the result is fixed. exp: A transfered 1$ to B. )
  3. state (what will be changed when an event is applied.)
  4. state machine (validate commands and generate events, apply event to update states)

Example:

 

3) reproducibility

 the most important advantage.  We could always reconstruct historical balance states by replaying the events from the very beginning. 

Q: Do we know the account balance at any given time?

A: We could answer it by replaying the events from the start, up to the point where we want.

 

Q: How do we know the historical and current account balances are correct?

A: We could verify the correctness of the account balance by recalculating it from the event first.

 

Q: How do we prove the system logic is correct after a code change?

A: We can run different versions of code against the events and verify that their results are identical.

 

 Command-query responsibility segregation (CQRS)

 For clients to query the balance. In CQRS, there is one state machine responsible for the write part of the state, but there can be many read-only state machines.  The read-only state machines lag behind to some extent, but will always catch up. The architecture design is eventually consistent. 

 

Design deep dive: 

Two optimizations:

1. We can store commands and events in disk rather than kafka. Appending is a sequential write operation, which is generally very fast. 

2. We may cache the recent commands and events. A technique is called MMap. It can write to a local disk and cache recent content. 

We can use rocket (local file-based local relational database. )db or SQlite to improve read performance.  

snapshot. we can use snapshot to accelerate the reproducibility. We don't have to stop the state machine and read command from the start. We the periodically stop the state machine and save the current state into a file. This is called a snapshot.  A snapshot is a giant binary file. Sync snapshots from command / event files.

 

Their are for types of data.

  • File-based command
  • File-based event
  • File-based state
  • State snapshot

 

Consensus.

  • No data loss
  • The relative order of data within a log file remains the same order across nodes.

We can use Raft consensus algorithm.

Push or Pull?

Pull: not real-time. and may overload the wallet service. But we can add a reverse-proxy. Periodicall pull.

Push:

 

Distributted transaction. 

We can use TC/C or saga as the distributed transaction solution. The phase status table is to track the transaction status. it is updated according the status.

 

13 Stock exchange

1) requirements

Functional Non-functional
placing a new order availability: At least 99.99%. 
canceling an order fault toleranfce and a fast recovery
support limit order latency. millisecond level. 
  security. prevent DDOS attacts. 

 

2) back-of-the envelope estimation

symbols 100
orders 1 billion / day
QPS

9:30 ~ 4:00 pm. 6.5 hours in total. 

1 billion / (6.5 * 3600) = 43000 / s

Peak QPS 5 * QPS = 215000

 

 

 

some terms.

broker most retail clients trade with an exchange via a broker
institutional client trade in large volumes using specialized trading software
limit number a limit order is a buy or sell order with a fixed price.
market data level  L1, L2, L3. Best time to sell and buy.
candlestick chart

 

FIX Financial information exchange protocol

3) design

 

trading flow
  1.  a client places an order
  2. the brokers sends the order to the exchange
  3. the order enter exchange through the gateway
  4. order manager performs risk check
  5. ''
  6. after passing risk checks, sufficient funds in the wallet for the order
  7. buy and sell 
  8. ''
  9. ''
  10. ''
  11. ''
  12. ''
  13. ''
  14. The executions are return to the client. ''
  15. market data flow. trace the order executions. 
  16. M1. generates a stream of execution as matchers are made.
  17. M2. The market data publisher constructs the candlestick charts 
  18. M3. The market data is saved to specilized storage for real-time analytics. 
  19. ''
  20. reporting flow
  21. R1. collect all the necessary reporting fields. 
sequencer

The sequencer is the key component that makes the matching engine deterministic.

  • timeless and fairness
  • fast recovery / replay
  • exactly-once guarantee
  • the sequencer mustbe sequential numbers, so that any missing numbers can be easily detected

 

 order manager

 receive orders on one end and receive executions on the other.

  • it sends the order for risk checks.
  • it checks the order against the users.
  • it sends orders to sequencer
  • receives executions for the filled orders to the broker
 client gateway

 it receives orders placed by clients and routes them to the order manager.

 

 Market data flow

 market data publisher receives executions from the matching engine and builds the order books and candlestick charts from the stream of executions.

 

 reporting flow

provides trading history, tax reporting, compliance reporting, etc. The reporter is less sensitive to latency. Accuracy and compliance are key factors for the reporter.

 

API Design

1) order

  • POST / v1/order. 

request {symbol, side, price, orderType, quantity}

response {id, creationTime, filledQuantity, remainingQuantity}

  • Execution GET/v1/execution? symbol={:symbol}&orderId={:orderId}&startTime={:startTime}&endTime={:endTime}

request {symbol, orderId, startTime, endTime}

response {id, orderId, symbol, side, price, orderType, quantity}

  • Order book GET/v1/marketdata/orderBook/L2?symbol={:symbol}&depth={:depth}

request {symbol, depth, startTime, endTime}

response {bids, asks}

  • Historical prices (candlestick charts) GET/v1/marketdata/cancles?symbol={:symbol}&resolution={:resolution}&...

requset {symbol, resolution, startTime, endTime}

response {candles, open, close, high, low}

 data models

Product, describes the attributes of a traded symbol. This data doesn't change frequently. Used for UI display. highly cacheable.

 

An order represents the inbound instruciton for a buy or sell order. 

 

An execution represents the outbound matched result. It is also called a fill.

orders and executions are stored in memory and leverages hard disk or shared memory.

 

order book: it is a list of buy and sell orders for a specific or financial instrument. 

adding / canceling a limit order, the time complexity O(1). So we should use double-linked list and a map.

performance

Latency = sum(executionTimeAlongCriticalPath)

Two ways to reduce the latency:

  1. Decrease the number of tasks on the critical path. (gateway-> order manager-> sequencer -> matching engine)
  2. shorten the time spent on each task (reduce eliminates the network hops by putting  every thing on the same server and disk access latency

Appliction loop

use a while loop to pull order. the thread is pinned to a fixed CPU core. The tradeoff of CPU pinning is that it makes coding more complicated. mmap provides a mechanism for high-performance sharing of memory between processes.

   
Event sourcing

  • the gateway transforms the FIX to SBE. sends each order as a NewOrderEvent
  • order manager receives the NewOrderEvent from the event store, validates it, and adds it to its internal order states then to matching core
  • if a order is filled, OrderFilledEvent is sent to Event Store.

 

Only one sequencer

The sequencer pulls events from the ring buffer that is local to each component. 

high availability

4 nines. 99.99%. This means the exchange can only have 8.64 seconds of downtime per day. It requires almost immediate recovery if a service goes down.

  • identify single-point-of-failures in the exchange architecture. we set up redundant instances alongside the primary instance.
  • detection of failure and the decision to failover to the backup instance should be fast. 

hot matching engine works as the primary instance, and the warm engine receives and processes the exact same events but does not send any event out onto the event store. The problem with this hot-warm design is that it only works within the boundary of a single server. To achieve high availability, we have to extend this concept across multiple machines or even across data centers. In this setting, an entire server is either hot or warm.

fault tolerance

The system might send out false alarms, which cause unnecessary failovers.

Bugs in the code might cause the primary instance to go down.

 

We use raft leader-election algorithms.

 

Recovery Time Objective(RTO) refers to the amount of time an application can be down without causing significant damage to the business. With RAFT, it guarantees that state consensus is achieved among cluster nodes.

Matching algorithms

FIFO first int first out

Market data publisher optimizations

MDP can receive mathced results from the matching engine and rebuild the order book and candlestick charts based on that

Clients need to pay extra to get other level data. So MDP can rebuild these data accroding to matching results.

 

Distribution fairness of market data

Ulticast using reliable UDP is a good solution to broadcast updates to many participants at once.

Network security

DDoS:

  • Isolate public services and data from private services
  • use a caching layer to store data that is infrequently updated
  • Harden URLS. https://my.website.com/data/recent
  • safelist/blocklist mechanism
  • rate limiting
posted @ 2024-04-14 21:33  ylxn  阅读(5)  评论(0编辑  收藏  举报