reading

System Design Interview

Chapter 1: Scale from zero to millions of users

Singleton web server data flow:

request domain -> domain server -> get IP -> Http request -> fetch data from database -> Http response

Database

Vertical scaling vs horizontal scaling 垂直扩展与横向扩展

Load balance

DNS 解析到 负载均衡器 load balancer 的公有 IP,由负载均衡器平均地分发请求量到真实的服务器的私有 IP 上。

Data replication 数据库备份

Cache 缓存

A cache is a temporary storage area that stores the result of expensive responses or frequently accessed data in memory so that subsequent requests are served more quickly

什么时候使用缓存?

Content Delivery Network (CDN)

A CDN is a network of geographically dispersed servers used to deliver static content. CDN servers cache static content like images, videos, CSS, JavaScript files, etc.

The URL’s domain is provided by the CDN provider. 配备了 CDN 的静态资源的 URL,通常指向 CDN server。

Time-to-live (TTL) 设置也需要平衡,不可过短与过长

CDN fallback CDN撤退,客户端需要检测到 CND 健康状态,当 CDN 服务器挂掉之后,需要访问自己的真实服务器获取资源

主动使 CDN 上文件失效

无状态 web 层 stateless web tier

stateful 有状态的弊端:

如果是有状态的 web 服务,并且是多服务器的,那么 A用户的 session 存储在 A服务器,他的每一个接下来的请求都必须指向 A服务器, 否则他就是一个未登录状态。添加和删除服务器会很麻烦,服务器挂掉的容灾处理也麻烦。

无状态web层的优点

把 session 从 web 层移动到存储层(例如关系型数据库,Redis,NoSQL)

Data Center 数据中心

geoDNS-routed = geo-routed

geoDNS is a DNS service that allows domain names to be resolved to IP addresses based on the location of a user.

challenges of multi-data center setup:

Message Queue

decouple different components to scale independently. 解耦不同组件,以使各个组件模块独立的扩展规模

A message queue is a durable component, stored in memory, that supports asynchronous communication. It serves as a buffer and distributes asynchronous requests.

logging, metrics, automation, monitoring

metrics:

Database scaling

分片 sharding

hash function 哈希函数来决定数据被存储到哪一个片,和应该去哪一个片读取数据。例如 user_id%4 4个分片。

sharding key = partition key 决定数据怎么被分发,例如 user_id 就是 sharding-key。此key必须可以平均的分配数据

challenges:

Conclusion of chapter 1

Chapter 2: Back of the envelope estimation 信封背面估算

Latency numbers 延迟数字

Availability numbers

service level agreement (SLA). 9 越多越好

Example

Assumptions:

Estimations:

Query per second (QPS) estimate:

We will only estimate media storage here.

Chapter 3: A framework for system design interview

面试者需要展示的是:

red flags:

4 setps

Step 1 - Understand the problem and establish design scope. 需要全面理解问题,并了解设计的范围, 3-10 minutes

Answering without a thorough understanding of the requirements is a huge red flag.
Slow down. Think deeply and ask questions to clarify requirements and assumptions. This is extremely important. 澄清需求和猜想。

Step 2 - Propose high-level design and get buy-in, 提出一个高层的笼统的设计并买进, 10-15 minutes

Step 3 - Design deep dive, 深入的设计, 10-25 minutes

You shall work with the interviewer to identify and prioritize components in the architecture. 该着重剖析高优先级的组件模块了

发布新帖子

订阅朋友圈更新

Step 4 - Wrap up, 打包结束, 3-5 minutes

Conclusion

Chapter 4: Design a rate limiter 限速器

In a network system, a rate limiter is used to control the rate of traffic sent by a client or a service. In the HTTP world, a rate limiter limits the number of client requests allowed to be sent over a specified period. If the API request count exceeds the threshold(门槛) defined by the rate limiter, all the excess(额外的) calls are blocked.

benefit:

Step 1 - Understand the problem and establish design scope

需要澄清的问题:

Step 2 - Propose high-level design and get buy-in

HTTP 429 response status code indicates a user has sent too many requests.

API gateway is a fully managed service that supports:

Where to place the rate limiter:

example:

Algorithms for rate limiting

Token bucket algorithm 令牌桶算法

refer to token_bucket.py

Pros:

Cons:

Leaking bucket algorithm 漏桶算法

The leaking bucket algorithm is similar to the token bucket except that requests are processed at a fixed rate. It is usually implemented with a first-in-first-out (FIFO) queue. 请求是固定的速度,把请求放入队列中,先入先出。

example refer to leakying_bucket.py

Pros:

Cons:

Fixed window counter algorithm 固定窗口计数算法

refer to fixed_window_counter.py

Pros:

Cons:

Sliding window log algorithm 滑动窗口日志算法

As discussed previously, the fixed window counter algorithm has a major issue: it allows more requests to go through at the edges of a window. The sliding window log algorithm fixes the issue. 为了解决’固定时间窗口算法’而设计的’滑动时间窗口算法’。

refer to sliding_window_log.py

Pros:

Cons:

Sliding window counter algorithm 滑动窗口计数算法

The sliding window counter algorithm is a hybrid approach that combines the fixed window counter and sliding window log. ‘滑动窗口计数算法’是 ‘固定窗口计数’和 ‘滑动窗口日志’ 算法的混合

refer to sliding_window_counter.py

Pros

Cons

Step 3 - Design deep dive

Detailed design

Rate limiter in a distributed environment 分布式系统中如何部署限速器

两个挑战点:

Performance optimization

Monitoring

Step 4 - Wrap up

Chapter 5: Design consistent hashing 一致性hash

分布式服务器,负载均衡器如果使用 ‘取模%’ 运算,可以满足,但是有缺点:

一致性 hash 可以在增加和减少服务器节点时候,只影响部分 key 的迁移

Consistent hashing is a special kind of hashing such that when a hash table is re-sized and consistent hashing is used, only k/n keys need to be remapped on average, where k is the number of keys, and n is the number of slots.

The consistent hashing algorithm was introduced by Karger et al. at MIT

问题:

As the number of virtual nodes increases, the distribution of keys becomes more balanced. This is because the standard deviation(偏离,误差) gets smaller with more virtual nodes, leading to balanced data distribution.

Chapter 6: Design a key-value store

Understand the problem and establish design scope

example:

Single server key-value store

Distributed key-value store

CAP (Consistency 一致性, Availability 可用性, Partition Tolerance 分割容忍性) theorem 定理. CAP theorem states it is impossible for a distributed system to simultaneously provide more than two of these three guarantees

Since network failure is unavoidable, a distributed system must tolerate network partition. Thus, a CA system cannot exist in real-world applications. 网络错误不可避免,分布式系统一定要容忍网络阻隔错误,所以丢弃P 而保证 CA 的应用不存在。

CP 强调一致性,牺牲可用性,例如 银行系统

AP 强调可用性,提供读,即使读到的数据是过期的,提供写,等到掉线节点上线之后再进行同步。

System components

Data partition

challenges:

一致性hash 解决以上问题,优点:

Data replication 数据复制

Consistency

Quorum consensus 法定多数投票 can guarantee consistency for both read and write operations.

A coordinator 协调者 acts as a proxy between the client and the nodes. 协调者按照以上方法用来协调每一次读和写的有效性。

The configuration of W, R and N is a typical tradeoff between latency and consistency. R 与 W 大小的设置应该考虑到系统对延迟与一致性的折中。

Strong consistency is usually achieved by forcing a replica not to accept new reads/writes until every replica has agreed on current write. This approach is not ideal for highly available systems because it could block new operations.

Inconsistency resolution: versioning 标注版本

versioning system -> reconcile 和解 vector clock 向量时钟 [server, version]

D1[S1,1] -> D2[S1,2] -> | -> D3([S1,2],[S2,1]) server2 改了       -> |
			| -> D4([S1,2],[S3,1]) 同时 server3 也改了 -> | -> D5([S1,3],[S2,1],[S3,1]) reconciled by server1

you can tell that a version X is a sibling (i.e., a conflict exists) of Y if there is any participant in Y’s vector clock who has a counter that is less than its corresponding counter in X. For example, the following two vector clocks indicate there is a conflict: D([s0, 1], [s1, 2]) and D([s0, 2], [s1, 1]). 如果有某个数据某个版本号大于相应的另一个节点上,同时有其他版本号小于那个节点上,那么就是有冲突了。

缺点:

Handling failures

failure detection 侦测错误

gossip protocol. 检测到某个服务器下线,需要多个服务器共同确认

handling temporary failure

“sloppy 邋遢的 懒散的 quorum 法定人数” is used to improve availability.

Instead of enforcing the quorum requirement, the system chooses the first W healthy servers for writes and first R healthy servers for reads on the hash ring. Offline servers are ignored. 在环上,选择前 W 个健康的节点来处理写操作,前 R 个健康的节点来处理读操作

If a server is unavailable due to network or server failures, another server will process requests temporarily. When the down server is up, changes will be pushed back to achieve data consistency. This process is called hinted handoff. 如果一个节点挂了,其他节点临时接手处理请求,当节点恢复之后,修改的数据会推送到该节点,以维持一致性。 提示性的接力

Handling permanent failures 永久错误,那就需要同步节点数据。

implement an anti-entropy 反熵 protocol to keep replicas in sync.

system architecture

write path

  1. The write request is persisted on a commit log file.
  2. Data is saved in the memory cache.
  3. When the memory cache is full or reaches a predefined threshold, data is flushed to SSTable on disk. Note: A sorted-string table (SSTable) is a sorted list of <key, value> pairs.

Read path

After a read request is directed to a specific node, it first checks if data is in the memory cache. If the data is not in memory, it will be retrieved from the disk instead. We need an efficient way to find out which SSTable contains the key. Bloom filter is commonly used to solve this problem.

redis 4 中引入了 布隆过滤器,用于判断某个 key 是否存在。当布隆过滤器说某个值存在时,这个值可能不存在;当它说不存在时,那就肯定不存在。

bloomfilter 把某个值的存在与否映射到一个二进制的列表中,有一点像 100 瓶药水,用 7 只小白鼠验证哪一瓶药水有毒。 参考 bloom_filter.py

summary

Chapter 7: Design a unique ID generator in distributed systems

Step 1 - Understand the problem and establish design scope

Candidate: What are the characteristics of unique IDs? 
Interviewer: IDs must be unique and sortable. 唯一性和可排序性
Candidate: For each new record, does ID increment by 1? 
Interviewer: The ID increments by time but not necessarily only increments by 1. 随着时间自增,但不一定自增1
IDs created in the evening are larger than those created in the morning on the same day. 晚上的增幅可以比早上的大
Candidate: Do IDs only contain numerical values? 
Interviewer: Yes, that is correct. 只包含数字
Candidate: What is the ID length requirement? 
Interviewer: IDs should fit into 64-bit. 最长不超过64 位
Candidate: What is the scale of the system?
Interviewer: The system should be able to generate 10,000 IDs per second. 一秒可以生成 1 万个 ID

Step 2 - Propose high-level design and get buy-in

Multi-master replication

MySQL Server1: 1,3,5…… MySQL Server2: 2,4,6…… auto_increment=True, increase by k, k 是服务器数目,避免冲突

drawback 缺点:

UUID

A UUID is another easy way to obtain unique IDs. UUID is a 128-bit number used to identify information in computer systems.

几个不同版本的 UUID 之间的区别 UUID -> UUID5
  1. Version 1 (Time-based UUID):
  1. Version 2 (DCE or POSIX UUID, with Namespace Variant):
  1. Version 3 (MD5-based UUID):
  1. Version 4 (Random UUID):
  1. Version 5 (SHA-1-based UUID):
采用 UUID 作为唯一标识生成器的优缺点:

Pros:

Cons:

Ticket Server

The idea is to use a centralized auto_increment feature in a single database server (Ticket Server).

Ticket server 优缺点:

Pros:

Cons:

Twitter snowflake approach

分而治之 Divide and conquer

1bit 41bits 5bits 5bits 12bits
0 timestamp datacenter ID machine ID sequence number
reserved Milliseconds since the epoch or custom epoch 32 个 32 个 the sequence number is incremented by 1. The number is reset to 0 every millisecond.
>> > time.time()
1716538915.8404758
# timestamp 是 17 位 字符串
# 实际环境中:在ASCII码编码方案中,一个英文字符占用一个字节,一个汉字字符占用两个字节的空间;
# 在Unicode编码方案中,一个英文字符或一个汉字字符都占用两个字节的空间;
# 在UTF-8编码方案中,一个英文字符占用一个字节,一个汉字字符占用三个字节的空间

Twitter snowflake default epoch 起始时间点 1288834974657, equivalent to Nov 04, 2010, 01:42:54 UTC.

Step 3 - Design deep dive

timestamp

2 ^ 41 - 1 = 2199023255551 13位字符串 最小单位是毫秒。 41 位标识的数值

2199023255551/1000/60/60/24/365 = 69 年

我们可以设置今天为时间戳的起点,69年后再次更换时间戳起点。 having a custom epoch time close to today’s date delays the overflow time. After 69 years, we will need a new epoch time or adopt other techniques to migrate IDs.

sequence number

2 ^ 12 = 4096 combinations. This field is 0 unless more than one ID is generated in a millisecond on the same server. In theory, a machine can support a maximum of 4096 new IDs per millisecond.

Step 4 wrap up

Detail of NTP

Process:

NTP 只能同步不同服务器之间的时间,不能保证对方的时间就是对的。

Chapter 8: Design a URL shortener 短链接生成器

Step 1 Understand the problem and establish design scope 理解问题并建立设计范围

Candidate: Can you give an example of how a URL shortener work? 
Interviewer: Assume URL https://www.systeminterview.com/q=chatsystem&c=loggedin&v=v3&l =long is the original URL. 
Your service creates an alias with shorter length: https://tinyurl.com/ y7keocwj. If you click the alias, 
it redirects you to the original URL.
Candidate: What is the traffic volume? 并发量 吞吐量
Interviewer: 100 million URLs are generated per day.  写:生成1千万一天 1157/s
Candidate: How long is the shortened URL? 
Interviewer: As short as possible. 尽可能的短
Candidate: What characters are allowed in the shortened URL? 
Interviewer: Shortened URL can be a combination of numbers (0-9) and characters (a-z, A-Z). 数字与字母的组合,分大小写
Candidate: Can shortened URLs be deleted or updated? 
Interviewer: For simplicity, let us assume shortened URLs cannot be deleted or updated. URL无法删除和更新

Back of the envelope estimation

这些预估数据很有参考意义:1是速度与频率,2是存储空间

Step 2 - Propose high-level design and get buy-in

API endpoints

1.URL shortening. To create a new short URL, a client sends a POST request, which contains one parameter: the original long URL. The API looks like this: POST api/v1/data/shorten

2.URL redirecting. To redirect a short URL to the corresponding long URL, a client sends a GET request. The API looks like this: GET api/v1/shortUrl

URL redirecting

Once the server receives a tinyurl request, it changes the short URL to the long URL with 301 redirect.

一个是永久重定向,一个是临时重定向

hash tables {shortURL, longURL}, hash function

Step 3 - Design deep dive

Data model

hash table? -> memory is limited -> relational database

hash table 存在内存中会很快消耗掉内存容量,把冷的 key 存储在磁盘上,用关系数据库。

短链接可以找到长链接,长链接也可以找到短链接

Hash function

hash value length

[0-9,a-z,A-Z] 10+26+26, find the smallest n that 62^n >= 365 billion

n=7 62^7=3,521,614,606,208 -> 7位字符串满足

Hash + collision 碰撞 冲突 resolution
Hash function Hash value(Hexadecimal)  
CRC32 5cbds0ds 8位
MD5   32位
SHA-1   40位

To resolve hash collisions, we can recursively append a new predefined string until no more collision is discovered.

如果 hash 后的值在数据库中存在,那么就递归对当前 hash value 结果加上固定的后缀,然后计算组合的值的 hash value,直到没有冲突为止。

虽然解决了冲突,但是查询数据库中是否存在短链接变得更加耗资源。

使用布隆过滤器 bloom filter 提高效率,第六章 chapter-6-design-a-key-value-store 介绍过。

Base 62 conversion 换算 62进制

The mappings are:

Hash collision vs base 62

URL shortening deep dive

  1. longURL is the input.
  2. The system checks if the longURL is in the database.
  3. If it is, it means the longURL was converted to shortURL before. In this case, fetch the shortURL from the database and return it to the client.
  4. If not, the longURL is new. A new unique ID (primary key) Is generated by the unique ID generator.
  5. Convert the ID to shortURL with base 62 conversion.
  6. Create a new database row with the ID, shortURL, and longURL.

分布式的唯一ID生成器是必须的,涉及到第七章内容。

URL redirecting deep dive

读应该比写更加频繁,也就是通过短链接跳转到长链接的使用场景更频繁。所以经常访问的短链接 {shortURL: longURL} 存在内存中比较有效率。

  1. A user clicks a short URL link: https://tinyurl.com/zn9edcu
  2. The load balancer forwards the request to web servers.
  3. If a shortURL is already in the cache, return the longURL directly.
  4. If a shortURL is not in the cache, fetch the longURL from the database. If it is not in the database, it is likely a user entered an invalid shortURL.
  5. The longURL is returned to the user.

Step 4 - Wrap up

Chapter 9: Design a web crawler 爬虫

A web crawler is known as a robot or spider. It is widely used by search engines to discover new or updated content on the web. Content can be a web page, an image, a video, a PDF file, etc.

use case:

Step 1 - Understand the problem and establish design scope

  1. Given a set of URLs, download all the web pages addressed by the URLs. 给定一组 URL,下载页面内容
  2. Extract URLs from these web pages. 从给定的 URL 网页中提取新的 URL
  3. Add new URLs to the list of URLs to be downloaded. Repeat these 3 steps.
Candidate: What is the main purpose of the crawler? Is it used for search engine indexing, data mining, or something else? 
Interviewer: Search engine indexing. 目的
Candidate: How many web pages does the web crawler collect per month?
Interviewer: 1 billion pages. 一个月10亿页 1,000,000,000/M  385 pages/s
Candidate: What content types are included? HTML only or other content types such as PDFs and images as well?
Interviewer: HTML only. 页面内容
Candidate: Shall we consider newly added or edited web pages? 
Interviewer: Yes, we should consider the newly added or edited web pages. 追踪页面更新内容
Candidate: Do we need to store HTML pages crawled from the web?
Interviewer: Yes, up to 5 years 存储 5 年, 1页 100KB, 5*24*1000000000*100/1000m/1000g/1000t/1000p  = 12PB 存储空间
Candidate: How do we handle web pages with duplicate content? 
Interviewer: Pages with duplicate content should be ignored. 去重

Back of the envelope estimation

Step 2 - Propose high-level design and get buy-in

seed URLs

domain name

The general strategy is to divide the entire URL space into smaller ones.

URL frontier 边界 前沿

Most modern web crawlers split the crawl state into two: to be downloaded and already downloaded.

The component that stores URLs to be downloaded is called the URL Frontier.

You can refer to this as a First-in-First-out (FIFO) queue.

HTML Downloader

HTML 页面下载器

DNS Resolver DNS 解析器

DNS -> IP

Content Parser 内容分析器

After a web page is downloaded, it must be parsed and validated because malformed web pages could provoke problems and waste storage space. Implementing a content parser in a crawl server will slow down the crawling process. Thus, the content parser is a separate component.

Content Seen?

Online research [6] reveals that 29% of the web pages are duplicated contents, which may cause the same content to be stored multiple times.

Content Storage

depends on factors such as data type, data size, access frequency, life span, etc. Both disk and memory are used.

URL Extractor 新 URL 提取器

Relative paths are converted to absolute URLs

URL Filter 过滤器

排除:

URL Seen? 是否已经下载过,或者还在 frontier 列表中

“URL Seen?” is a data structure that keeps track of URLs that are visited before or already in the Frontier.

avoid adding the same URL multiple times as this can increase server load and cause potential infinite loops.

Bloom filter and hash table are common techniques to implement the “URL Seen?” component.

URL Storage

URL Storage stores already visited URLs.

Step 3 - Design deep dive

Depth-first search (DFS) vs Breadth-first search (BFS) 深度优先还是广度优先

DFS is usually not a good choice because the depth of DFS can be very deep.

BFS is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO) queue.

In a FIFO queue, URLs are dequeued in the order they are enqueued. However, this implementation has two problems:

URL frontier

Politeness.

sending too many requests is considered as impolite or denial-of-service (DOS) attack

Priority 优先级

base on:

先排优先级,队列选择器优先选择高权重的队列里面的URL,并传给URL域名路由器,再排域名归属。

Freshness 刷新

Recrawl all the URLs is time-consuming 耗时 and resource intensive 资源密集型的. Few strategies to optimize freshness are listed as follows:

Storage for URL Frontier

几千万条 URL,存内存里面不现实,一是断电可能丢,二是不好规模扩展。全部存磁盘又回很慢。 两者结合获取各取所长,各避所短。

HTML Downloader

Robots.txt

called Robots Exclusion Protocol, is a standard used by websites to communicate with crawlers. It specifies what pages crawlers are allowed to download.

To avoid repeat downloads of robots.txt file, we cache the results of the file. 存储网页的hash值,避免逐个字节比对。

Performance optimization
1. Distributed crawl

To achieve high performance, crawl jobs are distributed into multiple servers, and each server runs multiple threads. The URL space is partitioned into smaller pieces; so, each downloader is responsible for a subset of the URLs URLs 也要分区,多个服务器上的爬虫负责抓取各自的分区的 URLs。避免重复。

2. Cache DNS Resolver

缓存DNS解析的结果,减少网络请求耗时。但是有些域名会更新IP地址,所以需要权衡缓存的过期时间,或者定期更新。

DNS requests might take time due to the synchronous nature of many DNS interfaces. DNS解析是同步顺序进行的,不可以并发。多个线程同时请求,后面的线程需要等待前一个线程取到DNS结果之后才可以。

3. Locality

爬虫所在服务器需要在地理上靠近被爬的网站服务器所在的位置。降低网络耗时。

4. Short timeout

有些网站响应很慢,或者没有响应。需要设置等待的过期时间。

Robustness 健壮性
Extensibility 可扩展性

在架构的 “content seen?” 后面可以添加以下两个模块:

Detect and avoid problematic content

This section discusses the detection and prevention of redundant 冗余的, meaningless 无意义的, or harmful 有害的 content.

  1. Redundant content As discussed previously, nearly 30% of the web pages are duplicates. Hashes or checksums help to detect duplication

  2. Spider traps A spider trap is a web page that causes a crawler in an infinite loop. www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/… Such spider traps can be avoided by setting a maximal length for URLs. 限定URL长度 However, no one-size-fits-all solution exists to detect spider traps. Websites containing spider traps are easy to identify due to an unusually large number of web pages discovered on such websites. It is hard to develop automatic algorithms to avoid spider traps; however, a user can manually verify and identify a spider trap, and either exclude those websites from the crawler or apply some customized URL filters. 或许可以让 AI 来进行特征学习,以此来减少人工干预。
  3. Data noise Some of the contents have little or no value, such as advertisements 广告, code snippets 代码片段, spam URLs 钓鱼链接. Those contents are not useful for crawlers and should be excluded if possible. 谷歌广告,百度广告链接,代码片段或许有特征,钓鱼链接该怎么识别?

Step 4 - Wrap up

Chapter 10: Design a notification system

include:

Step 1 - Understand the problem and establish design scope

Candidate: What types of notifications does the system support? 
Interviewer: Push notification, SMS message, and email. 手机通知,短信,邮件
Candidate: Is it a real-time system?
Interviewer: Let us say it is a soft real-time system. We want a user to receive notifications as soon as possible. 
However, if the system is under a high workload, a slight delay is acceptable. 尽可能实时推送
Candidate: What are the supported devices?
Interviewer: iOS devices, android devices, and laptop/desktop. 支持的设备包括移动端和PC端
Candidate: What triggers notifications?
Interviewer: Notifications can be triggered by client applications. They can also be scheduled on the server-side. 
客户端触发,或者服务端定时推送。
Candidate: Will users be able to opt-out?
Interviewer: Yes, users who choose to opt-out will no longer receive notifications. 用户可以取消接收推送
Candidate: How many notifications are sent out each day? 
Interviewer: 10 million mobile push notifications, 1 million SMS messages, and 5 million emails. 
一天 1千万条手机推送,1百万条短信推送,5百万条邮件推送。 16000000/24/60/60 = 185 QPS
是否需要存储消息?存储几天? 

Step 2 - Propose high-level design and get buy-in

Different types of notifications

iOS push notification

Provider -> APNs -> iOS

Android push notification

Provider -> FCM -> Android

SMS message

Provider -> SMS service -> SMS

Email

Provider -> Email service -> Email

Contact info gathering flow

Notification sending/receiving flow

challenges
High-level design (improved)

Notification servers: They provide the following functionalities:

Cache: User info, device info, notification templates are cached. 缓存用户信息,设备信息,推送消息模版

DB: It stores data about user, notification, settings, etc.

Message queues: They remove dependencies between components. Message queues serve as buffers when high volumes of notifications are to be sent out. Each notification type is assigned with a distinct message queue so an outage in one third-party service will not affect other notification types. 消息的缓冲区,一种第三方服务挂掉,不影响其他类型的推送。

Workers: Workers are a list of servers that pull notification events from message queues and send them to the corresponding third-party services.

Step 3 - Design deep dive

Reliability 可靠性

How to prevent data loss?

Notifications can usually be delayed or re-ordered, but never lost. 可以重新排序,延迟发送,但是不能丢,保证最终送达。

To satisfy this requirement, the notification system persists notification data in a database and implements a retry mechanism. 把消息存储在磁盘上,避免在内存中丢失,或者第三方服务挂掉重启之后的重试发送。

–> notification log. workers 在发送时候同步存储消息内容到 notification log 里面

Will recipients receive a notification exactly once?

事实上需要去重复。

dedupe logic:

When a notification event first arrives, we check if it is seen before by checking the event ID. If it is seen before, it is discarded. Otherwise, we will send out the notification.

Additional components and considerations

template reusing, notification settings, event tracking, system monitoring, rate limiting,

Notification template
Notification setting

give users fine-grained control over notification settings. 容许用户取消订阅消息。

Rate limiting 限速

给每个用户设定推送消息的最多条数。

This is important because receivers could turn off notifications completely if we send too often.

Retry mechanism

When a third-party service fails to send a notification, the notification will be added to the message queue for retrying. If the problem persists, an alert will be sent out to developers. 错误重试,尤其是第三方服务不可靠时候。

Monitor queued notifications

A key metric to monitor is the total number of queued notifications. If the number is large, the notification events are not processed fast enough by workers. To avoid delay in the notification delivery, more workers are needed. 监控消息队列里面的消息的条数,如果太多会导致延迟,那就考虑多加几个workers

Events tracking

Notification metrics, such as open rate, click rate, and engagement are important in understanding customer behaviors. 用于分析用户行为的数据,点击率,打开率,活动参与率。

为 Analytics service 分析服务收集数据。

updated design

Chapter 11: Design a news feed system 新鲜事订阅系统

According to the Facebook help page, “News feed is the constantly updating list of stories in the middle of your home page. News Feed includes status updates, photos, videos, links, app activity, and likes from people, pages, and groups that you follow on Facebook” 主页不断地更新的新鲜事

Step 1 - Understand the problem and establish design scope

Candidate: Is this a mobile app? Or a web app? Or both? 
Interviewer: Both 即是手机应用又是web应用
Candidate: What are the important features?
Interview: A user can publish a post and see her friends’ posts on the news feed page. 发布帖子,订阅帖子
Candidate: Is the news feed sorted by reverse chronological order or any particular order such as topic scores? 
For instance, posts from your close friends have higher scores.
Interviewer: To keep things simple, let us assume the feed is sorted by reverse chronological order. 时间倒序排列
Candidate: How many friends can a user have? 
Interviewer: 5000 一个人可以订阅5000个人的状态更新
Candidate: What is the traffic volume? 
Interviewer: 10 million DAU daily active user 日活 1千万用户 10,000,000/24/60/60 = 115 QPS
Candidate: Can feed contain images, videos, or just text? 
Interviewer: It can contain media files, including both images and videos. 
假如平均每个用户产生文字视频和图片总和大小为10M,每天产生 10,000,000 * 10MB = 10TB 每天。 5年就是 5*365*10 = 18PB

Step 2 - Propose high-level design and get buy-in

two flows:

Newsfeed APIs

posting a status, retrieving news feed, adding friends, 添加好友

POST /v1/me/feed
Params:
- content: content is the text of the post.
- auth_token: it is used to authenticate API requests.

GET /v1/me/feed
Params:
- auth_token: it is used to authenticate API requests.

Feed publishing

Newsfeed building

Step 3 - Design deep dive

Feed publishing deep dive

Web servers

Besides communicating with clients, web servers enforce authentication and rate-limiting. Only users signed in with valid auth_token are allowed to make posts. The system limits the number of posts a user can make within a certain period, vital to prevent spam and abusive content. 添加认证和限速功能。阻止垃圾和滥用内容。

Fanout service

Fanout is the process of delivering a post to all friends. Two types of fanout models are:

Fanout on write.

With this approach, news feed is pre-computed during write time. A new post is delivered to friends’ cache immediately after it is published.

Pros:

Cons:

Fanout on read.

The news feed is generated during read time. This is an on-demand model. Recent posts are pulled when a user loads her home page.

Pros:

Cons:

choice

We adopt a hybrid approach to get benefits of both approaches and avoid pitfalls in them.

The fanout service works as follows:
  1. Fetch friend IDs from the graph database. Graph databases are suited for managing friend relationship and friend recommendations. Interested readers wishing to learn more about this concept should refer to the reference material 图数据库(Graph Database)是一种使用图结构进行语义查询的数据库。它使用节点(Node)、边(Edge)和属性(Property)来表示和存储数据。 图数据库的关键概念是图,它能够高效地表示和查询实体及其之间的关系。

  2. Get friends info from the user cache. The system then filters out friends based on user settings. 获取可以推送的朋友列表。
    • For example, if you mute someone, her posts will not show up on your news feed even though you are still friends. 被订阅者屏蔽的用户发布的内容。
    • Another reason why posts may not show is that a user could selectively share information with specific friends or hide it from other people. 发布者屏蔽的订阅者
  3. Send friends list and new post ID to the message queue.
  4. Fanout workers fetch data from the message queue and store news feed data in the news feed cache. 获取ID和消息数据,生成推送消息内容,并且存储到缓存中。
    • You can think of the news feed cache as a <post_id, user_id> mapping table. Whenever a new post is made, it will be appended to the news feed table
    • The memory consumption can become very large if we store the entire user and post objects in the cache. Thus, only IDs are stored. 节省内存消耗,存储消息ID即可
    • The chance of a user scrolling through thousands of posts in news feed is slim. Most users are only interested in the latest content 用户查看所有消息的概率低,只对最新的推送感兴趣,有必要设置过期时间,对过时的并且未被用户查看的内容进行清理。
  5. Store <post_id, user_id > in news feed cache.

Newsfeed retrieval deep dive

media content (images, videos, etc.) are stored in CDN for fast retrieval.

  1. News feed service gets a list post IDs from the news feed cache. 从订阅系统缓存中获取帖子ID

  2. A user’s news feed is more than just a list of feed IDs. It contains username, profile picture, post content, post image, etc. Thus, the news feed service fetches the complete user and post objects from caches (user cache and post cache) to construct the fully hydrated news feed. 除了帖子ID,还要获取用户名,图片,内容。

  3. The fully hydrated news feed is returned in JSON format back to the client for rendering.

Cache architecture

       
Nees Feed news feed    
Content hot cache normal  
Social Graph 关系图 follower following  
Action liked replied others
Counters like counter reply counter other counters

Step 4 - Wrap up

其他高视角可以讨论的内容:

Scaling the database:

Other talking points:

Chapter 12: Design a chat system

Step 1 - Understand the problem and establish design scope

Candidate: What kind of chat app shall we design? 1 on 1 or group based?
Interviewer: It should support both 1 on 1 and group chat. 两种都要支持,多客户端支持。
Candidate: Is this a mobile app? Or a web app? Or both? 
Interviewer: Both. app与网页版都要
Candidate: What is the scale of this app? A startup app or massive scale?
Interviewer: It should support 50 million daily active users (DAU). 50,000,000 日活用户,50000000/24/60/60=578 QPS
Candidate: For group chat, what is the group member limit? 
Interviewer: A maximum of 100 people 群聊最大规模100人
Candidate: What features are important for the chat app? Can it support attachment?
Interviewer: 1 on 1 chat, group chat, online indicator. The system only supports text messages. 在线指示,只支持文字消息。
Candidate: Is there a message size limit?
Interviewer: Yes, text length should be less than 100,000 characters long. 一个字符1Byte 100KB
Candidate: Is end-to-end encryption required?
Interviewer: Not required for now but we will discuss that if time allows. 端到端加密
Candidate: How long shall we store the chat history? 
Interviewer: Forever. 算一下50年多少存储空间, 一个用户一天发10条消息,1MB*50,000,000*365*50 = 18.25 PB*50 = 912 PB

Step 2 - Propose high-level design and get buy-in

In a chat system, clients can be either mobile applications or web applications. Clients do not communicate directly with each other. Instead, each client connects to a chat service, which supports all the features mentioned above.

functions:

implement:

Polling 间歇性客户端请求拉取消息。

polling is a technique that the client periodically asks the server if there are messages available.

Depending on polling frequency, polling could be costly. It could consume precious server resources to answer a question that offers no as an answer most of the time. 比较耗费资源,因为一直问,而可能对方无答复。

Long polling

In long polling, a client holds the connection open until there are actually new messages available or a timeout threshold has been reached. 添加了等待,和等待超时。避免频繁发起HTTP请求,也会消耗HTTP连接资源。

Once the client receives new messages, it immediately sends another request to the server, restarting the process. Long polling has a few drawbacks:

WebSocket

WebSocket connection is initiated by the client. It is bi-directional and persistent. 双向的,持久的。

It starts its life as a HTTP connection and could be “upgraded” via some well-defined handshake to a WebSocket connection.

Through this persistent connection, a server could send updates to a client.

WebSocket connections generally work even if a firewall is in place. This is because they use port 80 or 443 which are also used by HTTP/HTTPS connections.

High-level design

Stateless Services
Stateful Service

In this service, a client normally does not switch to another chat server as long as the server is still available. 因为是长连接,连接断掉之前,客户端不切换服务器。

The service discovery coordinates closely with the chat service to avoid server overloading.

Third-party integration

push notification, It is a way to inform users when new messages have arrived, even when the app is not running.

Scalability

In our scenario, at 1M concurrent users, assuming each user connection needs 10K of memory on the server (this is a very rough figure and very dependent on the language choice), it only needs about 10GB of memory to hold all the connections on one box.

从单台服务器扩大规模到多台服务器。

Storage

An important decision we must make is to decide on the right type of database to use: relational databases or NoSQL databases? 数据库类型的选择,关系型数据库还是非关系型数据库?

The first is generic data, such as user profile, setting, user friends list. These data are stored in robust and reliable relational databases. Replication and sharding are common techniques to satisfy availability and scalability requirements. 普通用户数据使用关系型数据库,数据库备份,分片。

The second is unique to chat systems: chat history data. It is important to understand the read/write pattern. 聊天历史数据是海量的,需要单独考虑。

选择 key-value stores, reasons:

Data models

Message table for one-on-one chat
message    
message_id bigint primary key
message_from bigint  
message_to bigint  
content text  
created_at timestamp  
Message table for group chat
group_message    
channel_id bigint combined primary key
message_id bigint combined primary key
user_id bigint  
content text  
created_at timestamp  
Message ID

Message_id carries the responsibility of ensuring the order of messages. To ascertain the order of messages, message_id must satisfy the following two requirements: 消息ID需要保证消息的顺序,ID的生成需要考虑:

uuid4 虽然可以保证唯一性,但是不可以排序。之前章节的生成唯一ID(Twitter snowflake approach)可以参考,服务器ID+时间戳+自增ID。

Local means IDs are only unique within a group. The reason why local IDs work is that maintaining message sequence within one-on-one channel or a group channel is sufficient. This approach is easier to implement in comparison to the global ID implementation. ID本地唯一性,用于一对一和群聊够了。

Step 3 - Design deep dive

Service discovery

The primary role of service discovery is to recommend the best chat server for a client based on the criteria like geographical location, server capacity, etc

Apache Zookeeper

  1. User A tries to log in to the app.
  2. The load balancer sends the login request to API servers.
  3. After the backend authenticates the user, service discovery finds the best chat server for User A. In this example, server 2 is chosen and the server info is returned back to User A.
  4. User A connects to chat server 2 through WebSocket.

Message flows

1 on 1 chat flow

  1. User A sends a chat message to Chat server 1.
  2. Chat server 1 obtains a message ID from the ID generator.
  3. Chat server 1 sends the message to the message sync queue.
  4. The message is stored in a key-value store.
  5. online or offline:
    • If User B is online, the message is forwarded to Chat server 2 where User B is connected.
    • If User B is offline, a push notification is sent from push notification (PN) servers.
  6. Chat server 2 forwards the message to User B. There is a persistent WebSocket connection between User B and Chat server 2.
Message synchronization across multiple devices

Each device maintains a variable called cur_max_message_id, which keeps track of the latest message ID on the device. Messages that satisfy the following two conditions are considered as news messages: 每个设备本地单独维护一个当前最大消息ID

Small group chat flow

First, the message from User A is copied to each group member’s message sync queue: one for User B and the second for User C. You can think of the message sync queue as an inbox for a recipient. This design choice is good for small group chat because: 群聊中的消息同步队列,相当于每个用户的收件箱。

WeChat uses a similar approach, and it limits a group to 500 members. However, for groups with a lot of users, storing a message copy for each member is not acceptable.

On the recipient side, a recipient can receive messages from multiple users. Each recipient has an inbox (message sync queue) which contains messages from different senders.

Online presence

In the high-level design, presence servers are responsible for managing online status and communicating with clients through WebSocket.

a few flows that will trigger online status change:

user login

After a WebSocket connection is built between the client and the real-time service, user A’s online status and last_active_at timestamp are saved in the KV store. Presence indicator shows the user is online after she logs in.

user logout

The online status is changed to offline in the KV store. The presence indicator shows a user is offline.

User disconnection

A naive way to handle user disconnection is to mark the user as offline and change the status to online when the connection re-establishes.

However, this approach has a major flaw. It is common for users to disconnect and reconnect to the internet frequently in a short time.

Updating online status on every disconnect/reconnect would make the presence indicator change too often, resulting in poor user experience.

We introduce a heartbeat mechanism to solve this problem. Periodically, an online client sends a heartbeat event to presence servers. If presence servers receive a heartbeat event within a certain time, say x seconds from the client, a user is considered as online. Otherwise, it is offline. 引入心跳机制,客户端定期向’在线状态服务’发送心跳,如果’在线状态服务’发现某个用户超出 n 秒没有心跳上来,那么就认定这个用户离线了。

Online status fanout

How do user A’s friends know about the status changes? 用户如何直到其他用户的是否在线的状态?

Presence servers use a publish-subscribe model, in which each friend pair maintains a channel. When User A’s online status changes, it publishes the event to three channels, channel A-B, A-C, and A-D. Those three channels are subscribed by User B, C, and D, respectively. 发布订阅模式来通知用户状态给到其他朋友。 The communication between clients and servers is through real-time WebSocket. 因为用户聊天是websocket长连接,所以用户状态变更消息也与普通聊天消息一样,走websocket。

WeChat uses a similar approach because its user group is capped to 500. For larger groups, informing all members about online status is expensive and time-consuming. 微信没有查看其他用户是否在线的状态的功能吧???群聊也看不出。

Assume a group has 100,000 members. Each status change will generate 100,000 events. To solve the performance bottleneck, a possible solution is to fetch online status only when a user enters a group or manually refreshes the friend list. 群聊朋友多的情况下,采用 pull 的模式取代 push 模式。

Step 4 - Wrap up

additional talking points:

Chapter 13 Design a search autocomplete system 搜索自动提示填充

typeahead, search-as-you-type, incremental search

design top k, design top k most searched

Step 1 understand the problem and establish design scope

Candidate: Is the matching only supported at the beginning of a search query or in the middle as well?
Interviewer: Only at the beginning of a search query. 搜索开始的时候提示
Candidate: How many autocomplete suggestions should the system return?
Interviewer: 5 5条候选项
Candidate: How does the system know which 5 suggestions to return?
Interviewer: This is determined by popularity, decided by the historical query frequency. 根据热度,历史查询频率
Candidate: Does the system support spell check? 
Interviewer: No, spell check or autocorrect is not supported. 不用检查语法错误
Candidate: Are search queries in English? 
Interviewer: Yes. If time allows at the end, we can discuss multi- language support. 英语,多语言支持
Candidate: Do we allow capitalization and special characters? 
Interviewer: No, we assume all search queries have lowercase alphabetic characters. 全部小写字母。
Candidate: How many users use the product? Interviewer: 10 million DAU. 一千万日活用户 10,000,000/24/60/60 = 115 每秒用户数

requirements:

Back of the envelope estimation

Step 2 - Propose high-level design and get buy-in

At the high-level, the system is broken down into two:

Data gathering service

{“query string”:”frequency”}

空格间隔单词,所以会避免掉无意义的字母组合。

Query service

加入查询频率表如下所示:

Query Frequency
twitter 35
twitch 29
twilight 25
twin peak 21
twitch prime 18
twitter search 14
twillo 10
twin peak sf 8

select * from frequency_table where query like tw% order by frequency desc limit 5;

This is an acceptable solution when the data set is small. When it is large, accessing the database becomes a bottleneck.

Step 3 - Design deep dive

The high-level design is not optimal 不是理想的。

Trie (pronounced “try”) data structure 介绍 Trie 数据结构

Relational databases are used for storage in the high-level design. However, fetching the top 5 search queries from a relational database is inefficient. The data structure trie (prefix tree) is used to overcome the problem. 前缀树

The name comes from the word retrieval, which indicates it is designed for string retrieval operations.

consists of:

To support sorting by frequency, frequency info needs to be included in nodes. 支持频率,需要把频率添加到节点中

define some terms:

Steps to get top k most searched queries are listed below:

  1. Find the prefix. Time complexity: O(p). 找到前缀
  2. Traverse the subtree from the prefix node to get all valid children. A child is valid if it can form a valid query string. Time complexity: O(c) 从前缀节点往下查找所有子节点,如果子节点是合法单词就作为备选项。
  3. Sort the children and get top k. Time complexity: O(c*logc) 排序所有合法子节点,获取前五个高频的。

时间复杂度 = O(p) + O(c) + O(clogc)

二叉树的高度

1个节点, 高度为 1 log2(1)+1
2-3个节点,高度为 2 log2(3)+1
4-7个节点,高度为 3 log2(7)+1

对数换底公式

The above algorithm is straightforward. However, it is too slow because we need to traverse the entire trie to get top k results in the worst-case scenario. Below are two optimizations: 还是太慢,在最坏的情况下,需要遍历所有树的节点

  1. Limit the max length of a prefix 限定最大前缀
  2. Cache top search queries at each node 缓存每个节点查询结果
Limit the max length of a prefix

Users rarely type a long search query into the search box. Thus, it is safe to say p is a small integer number, say 50. If we limit the length of a prefix, the time complexity for “Find the prefix” can be reduced from O(p) to O(small constant), aka O(1). 用户搜索时候不会使用太长的句子,限制前缀长度可行。

Cache top search queries at each node

To avoid traversing the whole trie, we store top k most frequently used queries at each node.

Since 5 to 10 autocomplete suggestions are enough for users, k is a relatively small number.

we significantly reduce the time complexity to retrieve the top 5 queries. However, this design requires a lot of space to store top queries at every node. Trading space for time is well worth it as fast response time is very important. 空间换时间,权衡取舍。

  1. Find the prefix node. Time complexity: O(1)
  2. Return top k. Since top k queries are cached, the time complexity for this step is O(1).

Data gathering service 数据收集服务

In our previous design, whenever a user types a search query, data is updated in real-time. This approach is not practical for the following two reasons: 实时更新搜索历史不可行。

To design a scalable data gathering service, we examine where data comes from and how data is used. 需要衡量数据来源,和数据使用场景

data used to build the trie is usually from analytics or logging services. 需要从分析和日志服务中构建热词树

Analytics Logs. 分析日志

It stores raw data about search queries. Logs are append-only and are not indexed.

query time
tree 2024-06-04 14:01:01
try 2024-06-04 14:01:02
Aggregators. 聚合器

The size of analytics logs is usually very large, and data is not in the right format. We need to aggregate data so it can be easily processed by our system. 日志数据需要聚合处理,来便于分析

Aggregated Data. 聚合后的数据

aggregated weekly data.

query time frequency
tree 2019-10-1 12000
tree 2019-10-8 15000
Workers.

Workers are a set of servers that perform asynchronous jobs at regular intervals. They build the trie data structure and store it in Trie DB.

Trie Cache.

Trie Cache is a distributed cache system that keeps trie in memory for fast read. It takes a weekly snapshot of the DB. 分布式缓存存储,每周对Trie树数据库一个快照进行备份,存储在缓存中。

Trie DB.

Trie DB is the persistent storage. Two options are available to store the data:

  1. Document store: Since a new trie is built weekly, we can periodically take a snapshot of it, serialize it, and store the serialized data in the database. Document stores like MongoDB are good fits for serialized data. 文档存储,方便于快照,序列化
  2. Key-value store: A trie can be represented in a hash table form by applying the following logic: 键值对存储
    • Every prefix in the trie is mapped to a key in a hash table. 前缀映射到 key
    • Data on each trie node is mapped to a value in a hash table. Trie树上的 data 存储在 value 中。指的是可能的关键词以及他们的频率。

Query service 查询服务

  1. A search query is sent to the load balancer.
  2. The load balancer routes the request to API servers.
  3. API servers get trie data from Trie Cache and construct autocomplete suggestions for the client.
  4. In case the data is not in Trie Cache, we replenish 再填满 data back to the cache. This way, all subsequent requests for the same prefix are returned from the cache. A cache miss can happen when a cache server is out of memory or offline. 内存耗光和缓存挂掉会导致缓存未命中,那就需要去数据库中取数据

加快查询速度的方法:

Trie operations

Create

Trie is created by workers using aggregated data. The source of data is from Analytics Log/DB.

Update

There are two ways to update the trie.

Delete

We have to remove hateful, violent, sexually explicit, or dangerous autocomplete suggestions. We add a filter layer 增加过滤器删除不符合道德的词,

Scale the storage

Since English is the only supported language, a naive way to shard is based on the first character. Here are some examples. 数据库切片分区,基于首字母,但是单词中首字母分布不是均匀的。 但是也许 t开头的单词比 o 开头的单词多很多,关键词的也会相应的多很多。会导致服务器负载不均衡。

最多可以分 26 个服务器来存储Trie数据,因为有 26个字母,继续扩大规模的话就要根据前两个字母的组合来切片分区,以此类推。

为了解决以上字母分布不均匀的问题:

The shard map manager maintains a lookup database for identifying where rows should be stored. For example, if there are a similar number of historical queries for ‘s’ and for ‘u’, ‘v’, ‘w’, ‘x’, ‘y’ and ‘z’ combined, we can maintain two shards: one for ‘s’ and one for ‘u’ to ‘z’.

Step 4 Wrap up

Chapter 14: Design Youtube

video sharing platform, Netflix, Hulu

Step 1 - Understand the problem and establish design scope

Candidate: What features are important?
Interviewer: Ability to upload a video and watch a video. 上传和观看视频
Candidate: What clients do we need to support? 
Interviewer: Mobile apps, web browsers, and smart TV. 客户端包括手机app,浏览器,智能电视
Candidate: How many daily active users do we have? 
Interviewer: 5 million 日活 500万
Candidate: What is the average daily time spent on the product? 
Interviewer: 30 minutes. 每个用户每天平均花费 30 分钟在油管上
Candidate: Do we need to support international users? 
Interviewer: Yes, a large percentage of users are international users. 多语言支持
Candidate: What are the supported video resolutions? 
Interviewer: The system accepts most of the video resolutions and formats. 支持多种格式的视频上传,要可以转码
Candidate: Is encryption required? 
Interviewer: Yes 加密?对什么加密?
Candidate: Any file size requirement for videos?
Interviewer: Our platform focuses on small and medium-sized videos. The maximum allowed video size is 1GB. 最大视频大小 1GB
Candidate: Can we leverage some of the existing cloud infrastructures provided by Amazon, Google, or Microsoft? 
Interviewer: That is a great question. Building everything from scratch is unrealistic for most companies, 
it is recommended to leverage some of the existing cloud services. 可以利用现成的云计算基础设施服务

Back of the envelope estimation

Step 2 - Propose high-level design and get buy-in

CDN and blob 一滴 一团 storage are the cloud services we will leverage.

client ->| streaming video -> CDN
       ->| everything else -> API servers

CDN: Videos are stored in CDN. When you press play, a video is streamed from the CDN.

API servers: Everything else except video streaming goes through API servers. This includes

Video uploading flow

components:

The flow is broken down into two processes running in parallel. 上传有两个并行任务

Flow A: Upload the actual video

  1. Videos are uploaded to the original storage. 视频上传到原始存储的位置
  2. Transcoding servers fetch videos from the original storage and start transcoding. 转码器拉取视频并进行转码。
  3. Once transcoding is complete, the following two steps are executed in parallel: 转码完成执行以下并行任务
    • 3a. Transcoded videos are sent to transcoded storage. 任务一是把转码后的视频存储到单独的存储单元中。
      • 3a.1. Transcoded videos are distributed to CDN. 转码的视频分发到各个 CDN 服务器。
    • 3b. Transcoding completion events are queued in the completion queue. 转码完成的消息通知放到消息队列中,用于通知上传者或者订阅者。
      • 3b.1. Completion handler contains a bunch of workers that continuously pull event data from the queue. 完成转码处理器定期拉取队列里面的消息,并处理,包括更新缓存和数据库中的视频元数据。
      • 3b.1.a. and 3b.1.b. Completion handler updates the metadata database and cache when video transcoding is complete.
  4. API servers inform the client that the video is successfully uploaded and is ready for streaming. API通知客户端上传完成。
Flow B: Update the metadata

While a file is being uploaded to the original storage, the client in parallel sends a request to update the video metadata. The request contains video metadata, including file name, size, format, etc. API servers update the metadata cache and database.

Video streaming flow

Downloading means the whole video is copied to your device, while streaming means your device continuously receives video streams from remote source videos.

streaming video:

streaming protocol:

Step 3 - Design deep dive

Video transcoding

If you want the video to be played smoothly on other devices, the video must be encoded into compatible bitrates 比特速率 and formats.

Bitrate is the rate at which bits are processed over time. A higher bitrate generally means higher video quality. High bitrate streams need more processing power and fast internet speed.

转码的原因:

编码格式包括两部分:

Directed acyclic 非周期的 graph (DAG) model

Transcoding a video is computationally expensive and time-consuming.视频转码是计算密集型和耗时的任务。

Facebook’s streaming video engine uses a directed acyclic graph (DAG) programming model, which defines tasks in stages so they can be executed sequentially or parallelly 平行地

some tasks:

Video transcoding architecture

Preprocessor
  1. Video splitting. 视频分割。视频流是分割视频为一段小的图片帧 Group of Pictures(GOP) alignment
    • GOP is a group/chunk of frames arranged in a specific order. GOP 是特殊顺序的一组帧
    • Each chunk is an independently playable unit, usually a few seconds in length. 每一组都是独立的可播放单元,几秒长。
  2. Some old mobile devices or browsers might not support video splitting. Preprocessor split videos by GOP alignment for old clients. 兼容老设备。
  3. DAG generation. The processor generates DAG based on configuration files client programmers write.
  4. Cache data. The preprocessor is a cache for segmented videos. For better reliability, the preprocessor stores GOPs and metadata in temporary storage. If video encoding fails, the system could use persisted data for retry operations. 利用缓存进行临时文件存储,也方便转码失败重试。
DAG scheduler

Resource manager

work flow:

Task workers

Different task workers may run different tasks:

Temporary storage

Multiple storage systems are used here:

System optimizations

Speed optimization: parallelize video uploading

Uploading a video as a whole unit is inefficient. We can split a video into smaller chunks by GOP alignment 上传视频文件时候把文件分割成不同的块,并行的上传,上传完成之后再拼接成完整视频。

[original video] ----> split by GOP alignment ---> [|GOP 1|GOP 2|...|GPP N|]

This allows fast resumable uploads when the previous upload failed. 断点续传

Speed optimization: place upload centers close to users

选择地理上靠近用户的上传中心,让用户上传。提升上传速度。

use CDN as upload centers

Speed optimization: parallelism everywhere

Achieving low latency

[original storage] ---> [message queue] ---> [download module] --->[message queue]--->[encoding module]
---> [message queue] ---> [upload module] ---> [message queue] ---> [encoded storage] --->upload encoded videos 
---> CDN
Safety optimization: pre-signed upload URL 安全性优化,不直接暴露上传接口地址,为每一个用户生成一个签名的上传地址
  1. The client makes a HTTP request to API servers to fetch the pre-signed URL, which gives the access permission to the object identified in the URL. The term pre-signed URL is used by uploading files to Amazon S3 (or other third-party service provider).
  2. API servers respond with a pre-signed URL.
  3. Once the client receives the response, it uploads the video using the pre-signed URL.
Safety optimization: protect your videos
Cost-saving optimization 成本优化

YouTube video streams follow long-tail distribution. 2/8原则

  1. For less popular content, we may not need to store many encoded video versions. Short videos can be encoded on-demand. 对于不频繁访问的内容,转码的清晰度也避免过多,可以存一两种清晰度即可。
  2. Some videos are popular only in certain regions. There is no need to distribute these videos to other regions. 局部地区或者国家频繁访问的内容只部署在那个地区或者国家的本地 CDN 上。
  3. Build your own CDN。 和 ISP(Internet service providers 网络运营商) 合作建立 CDN。 ISPs are located all around the world and are close to users.
    • 中国移动
    • 中国联通
    • 中国电信
    • Comcast
    • AT&T
    • Verizon

Error handling 错误处理

Step 4 - Wrap up

Chapter 15: Design Google Drive

Google Drive is a file storage and synchronization service that helps you store documents, photos, videos, and other files in the cloud.

Step 1 - Understand the problem and establish design scope

Candidate: What are the most important features?
Interviewer: Upload and download files, file sync, and notifications. 上传,下载,同步,通知
Candidate: Is this a mobile app, a web app, or both? 
Interviewer: Both. 网页端与手机app端都做
Candidate: What are the supported file formats? 
Interviewer: Any file type. 所有格式的文件都要支持
Candidate: Do files need to be encrypted?
Interview: Yes, files in the storage must be encrypted. 文件需要加密
Candidate: Is there a file size limit?
Interview: Yes, files must be 10 GB or smaller. 文件不大于 10 GB
Candidate: How many users does the product have? 
Interviewer: 10M DAU. 一千万日活用户

估算 back of the envelope estimation

Step 2 - Propose high-level design and get buy-in

simple -> scale up ->

简单方案:

APIS

Upload a file
POST https://api.example.com/files/upload?uploadType=resumable
- uploadType=resumable
- data: Local file to be uploaded.
Download a file
GET https://api.example.com/files/download
Params:
- path: download file path.
Get file revisions
GET https://api.example.com/files/list_revisions
Params:
- path: The path to the file you want to get the revision history.
- limit: The maximum number of revisions to return.

Move away from fingle server 从单服务器扩展到多服务器

空间不足警告

shard the data

Amazon S3 for storage. “Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance”

Amazon S3 supports same-region and cross-region replication. 冗余备份数据,防止数据丢失。 A region is a geographic area where Amazon web services (AWS) have data centers.

Metadata database: Move the database out of the server to avoid single point of failure.

Sync conflicts

strategy: the first version that gets processed wins, and the version that gets processed later receives a conflict 先改的版本成本存盘,后改的用户会遇到系统报冲突。

如何解决冲突? How can we resolve the conflict for user 2? Our system presents both copies of the same file: user 2’s local copy and the latest version from the server. User 2 has the option to merge both files or override one version with the other. 保存两份修改,本地修改和服务器端修改。用户2可以选择合并修改,或者覆盖第一个修改。

High-level design

Block servers:

Block servers upload blocks to cloud storage. Block storage, referred to as block-level storage, is a technology to store data files on cloud-based environments. A file can be split into several blocks, each with a unique hash value, stored in our metadata database. Each block is treated as an independent object and stored in our storage system (S3). To reconstruct a file, blocks are joined in a particular order. As for the block size, we use Dropbox as a reference: it sets the maximal size of a block to 4MB 块状存储服务器,大文件被分割为一个个小块,每个小块有自己的哈希码,每个小块存储在亚马逊对象存储服务器上,每个小块是独立的。 把一个个小块按照原始顺序可以重新拼接成一个整的文件。块大小可以自定义,有的设置为 4MB

Cloud storage: 云存储

A file is split into smaller blocks and stored in cloud storage.

Cold storage: 冷存储

Cold storage is a computer system designed for storing inactive data, meaning files are not accessed for a long time. 存储不活跃的数据。

Load balancer:

A load balancer evenly distributes requests among API servers.

API servers:

These are responsible for almost everything other than the uploading flow. 除了上传文件之外所有其他操作的 API

Metadata database:

It stores metadata of users, files, blocks, versions, etc.

Metadata cache:

Some of the metadata are cached for fast retrieval.

Notification service:

It is a publisher/subscriber system that allows data to be transferred from notification service to clients as certain events happen. 通知文件上传完成,文件被修改

Offline backup queue:

If a client is offline and cannot pull the latest file changes, the offline backup queue stores the info so changes will be synced when the client is online. 客户端下线之后,相应的修改会存在队列中,等待客户端上线之后进行同步。

Step 3 - Design deep dive

Block servers

Block server work flow:

-> split -> [block N] -> compress -> [zip files] -> encrypt -> [encrypted files] -> cloud storage

For large files that are updated regularly, sending the whole file on each update consumes a lot of bandwidth. Two optimizations are proposed to minimize the amount of network traffic being transmitted: 大文件频繁更新是一个大挑战,如果每次更新整个文件需要消耗大量带宽。

解决办法:

High consistency requirement

provide strong consistency for metadata cache and database layers.

Memory caches adopt an eventual consistency model by default 最终一致性

Achieving strong consistency in a relational database is easy because it maintains the ACID (Atomicity, Consistency, Isolation, Durability) properties

Metadata database

Namespace: A namespace is the root directory of a user.

File: File table stores everything related to the latest file.

File_version: It stores version history of a file. Existing rows are read-only to keep the integrity of the file revision history. 只读保证文件历史版本的完整性

Block: It stores everything related to a file block. A file of any version can be reconstructed by joining all the blocks in the correct order. 文件的历史版本可以通过组装块的顺序进行重新构建。

Upload flow

包括两个流:一个是上传文件元数据流(快速),另一个是上传文件二进制数据流(耗时)

Add file metadata.
  1. Client 1 sends a request to add the metadata of the new file. 上传元数据
  2. Store the new file metadata in metadata DB and change the file upload status to “pending.” 存储到元数据库中,并且修改上传状态为进行中。
  3. Notify the notification service that a new file is being added. 通知其他服务 “有文件添加中”
  4. The notification service notifies relevant clients (client 2) that a file is being uploaded.
Upload files to cloud storage.
  1. Client 1 uploads the content of the file to block servers.
  2. Block servers chunk the files into blocks, compress, encrypt the blocks, and upload them to cloud storage. 块服务切割文件为块,并且压缩和加密块,最后上传块到云存储中。
  3. Once the file is uploaded, cloud storage triggers upload completion callback. The request is sent to API servers. 一旦上传完成,云存储服务触发回调服务,通知 API 服务 “文件上传已经完成”。
  4. File status changed to “uploaded” in Metadata DB.
  5. Notify the notification service that a file status is changed to “uploaded.” 通知文件已经上传完成。
  6. The notification service notifies relevant clients (client 2) that a file is fully uploaded.

Download flow

客户端如何知道其他客户端上传了或者修改了某个文件呢?

Once a client knows a file is changed, it first requests metadata via API servers, then downloads blocks to construct the file. 客户端知道文件被修改之后,首先他会请求文件元数据,获取更改了哪些数据块的信息,然后下载这些变化的数据块,以此来重新构建本地文件。

  1. Notification service informs client 2 that a file is changed somewhere else.
  2. Once client 2 knows that new updates are available, it sends a request to fetch metadata.
  3. API servers call metadata DB to fetch metadata of the changes.
  4. Metadata is returned to the API servers.
  5. Client 2 gets the metadata.
  6. Once the client receives the metadata, it sends requests to block servers to download blocks.
  7. Block servers first download blocks from cloud storage.
  8. Cloud storage returns blocks to the block servers.
  9. Client 2 downloads all the new blocks to reconstruct the file.

Notification service

To maintain file consistency, any mutation of a file performed locally needs to be informed to other clients to reduce conflicts. 为了维持文件的一致性,任何文件的修改都需要通知其他客户端。

opt for long polling for the following two reasons: 选择定期拉取的策略的两个原因:

Save storage space

Failure Handling

Step 4 - Wrap up

Another interesting evolution of the system is moving online/offline logic to a separate service. Let us call it presence service. By moving presence service out of notification servers, online/offline functionality can easily be integrated by other services.

Continue learning

Real-world systems

Facebook Timeline: Brought To You By The Power Of Denormalization: https://goo.gl/FCNrbm
Scale at Facebook: https://goo.gl/NGTdCs
Building Timeline: Scaling up to hold your life story:
https://goo.gl/8p5wDV
Erlang at Facebook (Facebook chat): https://goo.gl/zSLHrj
Facebook Chat: https://goo.gl/qzSiWC
Finding a needle in Haystack: Facebook’s photo storage:
https://goo.gl/edj4FL
Serving Facebook Multifeed: Efficiency, performance gains through redesign: https://goo.gl/adFVMQ
Scaling Memcache at Facebook: https://goo.gl/rZiAhX
TAO: Facebook’s Distributed Data Store for the Social Graph:
https://goo.gl/Tk1DyH
Amazon Architecture: https://goo.gl/k4feoW
Dynamo: Amazon’s Highly Available Key-value Store:
https://goo.gl/C7zxDL
A 360 Degree View Of The Entire Netflix Stack:
https://goo.gl/rYSDTz
It’s All A/Bout Testing: The Netflix Experimentation Platform:
https://goo.gl/agbA4K
Netflix Recommendations: Beyond the 5 stars (Part 1):
https://goo.gl/A4FkYi
Netflix Recommendations: Beyond the 5 stars (Part 2):
https://goo.gl/XNPMXm
Google Architecture: https://goo.gl/dvkDiY
The Google File System (Google Docs): https://goo.gl/xj5n9R
Differential Synchronization (Google Docs): https://goo.gl/9zqG7x YouTube Architecture: https://goo.gl/mCPRUF
Seattle Conference on Scalability: YouTube Scalability:
https://goo.gl/dH3zYq
Bigtable: A Distributed Storage System for Structured Data:
https://goo.gl/6NaZca
Instagram Architecture: 14 Million Users, Terabytes Of Photos, 100s Of Instances, Dozens Of Technologies: https://goo.gl/s1VcW5
The Architecture Twitter Uses To Deal With 150M Active Users:
https://goo.gl/EwvfRd
Scaling Twitter: Making Twitter 10000 Percent Faster:
https://goo.gl/nYGC1k
Announcing Snowflake (Snowflake is a network service for generating unique ID numbers at high scale with some simple guarantees): https://goo.gl/GzVWYm
Timelines at Scale: https://goo.gl/8KbqTy
How Uber Scales Their Real-Time Market Platform:
https://goo.gl/kGZuVy
Scaling Pinterest: https://goo.gl/KtmjW3
Pinterest Architecture Update: https://goo.gl/w6rRsf
A Brief History of Scaling LinkedIn: https://goo.gl/8A1Pi8 Flickr Architecture: https://goo.gl/dWtgYa
How We've Scaled Dropbox: https://goo.gl/NjBDtC
The WhatsApp Architecture Facebook Bought For $19 Billion:
https://bit.ly/2AHJnFn

Company engineering blogs

Here is a list of engineering blogs of well-known large companies and startups.

Airbnb: https://medium.com/airbnb-engineering Amazon: https://developer.amazon.com/blogs
Asana: https://blog.asana.com/category/eng Atlassian: https://developer.atlassian.com/blog
Bittorrent: http://engineering.bittorrent.com Cloudera: https://blog.cloudera.com
Docker: https://blog.docker.com
Dropbox: https://blogs.dropbox.com/tech eBay: http://www.ebaytechblog.com Facebook: https://code.facebook.com/posts
GitHub: https://githubengineering.com Google: https://developers.googleblog.com Groupon: https://engineering.groupon.com
Highscalability: http://highscalability.com Instacart: https://tech.instacart.com
Instagram: https://engineering.instagram.com Linkedin: https://engineering.linkedin.com/blog
Mixpanel: https://mixpanel.com/blog
Netflix: https://medium.com/netflix-techblog Nextdoor: https://engblog.nextdoor.com
PayPal: https://www.paypal-engineering.com Pinterest: https://engineering.pinterest.com
Quora: https://engineering.quora.com

Reddit: https://redditblog.com
Salesforce: https://developer.salesforce.com/blogs/engineering
Shopify: https://engineering.shopify.com
Slack: https://slack.engineering
Soundcloud: https://developers.soundcloud.com/blog
Spotify: https://labs.spotify.com
Stripe: https://stripe.com/blog/engineering
System design primer: https://github.com/donnemartin/system- design-primer
Twitter: https://blog.twitter.com/engineering/en_us.html Thumbtack: https://www.thumbtack.com/engineering
Uber: http://eng.uber.com
Yahoo: https://yahooeng.tumblr.com
Yelp: https://engineeringblog.yelp.com
Zoom: https://medium.com/zoom-developer-blog

References