Designing Data-Intensive Application 读书笔记 1. —— 简介 & 资料

书籍下载地址：

https://download.csdn.net/download/hapjin/10653988?utm_medium=distribute.pc_relevant_download.none-task-download-2~default~BlogCommendFromBaidu~Rate-2.dl_default&depth_1-utm_source=distribute.pc_relevant_download.none-task-download-2~default~BlogCommendFromBaidu~Rate-2.dl_default&dest=https%3A%2F%2Fdownload.csdn.net%2Fdownload%2Fhapjin%2F10653988&spm=1003.2020.3001.6616.2

Part I. Foundations of Data Systems
1. Reliable, Scalable, and Maintainable Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Thinking About Data Systems 4
Reliability 6
Hardware Faults 7
Software Errors 8
Human Errors 9
How Important Is Reliability? 10
Scalability 10
Describing Load 11
Describing Performance 13
Approaches for Coping with Load 17
Maintainability 18
Operability: Making Life Easy for Operations 19
Simplicity: Managing Complexity 20
Evolvability: Making Change Easy 21
Summary

2. Data Models and Query Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Relational Model Versus Document Model 28
The Birth of NoSQL 29
The Object-Relational Mismatch 29
Many-to-One and Many-to-Many Relationships 33
Are Document Databases Repeating History? 36

Relational Versus Document Databases Today 38
Query Languages for Data 42
Declarative Queries on the Web 44
MapReduce Querying 46
Graph-Like Data Models 49
Property Graphs 50
The Cypher Query Language 52
Graph Queries in SQL 53
Triple-Stores and SPARQL 55
The Foundation: Datalog 60
Summary 63

3. Storage and Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Data Structures That Power Your Database 70
Hash Indexes 72
SSTables and LSM-Trees 76
B-Trees 79
Comparing B-Trees and LSM-Trees 83
Other Indexing Structures 85
Transaction Processing or Analytics? 90
Data Warehousing 91
Stars and Snowflakes: Schemas for Analytics 93
Column-Oriented Storage 95
Column Compression 97
Sort Order in Column Storage 99
Writing to Column-Oriented Storage 101
Aggregation: Data Cubes and Materialized Views 101
Summary 103

4. Encoding and Evolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Formats for Encoding Data 112
Language-Specific Formats 113
JSON, XML, and Binary Variants 114
Thrift and Protocol Buffers 117
Avro 122
The Merits of Schemas 127
Modes of Dataflow 128
Dataflow Through Databases 129
Dataflow Through Services: REST and RPC 131
Message-Passing Dataflow 136
Summary 139

In Part I, we discuss the fundamental ideas that underpin the design of data-

intensive applications. We start in Chapter 1 by discussing what we’re actually

trying to achieve: reliability, scalability, and maintainability; how we need to

think about them; and how we can achieve them. In Chapter 2 we compare sev‐

eral different data models and query languages, and see how they are appropriate

to different situations. In Chapter 3 we talk about storage engines: how databases

arrange data on disk so that we can find it again efficiently. Chapter 4 turns to

formats for data encoding (serialization) and evolution of schemas over time.

Part II. Distributed Data
5. Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Leaders and Followers 152
Synchronous Versus Asynchronous Replication 153
Setting Up New Followers 155
Handling Node Outages 156
Implementation of Replication Logs 158
Problems with Replication Lag 161
Reading Your Own Writes 162
Monotonic Reads 164
Consistent Prefix Reads 165
Solutions for Replication Lag 167
Multi-Leader Replication 168
Use Cases for Multi-Leader Replication 168
Handling Write Conflicts 171
Multi-Leader Replication Topologies 175
Leaderless Replication 177
Writing to the Database When a Node Is Down 177
Limitations of Quorum Consistency 181
Sloppy Quorums and Hinted Handoff 183
Detecting Concurrent Writes 184
Summary 192

6. Partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Partitioning and Replication 200
Partitioning of Key-Value Data 201
Partitioning by Key Range 202
Partitioning by Hash of Key 203
Skewed Workloads and Relieving Hot Spots 205
Partitioning and Secondary Indexes 206
Partitioning Secondary Indexes by Document 206
Partitioning Secondary Indexes by Term 208
Rebalancing Partitions 209
Strategies for Rebalancing 210
Operations: Automatic or Manual Rebalancing 213
Request Routing 214
Parallel Query Execution 216
Summary 216

7. Transactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
The Slippery Concept of a Transaction 222

The Meaning of ACID 223
Single-Object and Multi-Object Operations 228
Weak Isolation Levels 233
Read Committed 234
Snapshot Isolation and Repeatable Read 237
Preventing Lost Updates 242
Write Skew and Phantoms 246
Serializability 251
Actual Serial Execution 252
Two-Phase Locking (2PL) 257
Serializable Snapshot Isolation (SSI) 261
Summary 266

8. The Trouble with Distributed Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Faults and Partial Failures 274
Cloud Computing and Supercomputing 275
Unreliable Networks 277
Network Faults in Practice 279
Detecting Faults 280
Timeouts and Unbounded Delays 281
Synchronous Versus Asynchronous Networks 284
Unreliable Clocks 287
Monotonic Versus Time-of-Day Clocks 288
Clock Synchronization and Accuracy 289
Relying on Synchronized Clocks 291
Process Pauses 295
Knowledge, Truth, and Lies 300
The Truth Is Defined by the Majority 300
Byzantine Faults 304
System Model and Reality 306
Summary 310

9. Consistency and Consensus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Consistency Guarantees 322
Linearizability 324
What Makes a System Linearizable? 325
Relying on Linearizability 330
Implementing Linearizable Systems 332
The Cost of Linearizability 335
Ordering Guarantees 339
Ordering and Causality 339
Sequence Number Ordering 343

Total Order Broadcast 348
Distributed Transactions and Consensus 352
Atomic Commit and Two-Phase Commit (2PC) 354
Distributed Transactions in Practice 360
Fault-Tolerant Consensus 364
Membership and Coordination Services 370
Summary 373

In Part II, we move from data stored on one machine to data that is distributed

across multiple machines. This is often necessary for scalability, but brings with

it a variety of unique challenges. We first discuss replication (Chapter 5), parti‐

tioning/sharding (Chapter 6), and transactions (Chapter 7). We then go into

more detail on the problems with distributed systems (Chapter 8) and what it

means to achieve consistency and consensus in a distributed system (Chapter 9).

Part III. Derived Data
10. Batch Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Batch Processing with Unix Tools 391
Simple Log Analysis 391
The Unix Philosophy 394
MapReduce and Distributed Filesystems 397
MapReduce Job Execution 399
Reduce-Side Joins and Grouping 403
Map-Side Joins 408
The Output of Batch Workflows 411
Comparing Hadoop to Distributed Databases 414
Beyond MapReduce 419
Materialization of Intermediate State 419
Graphs and Iterative Processing 424
High-Level APIs and Languages 426
Summary 429

11. Stream Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Transmitting Event Streams 440
Messaging Systems 441
Partitioned Logs 446
Databases and Streams 451
Keeping Systems in Sync 452
Change Data Capture 454
Event Sourcing 457
State, Streams, and Immutability 459
Processing Streams 464
Uses of Stream Processing 465
Reasoning About Time 468
Stream Joins 472
Fault Tolerance 476
Summary 479

12. The Future of Data Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Data Integration 490
Combining Specialized Tools by Deriving Data 490
Batch and Stream Processing 494
Unbundling Databases 499
Composing Data Storage Technologies 499
Designing Applications Around Dataflow 504
Observing Derived State 509
Aiming for Correctness 515
The End-to-End Argument for Databases 516
Enforcing Constraints 521
Timeliness and Integrity 524
Trust, but Verify 528
Doing the Right Thing 533
Predictive Analytics 533
Privacy and Tracking 536
Summary 543
Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559

3. In Part III, we discuss systems that derive some datasets from other datasets.

Derived data often occurs in heterogeneous systems: when there is no one data‐

base that can do everything well, applications need to integrate several different

databases, caches, indexes, and so on. In Chapter 10 we start with a batch pro‐

cessing approach to derived data, and we build upon it with stream processing in

Chapter 11. Finally, in Chapter 12 we put everything together and discuss

approaches for building reliable, scalable, and maintainable applications in the

future.

何时需要可扩展的数据系统？（vs 关系型数据库） =》先分析利、弊

Sometimes, when discussing scalable data systems, people make comments along the
lines of, “You’re not Google or Amazon. Stop worrying about scale and just use a
relational database.” There is truth in that statement: building for scale that you don’t
need is wasted effort and may lock you into an inflexible design. In effect, it is a form
of premature optimization. However, it’s also important to choose the right tool for
the job, and different technologies each have their own strengths and weaknesses. As
we shall see, relational databases are important but not the final word on dealing with
data.

参考资料如何获取

For academic papers, you can search for the title in Google Scholar to find
open-access PDF files. Alternatively, you can find all of the references at https://
github.com/ept/ddia-references, where we maintain up-to-date links.

The references at the end of each chapter are a great resource if

you want to explore an area in more depth, and most of them are freely available

online.

This book has over 800 references to articles, blog posts, talks, documenta‐

tion, and more, and they have been an invaluable learning resource for me. I am very

grateful to the authors of this material for sharing their knowledge.

资料

We have a web page for this book, where we list errata, examples, and any additional

information. You can access this page at http://bit.ly/designing-data-intensive-apps.

For more information about our books, courses, conferences, and news, see our web‐

site at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Watch us on YouTube: http://www.youtube.com/oreillymedia

http://oreilly.com/safari.

Members have access to thousands of books, training videos, Learning Paths, interac‐

tive tutorials, and curated playlists from over 250 publishers, including O’Reilly

Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Pro‐

fessional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco

Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt,

Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett,

and Course Technology, among others.

For more information, please visit http://oreilly.com/safari.

开源 vs 闭源

This book has a bias toward free and open source software (FOSS), because reading,
modifying, and executing source code is a great way to understand how something
works in detail. Open platforms also reduce the risk of vendor lock-in. However,
where appropriate, we also discuss proprietary software (closed-source software, soft‐
ware as a service, or companies’ in-house software that is only described in literature
but not released publicly).