Adventure Code

Logo

Personal blog

View My GitHub Profile

back to Book reviews

Designing Data-Intensive Applications - by Martin Kleppman, 2017

alt text

Main ideas and overview of the book:

Many applications today are Data-intesive vs Compute-intensive:

Big questions/hard problems:

  1. How to handle load?
  1. How handle failures?

Maind data dsitribution strategies: Replication & Partitioning

Table of contents:

Chapter1 - Reliability/Scalability

Describing Load:

Describing Performance:

Chapter2 - Data Models & Query Languages

Declarative Query Language (SQL) vs Imperative Query API

MapReduce a hybrid - selection query, with map and reduce user specified by code

Relational Model - classic approach

Document Model

Graph-Like Model

Chapter3 - Storage and Retrieval

Storage engine types:

- in-memory only (limited in size)
- log-strcutured storage engine
	- in memory hash index or sorted/sparse index
	- in-memory memtable - balanced tree to sort
	- disk based sorted log database (SSTable, LSM tree) - append only
- page-structured storage engines
	- disk based B-tree index
	- disk based database - in-place updates

In-memory hash-index

SSTable/LSM-Tree

- in-memory writes to a sorted+balanced tree - periodcally write to disk (sorted+compacted)
- keep in-memory spare index for each segment written to disk - sparse is enough because values are sorted
- values can be repeated, needs to check in specific order, from more recent to least recent segements
- bloomfilter for quickly discarting non-members

B-tree

- works with fixed-sized blocks/pages
- each page stores key range boundaries and references to child pages - starts from root page
- high branching factor -100-500, few levels
- easy to add new key/split a page, tricky to delete
- pros: keys are stored in one place, easier to lock/make transactional

Write Performance - LSM-tree wins

Read Performance - B-tree wins

Access Pattern differences

Column-Oriented Stotage

Chapter4 - Encoding and Evolution

Data Encoding/Serialization is tranformation

Evolvability

Data Encoding formats:


Chapter5 - Replication

Shared nothing architecture

Shared memory or shared disk architecture

Main approaches to replication:

Quorum

Node failure

Replication implementation

Hard problems: Consistency guarantees & Conlfict resolution

Consistency guarantees - eventual consistency, replication lag

- Read-Your-Own-Writes
	- client remeber last logical timestamp, and compare with replica's logical timestamp
- Monotonic Reads - (different answers from replicas - disappearing writes)
	- always read from same replica for a given user
- Consistent Prefix Reads - (see writes in the same order)
	- happens for partitions
	- no easy solution - keep track of casual dependencies

Conflict resolution - where multiple parallel writes might happen (Multi-Leader, Leaderless)

- Conclift avoidance - each user writes to the same leader
- Last-Write-Wins - durability violated
- Merging values - set, list
- Record conflict - delegate to app layer, user resolution
- Read Repair (for quorum reads)
- Versioning for each key in DB
	- client must read before writing (DB returns version, client sends version)
	- concurrent values are kept as siblings - need for conflict resolution
	- version vector - one version number per replica

Chapter6 - Partitioning

When you have so much data that it is not feasable to:

Partitioning spreads the data and query load accross many machines

Record -> Partition -> Node

Two main approches:

- partition by - Key Ranges
	- sorted, supports range scans
	- certain access patterns could lead to hotspot - date key, recent data
- partition by - Hash of Key Ranges
	- randomizes data, breaks hot spot patterns
	- but cannot be range scanned

Problems

Skewed partitioning, hot spot

Secondary indexes - don't map neatly to partitions

Rebalancing - changing data size

Request routing

Chapter7 - Transactions

Transaction

ACID

(Transaction) Isolation Levels

Read commited
 - prevents dirty reads - read of uncommited 
   - (two versions - commited vs transaction-in-progress)
 - prevents dirty writes - ovewrite of uncommited (row-level locks)
Snapshot Isolation ~ Repeatable Read
 - writers block writers, readers never block (writer nor reader)
 - good enough for small read/write transactions and long-running read-only
 - prevents read skew - timing anomaly, see different part of the database at different point in time 
   - while ongoing transaction A, it sees commited changes to the database by another transaction B 
   - instead of a consitent snapshot as it was at the beginning of transaction A (multi-version conccurency control)
Serializable (guarantees some serial order)
 - prevents lost updates - timing anomaly
   - two concurrent read-modify-write transactions, the first overwrites the write of the second without incorporating the changes
   - (atomic operations / explicit lock / autodetect and abort / compare-and-set)
 - prevents write skew - timing anomaly, two transactions read-and-updating two different objects based on a premise, that is outdated 
   - phantoms - query some condition, another write affects the results of the query, outdated premise (block data not commited yet)
   - at least on oncall, room double-booking, double-spending, duplicate-username chosen

Implementations

Chapter8 - Trouble with Distribuited Systems

No shared memory (shared nothing architecture) - communicate via network

All this can lead to partial failures and nondeterminism.

How to handle network faults?: timeout

What is the correct timeout value?

Trade-off: Bounded delay vs Resource Utilization

Circuit-switching

Packet-switching

Unreliable Clocks - pitfalls

Physical clocks

Trade-off: Cost vs Reliability

Hard Real-Time Systems

Byzantine Faults - (hard to deal with, usually assumed not to happen)

Philosophical:

Truth is defined by the majority

Chapter9 - Consistency and Consensus

Distributed Consistency Models

Linearizability (~Consistency in reads/writes)
- total order - sinlge copy of data and all operation are atomic
- recency guarantee
- single up-to-date value - that all nodes agree on
- coordinating state of replicas in face of delays/faults 
- slow, proportial to deays in network, coordination overhead, comes with performance hit

Casuality (Caually consistent)
- partial order, casual oredering
- casaully related events are ordered (happens-before)
- non related events are concurrent, no order defined
- can be implemented without performance hit

Ordering - helps preserve casuality

Total Order Broadcast - Atomic Broadcast

CAP Theorem - when Partitioned network, choose one:

Problems with CAP

Consensus

Coordination/Consensus services - ZooKeeper and etcd

"Outsourcing" some of the work of coordinating nodes

Fault-Tolerant Consensus Algorithms:

2-Phase-Commit - 2PC -> (comes down to a single node atomic commit on the coordinator)

Consensus is easy - if we have the leader node

Fault-Tolerant Consensus (unifrom consensus) is hard

Chicken and the egg problem

Given a weaker guarantee - unque leader within each epoch


Chapter10 - Batch processing

On a sinlge machine only - Unix tools efficiency

On multiple machines - Distributed Batch Processing has to solve:

MapReduce-Hadoop ecosystem

Higher level programming models

HDFS - distributed filesystem

Map_Group/SortMerge_Reduce (similar to UNIX pholosophy)

Mapper 
- take a file block input, and iterate over recods
- extract [key,value] pairs from each records
- sort by key
- write to disk

Reducer - assigned some keys/partitions
- connect to each mapper 
- download files of sorted [key,value] pairs (corresponding to it's partition)
- merge sorted files together
- iterate over keys and produce output - (could be the input for next map reduce job)

Reduce-Side Join / Sort-Merge join

Map-Side Join

MapReduce vs MPP (Massively Parallel Processing Databases)

Beyond Mapreduce - Dataflow Engines

Graph data

Chapter11 - Stream processing

Messaging systems

Queue-based message brokers (AMQP/JMS)

Log-based message brokers (Kafka/Kinesis)

Change Data Capture

Event sourcing

Mutable state & Immutable changelog

Stream Processing output

Unbounded input

Stream Processing

Event time vs Processing time

Joins

Fault tolerance - exactly once execution

Chapter12 - Future of Data Systems

No one-size fits all software - need to integrage different data systems

How to keep data in sync/consistent? - Ordering

  1. log-based derived data - log-based order
  2. distributed transactions - order writes using 2PL - locks & mutex

How to handle failures?

  1. log-based derived data - (deterministic) retry and idempotency (more than once execution)
  2. distributed transactions - retry with atomic commit to ensure only once execution

Log-based derived data

Distributed transactions

Stable ordering and fault-tolerant message processing are quite strict demands, but less expesnive and more operationally robust then distributed transactions (1. log-based derived data >> 2. distributed transactions)

Lambda architecture

Reads are events too

Making operations idempotent - Duplicate surpression

Enforcing uniqueness constriants

Single partition solution for uniqueness

Multi partition solution for uniqueness

Consistency conflates two requirements:

Hard uniqueness constriant (consensus)

Weaker uniqueness contraint (timeliness enforcement is relaxed):

Dataflow systems:

Trade-off: Performance/Availability at high load vs Number of inconsistencies/Compensating transactions


Ethics