← back to Book reviews
Designing Data-Intensive Applications - by Martin Kleppman, 2017

Main ideas and overview of the book:
Many applications today are Data-intesive vs Compute-intensive:
- raw CPU power is the not limiting factor
- the primary challenge is data - data quantity, data procesing speed, data complexity
Big questions/hard problems:
- How to handle load?
- indexes to speed up reads (range queries? hot indexes?)
- replication read scaling/latency improve
- partitioning data size scaling & read/writes scaling
- skewed/hot partitions? rebalancing?
- How handle failures?
- replication - failover/catch-up revocery
- lag? consistency guarantees? conflict resolution?
Maind data dsitribution strategies: Replication & Partitioning
Table of contents:
- Part 1 - focused on data storage & retrieval on a single machine
- Part 2 - data storage & retrieval on multiple machines
- Part 3 - building derived data systems - dataflow architecture
Chapter1 - Reliability/Scalability
Describing Load:
- specific to your system
- e.g: volume of reads/writes, volume of data to store, distribution of followers per user
Describing Performance:
- Throughput - in offline/batch systems (nr of records per second, total processing time of dataset)
- Response time - in online/interactive systems - percentiles - median p50, p95, p999 (1 in 1000) - tail latencies
Chapter2 - Data Models & Query Languages
Declarative Query Language (SQL) vs Imperative Query API
- more concise
- hides implementation details, allows for database system to intorduce improvements
- decouples client code from database engine details
- more limited in funcionality, which give more room for optimizations in database
MapReduce a hybrid - selection query, with map and reduce user specified by code
- map(record) -> emit(key, value)
- group and sort by key
- reduce(key, value) -> for each key called once, to produce final result
Relational Model - classic approach
- if self-contianed data with mostly one-to-many relationships - Document Model better
- less data locality, needs to join information
- if lost of many-to-many relationship, anything can be connected - Graph Model better
- complicated queries, several levels of joins needed, number of joins not fixed
- schema-on-write - table update/backfill needed for schema change
Document Model
- suitable for tree-like structure (mostly one-to-many relationships)
- easy to represent: one-to-many relationships
- hard to represent: many-to-on, many-to-many - data duplication, manual/client side joins - reference resolution
- data locality
- schema-on-read - easy to incorporate schema changes
Graph-Like Model
- suitable for lots of many-to-many relationships
- flexiblity to add edges, vertices
Property Graph Model
- vertices (properties) + edges (properties - head - tail - label)
Triple Store (RDF - Resource Description Framework)
- (subject, predicate, object) - does not distinguish between properties and edges - predicates used for both
- subject (vertex) + prediate (property name | edge label) + object (property value | vertex)
Chapter3 - Storage and Retrieval
- For efficient lookups we need an index
- For efficient range lookups, we need a sorted index
- But each index comes with write overhead cost
- To recover from crases - write-ahead log or redo log
Storage engine types:
- in-memory only (limited in size)
- log-strcutured storage engine
- in memory hash index or sorted/sparse index
- in-memory memtable - balanced tree to sort
- disk based sorted log database (SSTable, LSM tree) - append only
- page-structured storage engines
- disk based B-tree index
- disk based database - in-place updates
In-memory hash-index
- simple index, append only writes
- cons: if many keys, it does not fit in RAM
- cons: no support for range queries, only exact queries
SSTable/LSM-Tree
- sorted, sparse, index, write firts to memory - updates appended
- in-memory writes to a sorted+balanced tree - periodcally write to disk (sorted+compacted)
- keep in-memory spare index for each segment written to disk - sparse is enough because values are sorted
- values can be repeated, needs to check in specific order, from more recent to least recent segements
- bloomfilter for quickly discarting non-members
B-tree
- sorted index, write to disk - update-in-place, overwrite
- works with fixed-sized blocks/pages
- each page stores key range boundaries and references to child pages - starts from root page
- high branching factor -100-500, few levels
- easy to add new key/split a page, tricky to delete
- pros: keys are stored in one place, easier to lock/make transactional
Write Performance - LSM-tree wins
- lower write amplification and sequential operations/compression
- B-tree higher write amplification (redo log, page, split/parent page)
Read Performance - B-tree wins
- fewer operations (˜4 page reads, 4 deep B-trees)
- LSM-tree might need to check several data structure/SSTables in compaction
Access Pattern differences
- OLTP - transactions - low latency, small nr of records
- OLAP - analaytics - bulk, aggregates over large nr of records
Column-Oriented Stotage
- if rows are a large (100+ columns) and we need to aggregate and extract a couple of columns (3) - row based storage is highly inefficient
- compression
- different sorting order (write have to go first to memory, and merged to disk later)
Chapter4 - Encoding and Evolution
Data Encoding/Serialization is tranformation
- from memory efficient data structures (with pointers)
- to sequence of bytes (on disk, or over network)
Evolvability
- Data outlives code
- Rolling updates - nodes with different verisons
- backward compatibility (new code reading old data)
- forwards compatibility (old code reading new data)
Data Encoding formats:
- programming language specific
- single language
- weak support for versioning
- not performant
- textual (XML, JSON, CSV)
- simple
- datatypes are ambigous (number encoding) and limited (no binary string support - base64 workaround)
- schemas are optional, too complex, rarely used
- binary (Thrift, protobuf, Avro)
- more compact (50% less)
- mandatory schemas - enforced documentation, typed checking, evolvability
- add new fields - optional or must have defaults
- delete existing optional fields only
- never reuse tag numbers
Chapter5 - Replication
Shared nothing architecture
- independent machines, no access to other's memory disk - communicate through network
- cheap, commodity hardware
- more complex software required for coordination and high reliability
Shared memory or shared disk architecture
- mahcines can access other machines memory/disk
- more expensive hardware, contention and locking overhead
- less software complexity
Main approaches to replication:
- Single Leader
- Multiple Leaders
- Leaderless - (quorum)
Quorum
- n nodes, w - write quorum, r - read quorum
- if w + r > n - overlap, there is at least on node up to date
- for partitions - if home partition is down - use other nodes
- sloppy quorum and hinted handoff
- durability OK, but consitency violated, could read non-latest value
Node failure
- follower - catch-up recovery
- leader - failover
Replication implementation
- statment based (SQL forwarded) - cannot handle non-determinism (now, random)
- write ahead log - low level (bytes on disk) - tied to storage engine
- logical row-based - high level (before, after) - decoupled from storage engine
Hard problems: Consistency guarantees & Conlfict resolution
Consistency guarantees - eventual consistency, replication lag
- Read-Your-Own-Writes
- client remeber last logical timestamp, and compare with replica's logical timestamp
- Monotonic Reads - (different answers from replicas - disappearing writes)
- always read from same replica for a given user
- Consistent Prefix Reads - (see writes in the same order)
- happens for partitions
- no easy solution - keep track of casual dependencies
Conflict resolution - where multiple parallel writes might happen (Multi-Leader, Leaderless)
- Conclift avoidance - each user writes to the same leader
- Last-Write-Wins - durability violated
- Merging values - set, list
- Record conflict - delegate to app layer, user resolution
- Read Repair (for quorum reads)
- Versioning for each key in DB
- client must read before writing (DB returns version, client sends version)
- concurrent values are kept as siblings - need for conflict resolution
- version vector - one version number per replica
Chapter6 - Partitioning
When you have so much data that it is not feasable to:
- store on a single machine
- process on a single machine
Partitioning spreads the data and query load accross many machines
- Large dataset - distribuited across many disks
- Large query load - distribuited across many CPUs
Record -> Partition -> Node
Two main approches:
- partition by - Key Ranges
- sorted, supports range scans
- certain access patterns could lead to hotspot - date key, recent data
- partition by - Hash of Key Ranges
- randomizes data, breaks hot spot patterns
- but cannot be range scanned
Problems
Skewed partitioning, hot spot
- nodes with more data or more query than other nodes
- append random digits to key - spread write load to 10 partitions, but also increas read load to 10
Secondary indexes - don't map neatly to partitions
- local si - each node stores the indexes of data written to it (by Document)
- easy write, complex read (scatter/gather)
- global si - partitoned differently from primary key, each node has some terms (by Term)
- easy reads, complex writes (one write may affect many partitions/nodes)
- on writes, asynchronous update of global secondary indexes
Rebalancing - changing data size
- hash mod N -> bad idea
- when adding/removing nodes - all patritions change
- random partition boundaries -> good idea
- needs hash key for continous ranges
- "consistent hashing" - as only a few partitions change when adding nodes
- fixed number of partitions
- as data grows, partitions grow
- node added - fewer partitons per node - steals form other nodes
- dynamic number of partitions
- fixed sized partitions - split/merge as data increases
- fixed number partitions per node - datasize increases partitions, add nodes decreases
Request routing
- where is my data?
- how has the information? - client (complex app) / a routing tier (from coordination service) / all nodes (gossip protocol)
- service discovery problem
- how to learn about changes? - coodrination service - Zookeper - authoritive mapping of partitions->nodes
Chapter7 - Transactions
Transaction
- purpose - to simplify the programming model - for applications accessing a database
- give safety guarantees - in case of error scenarios and concurrency issues
- concurrent write, constraint violation, crash, network failur, disk failre, hardware/sofware faults => simplified to commit or abort
ACID
- Atomicity - Abortability of transactions - takes care of partial error
- (Consistency - Consistent state of the data between transactions - not really related to database, the application's responsability)
- Isolation - Concurrent transactions don't interfere with each other - different levels
- (Durability - Committed transaction data will not be lost - written to disk & replicated)
(Transaction) Isolation Levels
Read commited
- prevents dirty reads - read of uncommited
- (two versions - commited vs transaction-in-progress)
- prevents dirty writes - ovewrite of uncommited (row-level locks)
Snapshot Isolation ~ Repeatable Read
- writers block writers, readers never block (writer nor reader)
- good enough for small read/write transactions and long-running read-only
- prevents read skew - timing anomaly, see different part of the database at different point in time
- while ongoing transaction A, it sees commited changes to the database by another transaction B
- instead of a consitent snapshot as it was at the beginning of transaction A (multi-version conccurency control)
Serializable (guarantees some serial order)
- prevents lost updates - timing anomaly
- two concurrent read-modify-write transactions, the first overwrites the write of the second without incorporating the changes
- (atomic operations / explicit lock / autodetect and abort / compare-and-set)
- prevents write skew - timing anomaly, two transactions read-and-updating two different objects based on a premise, that is outdated
- phantoms - query some condition, another write affects the results of the query, outdated premise (block data not commited yet)
- at least on oncall, room double-booking, double-spending, duplicate-username chosen
Implementations
- materializing conlficts - complex implementation
- single threaded execution - does not scale ~ it's like locking the entire database
- 2PL (two phase locking) - does not perform well
- readers block writers (but not reader)
- writers block readers (and writers)
- lock modes shared mode(read-read okay)/exlusive mode
- SSI (Serializable Snapshot Isolation)
- detect outdated premise in transactions
- detect reads of potentially stale value MVCC (uncommited write occurs before read)
- only an issue if we want to commit that read value - optimistic approach
- detect writes that affect prior reads (write occurs after read)
- allow to proceed, aborted if not serilaziable - optimistic approach
Chapter8 - Trouble with Distribuited Systems
No shared memory (shared nothing architecture) - communicate via network
- unrelibale network - variable delays (swtich / VM / OS - queues), lost/delayed packets
- unreliable clocks
- process pauses (GC, CPU context switch, VM, I/O)
All this can lead to partial failures and nondeterminism.
How to handle network faults?: timeout
What is the correct timeout value?
- networks can have unbounded delays, no guarantees
- timeout too long - blocking app/user waiting/unused capacity
- timeout too short - declaring a node dead prematurely - double execution of an action, failover -> which may cause cascading failures
- no "correct" value, determined experimentally
Trade-off: Bounded delay vs Resource Utilization
Circuit-switching
- telephone netwrods, ISDN
- fixed bandiwth allocation -> know bandwith requriement for a call
- bounded delay/latency -> but worse resource utilization
Packet-switching
- Ethernet and IP
- dynamic bandwith allocation -> unknown bandwith requirements, optimized for bursty traffic
- variable delay/latency -> but better resoure utilization
Unreliable Clocks - pitfalls
- leap seconds, time-of-day clock might jump backwards
- quartz clock drifting, nodes with different time
- NTP synchronized - not reliable
- local time servers is <10ms drift (best case scenario)
- remote time server <100ms
Physical clocks
- time-of-day (might jump back-forward)
- monotonic (might jump forward)
Logical clocks
- strictly monotonic
Trade-off: Cost vs Reliability
Hard Real-Time Systems
- expensive, bounded delay but a lower throughput
- special OS (RTOS), languages, libraries, tools
Byzantine Faults - (hard to deal with, usually assumed not to happen)
- compromised nodex/untrusted parties
- arbitrary faults, corrupted responses
Philosophical:
- in a distribuited system - how can we know anything, if the mechanism of perception and measurement is unreliable?
- makes assumption about the system behavior (system model)
- design to provide guarantees under these assumptions
Truth is defined by the majority
- cannot rely on a single node, must rely on quorum
- individual nodes must abide quorom decisions
- leader and the lock
- fencing tokens - sequence number - to avoid long pause caused inconsistencies, lock expired but not known by node leasing it
- return a sequence number when lock is requested/leased
- send senquence number when writing using lock
- compare to see if it's the latest sequence number, or newer one has been issued
Chapter9 - Consistency and Consensus
Distributed Consistency Models
Linearizability (~Consistency in reads/writes)
- total order - sinlge copy of data and all operation are atomic
- recency guarantee
- single up-to-date value - that all nodes agree on
- coordinating state of replicas in face of delays/faults
- slow, proportial to deays in network, coordination overhead, comes with performance hit
Casuality (Caually consistent)
- partial order, casual oredering
- casaully related events are ordered (happens-before)
- non related events are concurrent, no order defined
- can be implemented without performance hit
Ordering - helps preserve casuality
- Sequence number ordering - OK in a single-leader system
- Lamport timestamp - in a mulit-leader/leaderless system
- each node has a counter and an ID
- maximum counter values is tracked and always sent - nodes update their counter
Total Order Broadcast - Atomic Broadcast
- messages are guaranteed to be delivered reliably in a fixed order
- reliable message delivery (if delivered to one node, delivered to all nodes)
- total order delivery (delivered in the same order for every node)
- there is no guarantee WHEN a message will be delivered - nodes could lag behind
- every node processes the same writes in the same order (replication log, transction log, write-ahead log, lock log)
CAP Theorem - when Partitioned network, choose one:
- Consistency (linearizability) - consistent reads/writes - disconnected nodes wait/return error (unavailable)
- Availability - disconnected nodes process request independentyl - (non-linearizable, ~not consistent)
Problems with CAP
- considers only on consistency model (linearizability)
- considers one kind of fault (network partition - not delays/lost packets)
Consensus
- most important and fundamental problem in distributed computing - get all nodes to agree
- but not every system requires consensus - maybe it's okey to have multiple leaders/conflicting values - branching/merging version histories
- equivalent to other problems
- linearizable compare-and-set/increment-and-get register
- Total Order Broadcast
- locks and leases
- mebership/coordnation
Coordination/Consensus services - ZooKeeper and etcd
- small database, holding data in memory, slowly changing data
- replicated across nodes consistently - using fault-tolerant total order broadcast
- small number of nodes (3 - 5) - easier to perform majority votes, than on thousands of nodes
"Outsourcing" some of the work of coordinating nodes
- leader election -> implemented by distributed lock, linearizability - all nodes must agree who has the lock
- allocating work to nodes - job schedulers
- rebalancing partitions
- service discovery - who is the leader
- read only cache replicas, not voting, just to serve reads that not need to be linearizable (like DNS - availability > staleness)
- membership service - active live members of a cluster with consensus
Fault-Tolerant Consensus Algorithms:
- Viewstamped Replication
- Paxos
- Zab (Zookeeper - Hadoop, Kafka)
- Raft (etcd - K8)
2-Phase-Commit - 2PC -> (comes down to a single node atomic commit on the coordinator)
- coordinator - hands out unique transaction ID
- prepera request - participants acknowledging will wait indefinitely for coordinator decision
- commit/abort request - coordinator decision - written to disk first - send to participants
- in case of error/timeout - retry forever
- problems - coordinator failure recovery
- could never return - indefinite wait for decisions or loss of atomicity guarantees
- corrupted log, requiers manual conflict resolution or loss of atomicity guarantees
Consensus is easy - if we have the leader node
- but what happens if leader node fails
- we wait for the leader node to recover indefinitly
- or human selects the leadaer node - "act of God"
Fault-Tolerant Consensus (unifrom consensus) is hard
- because of has to satisfy livenes property (Termination) - make progress, eventually a decision has to be made
- of course if all nodes crash there cannot be a consensus
- there is a limit, consensus algorithms require the majority of nodes to be working correctly for termination
Chicken and the egg problem
- problem of split brain - two or more nodes believe they are the leader and get the database into inconsistent state
- so to elect a leader - we need consensus
- but to reach consensus - we need a leader to decide from proposals
Given a weaker guarantee - unque leader within each epoch
- two rounds of voting
- leader election - increases the epoch, quorum majority (the highest number wins if multiple leaders)
- leader proposal for every decisions - has to wait for quorom majority approval
- overlap between the two
- at least 1 node who approved the proposal, participated in latest leader election
- so there is no higher epoch that would conflict, the leader still holds the leadership
- voting is like synchronous replication - comes with a cost
- if a flaky network, the system could be stuck in leader reeleciton instead of doing actual work
- Systems of Record - source of truth, authoritive verison of the data
- Derived Data Systems - redundant, used to improve performance
Chapter10 - Batch processing
- Bounded input - files containing recrods
- Uses filesystem
- Periodic processing
On a sinlge machine only - Unix tools efficiency
- do one thing well
- immutable input, stateless, no side effects, repeatable (pure functions)
- composability, pipe
- universal interface input/output - file description, ordered sequence of bytes separated by \n
- separation of logic from wiring - input/output can be redirected
- efficient implementation - sort - can spill to disk, uses sequential access patterns, uses multiple CPUs
On multiple machines - Distributed Batch Processing has to solve:
- partitioning
- fault tolerance - save to disk, retry
MapReduce-Hadoop ecosystem
- used to build search indexes (full-text/fuzzy index)
- used to build key-value data files (ml recommendations,classificaion precompute)
Higher level programming models
- Pig, Hive, Cascading, Crunch
HDFS - distributed filesystem
- NameNode keeps track of which file block on which node
- deamon process on each node - exposing a network service to allow other nodes to access files
Map_Group/SortMerge_Reduce (similar to UNIX pholosophy)
- immputable inputs - input/output HDFS direcroty
- no side effects (preferrebly)
- repeatable
- composable - MapReduce jobs ca be composed as well, separation of logic and wiring I/O directories
Mapper
- take a file block input, and iterate over recods
- extract [key,value] pairs from each records
- sort by key
- write to disk
Reducer - assigned some keys/partitions
- connect to each mapper
- download files of sorted [key,value] pairs (corresponding to it's partition)
- merge sorted files together
- iterate over keys and produce output - (could be the input for next map reduce job)
Reduce-Side Join / Sort-Merge join
- mapper output is sorted by keys
- reducers merge together sorted keys
Map-Side Join
- small side of join in memory (broadcast hash)
- both sides of join use same partitions, small side in memory
MapReduce vs MPP (Massively Parallel Processing Databases)
- general-purpose storage
- HDFS - read/write data in arbitrary format (sequence of bytes) any data model, any encoding (not just relational model)
- dump data and figure out how to process later
- general-purpose processing
- SQL query execution supported
- arbitrary code execution in Map/Reduce callbacks (ml systems precompute)
- fault tolerance - restarting of failed tasks
- sparing use of memory - eager to write to disk
- for fault tolerance + assumption that data does not fit in memory
Beyond Mapreduce - Dataflow Engines
- offer a more flexible programming model and processing speed up
- not just Map-SortMerge-Reduce steps, arbitrary Operators (skip sort if not needed)
- Operators, just like in MapReduce, are user functions that process one record at a time on a single thread
- don't save intermediate results to HDFS/replicated if not needed (keep on local disk only)
- Tez - light wrapper over YARN
- Spark - RDD for tracking data ancestry for fault tolerance
- Flink - checkpointing state for fault tolerance
Graph data
- iteravite processing - Pregel processing model
- process one vertex at a time, vertexes send messages to other vertexes for the next iteration
- could be slow due to lots of cross-machine communication
- small data could be processed on a single-machine (GraphChi)
Chapter11 - Stream processing
- Unbounded input - topics containing events
- Uses message borker / event logs
- Continous processing
Messaging systems
- direct messagings - UDP/HTTP/RPC - application code retries
Queue-based message brokers (AMQP/JMS)
- transient storage - (deleted after ACK)
- multiple parallel consumers - fan out
- ACK and redelivery - deleted cannot be reprocessd, no history
Log-based message brokers (Kafka/Kinesis)
- durabile storage - append only log - ORDERED
- partitions that have assinged consumers (from consumer groups)
- consumer offset - easier to reprocess, can jump back in history
Change Data Capture
- system of records -> log of data changes -> derived systems
- apply changes in the same ORDER to derived systems (log based message broker)
- async replication, that has lag
Event sourcing
- similar to CDC, but higher level, application level (not DB level)
- immutable event log - recrods actions, not side effects/DB level entity changes
Mutable state & Immutable changelog
- they represent two sides of the same coin - no contradiction
- changelog === evolution of state overtime
- separate how data is written, from how it is read
- no worries about - schema design, index complexities to supporting certain query/access patterns
- use a write-optimized event log, then later add several different read-optimized views as needed (derived data systems)
Stream Processing output
- a new data store (derived data system, db, cache, search index)
- push/stream events to a client app
- produce another stream (derived stream)
Unbounded input
- cannot be sorted (no sort-merge join)
- cannot be restarted from the beginning if failed
Stream Processing
- pattern matching engines - maintains state of the query - events flow through continously
- aggergation engines - copmute avg/count over a fixed window
Event time vs Processing time
- occurence time by device clock, sent time by device clock, recieved time by server clock
- lagging,straggler events - dropped or metric correction republished
Joins
- time window limited, time-dependent (slowly changing dimensions)
- stream processor needs to maintain state
- based on one join input
- and query state on events from the other join input (enrich)
Fault tolerance - exactly once execution
- microbatching (1 second window) - repeatable retry
- rolling checkpoint generation / state saved to disk - or state rebuilt from log
- atomic commit - transctions with rollback
- idempotent writes - send offset with each change, don't accept older offset / fencing token
Chapter12 - Future of Data Systems
No one-size fits all software - need to integrage different data systems
How to keep data in sync/consistent? - Ordering
- log-based derived data - log-based order
- distributed transactions - order writes using 2PL - locks & mutex
How to handle failures?
- log-based derived data - (deterministic) retry and idempotency (more than once execution)
- distributed transactions - retry with atomic commit to ensure only once execution
Log-based derived data
- async replication, no timing guarantees
- localize failure, replication failure does not affect main write
- single leader node throughput === guarantees total order broadcast (forces a bottleneck)
- higher load requires partitions - ordering between partitions becomes ambiguous - casual dependecies cannot be captured effectively
Distributed transactions
- linearizability (read own writes)
- amplify failure, any write failes aborts whole operation
- poor performance
Stable ordering and fault-tolerant message processing are quite strict demands, but less expesnive and more operationally robust then distributed transactions (1. log-based derived data >> 2. distributed transactions)
Lambda architecture
- event log
- stream processing system for quick / approximate updates to dervied views
- batch processing system for slow / correct updates to derived views
- overhead of maintaining two separate systems, merging results can become complex
- rise of unified system to do both
Reads are events too
- serving a read requests, could be interpreted as join
- join between stream of read request and database - stream-table join
- one-off read ~ pass read request through join operator and forget it
- subscribe request ~ persistent join with past and future events - state stored
Making operations idempotent - Duplicate surpression
- add metadata - operation IDs - pass from client all the way to db recods and enforce uniqueness
- fencing token when failing over from one node to another
Enforcing uniqueness constriants
- single leader works - for limited throughput and as long as it does not fail
- for failover consensus, synchronization is needed
Single partition solution for uniqueness
- sending writes that can conflict to the same partition
- each partition is processed sequentially - local database to keep track of state
- for every write we emit a success or reject message
Multi partition solution for uniqueness
- example: money transfer from A to B, request ID - three different partitions - how to sync?
- cross-partitions transactions - atomic commit - throughput suffers
- multiple stages of stream processors (dataflow)
- stage 1: durably log a single message with all data requestID, account A, account B (either appears or not in the log, single ojbect write are atomic)
- stage 2: read each request and emit two output message/stream (could be duplicated)
- A stream: debit + request ID
- B stream: credit + request ID
- immutable and deterministic to support retries
- stage 3: A & B stream processor process and deduplicate messages based on request ID
Consistency conflates two requirements:
- Timeliness - read most up-to-date value
- linearizability - achieves it in a strong way
- read-after-write - weaker guarantee
- Integrity - no data contradiction and no data loss
- atomic commit
- exactly-once execution
- handles lost events (fault-tolerant delivery)
- handles duplicate execution surpression (through idempotency)
- Integrity (permanent inconsistency) >> Timeliness (eventual consistency)
Hard uniqueness constriant (consensus)
- can be achieved by funneling all events in a particular partition through a single node
- limits performance, clients wait for checks
Weaker uniqueness contraint (timeliness enforcement is relaxed):
- optimistic write, and clients check for constraints after (async)
- compensating transaction might be acceptable in many cases (apologies)
Dataflow systems:
- multi stage derived data stream processors
- can maintain integrity without atomic commit, linearizability, synchronous cross-partition coordination - less expensive, more robust
- if we loosen the timeliness constraints (allow temporary violation and compensate later) we can avoid sync coodination and improve throughput even more
Trade-off: Performance/Availability at high load vs Number of inconsistencies/Compensating transactions
Ethics
- Predictive analytics
- Algorithmic prison - beign exluced from key aspects of society (loan, jobs, housing, ari travel) because of ml decision-making software labelling someone as risky - (without any proof of guilt)
- amplify bias in the data, past discriminations, negative feeddback loops
- Privacy and survaillance
- harm now or in the future ~ similar to pollution/hazardous