Thoughts on Data Science, ML and Startups

Book Notes: Designing Data Intensive Applications

I have known this book for a few years now and it was highly recommended by the engineers I respect, but until now I de prioritised it in favor of other technical books. However, I will be involved in building a data processing system in my new role, so I decided that it is about time I should read it. Folling are my notes.

Part 1: Foundations of Data Systems

In the first part of the book, author goes through the foundational ideas that apply to all data systems.

Chapter 1: Reliable, Scalable, and Maintainable Applications

Most applications today are data-intensive rather than compute-intensive. The limiting factor of these systems - amount of data, the complexity of data, and the speed of change for that data. Data systems are systems designed to accomplish a specific goal and are made of other, more general, components: * Databases - to store data for later retrieval; * Caches - to remember results to spead-up reads; * Search indexes - to allow users to search by keywords or filter data; * Stream processing - to handle requests asynchronously; * Batch processing - to crunch a large amount of data at once;

The book focuses on three important concerns for the data systems:

1. Reliability

The system continues to work correctly even when things go wrong (hardware faulst, software errors, human errors).

One overlooked aspect of reliability that it is defined for a specific load/data volume. If this load increases, system should have another property - scalability to help system return to the reliable mode of operation.

2. Scalability

Systems' ability to cope with increased load. Ability to cope with the load is measured by a adequite performance metric. Load can mean different things and needs to be considered in several dimensions - number of requests, number of reads/writes, number of simultanious active users, the hit ratio on a cache etc. Performance, as load, is described differently for different systems - throughput for bach processing system, response time for an online system.

Measuring performance
---------------------
Average response time is a common metric, however it says little about the typical experience of your users. The better metric is median (50th percentile) - half of the requests return faster than median and half of the requests take longer than median.
High percentiles (95th, 99th, 99.9th) of response times, also known as `tail latencies`, are important because they directly affect users' experience of the service. The customers with the slowest requests are often those who have the most data in their accounts, becayse they made the most purchases and therefore are the most valuable customers.

We can cope with load in mainly two different ways - scaling up and scaling out. A system that can run on a single machine is often simpler, but high-end machines can become very expensiv, so very intensive workloads often can't avoid scaling out. In reality, good architectures usually involve a pragmatic mixture of approaches: for example, using several fairly powerful machines can still be simpler than a large number of small virtual machines.

3. Maintainability

Over time, all the developers that will work on the system should be able to do so productively. It is a well known fact that the majority of the cost of the software is not in its initial development, nbut in its ongoing maintenance. Maintainability could be further broken down to: * Operability - the ease with which the operations team can keep the system running. * Simplicity - the ease with which new engineers can understand the system. * Evolvability - the ease wih which engineers can adapt the system to new usecases.

Chapter 2: Data Models and Query Langauges

...