Foundations

Step 1 in the Databases path · 1 concepts · 0 problems

0 / 1 complete

📘 Learn Foundations from zero

First principles. A program manipulates data in memory (RAM), but RAM is volatile — cut the power and the data is gone. To make data persist, you write it to durable storage (disk). The earliest, simplest way to do this was to dump records into ordinary files and write application code that opens, reads, and parses them. This is the era "before the advent of the database" — what the textbooks call a file-processing system, typically files manipulated by conventional programs (COBOL, C) each owning its own format.

Analogy — the shoebox of index cards. A small business stores every customer on index cards in shoeboxes. Sales has a shoebox, Billing has its own, Shipping has another. A customer moves house, so someone must update the address in three boxes. Miss one and the boxes disagree — Billing ships to the old address. If two clerks grab the Sales box at the same time, one clerk's edit overwrites the other's. And nobody can quickly answer "which customers bought product X and live in Texas," because the cards were never organized for that question — each box was built for one program's needs.

Worked example. A 1970s payroll program keeps employees.dat; the benefits team keeps a separate benefits.dat that also stores each employee's name and department. When someone changes department, payroll updates its file but benefits is never told — the two files now disagree (inconsistency arising from redundancy). Worse, each program hard-codes the byte offset and field width where "salary" lives, so widening that field breaks every program that reads the file — there is no data independence (program–data dependence).

Key insight: file-based storage works until data is shared. The instant multiple programs and multiple users touch the same data, files force each application to re-invent redundancy control, integrity, querying, concurrency, and recovery — independently, partially, and inconsistently. The DBMS was invented to provide all of these once, behind a single shared interface and data model.

✨ Added by the guide to build intuition — not from the source course.

🎯 Guided practice

Easy — Spot the redundancy. An online store keeps orders.csv with columns (order_id, customer_name, customer_email, customer_address, item, price). A customer places 50 orders, then changes their email. What goes wrong, and why?
Reasoning: (1) The email is stored once per order — data redundancy, 50 copies of one fact. (2) Changing it means rewriting all 50 rows; miss any and you get data inconsistency (the same customer shows two emails, and queries can't tell which is true). (3) The fix the relational model teaches: store each fact once — split into a customers table and an orders table that references it by a foreign key customer_id. This is precisely the motivation for normalization (Codd): a non-key fact should depend on the key, the whole key, and nothing but the key. Internalize the pattern: "a fact stored more than once is an inconsistency waiting to happen."
Medium — The lost update. Two cashier terminals both read inventory.txt, which says widgets = 100. Terminal A sells 10, Terminal B sells 5, at nearly the same instant. Each does: read into memory, subtract, write back. What final value can the file hold, and what should it be?
Reasoning, step by step: The correct result is 100 - 10 - 5 = 85. But trace one interleaving: A reads 100; B reads 100 (A hasn't written yet); A computes 90 and writes 90; B computes 95 from its stale 100 and writes 95. Final value: 95 — the sale of 10 widgets vanished. This is the textbook lost-update anomaly: a read-modify-write race with no concurrency control. (A different interleaving where B finishes first yields 90; either way one update is lost.)
How a DBMS fixes it: it wraps each read-modify-write in a transaction and enforces isolation via locking (e.g. two-phase locking) or an atomic UPDATE ... SET stock = stock - 10, so the two updates are serializable — the result is equivalent to running one entirely before the other. This is the bridge to ACID: Atomicity (all-or-nothing), Consistency, Isolation (concurrent txns don't interfere), Durability (committed data survives crashes). The general lesson: shared mutable state demands coordinated, atomic, isolated updates — and flat files give you none of them.

✨ Added by the guide — work these before the full problem set.

Lessons in this topic

○Before the Advent of Database