Authorization

Step 24 in the System Design path · 3 concepts · 0 problems

0 / 3 complete

📘 Learn Authorization from zero

Authorization decides what an already-identified caller is allowed to do — "can user 42 DELETE /orders/99?" It runs on EVERY request after login, so a slow or wrong decision either tanks your p99 or leaks data. In an interview, the trap is conflating it with authentication (who you are) and hand-waving "we check permissions." The signal they want: name the model (RBAC vs ABAC vs ReBAC), say where the check runs (gateway vs service), and reason about the cache/staleness trade-off when a permission is revoked. Work through the questions below before revealing each answer.

✨ Added by the guide to build intuition — not from the source course.

Lessons in this topic

🏗️ Apply it — design walkthrough

Work through this after you've learned the concepts in the lessons above.

AuthN vs AuthZ

🤔 A request arrives with a valid login token, but the system still returns 403 Forbidden. Authentication passed — so what exactly failed, and why are these two separate steps?

Reveal the reasoning

Authentication (AuthN) answers who are you → it verifies identity (password, OAuth, token). Authorization (AuthZ) answers what may you do → it runs on every request after.

Cause → effect: the token proved identity (AuthN OK), but when the system mapped user 42 → action DELETE → resource order 99, it found no matching permission → returns 403 (not 401; 401 means "not authenticated / log in", 403 means "authenticated but not allowed").

Run frequency: AuthN happens at the session boundary (login / token issuance); AuthZ runs ~every request (a single page can fire 20+ checks).
Trade-off / cost: because AuthZ runs constantly, its latency is multiplied across every call — a 5 ms permission lookup on a request that does 8 checks adds 40 ms before any real work. That is why people cache it (later steps), and why mixing the two concerns into one slow "login check" is a design smell.

RBAC mechanism

🤔 You don't want to attach 500 individual permissions to each of 10,000 users. What indirection layer lets you grant access to thousands of users by editing one thing, and how does a single check actually resolve?

Reveal the reasoning

RBAC (Role-Based Access Control) inserts a role between users and permissions: user → role → permission. You assign permissions to roles, and users to roles.

Mechanism of one check for can(user42, delete, order99):

1. Look up user 42's roles → {editor, billing}.
2. Expand roles to permissions → {order:read, order:write, ...}.
3. Test if order:delete is in that set → not present → deny.

Cause → effect: grant a new permission to 5,000 support agents by editing 1 role, not 5,000 user rows — the role is the single point of change.

Trade-off / cost: RBAC is coarse — a role is the same for everyone who holds it, so it can't express "only orders YOU created" or "only during business hours" without baking the condition into the role name. Teams work around this by minting hyper-specific roles (editor_region_eu_readonly) and end up with role explosion — hundreds of near-duplicate roles that nobody can audit. When you hear "it depends on the resource's attributes," RBAC alone is the wrong tool.

ABAC for fine-grained

🤔 Requirement: "a user may edit a document only if they're in the same department AND it's not locked AND the request is during 9–5." RBAC would need a role per (department × lock-state × hour). What model evaluates this as a policy instead, and what do you pay for that power?

Reveal the reasoning

ABAC (Attribute-Based Access Control) decides via a policy/rule evaluated at request time over attributes of the subject, resource, action, and environment — not pre-assigned roles.

Mechanism — the policy engine evaluates a boolean expression:

subject.dept == resource.dept AND
resource.locked == false AND
9 <= env.hour < 17 → all true → allow.

Cause → effect: one rule replaces the combinatorial explosion of roles — when a new department appears, the rule subject.dept == resource.dept still holds with zero new roles.

Trade-off / cost: (1) the engine must gather every attribute at decision time — resource.dept and resource.locked may require a DB fetch, adding latency and coupling AuthZ to your data; (2) it's hard to answer "who can access X?" — there is no membership list to read, so you'd have to run every user through the policy, whereas RBAC just lists the role's members. RBAC = easy to audit, coarse; ABAC = expressive, hard to audit. Many systems combine them: RBAC for the broad gate, ABAC for the fine condition.

ReBAC / relationships

🤔 Google Docs: "anyone the doc is shared with can view, AND anyone in a group the doc is shared with, AND anyone with access to the parent folder." These are relationships between objects, not attributes or flat roles. How does Google's Zanzibar model answer one check, and what's the operational cost?

Reveal the reasoning

ReBAC (Relationship-Based Access Control) stores access as a graph of relationship tuples shaped object#relation@subject, e.g. doc:99#viewer@user:42 or doc:99#parent@folder:7.

Mechanism — a check is a graph reachability search: starting from doc:99#viewer, can I reach user:42 by walking edges (direct share → group membership → inherited folder permission)? If a path exists → allow.

Cause → effect: inheritance ("folder → all docs inside") is expressed once as a single parent edge, instead of duplicating a viewer tuple onto every child doc.

Scale signal: Google's Zanzibar handles trillions of relationship tuples at p95 < 10 ms, with in-memory caches absorbing ~90%+ of reads (the source of that low latency).
Trade-off / cost: a check can require multiple sequential graph hops, each potentially its own lookup — deep nesting blows up tail latency, so you need aggressive caching plus consistency tokens ("zookies") that pin the check to a snapshot at least as fresh as the content. Without that you hit the "new enemy" problem — honoring a stale cached ACL and granting access that was already revoked. It's the most powerful model and the most operationally heavy — overkill unless your domain is genuinely graph-shaped (sharing, hierarchies, social).

Where to enforce

🤔 You could check permissions at the API gateway (one central place) or inside each microservice. If you only check at the gateway, what attack still gets through — and what's the downside of checking everywhere?

Reveal the reasoning

Two layers, two jobs:

Gateway / edge: good for coarse checks (is the token valid? does this role even reach this route?). Cheap, central, blocks obvious junk early.
Service-level: required for fine-grained, data-dependent checks ("is this your order?") because only the service knows the resource's attributes.

Cause → effect of gateway-only: a caller that reaches a service directly (an internal/compromised neighbor service, a misrouted internal call, SSRF) skips the edge entirely → no check runs → data leak. This is why zero-trust / defense-in-depth says the service must not assume the gateway already checked.

Trade-off / cost: enforcing in every service means (1) every team re-implements AuthZ → drift and inconsistent bugs, and (2) duplicated lookup latency. The standard fix is the PDP/PEP split: a centralized Policy Decision Point owns the policy logic, while a local Policy Enforcement Point at each service does the blocking — implemented as a sidecar or library (e.g. an OPA agent, or a Zanzibar-style check service). Policy stays central and consistent; enforcement stays local and unbypassable.

Carrying claims in JWT

🤔 To avoid a DB lookup on every request, you stuff the user's roles into a signed JWT the service can verify locally. An admin is fired at 10:00 and you revoke their admin role in the database at 10:01 — why can they still delete records at 10:30, and how do you bound the damage?

Reveal the reasoning

Mechanism: a JWT is a signed, self-contained token carrying claims (e.g. roles:[admin], exp:...). The service verifies the signature with the issuer's key and reads the claims — no network call, no DB hit → fast, stateless AuthZ.

Cause → effect of the bug: the token issued at login still physically contains roles:[admin]. Revoking the role updated the database, but the service trusts the self-contained token, not the DB → the stale claim is honored until the token expires. With a 1-hour TTL, a token minted at 09:45 stays valid until ~10:45 — the fired admin keeps power for the remainder of that window.

This is the core trade-off: statelessness vs. revocation latency.

Long TTL (hours): fewer re-auth round-trips, but a large revocation window.
Short TTL (5–15 min) + refresh token: small window (revocation takes effect at the next refresh), but more traffic to the auth server.
Token denylist / introspection: near-instant revocation, but you've reintroduced the central lookup the JWT existed to avoid — back to stateful.

Interview line: "JWTs trade revocation immediacy for statelessness; I'd use short-lived access tokens with refresh, and a denylist only for emergency revocation."

OAuth vs JWT confusion

🤔 A teammate says "let's use OAuth instead of JWT for auth." Why is that comparison a category error — what does each one actually do, and where does OIDC fit?

Reveal the reasoning

They operate at different layers, so "OAuth vs JWT" is like "HTTP vs JSON" — not alternatives.

OAuth 2.0 = an authorization framework (a protocol). It defines the flow by which an app gets delegated access to a resource on a user's behalf without seeing their password — who issues tokens, how they're scoped, refreshed, and revoked. It does not mandate a token format.
JWT = a token format. A signed, self-contained way to package claims (RFC 7519). It says nothing about how the token was obtained.
OIDC (OpenID Connect) = an identity layer on top of OAuth 2.0. OAuth alone only proves "this app may call this API," not who the user is; OIDC adds that by issuing an ID token (a JWT) with verified identity claims.

Cause → effect: they compose, they don't compete. A typical setup runs the OAuth flow to obtain a token, has the authorization server format that token as a JWT, and uses OIDC when it also needs authenticated identity. Saying "OAuth instead of JWT" reveals the confusion: OAuth is the how-you-get-it, JWT is the what-it-looks-like.

Trade-off / cost: the choice that does trade off is JWT (self-contained) vs. an opaque token (a random string the resource server must introspect against the auth server). JWT = fast local verification, but the revocation-lag problem from the previous step; opaque = instant revocation by central lookup, at the cost of a network round-trip per check. OAuth supports either format.

🎯 Guided practice

Easy — Tell AuthN from AuthZ. A user opens DELETE /api/posts/42. The request carries a valid session token. The server returns 403 Forbidden (not 401 Unauthorized). What happened?
Step 1: A valid token means AuthN passed — identity is known. Step 2: Despite its name, 401 Unauthorized actually signals an authentication failure — "I don't know who you are." 403 Forbidden signals an authorization failure — "I know who you are, but you may not do this." Step 3: So the user is authenticated but lacks permission to delete post 42 — perhaps they are a viewer, not the post's author or an admin. Core pattern: AuthN and AuthZ are separate gates, and the two status codes map directly onto which gate failed (401 = identity unknown, 403 = identity known but not permitted).
Medium — Design the check for "users edit only their own profile, admins edit anyone."
Step 1 — Identify the principal: after AuthN, you hold the caller's identity, e.g. { userId: "u_7", role: "user" }, extracted from a verified token's claims. Step 2 — Identify the resource: the request is PUT /users/u_99; the target resource id is u_99. Step 3 — Write the rule (RBAC for the admin path, ownership/ReBAC for the self path): allow if role == "admin" (role-based) OR principal.userId == resource.ownerId (relationship between principal and this specific object). Step 4 — Apply: here u_7 is a plain user editing u_99; not admin, and u_7 != u_99, so deny with 403. If it were PUT /users/u_7, ownership matches → allow. Step 5 — Avoid the trap: a naive design checks only role == "user" and lets any user edit any profile — the classic IDOR / confused-deputy bug. The fix is binding the principal to this specific resource, not just checking their role in the abstract. Core pattern: every authorization decision is a function of (principal, action, resource, context) → allow/deny, and the resource binding is the part juniors forget.

✨ Added by the guide — work these before the full problem set.