Knowledge Guide
HomeSystem DesignSystem Design Problems

medium Design a Recommendation System

Image
Image

Step 1. System Definition

The recommendation system is a core service for a large-scale streaming platform (e.g., Netflix, Amazon Prime, Hulu) that suggests personalized video content to users. Its purpose is to enhance user engagement by helping users discover movies and shows they are likely to enjoy, thereby increasing watch time and satisfaction. The system delivers two types of recommendations: real-time recommendations (dynamically generated as users browse, to reflect their immediate context or latest actions) and batch-processed recommendations (periodically computed in bulk, e.g,. daily or weekly, to capture long-term preferences and trends). It must serve hundreds of millions of users reliably with low latency.

Key Entities:

Image
Image

Step 2. Requirements

Functional Requirements

Non-Functional Requirements

Step 3: Capacity Estimations

Before finalizing the design, let's estimate the scale to ensure components can handle the load:

Summary:

Details:

These estimates guide the choice of technologies (NoSQL databases, distributed caches, etc.) and ensure the architecture can scale to these volumes.

We will use a REST API for client interactions.

1. Get Home Feed

{ "rows": [ { "type": "continue_watching", "title": "Continue Watching for Jane", "items": [ { "video_id": "vid_123", "progress_sec": 450, "thumbnail": "..." } ] }, { "type": "top_picks", "title": "Top Picks for You", "items": [ { "video_id": "vid_555", "match_score": 0.98 } ] } ] }

2. Record User Event

{ "user_id": "u_999", "event_type": "heartbeat", // or play, pause, rate "video_id": "vid_123", "timestamp": 1679000000, "position_sec": 455 }

Step 4: High-Level System Design

Architecture Overview: The system is composed of online serving components (to handle live user requests) and offline processing components (to crunch data and update recommendations periodically), along with data storage and integration layers. Below is an overview of key components and how they interact:

High-level design of Recommendation Service
High-level design of Recommendation Service

Data Flow (from data ingestion to serving recommendations):

  1. User Interaction & Logging: When a user interacts with the platform (watches a video, rates something, adds to watchlist, etc.), the client sends this event to the backend. For example, after finishing a movie, an event “User U watched Movie M at time T” is sent. The backend logs this event to the Event Ingestion System (e.g. a Kafka topic or analytics service). This event is appended to a durable log for offline processing. It may also update some real-time counters: for instance, increment a view count for Movie M in a cache or trigger a quick recalculation of U’s recommendations (in a streaming/nearline processor). The user’s immediate session state might also be updated (so the “Continue Watching” list on their home page is updated instantly).
  2. Real-Time Processing: The event can have immediate effects. A streaming computation layer (or even the Recommendation Service itself when it receives the event) might handle quick updates. For example, upon “User U watched M”, the system could: update U’s profile (mark M as watched so it’s not recommended again), and fetch a few similar movies to M to recommend next. This could be done by a lightweight Real-Time Recommender component that consumes events. Alternatively, the Recommendation Service when next called will notice that U watched M and use that context. Also, the Trending Service might consume all “watched” events to update the current top-N popular list continuously. These real-time operations ensure freshness between batch runs.
  3. Batch Processing (Offline): The bulk of heavy recommendation computation happens asynchronously. At scheduled intervals (e.g. every night), the Batch Processing Pipeline kicks off jobs. These jobs read large datasets: user watch histories (from the event logs or a data warehouse), content information, and possibly prior model data. For instance, a collaborative filtering algorithm might build a big user-item matrix from all watch history and factorize it to learn latent features, or compute item-to-item similarity by analyzing co-watches. This can produce either a model (like embeddings for each user and item) or directly a list of top recommendations for each user. The batch jobs might produce intermediate data too (such as an item similarity index, which the online system can use). After computation, the batch job writes the results to the appropriate store, e.g., update the Recommendations Data Store with new top-N recommendations per user, update model parameters, and store any popularity stats or derived data. This batch process is not real-time, so it can use complex algorithms and crunch through the entire dataset with the luxury of a few hours of processing.
  4. Publishing Results: Once the batch job finishes generating recommendations (or model), there is a step to publish or load these results into the serving layer. For example, if the results are written to a file or Hadoop storage, a process will load them into the live NoSQL DB that the Recommendation Service uses. This could be done via a bulk upload or through a publish-subscribe mechanism that notifies the serving store of new data (Netflix’s internal tool “Hermes” or simply a script, for instance). Alternatively, the batch job might directly write to the serving database if it has access. The important part is that by the time users start their day, the Recommendation Service has the latest precomputed data available.
  5. Serving Recommendations (Online Query): When the user opens the app or navigates to a section that requires recommendations (e.g. the homepage or a “because you watched X” list), the client calls the Recommendation Service API. For example, a request goes to /recommendations?user=U&context=home. The API Gateway passes this to an instance of the Recommendation Service. The service will then gather necessary data: it will typically fetch the precomputed recommendation list for user U from the Recommendations store (a fast key-value lookup by userID). If the context is specific (say user is on a details page for Movie M and asks for similar titles), the service might instead fetch the similar-items list for M (which could be precomputed as well, like an item-to-item index). The service then may combine results from multiple sources, e.g., it might start with the stored personalized list, then adjust it. Adjustments could include filtering out any items the user has seen or that are not currently available, merging in a few currently trending items or new releases (especially if the user’s list is short or stale), and possibly re-ranking slightly based on very recent behavior (if not already accounted for).
  6. Returning the Response: The Recommendation Service prepares the final list of recommended content (say top 20 items). It might attach scores or just the ordered list of content IDs. It then either returns those IDs to the client, or it may call the Content Metadata Service to fetch details (title names, etc.) so that the client gets a fully populated response. This choice is a design trade-off: returning just IDs keeps the Recommendation Service simple and the payload small, but then the client has to call another service to get details (or the gateway does it). Often, an aggregation layer or the Recommendation Service itself will join in the metadata for convenience, especially if it’s cached, so the client gets one response with everything needed to display the recommendation carousels.
  7. Client Display and Feedback: The client receives the recommendations, displays them to the user (e.g., showing a row of movie thumbnails “Top Picks for You”). If the user interacts (scrolls, clicks one), those actions can generate further events (e.g,. clicking a title might trigger an impression log or a “more like this” request). The cycle continues with new events flowing in and (eventually) influencing the next round of recommendations.

Throughout this flow, multiple layers of caching and fallback ensure performance and reliability. For instance, if the Recommendation Service cannot reach the precomputed data store for some reason, it might fall back to a cached response or a default trending list so that the user still sees something. Similarly, if the batch job hasn’t yet updated today, the system can use yesterday’s results. In essence, data flows from users into logs and databases, gets transformed into recommendation knowledge (models, lists), and then flows back to users as personalized suggestions, with continuous feedback.

Step 6: Database Schema

This schema uses a hybrid approach combining relational tables for structured, consistent data and NoSQL stores for high-scale, real-time recommendation data. The relational (SQL) database ensures data integrity and easy joining of user, content, and interactions. The NoSQL stores (Cassandra/DynamoDB for key-value and wide-column data, plus Elasticsearch for search) handle large volumes and fast lookups/updates for recommendations, similarity, trends, and search.

Relational Database (SQL) Schema

Users Table

Purpose: Stores user profile information.

Field NameData TypeDescription
user_idINT (PK)Unique user identifier (primary key).
nameVARCHAR(100)User’s full name or display name.
emailVARCHAR(100)Email address (unique, used for login).
signup_dateDATETIMETimestamp when the user registered.
countryVARCHAR(50)Country or region of the user (for locale/trending).

Indexes / Keys:

Content Metadata Table

Purpose: Stores details of movies/shows (content catalog metadata).

Field NameData TypeDescription
content_idINT (PK)Unique content identifier (primary key).
titleVARCHAR(200)Title of the movie or show.
descriptionTEXTSynopsis or description of the content.
genreVARCHAR(50)Primary genre of the content (e.g., "Action").
content_typeVARCHAR(20)Type of content ("Movie", "Series", etc.).
release_yearINTYear the content was released.
languageVARCHAR(50)Language of the content (optional, for filtering).

Indexes / Keys:

User Watch History Table

Purpose: Logs each content viewing event per user. This is a normalized log of what users watched, used for collaborative filtering signals and history-based recommendations.

Field NameData TypeDescription
history_idBIGINT (PK)Unique identifier for this watch event (primary key).
user_idINT (FK)ID of the user who watched (foreign key to Users table).
content_idINT (FK)ID of the content watched (foreign key to Content table).
watch_dateDATETIMEDate and time when the content was watched.
durationINTWatch duration in minutes (or minutes watched, optional).

Indexes / Keys:

User Preferences Table

Purpose: Stores explicit user preferences like preferred genres or other categories the user likes, as stated by the user (e.g., during onboarding or in profile settings).

Field NameData TypeDescription
user_idINT (FK)ID of the user (foreign key to Users table).
preference_typeVARCHAR(50)Type of preference (e.g., "Genre").
preference_valueVARCHAR(100)Value of the preference (e.g., "Comedy", "Drama").

Indexes / Keys:

Ratings Table

Purpose: Stores user-provided ratings for content they have watched. These explicit ratings feed into collaborative filtering algorithms.

Field NameData TypeDescription
user_idINT (FK)ID of the user who rated the content (FK to Users).
content_idINT (FK)ID of the content that was rated (FK to Content).
ratingTINYINTRating given by the user (e.g., 1–5 stars or 1–10 scale).
rating_dateDATETIMEDate and time when the rating was made.

Indexes / Keys:

NoSQL Collections / Tables (High-Scale & Real-Time)

The NoSQL components handle large-scale data and real-time updates that are less suited for the relational model. These systems use keys and partitioning to ensure fast reads/writes across distributed nodes. Below, each NoSQL store is outlined with key fields and data types (using Cassandra/DynamoDB terminology or Elasticsearch for the search index).

Precomputed Recommendations (User → Top-N Items)

Purpose: Stores precomputed top-N recommendations for each user for fast retrieval (e.g., “Recommended for You” list). This is typically updated by offline batch jobs or real-time pipelines and read frequently when the user opens the app.

Field NameData TypeDescription
user_idTEXT (Partition Key)Key – Unique user identifier (partition key for fast lookup).
recommended_itemsLIST (or JSON)List of top N recommended content IDs for this user, in ranked order.

Design Notes:

Item-to-Item Similarity Index

Purpose: Captures content-to-content similarity for item-based recommendations (e.g., “Because you watched X, you might like Y”). For each content item, it stores a list of similar items.

Field NameData TypeDescription
item_idTEXT (Partition Key)Key – Content ID for which we store similar items (partition key).
similar_itemsLIST or LISTList of similar content IDs related to item_id. Could be a simple list of IDs or structured with similarity scores.

Design Notes:

Trending Content

Purpose: Tracks globally and regionally trending content, updated frequently (e.g., hourly or daily) based on view counts, likes, and other metrics. This provides quick access to "What's popular" for all users or by region.

Field NameData TypeDescription
regionTEXT (Partition Key)Key – Region identifier. Could be country code (e.g., "US") or "GLOBAL" for worldwide trending.
trending_itemsLISTList of top trending content IDs in that region, ordered by popularity.
last_updatedTIMESTAMP(Optional) Timestamp of last update for this region’s trending list.

Design Notes:

User Real-Time Events

Purpose: Captures live user interactions (play, pause, likes, etc.) in real-time. This firehose of events can be used to immediately adjust recommendations (e.g., boosting an item right after a user likes it, or recording an in-progress watch). Storing these in a NoSQL table allows fast, high-volume writes and flexible querying for recent events.

Field NameData TypeDescription
user_idTEXT (Partition Key)Key – User identifier for the event (partition key to group events by user).
event_timeTIMESTAMP (Clustering Key)Timestamp of the event (also part of the key to sort by time).
content_idINTID of the content the event pertains to (if applicable).
event_typeTEXTType of event (e.g., "PLAY", "PAUSE", "LIKE", "DISLIKE", "SEARCH").
detailsTEXT/JSON(Optional) Additional details (e.g., duration watched at pause, device info, etc.).

Design Notes:

Step 7: Detailed Component Design

Recommendation Computation Workflow

1. Offline Batch Recommendation Generation

The offline pipeline is where the heavy lifting is done to compute personalized recommendations using the full history of data. Here’s how it typically works:

Detailed Design of Recommendation Service
Detailed Design of Recommendation Service

Real-Time Recommendation Serving and Updates

The online part of the system uses the precomputed data but also reacts to new information in real time to fine-tune recommendations for immediacy:

Nearline Processing

In between, we have a nearline or streaming layer, as mentioned. This operates in quasi real-time but off the request path (asynchronous). For example, when a user finishes watching something, a nearline processor might within a minute update some of that user’s recommendations or trigger partial model updates (like updating that user’s factor vector with a simple formula rather than waiting for tonight’s full retraining). This gives a compromise: it can handle event-by-event updates with a bit more computation than the strict online path, but since it doesn’t need to respond to the user immediately, it can take a few seconds and aggregate work. The nearline system often writes to the same storage that the online system reads from (like updating the user’s profile store or a cache of their top picks). For instance, Netflix’s architecture uses a nearline pipeline to update “recently watched” and similar signals via a distributed computation framework.

Interaction of Real-Time and Batch Components

The batch and real-time parts work together to provide a balanced solution. The batch system ensures deep analysis of large data (for accuracy and breadth), while the real-time system ensures freshness and context-awareness. Key interaction points and strategies include:

In summary, the detailed design ensures that all user and content data is organized in the right storage, the batch algorithms can efficiently crunch through the big picture and store results, and the online service can quickly assemble a recommendation list by combining precomputed insights with real-time context. Simpler algorithms are used, but smart use of data (like item similarities and trending info) ensures recommendations remain relevant and timely.

Caching Layers

Caching is everywhere to ensure speed. We specifically use:

Recommendation Request Workflow (End-to-End)

To illustrate how everything comes together, consider what happens when a user opens their Netflix home page (i.e., a recommendation request):

  1. Request Arrival: The user’s device (client) hits the API Gateway with an authenticated request like “GET /recommendations/home” with the user’s ID or token. The API gateway authenticates and forwards this to the Recommendation Service (which is part of our Recommendation Engine). Suppose the user is in region US-EAST, the DNS/load balancer directs it to a nearby cluster of servers for low latency.
  2. Fetching User Context: The recommendation service instance handling this request will first gather the necessary data about the user. It calls the User Profile Service (or reads from a cache) to get the user’s profile info, e.g., their viewing history, explicit ratings, maybe demographic info if available, and any precomputed user features (like their embedding or genre affinity scores). This is typically a quick lookup (thanks to caching or a fast NoSQL query).
  3. Fetching Candidates: Based on the user’s profile, the service generates recommendation candidates:
    • It might query the precomputed recommendations store: e.g., check if we have a ready-to-use list of top items for this user from the last batch job. If so, those items form an initial candidate set.
    • It will likely augment this with fresh candidates from algorithms. For collaborative filtering: it could take the user’s top N similar users (which might be stored from an offline job) and gather items those users rated highly. Or use the user’s embedding to do a nearest-neighbor search in the item embedding space (via the ANN index). For content-based: it could take the top 2–3 genres the user watches and fetch a few popular new items in those genres, or find items with high similarity to the last thing the user watched. In practice, we might have separate modules, e.g., UserCF module yields 50 items, ItemCF module yields 50 items, Content module yields 50, Popularity module yields 10, etc. We take a union of them. If there are duplicates (the same item came from two methods), we handle that by merging or increasing its score.
    • This candidate retrieval phase might involve calls to an Item Similarity Service (to get similar items by content or collaborative similarity) and to a Trending Service (to get currently popular items). Those services either compute quickly or look up cached lists.
    • Also, some candidates are rule-based: e.g., “Continue Watching” items (if the user has incomplete videos) are directly pulled from the user’s profile; these will be placed in a special row on the UI rather than the algorithmic list, but it’s part of the overall logic to decide what rows to show.
  4. Feature Preparation: For each candidate item, the service gathers features needed for ranking. This could include:
    • Item features: from the Content Metadata Service (like item genre, year, etc., if not already present in an earlier lookup).
    • User-item interaction features: e.g., has the user watched other seasons of this show (for a series recommendation)? Did the user search for this title recently? Some of these features may come from the user’s profile (which might store that info). Others might require a quick query (though we try to avoid any heavy per-item queries at request time).
    • Context features: current time, device type, etc., which are straightforward. These features might be assembled into a vector per (user, item) pair for input to the ranking model. The system likely does this in-memory for speed, since the data needed is mostly local or already fetched.
  5. Ranking & Filtering: Now the Ranking module applies the machine learning model to score each candidate. It runs the model for each candidate and sorts the candidates by score.
    • We also incorporate any business rules at this stage: e.g., diversity constraints. If the top 10 all happen to be very similar (say all are from the same franchise), the system might enforce some spreading out (demote some and promote the next best different item). This can be done via reranking algorithms or by adding diversity as part of the scoring function.
    • Filtering: Ensure we remove any content that should not be shown. For example, items that the user has “thumbs-downed” in the past might be filtered out regardless of score. Or if the user is a kid profile, filter out R-rated content. These rules are applied post-scoring or by setting their score to -inf effectively.
  6. Return Results: The top K items (say top 20 for the first row of the homepage) are selected. The service formats the response, which typically includes item IDs and possibly some metadata like the title or a synopsis snippet. However, often the client will retrieve the detailed info (like images) separately or from a CDN. The response goes back through the API Gateway to the user’s device. Now the user sees the recommendations.
  7. Caching and Logging: The service might cache the result of this computation for a short time (couple of minutes) keyed by (user, context) so that if the user refreshes quickly, it doesn’t redo all steps. But if any new event comes in (like user watched something), that cache would be invalidated. All the details of what was recommended are also logged (which items in what order) to the event pipeline. This is important for training data – we need to know “we recommended item X, and the user did/didn’t click it” to learn from implicit feedback which recommendations worked. So a “impression” or recommendation-served event is generated.
  8. Continuous Feedback: As the user interacts (maybe clicks one of the recommended items), that goes into the feedback loop as described. Suppose they watch one of the recommended movies – an event for “play start” and later “play complete” will fire. The nearline system might pick that up and update their profile immediately (so if they go back to the home screen after watching, the recommendations account for it). The next day’s batch retraining will also include that data point in refining the models.

The above workflow ensures real-time personalization while leveraging precomputed data. Multi-device consistency is naturally handled because no matter which device makes the request, it hits the same backend and the profile is centralized. If a user starts a show on their TV, a “continue watching” candidate will be added to their profile; later if they open the app on mobile, the workflow will include that candidate because the profile knows the show is in progress. Consistency is maintained by updating the central profile and caches on each event, rather than storing state on the device.

Also, multi-profile support (like Netflix’s separate profiles on one account) is easily handled by treating each profile as a separate user in the recommendation system (with its own profile ID and data).

Throughout the workflow, performance optimizations like caching, parallelism (fetching user profile and item features in parallel), and not blocking on non-essential data (e.g., if an external metadata service is slow, we might proceed with whatever data we have and fill that item later or exclude it) are applied to meet the latency target.

🤖 Don't fully get this? Learn it with Claude

Stuck on Design a Recommendation System? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🪜 Hint ladder (no spoilers)

Progressively stronger hints — you still solve it.

I'm working on the problem **Design a Recommendation System** (System Design). Give me a HINT LADDER: start with the tiniest nudge, then wait. Only reveal the next, stronger hint when I ask. Do NOT show the full solution unless I type 'show solution'. Keep me doing the thinking. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🎨 Explain the approach visually

See the technique, not just code.

Explain the optimal approach to **Design a Recommendation System** with a VISUAL walkthrough: trace it on a small concrete example using ASCII art / a step-by-step diagram, narrate what changes each step, then give time & space complexity with a one-line derivation. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🔍 Review my solution

Catch bugs, edge cases, sub-optimality.

I'll paste my solution to **Design a Recommendation System**. Review it for correctness, missed edge cases, and time/space complexity, then coach me toward the optimal — don't just rewrite it. Ask me to paste my code now. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🔁 Drill the pattern

Lock in recognition with look-alikes.

Give me 2 problems that use the SAME underlying pattern as **Design a Recommendation System**. For each, let me attempt first, then review my answer and name the trigger signal that reveals the pattern. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes