Knowledge Guide
HomeSystem DesignSystem Design Problems

medium Designing YouTube Likes Counter

Image
Image

Problem Statement: Design a backend system to manage likes and dislikes for YouTube videos and comments at scale. The system must handle recording user reactions, updating aggregate counts, and retrieving these counts with low latency. It must be architected to support millions of concurrent users and billions of historical data points.

Key Entities:

Real-World Example: Consider a scenario where user Alice watches the video “Funny Cats” (videoID: abc123).

  1. Video Interaction: Alice clicks the "thumbs up" button. The system records a Like event linking Alice to the video and triggers an increment to the video’s total like count.
  2. Comment Interaction: Alice reads a comment (commentID: cmt99) and likes it. The system records this reaction and updates the comment's count.
  3. Concurrent Interaction: Simultaneously, user Bob clicks "thumbs down" on the same video. The system records a Dislike event and increments the video’s dislike count.

The system must ingest these events and update the visible counts reliably and efficiently, ensuring data consistency even when millions of users interact with popular content simultaneously.

Key Entities
Key Entities

2. Requirements Analysis

Functional Requirements

Non-Functional Requirements

3. Capacity Estimation & Constraints

Before defining the architecture, we must estimate the system scale to determine the necessary storage and throughput capacity.

Traffic Estimates

Storage Estimates

We use a REST API.

1. Cast Vote

{ "target_id": "video_abc123", "target_type": "video", // "comment" "action": "like" // "dislike", "none" (remove) }

2. Get Counts & State (Batch)

{ "items": { "video_abc": { "likes": 1500200, "dislikes": 4050, "user_state": "like" }, "comment_xyz": { "likes": 45, "dislikes": 0, "user_state": "none" } } }

5. High-Level Architecture

The system comprises multiple decoupled components designed to handle high concurrency, data persistence, and background processing. Below is an overview of the architecture and data flow:

High-level design of YouTube Likes Counter
High-level design of YouTube Likes Counter

6. Data Model Design

Here is the NoSQL schema for tracking user like/dislike actions on posts and comments. It is optimized for high scalability and quick retrieval of the latest user actions, while minimizing query latency and ensuring efficient updates.

1. UserLikes Table (Likes/Dislikes per User per Content item)

Stores the latest like/dislike action by each user on a specific post or comment. This table is keyed by user_id and content_id (composite key) so that each user-item pair has at most one record (the most recent action).

FieldData TypeDescription
user_idStringUnique identifier of the user who performed the action. (Part of the composite primary key)
content_idStringUnique identifier of the content (post or comment) that was liked/disliked. (Part of the composite primary key)
action_typeString (enum)Type of action: either "like" or "dislike". Indicates the user's latest reaction on the content item.
timestampDateTimeTimestamp of the latest action by the user on this item. Used to record when the action occurred (or last changed).

2. ContentStats Table (Aggregated Likes/Dislikes per content)

Maintains the total count of likes and dislikes for each post or comment. This table is keyed by content_id so the like/dislike counts for any item can be fetched in a single, fast lookup. It is updated whenever a user’s like/dislike on that item changes (often via an event or stream).

FieldData TypeDescription
content_idStringUnique identifier of the content (post or comment). Acts as the primary key for this table (one record per item).
total_likesIntegerCumulative count of all "like" actions for this item. Updated whenever a new like is added or a dislike is changed to a like.
total_dislikesIntegerCumulative count of all "dislike" actions for this item. Updated whenever a new dislike is added or a like is changed to a dislike.

Architectural Approaches

When designing a scalable like/dislike system, various architectural strategies can be employed. Each approaches has its pros/cons, lets look into these:

1. Synchronous Updates (Direct Write to DB and Counters Together)

How it works: Every user like/dislike action is immediately written to the database, updating both the individual user-action record and incrementing/decrementing the total like counter in a single, synchronous operation (often within a transaction). For example, clicking "like" will insert a like record and update the post’s like count column in the same request.

Pros:

Cons:

Use Cases: This approach is best in scenarios with low to moderate traffic or where absolute consistency is paramount. For example, a small community forum or an internal application can use direct DB updates for simplicity. It’s also acceptable when the rate of likes/dislikes is low enough that the database can easily handle it. However, it becomes problematic at large scale (millions of likes) where the single counter update is the choke point.

High-level Design: Direct Write to DB and Counters Together
High-level Design: Direct Write to DB and Counters Together

2. Hybrid Approach (User Action Stored Immediately, Counts Updated Asynchronously)

How it works: In this approach, the application handles like/dislike actions asynchronously using a message queue (Kafka) to update counts in batches:

  1. Immediate Write of Action: When a user likes or dislikes a content item, the individual action is recorded immediately in the database (e.g. inserting a row in a Likes table with user_id, item_id, action_type). This ensures the source of truth for individual actions is always up-to-date.
  2. Publish Event to Kafka: Instead of updating the aggregate like/dislike count on that content item synchronously, the service publishes an event to a Kafka topic (e.g. “like_events”). The event contains details such as the item ID, whether it was a like or dislike, etc.
  3. Kafka Consumers Aggregate Counts: One or more Kafka subscriber workers listen on the topic. These workers accumulate events and periodically update the total counts. For example, a worker might keep an in-memory counter for each item or buffer a batch of events. Workers can be configured to batch updates – for instance, after every 100 events or every X seconds, they will compute the new totals. This batching dramatically reduces write load on the primary database by coalescing many increments into a single update.
  4. Batched Database Update: When a worker’s threshold is met, it writes the aggregated count to the database. This could mean updating a counter field in an “Items” or “Posts” table (e.g. incrementing the like_count by 100). The update can be done with an atomic operation (like SQL UPDATE ... SET like_count = like_count + 100) to avoid race conditions between batches. By processing in batches, the system can handle high throughput of likes/dislikes while keeping database load manageable.

Pros:

This Kafka-based asynchronous approach offers several benefits at large scale:

Cons:

Use Cases: The hybrid approach is useful when you need to handle a decent volume of likes and want to avoid slowing down the user’s action, but you still want the reliability of logging every action. Many systems use this pattern in combination with slightly delayed updates. For example, an application might show counts that update every few seconds or on page refresh, which is acceptable in social apps where seeing the count “eventually” is good enough.

High-level Design: Hybrid Asynchronous Approach
High-level Design: Hybrid Asynchronous Approach

3. Asynchronous Count Updates with Kafka and Caching (Revised Third Approach)

How it works: This approach builds on the Kafka asynchronous update mechanism but adds a caching layer to provide instant feedback and fast reads. The steps are:

  1. Immediate Write of Action to DB: Just like the previous approach, each like/dislike action is recorded as a separate entry in the database right away. This ensures durability and a trace of each user’s action.
  2. Update Cache’s Count: The system then updates the cache for the total like/dislike count of that item immediately. For example, if a post had 50 likes in cache, and a new like comes in, the application or a caching layer will increment the cached count to 51. This gives real-time feedback – the next time someone fetches the like count (even the same user immediately after liking), they will see “51” from the cache without waiting for the backend aggregation. This cache is typically a fast in-memory store like Redis or Memcached. The update can be done with an atomic operation (e.g., Redis INCR) to handle concurrent updates safely – ensuring two simultaneous likes both get applied without losing one.
  3. Publish Event to Kafka: In parallel, the service still publishes an event to the Kafka topic for likes (so the asynchronous pipeline is informed of the new like). The event will be used by background workers to eventually reconcile the persistent count in the database.
  4. Kafka Workers Update Database: Kafka consumer workers operate similarly as in Approach 1: they consume like events in batches. The difference now is that these updates to the database’s aggregate count are somewhat redundant in the short term (because the cache already has the latest count), but they serve to persist the aggregated count for long-term consistency and as a fallback. Workers might batch 100 events and then do an SQL UPDATE posts SET like_count = like_count + 100 WHERE post_id=.... Over time (every few seconds or minutes), the database’s stored count catches up with what the cache shows. If the cache was updated for each like, the DB update should match that total after processing the batch.
  5. Cache and DB Convergence: The cache entry for the count can be given a short TTL (Time To Live), say 5 seconds, or some small window. This means every few seconds, if no new updates happen, the cache will expire and the next read will fetch the count from the database (which by then should include all recent updates from the Kafka consumers). This helps correct any discrepancy that might have occurred between cache and database. Alternatively, the system can invalidate the cache or refresh it when the database is updated by the worker (a mini write-through on the aggregated count update).

Read Path: For any client retrieving the like/dislike count, the application will read from the cache. If the cache has a value (not expired), it returns that almost instantly. This ensures that users always see the most up-to-date count (including recent likes) with low latency, as long as the cache is being kept in sync. If the cache entry expired or is missing (cache miss), the service can fall back to the database: fetch the persisted count from DB (which might be slightly behind), return it, and repopulate the cache with that value (cache-aside pattern). However, because of the short TTL and continuous updates, such cache misses for a hot item would be rare. Essentially, the cache acts as the primary source for reads, with the DB as backup and long-term storage.

Pros:

The Kafka + Cache approach provides the best of both worlds – responsive updates and scalable processing:

Cons:

While this approach is powerful, it adds more components to manage:

Use Cases: The revised third approach combines Kafka for scalable, reliable processing with caching for instant user feedback and fast reads. It handles failures by ensuring no single point (cache or worker) being down will permanently lose data or show wrong counts for long. The use of short TTLs, cache invalidation, and idempotent message processing collectively address the consistency challenges, making this approach suitable for large-scale systems where both performance and accuracy are required. Users get a snappy experience, and the system can handle the load and recover from issues gracefully, at the cost of a more complex architecture that must be carefully managed.

High-level Design: Asynchronous with Caching
High-level Design: Asynchronous with Caching

Database Sharding & Partitioning

Both the UserLikes store and the ContentStats store must be partitioned to scale out. We cannot rely on a single monolithic database.

UserLikes (Per-User actions) Partitioning:

ContentStats (Counts) Partitioning:

🤖 Don't fully get this? Learn it with Claude

Stuck on Designing YouTube Likes Counter? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🪜 Hint ladder (no spoilers)

Progressively stronger hints — you still solve it.

I'm working on the problem **Designing YouTube Likes Counter** (System Design). Give me a HINT LADDER: start with the tiniest nudge, then wait. Only reveal the next, stronger hint when I ask. Do NOT show the full solution unless I type 'show solution'. Keep me doing the thinking. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🎨 Explain the approach visually

See the technique, not just code.

Explain the optimal approach to **Designing YouTube Likes Counter** with a VISUAL walkthrough: trace it on a small concrete example using ASCII art / a step-by-step diagram, narrate what changes each step, then give time & space complexity with a one-line derivation. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🔍 Review my solution

Catch bugs, edge cases, sub-optimality.

I'll paste my solution to **Designing YouTube Likes Counter**. Review it for correctness, missed edge cases, and time/space complexity, then coach me toward the optimal — don't just rewrite it. Ask me to paste my code now. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🔁 Drill the pattern

Lock in recognition with look-alikes.

Give me 2 problems that use the SAME underlying pattern as **Designing YouTube Likes Counter**. For each, let me attempt first, then review my answer and name the trigger signal that reveals the pattern. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes