Knowledge Guide
HomeSystem DesignSystem Design Problems

medium Designing Google News

Image
Image

Let's design a news aggregator similar to Google News. This system will collect news articles from many external sources (newspapers, news channels, blogs, etc.) and present them to users in one consolidated feed. The goal is to deliver a personalized, near-real-time news feed based on each user’s preferences such as subscribed channels, categories (like politics, sports, tech), or particular entities (e.g. a celebrity, sports team, or artist). The system should support both pull (user refreshing a feed) and push (notifications for new articles) delivery modes, ensuring users quickly see breaking news or updates in their subscribed topics. We aim to achieve this at web-scale, meaning the design must accommodate hundreds of millions of users and very high data throughput, all while maintaining low latency and high availability.

Entities and Concepts

Google New - Key Entities
Google New - Key Entities

Example Scenario: A user subscribes to “Tech” category and “Apple Inc.” entity. When Apple releases a new product and various news sources publish articles about it, the aggregator ingests those articles, classifies them as “Tech” and tags “Apple Inc.” as an entity. The system will then deliver these to the user’s feed. If push is enabled, the user might get a notification on their phone saying “New article: Apple launches X” within seconds of publication.

2.1 Functional Requirements

2.2 Non-Functional Requirements

With requirements established, we can now estimate the scale in concrete terms to guide design decisions.

Before designing the architecture, it’s important to gauge the scale of data and traffic the system must handle. Below are rough capacity estimates for a large-scale news aggregator:

Aspect
Estimate (assumption)
Daily Active Users (DAU)~10 million global DAUs (order of millions).
Peak Concurrent UsersOn the order of hundreds of thousands concurrently (spread globally). Peak traffic likely during morning/evening news hours in each region.
Feed Requests (QPS)~1,000–5,000 queries/sec (globally) during peak. Each active user might refresh or scroll their feed multiple times per day. For example, 10M users * ~5 feed refreshes/day = 50M fetches/day (≈580 QPS average; peaks higher).
Push NotificationsPotential bursts of millions of notifications for major breaking news. E.g., a global breaking story might trigger a push to 1M+ users. The system’s notification service must fan out to many devices efficiently (likely via OS push services in batches).
New Articles Ingested50,000 – 100,000 articles per day (estimated). Google News, for example, aggregates content from 20,000+ publishers. If each publisher posts several articles daily, the ingestion rate could average ~1 article per second, with peaks during news events.
Max Ingestion ThroughputDuring breaking news bursts, the system may see spikes (e.g. hundreds of articles per minute across sources). The ingestion pipeline should handle thousands of articles per hour.
Data Storage (Articles)Storing full content for millions of articles. Assuming ~100K articles/day, that’s ~36 million/year. If an article record (metadata + text) averages 5 KB, one year of news is ~180 GB. For safety, plan for terabytes of storage over years (including media). Older articles might be archived to colder storage after 1-2 months (Google News, e.g., focuses on ~44 days).
Data Storage (Users)With 10M users, storing profiles, preferences, and feed pointers is required. If each user record is small (a few KB including subscription lists), that’s on the order of tens of GB for user data.
Subscription GraphEach user might subscribe to, say, 10-20 topics/authors on average. That’s ~100M subscription relationships. Storing this in a database or cache is feasible (hundreds of millions of rows). Hot topics (like “World News”) might have millions of followers, which is crucial for fan-out considerations.
Read/Write RatioThe system is read-heavy. For each article ingested (write), there are many feed reads. Write QPS (ingestion) might be ~1-5 QPS average, whereas read QPS (feed fetches, search queries) is in the thousands. This justifies heavy caching for reads.

Implications: The high read volume and global distribution mean we need robust caching and a content delivery network. The large number of subscriptions suggests that a naive approach of immediately pushing every new article to every follower could overwhelm the system (especially for topics with millions of followers). We will need to design an efficient fan-out/fan-in mechanism and possibly limit how we propagate updates for very popular content (see Hybrid push/pull in Section 6). The storage estimates show that text content is manageable, but media (images/videos) should be offloaded to cloud storage or CDNs. Additionally, the system’s databases must be sharded or clustered to handle the volume of data and traffic.

At a high level, our news aggregator is composed of two main pipelines that work in tandem:

  1. Content Ingestion & Processing Pipeline – Responsible for collecting articles from external sources, processing them (normalize, deduplicate, categorize), and storing them in a content repository. This pipeline ensures we have a clean, up-to-date collection of news articles with metadata.
  2. Feed Generation & Delivery Pipeline – Responsible for delivering the right content to each user. It takes the processed articles plus the user’s subscription info to produce a personalized feed. It supports on-demand fetch (pull) and real-time push notifications. It includes services for subscription management, feed assembly, and notification dispatch.

Major Components Overview:

1. Ingestion Layer (Content Collection): External news sources feed into the system through a scalable ingestion framework:

High-Level Architecture of Google News
High-Level Architecture of Google News

2. Processing & Storage Layer: A set of backend services consume the article events from the queue and perform processing:

At the end of processing, the system has persisted the new article and tagged it appropriately. Now it’s ready to be delivered to users who want such content.

3. Feed Delivery Layer (Personalization & Distribution): This is the heart of how we get relevant content to users.

Data Flow Summary: When a news article is published by a source, our ingestion layer picks it up (within seconds or minutes), processes it, and stores it. This triggers updates: the feed service (push component) may update many user feeds or relevant caches, and the notification service may alert users. When a user opens their app (pull), the feed service quickly retrieves a personalized list (either from the precomputed feed or by querying recent content) and returns it. The user sees a timeline of articles and can click through to read one (likely fetched from cache or the source). If a new article arrives while they’re browsing, a live update or notification can appear.

This high-level design is modular: each component (crawling, processing, feed assembly, notifications) can be scaled independently.

Article Data: Each news article can be represented by a data object with fields like: article_id, title, content_snippet, url, source_id, publish_time, category, keywords, image_url, etc. We might also store a vector for content (if doing recommendations) or flags (trending, breaking, etc.).

User Data: Each user has a profile with preferences. Likely stored in a relational database (since user data is relatively smaller scale and needs ACID for things like password info, etc.). We can have tables:

To support fan-out (push model), we need efficient lookup of “which users are subscribed to X category or source.” This is essentially the inverse mapping of UserPreferences:

Summary of storage tech and purpose:

6.1 News Ingestion and Deduplication Pipeline

Overview: The content ingestion subsystem is essentially the “web crawler” and parser for the aggregator. Its job is to gather fresh articles from all the external sources continuously and robustly.

  1. Source Discovery & Administration: The system maintains a catalog of sources – RSS feed URLs, API endpoints, sitemaps, etc. An admin interface can add new sources or categories. This metadata (mapping of publishers to feed URLs, categories, and possibly crawling schedules) is stored in a configuration DB. Administrators can map each feed to a category or region (e.g., “NYTimes Technology RSS” -> category Tech, region USA).

  2. Fetching (Crawlers): A fleet of crawler services continuously checks each source for updates. For RSS feeds and APIs, this might be a periodic poll. For example, each RSS URL could be polled every few minutes (frequency might depend on publisher’s update rate). To reduce redundant fetching, a Feed Manager can check the HTTP headers (last-modified or use etags) or compute a hash of the feed to see if it changed. For sites without RSS, web scrapers parse HTML pages or use sitemaps. The crawlers push any newly found article reference into the pipeline:

    • Each new article (identified by URL or ID) results in a message (e.g. JSON containing the URL and possibly partial info) being published to the crawl queue (or directly to the processing queue if simpler). The system uses a Bloom filter cache to quickly check if a URL was seen before, avoiding duplicate crawling of the same article. This in-memory bloom filter (or a distributed cache of recent URLs) prevents re-processing content and saves database lookups.
  3. Content Retrieval: For each new article URL enqueued, a worker (let’s call it Article Fetcher) retrieves the full content. This might involve fetching the HTML page or calling a content API. The raw content (HTML, images, etc.) is saved to a Raw Content Blob Store temporarily. Storing raw HTML is useful for parsing and also as a backup (and for legal/archive purposes).

Content Ingestion Pipeline
Content Ingestion Pipeline
  1. Parsing & Extraction: The Content Parser workers take the raw content and extract structured data: title, body text, author (if available), publish time, media URLs. This step may use HTML parsing libraries or publisher-specific rules (since not all sites follow the same format). It cleans the content (removing HTML tags, ads, etc.) to get a clean text. The parser also uses the categorization logic (possibly invoking an NLP classifier or simply using the source’s known category from the RSS feed). The result is a structured article object ready to store.

  2. Deduplication and Merge: After parsing, as an extra safety, the system can double-check if a very similar article already exists (e.g., sometimes two sources might post the same news agency text). A content hash or similarity check can be done. True duplicates are discarded or merged (maybe incrementing a count of sources). In most cases, the earlier URL-based dedup suffices.

  3. Storing Processed Article: The structured article is stored in the Article Database/Index. This is a critical write path. The storage operation includes indexing the article by keywords (for search) and by category. It also writes the mapping of article to its content blob (if full content is stored separately). If using Elastic, for example, this is an index write; if using a SQL/NoSQL, it’s an insert. We also update any necessary secondary indexes (like a reverse index for search terms or a time-sorted list for the category).

  4. Publishing New-Article Event: Once an article is stored, an event is published on an internal New Article Pub/Sub topic (e.g. Kafka topic). This event contains the article ID and metadata (category, maybe a short summary). This is used to notify other parts of the system (feed generator, cache updater, notifications). The event could be consumed by:

    • Feed update service (to handle fan-out).
    • Search indexer (if not done synchronously, one could index via a consumer).
    • Analytics (e.g. to count new article in trending).
    • We have essentially a pipeline as described: crawl -> parse -> store -> notify.
  5. Third-Party API Ingestion: In cases where we use an external news API (that might provide bulk news, e.g. “give me latest 100 news in Tech”), the flow is a bit different: a scheduler calls the API, gets a batch of articles, then for each article we go through steps of dedup, parse (the data might already be structured from the API), and store. This can bypass some crawling steps.

This pipeline must be fault-tolerant. Using message queues ensures if a component fails, messages aren’t lost but can be retried. We should have idempotency in parsing and storing (if the same message is processed twice, ensure we don’t duplicate the article in DB – using the URL or an external ID as a unique key helps).

6.2 Real-Time Feed Generation Engine (Pull & Push)

Feed Assembly: When it’s time to deliver content to a user, we need to gather all relevant articles for that user’s interests and sort them. There are two primary approaches to assemble the feed:

Given our scale, we likely use a hybrid approach:

Ranking/Personalization: Once we have the candidate articles for a user (from either method), we need to rank them. A basic ranking algorithm could be:

This ranking can be implemented via a scoring function. For example: score = w1*Recency + w2*PersonalInterest + w3*GlobalTrending, where weights tune what matters. In advanced systems, a machine learning model (trained on past engagement) could predict the probability of the user clicking each candidate, and rank by that. But that requires collecting interaction data and retraining models – an advanced extension.

Personalization Pipeline: To support the above, we’d include:

For this design, it suffices to mention the concept; implementing it fully is complex. The system can start with simple heuristics and evolve to ML-based ranking as data grows.

Feed Update Mechanism: When using push model, the feed assembly happens continuously. For example, upon a new article event: fetch the list of subscribers, for each update their feed data (maybe push the article ID onto a sorted feed list). This could be done by a Feed Update Service that consumes the article event queue. For efficiency, this service might chunk the fan-out: e.g. retrieve 1000 subscriber IDs at a time and do bulk insertions, etc. Using in-memory data structures or fast caches is key to handle high fan-out.

6.3 Push Notifications & Real-Time Updates

The system supports two forms of pushing updates to users:

Both real-time updates and push notifications need to be robust – if a user’s device is offline, the push is queued by the OS; if our service is temporarily down, we might lose some realtime messages (which is acceptable, as the user can still pull manually). We ensure that on reconnect the client will do a full sync.

6.4 APIs for Frontend/Mobile Clients

The system will expose a set of APIs (or GraphQL endpoints) for client applications to interact with. We should design these APIs to be efficient and secure, and consider the high QPS they will face.

Key APIs and their design:

6.5 Example Data Flows

To illustrate, here are a couple of concrete scenarios with our design:

Designing for web-scale involves applying many techniques to handle high load and ensure reliability. Here we highlight the key strategies and justify our choices:

Trade-offs discussion: In our design, we traded some consistency and complexity for performance:

In conclusion, this design leverages a pipeline architecture, distributed data stores, caching, and careful choice of push vs pull to meet the requirements. It should be able to deliver personalized news to users in near real-time (a few seconds latency) even under high load, with eventual consistency ensuring the system remains robust and available. The architecture is modular and scalable, much like those of large-scale social feed and aggregator systems in production today, and reflects best practices a senior engineer would apply to such a problem.

🤖 Don't fully get this? Learn it with Claude

Stuck on Designing Google News? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🪜 Hint ladder (no spoilers)

Progressively stronger hints — you still solve it.

I'm working on the problem **Designing Google News** (System Design). Give me a HINT LADDER: start with the tiniest nudge, then wait. Only reveal the next, stronger hint when I ask. Do NOT show the full solution unless I type 'show solution'. Keep me doing the thinking. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🎨 Explain the approach visually

See the technique, not just code.

Explain the optimal approach to **Designing Google News** with a VISUAL walkthrough: trace it on a small concrete example using ASCII art / a step-by-step diagram, narrate what changes each step, then give time & space complexity with a one-line derivation. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🔍 Review my solution

Catch bugs, edge cases, sub-optimality.

I'll paste my solution to **Designing Google News**. Review it for correctness, missed edge cases, and time/space complexity, then coach me toward the optimal — don't just rewrite it. Ask me to paste my code now. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🔁 Drill the pattern

Lock in recognition with look-alikes.

Give me 2 problems that use the SAME underlying pattern as **Designing Google News**. For each, let me attempt first, then review my answer and name the trigger signal that reveals the pattern. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes