Knowledge Guide
HomeSystem DesignSystem Design Problems

medium Designing Notification Service

Let's design a notification system that works seamlessly for various applications, whether it's a social media platform, an e-commerce site, or a productivity tool.

Image
Image

A notification system is a centralized platform that delivers alerts and messages (notifications) to users on behalf of multiple separate applications (tenants). It acts as a shared service that any application – be it a social network, an e-commerce site, or a productivity tool – can use to send notifications to its users. For example, think of a system like Amazon Simple Notification Service (SNS) or Firebase Cloud Messaging, which many different apps leverage to inform users about events (like a new message, an order shipped, or a calendar reminder).

Real-world analogy: Consider how Facebook notifies you of a friend’s post, Amazon notifies you about your package delivery, and Slack alerts you about a new mention. A multi-tenant notification service is a single system capable of handling all those scenarios for many companies at once, ensuring each tenant’s data and users stay isolated from each other.

Key Entities/Terms:

Before diving into design, let’s clarify what this notification service must do and the constraints it must meet.

Functional Requirements (Features):

Non-Functional Requirements (Scale, Performance, Constraints):

With these requirements, we have a clear target: a fast, reliable notification service that can handle all tenants’ events and deliver via the appropriate channels, respecting user preferences, even at a massive scale.

Let’s estimate the expected scale to ensure our design can handle the load. (These are rough numbers to guide our choices.)

These estimations highlight the need for aggressive horizontal scaling (lots of parallel processing), efficient data partitioning, and asynchronous processing. The system must be distributed to handle peak loads and large storage.

Now let’s design the architecture with these numbers in mind.

We expose a REST API.

1. Send Notification

{ "user_id": "u_12345", "template_id": "order_shipped_v1", "priority": "high", // or "low" "idempotency_key": "uuid-gen-by-client", "data": { "order_id": "999", "name": "Alice" } }
{ "notification_id": "notif_abc123" }

2. Get Status

3. Update Preferences

4. High-Level Design

At a high level, the notification system will be designed as a distributed, event-driven pipeline with multiple components cooperating asynchronously. The core idea is to decouple the act of generating a notification from the act of delivering it using a reliable queue, which allows the system to scale and to buffer bursts. Different components will handle: intake of requests, scheduling, queueing, processing per channel, and persistence. Each component can scale horizontally and be managed independently. Here's an outline of the architecture:

High-level Design
High-level Design

Data Flow (Notification Request to Delivery)

Let's walk through how a notification travels through the system in real-time mode, and in batch mode:

  1. Request Ingestion: An external application (producer) makes an API call to send a notification. For example, an e-commerce platform calls POST /sendNotification with the user’s ID, notification type (say, "OrderShipped"), and perhaps some content or template data. This request hits the API Gateway, which authenticates it (ensuring the caller is allowed to send notifications) and forwards it to one of the Notification Service instances.
  2. Request Handling in Notification Service: The Notification Service receives the request. It validates the payload – confirming required information is present (such as a recipient identifier, message content or template, and at least one channel) and that it’s well-formed. It might also perform authentication/authorization checks if not done earlier. After validation, it quickly acknowledges receipt (especially if this is a synchronous API call) so the upstream caller can proceed, while the notification is handled asynchronously thereafter.
  3. Preference & Routing Logic: The Notification Service then determines how to route this notification. It queries the User Preference Service (or reads from a cached preferences store) to get the user’s current notification settings. This yields information such as: which channels the user has enabled, any channel overrides for this notification type, and whether the user has hit any rate limits or snooze settings. For example, if the user opted out of email for promos, and this is a promotional notification, the service would exclude email from the channels. It also checks system-wide rules (e.g., not sending more than N marketing notifications to a user per day) to decide if this notification should be throttled or dropped.
  4. Message Preparation: Now, the service knows which channels to send the notification through (based on the request and preferences). It prepares the content for each channel. If templates are used, the service fetches the template for, say, email and fills in the dynamic fields (like user name, order details). Each channel may require a slightly different payload – e.g., an email needs a subject and HTML body, whereas an SMS needs a short text, and a push notification might have a title, body, and custom data for the app. The Notification Service can transform the request into channel-specific messages.
  5. Enqueueing Messages: The service then enqueues the notification for delivery. Since there’s no priority among channels, it will enqueue a separate message for each channel the notification should go out on. This can be done in two ways: (a) using a unified queue/topic with the channel as part of the message data, or (b) using separate queues or topics for each channel (e.g., an "Email" queue, "SMS" queue, etc.). A common design is to use topic partitioning – for instance, a Kafka topic per channel – so that channel workers can consume their respective messages independently. For example, if a notification needs to go via Email, SMS, and Push, three messages are produced: one to the Email topic, one to the SMS topic, one to the Push topic. Each message includes the content and metadata needed for that channel (like the email subject or the SMS text, destination addresses, etc., along with a notification ID and perhaps a retry count). By pushing messages to a queue, we decouple the immediate request handling from the slower delivery process – the API call can complete quickly, and actual sending happens asynchronously.
  6. (Scheduled Notifications): If the request included a future send time or it’s part of a batch campaign, the Notification Service would invoke the Scheduler instead of immediately enqueueing for delivery. For a scheduled notification, the service stores the request (for example, in a Scheduler Service database or a delayed-queue) with the desired send timestamp. The Scheduler continuously scans or waits until the send time is reached, then moves those pending notifications into the regular delivery queue (feeding into step 5 at the appropriate time). This way, scheduled notifications join the same pipeline as real-time ones when their time comes. Batched campaigns might be handled by inserting many notifications into the scheduler or queue in bulk (possibly with their own tools to generate a large list of user-specific messages, as was done in the Duolingo Super Bowl campaign using precomputed user lists in S3).
  7. Queue Processing and Channel Delivery: The various Channel Processor services are subscribed to the Notification Queue (or specific topics). Each channel processor continuously pulls messages intended for its channel. For example, an Email Worker instance will read from the Email topic/queue. Once a message is pulled, the worker processes it: it establishes a connection to the relevant delivery provider or service and attempts to send the notification. This involves using channel-specific protocols or third-party APIs:
    • The Email Processor might call an email sending service or SMTP server (such as SendGrid, Mailgun, or Amazon SES) to send out the email. It will format the email (HTML or text) if not already formatted, handle attachments or images, and send the message to the recipient’s email address.
    • The SMS Processor will send the SMS text via an SMS gateway API (like Twilio or Nexmo). It may need to ensure the text fits within SMS length limits or split long messages, handle country codes, etc.
    • The Push Notification Processor uses push services such as Firebase Cloud Messaging (FCM) for Android and Apple Push Notification service (APNs) for iOS. It prepares the payload (title, body, icon, any custom data) and sends it to the respective service, which will then deliver it to the app on the user’s device.
    • The In-App Notification Processor will write the notification to the in-app notifications store (database) and if the user is currently online/connected, it can also deliver it in real-time via a persistent connection (for example, via a WebSocket message to the user’s session). This ensures the notification appears inside the app immediately if the user is active, and is stored for later retrieval if the user is offline.
      Each of these processors operates independently, so an email notification could be sending at the same time as the SMS and push for the same event – there’s no requirement to do them in sequence. This parallelism improves overall latency for multi-channel notifications.
  8. Delivery Confirmation & Logging: After attempting to send, each channel processor gets a result – success or failure. For channels with synchronous APIs, the result is immediate; for some (like email), there might be callbacks or events for delivery/bounce later, but at least initial send status is known. The processor then logs the outcome of the notification into a Notification Logs store. A log entry typically includes: notification ID, user ID, channel, timestamp, status (sent, delivered, failed, etc.), and perhaps an error code or provider response if failed. These logs enable tracking and auditing – e.g. for customer support queries “I didn’t get this email” or for system health metrics. If the send was successful, the log is marked delivered; if it failed, the system may mark it for retry (and log an initial failure with maybe a pending status). All log entries are stored in a database built to handle high write volume. This could be a relational DB table or a time-series store or even a log indexing system, depending on query needs.
  9. User Notification Retrieval: For in-app notifications, the final step is allowing the user to read their notifications. The app’s client (e.g., mobile app or web frontend) can call an API (possibly the Notification Service or a dedicated Notification Query Service) to fetch recent notifications for that user. The service will query the Notifications DB for that user’s records (potentially using a cache for speed) and return them so the client can display, for example, a list of notifications (read/unread status, etc.). This read path is separate from the write path of sending notifications to maintain high throughput – often even separated into a different service or read replica to not impact send performance. The query service can also support features like marking notifications as read, or searching within notifications, but those are extensions beyond core delivery.

Throughout this flow, no single channel is prioritized or blocks another – each channel’s delivery is handled independently via the queue. This design ensures the fastest possible dispatch on all channels and isolates any slowness (for instance, an email service lag) to that channel’s worker without holding up others.

6. Data Storage and Schema

In this hybrid notification system, SQL tables store structured, relational data (like user info, preferences, and templates) for consistency and ease of management, while NoSQL tables handle high-volume, unstructured event data (notification history and statuses) for scalability. This approach leverages the strengths of both: relational databases ensure ACID compliance for critical metadata, and NoSQL databases provide flexible schemas and horizontal scaling to handle millions of notification records efficiently. Below is the schema design with each table’s fields, data types, and descriptions in tabular format, along with explanations of design considerations before and after the tables for clarity.

SQL Tables

The SQL tables manage static or slowly changing data such as tenant configurations, user profiles, notification preferences, templates, scheduled jobs, and audit logs. These tables are normalized and use clear relationships (foreign keys) to maintain data integrity. Each table is outlined with its fields:

tenants

Stores tenant (application) details. Each tenant represents a client application or organization using the notification system.

Field NameData TypeDescription
tenant_idINT (PK)Unique identifier for the tenant (primary key).
nameVARCHAR(255)Name of the tenant application or organization.
contact_emailVARCHAR(255)Contact email for the tenant’s administrator or support.
created_atDATETIMETimestamp when the tenant was registered.
statusVARCHAR(50)Status of the tenant (e.g., “active”, “inactive”), used to enable/disable notifications for this tenant.

Explanation: The tenants table ensures multi-tenancy support by segregating data per application. It includes basic identification and contact info for each tenant, along with a status flag to control activity. This structured information is well-suited for a relational table since it’s relatively static and requires consistency.

users

Stores users who receive notifications. Users are associated with tenants.

Field NameData TypeDescription
user_idINT (PK)Unique identifier for the user (primary key).
tenant_idINT (FK)Identifier of the tenant this user belongs to (foreign key to tenants).
nameVARCHAR(100)Full name of the user.
emailVARCHAR(255)Email address of the user (used for email notifications).
phoneVARCHAR(20)Phone number of the user (used for SMS or phone notifications).
created_atDATETIMETimestamp when the user account was created.
statusVARCHAR(50)Account status of the user (e.g., “active”, “disabled”).

Explanation: The users table holds profile and contact information for notification recipients. It relates to the tenants table via tenant_id to ensure each user is tied to the correct tenant (application). Storing users in SQL guarantees referential integrity (e.g., notifications link to valid users) and makes it easy to join with preferences or templates if needed.

user_preferences

Stores per-user notification preferences, such as which types of notifications or channels a user has enabled or disabled.

Field NameData TypeDescription
user_idINT (PK, FK)Identifier of the user (primary key, foreign key to users). Each user has one preferences record.
email_notificationsBOOLEANWhether the user wants to receive email notifications (true = opt-in).
sms_notificationsBOOLEANWhether the user wants to receive SMS/text notifications.
push_notificationsBOOLEANWhether the user wants to receive push/in-app notifications.
language_preferenceVARCHAR(10)(Optional) Preferred language or locale for notifications (e.g., "en", "es").
updated_atDATETIMETimestamp of the last update to this user’s preferences.

Explanation: The user_preferences table captures notification settings for each user. By separating preferences from the main user profile, the system can easily extend or modify preference fields without altering core user data. Each user has one row containing various preference flags (for different channels or categories of notifications). In a relational model, this structured approach simplifies queries for preference-controlled sends (for example, checking if a user has opted into email before sending).

notification_templates

Stores reusable message templates for notifications. Templates allow consistent formatting of messages and reuse across multiple notifications.

Field NameData TypeDescription
template_idINT (PK)Unique identifier for the notification template.
tenant_idINT (FK)Tenant that owns this template (foreign key to tenants), allowing tenants to have custom templates.
nameVARCHAR(100)Template name or key (e.g., “WelcomeEmail”, “PasswordReset”).
subjectVARCHAR(255)Subject line for the notification (if applicable, e.g., for email notifications).
bodyTEXTBody content of the template, with placeholders for dynamic data (e.g., “Hello {username}, ...”).
channelVARCHAR(50)Channel/type of notification this template is for (e.g., “email”, “sms”, “push”).
created_atDATETIMETimestamp when the template was created.
updated_atDATETIMETimestamp when the template was last modified.

Explanation: The notification_templates table enables definining message formats once and reusing them. Each template can be specific to a channel and possibly personalized with placeholders. Storing templates in SQL ensures they can be transactionally updated (e.g., updating the content or disabling a template) with versioning if needed. Since templates are relatively static and shared data, an SQL table is appropriate for strong consistency (so that notifications always use the latest approved content for compliance).

scheduled_notifications

Stores notifications that are scheduled for future delivery (e.g., reminders or announcements that should be sent at a later time or on a specific schedule).

Field NameData TypeDescription
schedule_idINT (PK)Unique identifier for the scheduled notification entry.
tenant_idINT (FK)Tenant context for the notification (foreign key to tenants), helpful for multi-tenant filtering.
user_idINT (FK)The intended recipient user’s ID (foreign key to users).
template_idINT (FK)Reference to the notification template to use (foreign key to notification_templates).
scheduled_timeDATETIMEDate and time when the notification is scheduled to be sent.
statusVARCHAR(50)Current status of the scheduled notification (e.g., “scheduled”, “sent”, “canceled”).
created_atDATETIMETimestamp when this schedule was created.
created_byVARCHAR(50)Identifier of who scheduled the notification (could be a user ID or system/admin user).

Explanation: The scheduled_notifications table holds messages that need to be sent in the future. Each entry ties a user with a template to send at a specific time. Using SQL here allows for reliable scheduling (with transactions to avoid duplicate scheduling or race conditions) and easy querying (e.g., find all notifications to send in the next hour). The status field tracks whether the job is still pending, already sent, or cancelled, and can be updated as the job is processed. When the scheduled time arrives, the system will retrieve these entries and then create actual notification events (which get recorded in the NoSQL notifications history).

audit_logs

Stores logs for compliance and tracking important system events. This includes records of notifications sent, preference changes, template edits, or any administrative actions, to provide an audit trail.

Field NameData TypeDescription
log_idBIGINT (PK)Unique identifier for the audit log entry.
tenant_idINT (FK)Tenant associated with the event (foreign key to tenants), if applicable.
user_idINT (FK)User associated with the event (if applicable, e.g., the user who was notified or whose preferences changed).
actionVARCHAR(100)Description of the action or event (e.g., “NOTIFICATION_SENT”, “PREFERENCE_UPDATED”, “TEMPLATE_MODIFIED”).
detailsTEXTAdditional details about the event (e.g., notification content snippet, old vs new values for changes, error messages if any).
performed_byVARCHAR(100)Who performed the action (could be a user id, admin id, or system process name).
timestampDATETIMEDate and time when the event was logged.

Explanation: The audit_logs table provides a historical record of significant events for compliance auditing and debugging. By storing these in SQL, we ensure they are immutable and safely stored (with ACID guarantees) — important for compliance requirements. For example, whenever a notification is sent, an entry could be logged here (in addition to the NoSQL history) to satisfy regulatory needs that require easily queryable records of communication. It can also log user preference changes or template changes to track who did what and when. This structured log data can be filtered by tenant or user for audit reports.

NoSQL Tables

The NoSQL collections handle the high-scale data: the actual notifications sent to users and the tracking of their delivery/read status. Using a NoSQL database (like MongoDB or Cassandra) for this part of the system allows it to scale horizontally to millions of records and high write throughput with flexible schema. Each notification event is stored as a document, and any updates to its delivery status are recorded separately.

notifications

(NoSQL collection) Stores the notification history for users – each record represents a notification event (a message sent to a user).

Field NameData TypeDescription
notification_idUUID/String (PK)Unique identifier for the notification event (could be a UUID or auto-generated ObjectID in a document store).
tenant_idINT or StringTenant identifier for context (duplicates the tenant for quick filtering in NoSQL, since joins aren’t used).
user_idINT or StringID of the user who received the notification.
template_idINT or StringReference to the template used (if any) for this notification.
messageTEXT/JSONThe content of the notification sent (could be text, JSON for structured content, etc.).
channelSTRINGDelivery channel used (e.g., “email”, “SMS”, “push”).
sent_atDATETIMETimestamp when the notification was sent (or created in the system).
metadataJSON (optional)Additional metadata for the notification (e.g., context info like subject line, recipient info, or template parameters used).

Explanation: The notifications collection holds each notification that has been generated and sent to users. Storing these in a NoSQL database enables handling a massive volume of events — for example, every single notification for every user can be logged without impacting the performance of the transactional system. Each document includes the content and context of the notification, which is useful for building a notification history or inbox feature for users. We include references like template_id to know which template was used (if needed for reconstructing or analyzing messages), and store the actual message content so that the history is preserved even if templates change over time. The channel helps identify how it was delivered (for example, to filter user history by channel). Storing tenant_id and user_id with each record (even though that duplicates relational data) is intentional in NoSQL design to optimize queries by those fields (common access patterns are fetching all notifications for a given user or tenant). This denormalization is acceptable given NoSQL’s focus on read performance and scalability. According to a scalable design approach, each notification event should capture all data needed without requiring joins, which is achieved by including message content and context directly in the document.

notification_status

(NoSQL collection) Tracks the delivery status of notifications, separate from the notification content. This is used to monitor whether notifications have been delivered, read, failed, etc., and to update their status over time.

Field NameData TypeDescription
status_idUUID/String (PK)Unique identifier for this status record (could be a unique ID or use composite key of notification+timestamp).
notification_idUUID/StringIdentifier of the notification this status update corresponds to (foreign key reference to a record in notifications collection).
user_idINT or StringID of the user who is the recipient of the notification (for convenient querying by user, duplicates from notifications).
statusSTRINGDelivery status value (e.g., “sent”, “delivered”, “opened”, “failed”, “read”).
updated_atDATETIMETimestamp when this status was recorded (when the status changed or was reported).
detailsJSON (optional)Additional details about the status, if any (e.g., error message if failed, device info for delivery, or read timestamp).

Explanation: The notification_status collection is designed to keep track of the evolving status of each notification. By separating status tracking into its own collection, the system can record multiple status changes over time without rewriting the main notification record. For example, a notification might be sent at time A, delivered at time B, and read by the user at time C – each of these can be a separate entry in notification_status. This separation improves write performance (as status updates are high-frequency and can be appended as new documents) and avoids contention on the main notification record. It also allows querying the status history independently (for analytics or troubleshooting). Each status entry references the notification_id and the user_id for ease of lookup. This design is similar to having a NotificationReceivers/Statuses collection in other systems, which links a notification to its recipient and status. By logging statuses (like read/unread or delivered/failed) separately, we can efficiently track who has seen or received each notification and scale to large numbers of events and users without heavy joins.

7. Detailed Component Design

Notification Scheduler & Batch Processing

To support scheduled and batch notifications, a Scheduler component is included. This could be as simple as a database of future notifications with a background job, or a distributed scheduling system. One implementation: use a table ScheduledNotifications(id, user, content, channels, send_time) where notifications are stored if send_time is in the future. A Scheduler Service periodically scans for due notifications (or a lightweight cron job queries for items where send_time <= now()), then enqueues them to the Notification Queue for delivery. For precision (when a large batch needs to go out exactly at a specific time), the system might pre-fetch those notifications slightly before due time and stage them. In high-scale cases, a more sophisticated approach is needed: e.g., use a delayed queue or priority queue data structure. Some message queues (like RabbitMQ or AWS SQS with delay) support delayed messages natively. Another approach is to partition scheduled jobs by time (e.g. per minute) and have distributed workers pick up the correct bucket at the right time. The key is that scheduling logic runs in a fault-tolerant way: possibly multiple scheduler nodes for redundancy, using locks or leader election to avoid duplicate sends. If a scheduled notification is missed (e.g., scheduler down at that minute), the system should detect and recover or have a manual fallback.

For batch campaigns (e.g., sending a marketing email to 10 million users), it’s inefficient for a client to call the API 10 million times. Instead, we might provide a bulk interface or offline loading mechanism. For instance, an admin could upload a list of target user IDs for the campaign. The system (as described by the Duolingo case) might pre-fetch those users’ data, store the list in a file or DB, and then when triggered, rapidly push all those notifications through the pipeline. The workers and queue must be scaled up to handle this burst. To avoid overwhelming downstream providers (like an email service), we might intentionally throttle the consumption rate or use multiple provider accounts spread over the load.

Worker Architecture and Queuing Mechanisms

The Channel Processors (workers) are designed to be stateless and horizontally scalable. Each worker instance is a consumer from the queue. For a high-throughput system, a distributed log/queue like Kafka is a good choice because it can handle very high message rates and partition the stream across many consumers. We would create (for example) separate Kafka topics for Email, SMS, Push, etc., each with multiple partitions. The number of partitions defines the maximum parallelism – we can have one consumer thread per partition. If we anticipate 100k notifications/sec on peak, and a single consumer can process say 1k messages/sec, we’d need on the order of 100 consumers working in parallel across partitions. We might allocate, say, 50 partitions for email, 20 for SMS (if volume is lower), 30 for push, etc., based on expected load per channel. Each partition could be processed by one consumer instance, and we can always add more consumers up to the partition count. Workers can be auto-scaled based on queue backlog or system load. Each message includes a unique notification ID and perhaps a deduplication key (especially if using at-least-once delivery semantics, so if the same message reappears due to a retry, the worker can detect it has seen that ID before and skip or update instead of duplicate sending). Workers also maintain a retry count or delivery status in memory or via the log metadata.

Queue Semantics: The queue should guarantee at-least-once delivery of messages to the workers, meaning a message will be retried if a worker crashes mid-processing. Kafka by default provides at-least-once (consumers commit offsets after processing). We could also use a system like RabbitMQ which can requeue un-ACKed messages. In either case, if a worker fails after partially sending, we need idempotency to avoid duplicates. Alternatively, an exactly-once pipeline could be built with more complexity (e.g., Kafka transactions, idempotent producers/consumers). Many large systems choose at-least-once with de-duplication on the consumer side as needed. Deduplication could involve checking a cache or database of recently sent notification IDs.

No Prioritization: Since “no prioritization between channels” is required, we are not weighting the queue processing in favor of any channel. All channel topics are processed as fast as possible. Priorities can still be handled within a channel if needed (for example, maybe an SMS could have normal vs high priority messages in different queues), but the problem statement says no prioritization across channels, so we treat them equally.

Ordering Considerations: Generally, notifications to the same user via the same channel should be delivered in order of generation. Using a partitioning scheme where the partition key is the user ID (or some hash of it) can ensure that all notifications for a given user go to the same partition (and thus are processed in sequence). This prevents, say, an “Order Shipped” notification from overtaking the “Order Placed” notification for the same user. Cross-user ordering doesn’t matter. Partitioning by user also distributes load roughly evenly if user IDs are random. If certain users generate a lot of notifications (e.g. a very active user), that partition could see heavier load – in extreme cases we might partition by a composite key (user and maybe notification type) or just accept some imbalance for ordering guarantees. Another approach is partition by notification type or channel-specific logic (ensuring, for example, all promotional emails are spread out). The design can incorporate sharding at multiple levels: e.g., user-based sharding for preferences DB and partitioning by time for logs, but for the queue, user-based partitioning is a simple strategy to preserve order per recipient.

Retry and Failure Handling

Retry Policies: Each channel processor will implement a retry mechanism for transient failures. Common strategy is exponential backoff – if a send fails, wait a short interval and try again, increasing the wait time on each failure. For example, after 1st failure wait 1s, after 2nd wait 5s, then 30s, etc. We will configure a maximum number of retries (say 3 attempts) per notification per channel. Some channels might have specific retry rules: e.g., for SMS, if the first attempt fails due to a carrier issue, it might be pointless to retry quickly – maybe we retry after a longer delay or use an alternate SMS provider if available. For email, a common pattern is to retry a couple of times over a few minutes for temporary SMTP issues. Push notifications typically either succeed or not; if a device token is invalid, a retry won’t help (the failure is permanent for that token). So the worker might classify errors into permanent vs transient. Permanent errors (invalid address, unregistered device, etc.) will not be retried excessively; they might be logged and dropped immediately. Transient ones (server timeouts, rate limit exceeded, etc.) trigger the retry logic.

The system can implement retries in a few ways. One is in the worker process itself with an in-memory timer or loop. Another robust way at scale is to publish the failed message to a retry queue (or back onto the main queue with a delay). For instance, if using Kafka, we might have separate topics like Email_Retry or use a field in the message for attempt count and have workers requeue the message with an incremented attempt count and a timestamp to not process until a certain time. Some setups use a dead-letter queue pattern: after N failed attempts, the message goes to a DLQ instead of retrying further.

Deduplication: With at-least-once delivery and retries, deduplicating notifications is important so users don’t get spammed with duplicates. We handle this by using a unique notification ID for each notification request (the API should provide or the system generates one). This ID travels with all channel messages. Workers or the Notification Service can keep a cache of recently seen IDs or check in the database before sending. For example, the Notification Service could store a short-lived record of "notification request processed" keyed by that ID, so if the same request comes again (due to client retry or message duplication), it ignores it. On the consumer side, if a worker sees a message ID that was already processed (could check a Redis set or an entry in the logs), it will skip sending. Idempotent operations are also key: if we accidentally send the same email twice, many email providers have ways to detect duplicate IDs if we pass a message ID, but we shouldn’t rely solely on that. Within our system, a combination of careful commit handling in the queue and idempotency checks prevents duplicates from normal operation. The Duolingo example explicitly mentioned ensuring no duplicate messages even if two triggers fired, highlighting the need for idempotency in the face of concurrent events.

Failure Handling and Fallbacks: Not all failures are temporary. If a channel is down or a third-party service is having an outage, we may want to failover or fallback. For instance, if our primary SMS provider fails, a secondary provider could be used. The system can be configured with multiple integrations per channel, ranked by preference. On failure to deliver via the first, it can try the second. This improves reliability at the cost of complexity and possibly higher cost. Large systems often integrate multiple vendors for critical channels (e.g., two email service providers) and implement logic to route around outages. Similarly, if sending push notifications, if one region of FCM is unresponsive, we might try another region’s endpoint. These fallbacks should still obey the no-duplication rule (only one should ultimately succeed).

If a notification ultimately fails after retries (e.g., email bounced, or we exhausted retries for a transient error), the system should record that failure in the log and possibly move the notification to a Dead Letter Queue for manual review. Operators can inspect DLQ messages and decide to reprocess them later or alert the source service that the notification was not delivered. For example, if an email keeps bouncing, perhaps the user’s email is invalid – the system could flag that user’s email contact in the preferences as invalid to avoid future attempts.

Graceful Degradation: Under extreme load or partial failures, the system should degrade gracefully. For instance, if the notification database is down, the system might still attempt to send notifications but skip logging to the DB (or log to an in-memory buffer or fallback store to flush later). If the queue is backed up (indicating downstream slowness), the rate of accepting new requests might be throttled via the API gateway or using backpressure signals. The idea is to prevent total crashes – e.g., shed non-critical load if needed (maybe drop or delay lower priority notifications, though by requirement we aren’t prioritizing channels, an extension could be prioritizing notification types in an overload scenario).

Additional Features

Rate Limiting per User/Channel: To prevent a user from being overwhelmed by notifications (especially promotional), the system should enforce limits. For example, not more than X promotional notifications per day per user. This requires counting notifications sent (which could be done via the logs or a separate counter service) and checking before sending. If a limit is exceeded, the system can defer or drop some notifications (or combine them into a digest). This is typically part of the business logic in the Notification Service or User Preferences evaluation.

Notification Format and Personalization: The system design allows personalization by using templates and data. It also should handle localization (different languages for notifications, possibly choosing template based on user locale). Templates and content might be stored or managed through a CMS that the Notification Service can query, but at design time it suffices to note templates exist and are fetched when composing messages.

Security Considerations: Ensure that the Notification Service APIs are protected (e.g., with OAuth tokens or API keys) so that only authorized systems (like the company’s own backends) can send notifications – otherwise spammers could abuse it. Also, validate content to prevent injection (especially if content might be HTML for email, or if we allow some rich content in notifications). The system should also not leak data – e.g., one user shouldn’t be able to query another’s notifications due to proper auth checks.

8. Scalability and Performance Considerations

Designing for 3B notifications/day requires careful attention to scalability. Key strategies include distributing load, parallelizing work, and eliminating bottlenecks:

🤖 Don't fully get this? Learn it with Claude

Stuck on Designing Notification Service? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🪜 Hint ladder (no spoilers)

Progressively stronger hints — you still solve it.

I'm working on the problem **Designing Notification Service** (System Design). Give me a HINT LADDER: start with the tiniest nudge, then wait. Only reveal the next, stronger hint when I ask. Do NOT show the full solution unless I type 'show solution'. Keep me doing the thinking. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🎨 Explain the approach visually

See the technique, not just code.

Explain the optimal approach to **Designing Notification Service** with a VISUAL walkthrough: trace it on a small concrete example using ASCII art / a step-by-step diagram, narrate what changes each step, then give time & space complexity with a one-line derivation. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🔍 Review my solution

Catch bugs, edge cases, sub-optimality.

I'll paste my solution to **Designing Notification Service**. Review it for correctness, missed edge cases, and time/space complexity, then coach me toward the optimal — don't just rewrite it. Ask me to paste my code now. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🔁 Drill the pattern

Lock in recognition with look-alikes.

Give me 2 problems that use the SAME underlying pattern as **Designing Notification Service**. For each, let me attempt first, then review my answer and name the trigger signal that reveals the pattern. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes