Home › System Design › System Design Problems

hard Designing Payment System

Step 1: System Definition

Design a payment processing platform (similar to Stripe) that enables merchants (businesses) to accept online payments from customers securely and reliably. The system will handle the entire lifecycle of a payment transaction – from capturing payment details to authorizing the transaction, transferring funds, and handling post-payment events (like refunds or chargebacks). Ultimately, this “Stripe-like” system serves as a payment service provider that combines the functionality of a payment gateway and a payment processor in one integrated platform.

Core Entities and Roles:

Merchant: The business or seller using our platform to charge customers. Merchants integrate with our system (via API) to process payments for goods or services. Each merchant typically has a merchant account in our system where their transactions and balances are recorded.
Customer: The end-user or buyer who wants to pay the merchant. The customer provides a payment method (e.g., credit card details, bank account, etc.) to complete a transaction.
Payment Gateway: A service/component that securely transmits payment information from the customer to the payment processor and back. It acts as a bridge connecting the merchant, customer, and bank networks. In our system, the API and front-end components play the role of the payment gateway, ensuring sensitive data (like card numbers) is captured and transported securely (via encryption and tokenization).
Payment Processor: A service that actually processes transactions by communicating with financial networks. It handles authorization with the customer’s bank and ensures funds are moved from the customer’s account to the merchant’s account. Our platform will fulfill the payment processor role by connecting to card networks and banking systems to authorize and settle transactions.
Issuing Bank (Issuer): The customer’s bank that issued their credit/debit card. The issuer represents the customer in the transaction and is responsible for approving or declining the transaction based on the customer’s account status. If a customer disputes a charge (chargeback), the issuer is the entity that initially refunds the customer and evaluates the claim.
Acquiring Bank (Acquirer): The merchant’s bank that processes credit card payments on the merchant’s behalf. The acquirer represents the business in the transaction, acquiring money from the issuer and eventually depositing it into the merchant’s account.
Card Networks: The credit card networks (Visa, MasterCard, American Express, etc.) that relay transaction information between the acquirer and issuer. They set rules and ensure standardized communication for authorizations, refunds, chargebacks, etc. (These networks are part of the external ecosystem our system must interact with.)
Payment Token: A surrogate identifier for sensitive payment details. Our system will use tokenization to exchange actual card data for a secure token. For example, the first time a customer’s card is used, the system can store it securely and return a token to the merchant. Subsequent transactions can reference the token. This way, the merchant’s servers never need to store raw card numbers.
Fraud Detection Service: A component or subsystem that evaluates transactions for potential fraud. This typically involves analyzing transaction data (amount, location, past behavior, device info, etc.) and possibly using rules or machine learning to flag high-risk payments. If a payment is suspected fraudulent, the system may decline it or mark it for review.
Chargeback: A transaction reversal initiated when a customer disputes a charge through their bank. In a chargeback, the issuer withdraws the funds from the acquirer (who in turn may deduct from the merchant) and returns them to the customer. The system needs to log chargebacks, notify the merchant, and possibly allow the merchant to submit evidence. (The acquirer is typically liable for the chargeback amount while it’s resolved.)
Refund: A post-settlement action where the merchant (or system) returns funds to the customer for a transaction (e.g., customer requested a return). The system should support full and partial refunds by creating reverse transactions through the payment network.
Settlement/Payout: Once payments are captured, the system will eventually transfer the accumulated funds to the merchant’s actual bank account (minus fees).

System Scope: Our payment system will expose APIs for merchants to perform actions like creating a payment charge, refunding a charge, etc., similar to Stripe’s API. It will internally handle authorization with banks, storing transaction records, managing customer payment details, fraud checks, and sending notifications (like webhooks or emails).

Step 2: Requirements Clarification

Functional Requirements

Payment Processing (Authorization & Capture): The core functionality is to process a payment transaction. The system should accept a payment request (e.g., an API call to charge a customer’s card for a certain amount) and handle the entire authorization flow. This involves validating the request, performing fraud checks, communicating with the external payment network (acquirer/issuer), and returning the result (approved or declined) to the merchant. If approved, the system will mark the funds to be captured and later settled to the merchant.
Payment Methods: Initially, support credit and debit card payments (the most common use-case). The design should be extensible to other payment methods such as bank transfers (ACH), digital wallets (Apple Pay, Google Pay), or alternative methods.
Recurring Payments & Saved Cards: Allow merchants to charge returning customers without asking for card details each time. This means supporting creation of customer profiles and payment method tokens. For example, an API to save a customer’s card (which returns a token or customer ID), and a way to charge that token in the future. This involves securely storing payment details (tokenization) and recurring billing logic (subscriptions, though subscription management might be an extension of core payments).
Refunds: Provide an API to issue refunds on previous charges. A refund reverses a payment (full or partial). The system must record the refund transaction, adjust the merchant’s balance, and initiate the refund through the payment network (so the money goes back to the customer’s card or account). The refund outcome (success/failure) should be communicated back and logged.
Chargebacks/Disputes: The system must handle chargebacks, which often occur asynchronously (days or weeks after a charge). This includes receiving notifications from the payment network or acquirer about disputes, updating the status of the original transaction to “chargeback” or “disputed,” and exposing data to the merchant (so they can see the dispute and submit evidence). It should also adjust balances (funds might be held back from the merchant). While the actual dispute resolution is partly offline (banks and networks involved), our system needs to log it and react accordingly (e.g., alert the merchant via a webhook or dashboard).
Idempotency & Retries: Provide safe retry mechanisms for merchants. In practice, network or server issues can cause a merchant to not receive a response. The API should support an idempotency key on requests: a unique client-provided key that ensures if the same operation is received twice, it’s only executed once. For example, if a merchant sends a charge request and doesn’t get a response due to a timeout, they can retry with the same key – the system will recognize it and return the original outcome without double-charging the customer. This is crucial for a robust API.
Logging and Audit Trail: Every transaction event should be logged. The system must maintain an audit trail of actions (payments, refunds, chargeback updates) for compliance and debugging. This includes detailed transaction records (timestamps, involved IDs, status changes, error codes from banks, etc.). In finance, auditability is critical – one should be able to trace what happened to the money at each step.
Notification and Webhooks: After certain events (payment succeeded, payment failed, refund processed, chargeback filed, etc.), the system should notify interested parties. For merchants’ applications, provide webhooks – HTTP callbacks that the system sends to merchant-defined URLs with event data. This way, the merchant’s system can react (e.g., fulfill an order after payment success). Also, send email receipts to customers if the merchant opts in (Stripe can send receipts on the merchant’s behalf).

Non-Functional Requirements

Scalability: Handle up to 2,000 transactions per second (TPS), scaling to billions of transactions yearly.
Low Latency: Less than 2 seconds response time for payment authorization.
High Availability (HA): 99.99% uptime with redundancy and failover.
Consistency and Correctness: Financial transactions require strong consistency in record-keeping. We must ensure that we don’t lose or double-count money. This means database writes for transactions should be atomic (ACID properties) and once a payment is recorded as successful, it is durable. Balances (the money owed to merchants) must be correctly updated with each transaction. Inconsistencies (e.g., a payment captured but not recorded due to a crash) are unacceptable – the system should employ mechanisms (like transaction logs, two-phase commits, or reliable messaging) to avoid this.
Security: Extremely critical for a payment system. All data in transit must be encrypted (TLS for API calls). Sensitive data at rest (like card numbers or personal info) must be encrypted and access controlled.
Observability: Comprehensive logging, monitoring, and distributed tracing for debugging and audits.

Step 3: Back-of-the-Envelope Capacity Estimation

Scale: ~100,000+ merchants; peak loads of 5,000 TPS during major sales.
Storage: ~300 bytes per transaction record. For 1 billion transactions/year, that's roughly 300 GB of storage per year.
Read vs. Write Load: Approximately 70% writes (transaction inserts/updates) and 30% reads (queries). Use read replicas or caching to optimize frequent queries.
Network Throughput: 2,000 TPS of API calls is roughly 34 Mbps of traffic (assuming ~2 KB per request/response). External calls to banks/card networks might handle ~50 TPS per connection, so multiple concurrent connections are needed.

Step 4: High-Level System Design

Architecture Overview

API Gateway / Load Balancer: Serves as the entry point for all requests. It handles authentication, rate limiting, and routing to internal services.
Payment Service (or Payment Processing Service): The core backend service that handles the main payment logic. This service (which could itself be composed of multiple microservices, but initially think of it as one logical unit) is responsible for orchestrating a payment request. It will validate the request, coordinate with other components (like the fraud detection service, card vault, and external payment processor), and update the transaction state in the database. The Payment Service exposes endpoints like “Charge Payment,” “Refund Payment,” etc., which the API gateway passes through. It contains the business logic for payments.
Card Vault Service: A secure service for storing sensitive payment data (card numbers, bank account numbers, etc.). When a merchant wants to save a card or when a new card is used, the Payment Service will interact with the Card Vault. The vault will tokenize and store the card details (fully encrypted). It might return a token or card ID which is used in place of the actual card for future charges. The vault service ensures that even if other parts of the system are compromised, raw card data is protected.
Fraud Detection Service: Analyzes transactions using machine learning models and rule-based heuristics to flag high-risk payments before processing.
External Payment Network Integration (Acquirer Gateway): Handles communication with external payment processors, acquirer banks, and card networks (e.g., Visa/Mastercard APIs).
Transaction Database (Ledger): Stores all transactions, balances, and logs. It will be a sharded relational database for strong consistency on transactional data, supplemented by a NoSQL store for extensive logging or analytics.
Messaging/Queue System: An asynchronous messaging system (e.g., Kafka) to decouple processes. Events for successful payments, refunds, or chargebacks are published to be handled asynchronously. For example, after a transaction is processed, an event “PaymentSucceeded” can be published to a topic. Other services (notifications, ledger updates, analytics pipelines) can consume that without the Payment Service having to synchronously call each.
Notification Service: A service responsible for sending notifications such as email receipts to customers or SMS alerts. The Payment Service can offload this duty by emitting an event or enqueuing a job, so the Notification Service picks it up.
Webhook Service: Listens to events and sends real-time notifications to merchants (via webhooks) about status changes (payment succeeded, failed, dispute opened, etc.).

Data Flow for a Typical Transaction (Credit Card Charge): To illustrate, consider a customer buying a product on a merchant’s website using a credit card:

Client-Side Tokenization: It’s common (as Stripe does) that the merchant’s front-end uses a JavaScript library or SDK provided by the payment system to collect card details and send them directly to the payment system, getting back a token. This way, the merchant’s backend never sees the raw card number. For example, the browser calls our API to tokenize the card (via the Card Vault), and gets a token or payment method ID. This token is then sent to the merchant’s server.
Charge Request: The merchant’s server (or client) calls our Charge API (e.g., POST /v1/charges) via the API Gateway. They include the amount, currency, and either the token from step 1 or some payment instrument details, plus possibly an idempotency key.
API Gateway: Validates the merchant’s API key/auth token, checks request size/rate limits, then forwards the request to the Payment Service’s endpoint for creating a charge.
Payment Service – Request Validation: The Payment Service first parses the request. It verifies the merchant is allowed to make this charge (e.g., checks account status, whether currency is supported, etc.). It also ensures required fields are present (amount, etc.) and that the amount is positive and within allowed limits. If an idempotency key is provided, it will do a lookup in the Idempotency store to see if this key has been seen for this merchant.
- If the key exists and a result is stored, it will short-circuit and return the stored result (thus avoiding duplicate processing).
- If not, it will record this key as in-progress (to prevent a race if a retry comes in while processing).
Fraud Check: The Payment Service calls the Fraud Service to assess risk. If the transaction is high-risk, it may be declined or flagged for manual review.
Card Data Retrieval: The Payment Service needs the card details to send to the acquirer (unless the merchant provided raw card info directly). If a token was provided, the Payment Service calls the Card Vault Service to get the actual card number, expiry, and possibly CVV (though often CVV isn’t stored, it’s provided by customer at time of transaction and not retained). The Card Vault returns the decrypted card data securely. This call is internal and secured – we never expose card data outside.
External Authorization (Acquirer/Issuer): Now the Payment Service prepares a request to the external payment gateway/processor (which could be an acquirer’s API or a payment network connection). This might be a JSON API call: it sends card number, expiry, amount, merchant ID (or acquirer merchant ID), etc., to the acquirer.
- This is typically the slowest step, as it involves leaving our system to call the bank network.
- The response comes back with either approved (and an auth code, transaction ID) or declined (with a reason code), or an error (maybe timeout or network issue). The Payment Service receives this.
Processing Response: If approved, the Payment Service will mark the transaction as approved. It generates a transaction record in the database (if not already created) with status = “approved” (or “succeeded”). It also creates a related ledger entry to credit the merchant’s balance with the amount (minus fees). If declined, it records the transaction as failed (with reason). If there was an error (no definitive response), this is tricky: we might decide to mark the transaction as “pending” or unknown and trigger a retry or manual follow-up. Often, if a call times out, you do not immediately retry the charge (to avoid double charge) – instead you might query the status or rely on idempotency (the next retry with same key will either get processed or we have logic to prevent double charge if the first actually went through).
Post-Processing: After updating our records, the Payment Service will generate a response to return to the API caller (merchant). Typically, the response includes the transaction status (succeeded or failed), a unique charge ID in our system, and maybe details like captured amount, fees, etc. If failed, include an error message or code. This response goes back through the API Gateway to the merchant. From the merchant’s perspective, the API call to charge is now complete with a result. The customer at checkout sees “payment approved” (or error if declined).
Asynchronous Events: Meanwhile, our system triggers follow-up processes:
- The Payment Service (after committing the transaction) publishes a “Payment Succeeded” event to the internal Event Bus or sends a message to a queue that a payment is done. This event contains the charge ID, merchant, amount, etc.
- The Webhook Service, subscribed to such events, will pick it up and look for any webhooks that the merchant has registered for “payment_succeeded”. It will then send an HTTP POST to the merchant’s callback URL with the data. This might happen within seconds of the transaction. If the merchant’s server is down, it will retry a few times over, say, the next hour. This ensures the merchant’s system is notified.
- The Notification Service might see the event and if configured, send an email receipt to the customer.
- A separate Analytics or Reporting Service could log the transaction to a data warehouse for long-term analysis (via an event or by tailing the transaction DB).

This refund flow will be as follows:

Refund Flow: A merchant calls the Refund API (with the original transaction ID and amount to refund). The Payment Service would verify the transaction exists and was successful, ensure the refund amount is not more than original. It then calls the external API to issue a refund (or a reversal if the payment wasn’t captured yet). If approved, mark refund record, adjust ledger (debit merchant’s balance for that amount), and return result. Webhooks and notifications would similarly follow (e.g., “payment_refunded” event). Refunds might not be instantaneous in the banking sense (card refunds often appear after a day), but we treat it as done from our system perspective once the acquirer confirms.

Step 5: Database Schema

Merchants

Stores merchant (business) account details. This table is relatively small and can be kept on a primary shard or globally.

Field Name	Data Type	Description
merchant_id (PK)	BIGINT	Unique merchant identifier (primary key).
name	VARCHAR(255)	Merchant’s business name.
email	VARCHAR(255)	Contact email (unique).
status	VARCHAR(50)	Account status (e.g., `active`, `suspended`).
created_at	TIMESTAMP	Timestamp when the merchant account was created.
available_balance	BIGINT	Current available balance for the merchant (in cents). Denormalized for quick access; updated via ledger entries.
pending_balance	BIGINT	Funds pending settlement (if applicable). Denormalized.

Indexes: Unique index on email for login/account lookup.
Sharding: Merchants table can remain unsharded or lightly partitioned (small size). Other tables use merchant_id to distribute data per merchant.

Customers

Stores end-customer profiles for each merchant (buyers who saved payment info or were charged).

Field Name	Data Type	Description
customer_id (PK)	BIGINT	Unique customer identifier.
merchant_id (FK)	BIGINT	Merchant who owns this customer. FK to `Merchants(merchant_id)`.
name	VARCHAR(255)	Customer name.
email	VARCHAR(255)	Customer email (could be NULL if not provided).
phone	VARCHAR(50)	Customer phone number.
created_at	TIMESTAMP	Profile creation timestamp.
updated_at	TIMESTAMP	Last update timestamp.
default_payment_method	BIGINT	(Optional) FK to default payment method in `PaymentMethods`.

Indexes: Index on (merchant_id, email) for quick lookup of customer by email per merchant (ensure unique per merchant).
Sharding: Partitioned by merchant_id (all customers of a merchant stored together).

Payment Methods

Stores tokenized payment details (cards, bank accounts) for customers.

Field Name	Data Type	Description
payment_method_id (PK)	BIGINT	Unique payment method identifier.
merchant_id (FK)	BIGINT	Owner merchant. FK to `Merchants(merchant_id)`.
customer_id (FK)	BIGINT	Customer who owns this payment method. FK to `Customers(customer_id)`.
type	VARCHAR(50)	Payment type (`card`, `bank_account`, etc).
token	VARCHAR(255)	Token/reference to payment info (e.g., vaulted card token).
card_brand	VARCHAR(50)	If type=card: Card network (Visa, Mastercard, etc).
card_last4	VARCHAR(10)	If type=card: Last 4 digits of card number.
card_exp_month	INT	If type=card: Expiration month.
card_exp_year	INT	If type=card: Expiration year.
bank_name	VARCHAR(100)	If type=bank: Bank name (optional).
bank_last4	VARCHAR(10)	If type=bank: Last 4 of bank account.
created_at	TIMESTAMP	When the payment method was added.
updated_at	TIMESTAMP	Last update timestamp.
is_active	BOOLEAN	Whether the payment method is active (not deleted/invalid).

Indexes: Index on (merchant_id, customer_id) to quickly fetch all payment methods of a customer; unique index on token (globally or per merchant) to prevent duplicate tokens.
Sharding: Sharded by merchant_id. Payment methods reside on the same shard as their customer.

Transactions

Core table for all payment transactions (charges, payments). This table is high-volume and critical.

Field Name	Data Type	Description
transaction_id (PK)	BIGINT	Unique transaction ID.
merchant_id (FK)	BIGINT	Merchant who received the payment. FK to `Merchants(merchant_id)`.
customer_id (FK)	BIGINT	Customer who made the payment. FK to `Customers(customer_id)`.
payment_method_id (FK)	BIGINT	Payment method used. FK to `PaymentMethods(payment_method_id)`.
amount	BIGINT	Transaction amount in cents (e.g., $10 = 1000 cents).
currency	VARCHAR(10)	Currency code (e.g., USD, EUR).
status	VARCHAR(50)	Transaction status (`pending`, `succeeded`, `failed`, etc).
type	VARCHAR(50)	Transaction type (`charge`, `auth`, `capture`, etc.).
description	VARCHAR(255)	Description or order info (optional).
reference_code	VARCHAR(100)	External reference (e.g., order ID from merchant system).
processed_at	TIMESTAMP	When the transaction was processed (authorized/captured).
settled_at	TIMESTAMP	When funds settled (if applicable, e.g., for ACH or delayed capture).
created_at	TIMESTAMP	Creation timestamp (initial request time).
updated_at	TIMESTAMP	Last update timestamp.

Indexes:
- Index on (merchant_id, created_at) for retrieving recent transactions per merchant (common query).
- Index on (merchant_id, status) to find pending or failed transactions quickly (for retries or review).
Sharding/Partitioning: Sharded by merchant_id – each merchant’s transactions are stored on a designated shard or partition . Within each shard, the table can be partitioned by date (e.g., by month or quarter) to optimize queries on date ranges and purge old data without affecting current data.

Transactional Integrity: Inserting a new transaction and updating balances/ledgers are done within a single ACID transaction to ensure all-or-nothing updates (money movement is never partially recorded).

Ledgers

Records financial entries for merchants – every credit or debit affecting a merchant’s balance (payments, refunds, chargebacks, payouts, fees). This provides an audit trail and running balance.

Field Name	Data Type	Description
ledger_id (PK)	BIGINT	Unique ledger entry ID.
merchant_id (FK)	BIGINT	Merchant to whom this ledger entry belongs. FK to `Merchants(merchant_id)`.
transaction_id	BIGINT	Related transaction (if applicable). FK to `Transactions(transaction_id)`.
type	VARCHAR(50)	Entry type: e.g., `payment_credit`, `refund_debit`, `chargeback_debit`, `payout_debit`, `fee_debit`, etc.
amount	BIGINT	Amount of this entry (in cents). Credits (incoming funds) are positive; debits (outgoing) are negative amounts or recorded separately by type.
currency	VARCHAR(10)	Currency (should match merchant’s transaction currency).
balance_after	BIGINT	Merchant’s balance after this entry was applied.
description	VARCHAR(255)	Description or reference (e.g., “Charge ID X”, “Payout to bank”, etc.).
created_at	TIMESTAMP	When the ledger entry was recorded.
settled_at	TIMESTAMP	If applicable, when the entry was settled (e.g., payout completion date).

Index on merchant_id, created_at to retrieve ledger entries by merchant in chronological order (e.g., for statements).
Index on merchant_id, type if querying specific types (e.g., all payouts for a merchant).
Sharding/Partitioning: Sharded by merchant_id, same shard as transactions for consistency. Could be partitioned by date as well (to efficiently query or archive older entries).
Data Integrity: Ledger entries are written within the same transaction as the corresponding Transactions/Refunds updates to maintain consistency (ensures balances are correct). The balance_after is denormalized for convenience so the current balance can be obtained by looking at the latest entry, rather than summing all entries. This speeds up balance queries at the cost of storing redundant data (which is safe with transaction consistency controls).

Refunds

Tracks refunds issued for transactions.

Field Name	Data Type	Description
refund_id (PK)	BIGINT	Unique refund identifier.
merchant_id (FK)	BIGINT	Merchant who issued the refund. FK to `Merchants(merchant_id)`.
transaction_id (FK)	BIGINT	The original transaction being refunded. FK to `Transactions(transaction_id)`.
amount	BIGINT	Refunded amount (in cents).
status	VARCHAR(50)	Refund status (`pending`, `succeeded`, `failed`).
reason	VARCHAR(255)	Reason for refund (customer request, product return, etc.).
created_at	TIMESTAMP	When the refund was initiated.
processed_at	TIMESTAMP	When the refund was completed (money actually refunded).

Indexes: Index on transaction_id to quickly find refunds for a given transaction. Also index on (merchant_id, created_at) to list recent refunds per merchant.
Sharding: Sharded by merchant_id (same shard as the original transaction).

Webhooks

Stores outgoing webhook events to notify merchants of relevant events (e.g., a transaction succeeded, a refund completed).

Field Name	Data Type	Description
webhook_id (PK)	BIGINT	Unique webhook event ID.
merchant_id (FK)	BIGINT	Merchant that should receive the webhook. FK to `Merchants(merchant_id)`.
event_type	VARCHAR(100)	Type of event (e.g., `transaction.succeeded`, `refund.created`).
event_data	TEXT	Payload data (JSON or serialized) relevant to the event.
status	VARCHAR(50)	Delivery status (`pending`, `sent`, `failed`).
attempts	INT	Number of delivery attempts made.
next_retry_at	TIMESTAMP	Next scheduled retry time if last attempt failed.
created_at	TIMESTAMP	When the event was generated.
delivered_at	TIMESTAMP	When the event was successfully delivered (if at all).

Indexes:
- Index on (merchant_id, status) to find all pending webhooks for a merchant.
- Index on next_retry_at for scheduling retries (find the next events due).
Sharding: Sharded by merchant_id (to keep events with related data). However, dispatching systems may also aggregate across shards for processing all pending webhooks.

Event Logs

Captures all major system events for audit and troubleshooting (this could include security events, system errors, or high-level actions).

Field Name	Data Type	Description
event_id (PK)	BIGINT	Unique event log ID.
merchant_id	BIGINT	Merchant related to the event (nullable if global event).
event_type	VARCHAR(100)	Type of event (e.g., `transaction.created`, `refund.processed`, `login`, `api_call`).
event_details	TEXT	Details about the event (could be JSON or message text).
created_at	TIMESTAMP	Timestamp of the event.
user_id	BIGINT	(Optional) User or staff who triggered the event (if applicable).
source	VARCHAR(50)	Source of event (`system`, `merchant_portal`, `API`, etc.).

Indexes:
- Index on merchant_id, created_at for retrieving events per merchant chronologically.
- Index on event_type for filtering specific events (e.g., all api_call events for monitoring).
Sharding/Partitioning: Could be partitioned by date (since this table can grow very large). If events are mostly tied to merchants, shard by merchant_id; otherwise, a centralized logging service or separate database might handle this.
Note: This table may accumulate a huge number of rows (especially if logging every API call or transaction event). In practice, one might move this to a separate logging system or use a time-series DB. Here, we include it for completeness, stored with ACID compliance (useful for audit trails).

Step 6: Detailed Component Design

Now we’ll explore key components in detail.

6.1 API EndPoints

POST /v1/charges – create a charge
GET /v1/charges/{id} – retrieve charge status
POST /v1/charges/{id}/refund – create a refund (or /v1/refunds separate)
GET /v1/balance – get current merchant balance
GET /v1/transactions?filter... – list transactions (possibly paginated, etc.)
etc. (Also endpoints to save cards, manage customers, etc.)

These all funnel through the gateway to appropriate internal handlers.

6.2 Payment Service (Core Orchestration)

This service is the brain of the operation. We can consider it as a payment orchestration layer that coordinates between the database, external services, and internal auxiliary services.

Internal Structure: We could design the Payment Service itself in a modular way or even as multiple microservices. For instance, some organizations would split the responsibilities: an Orchestrator service, a Connector service for external calls, etc. For clarity, we’ll discuss it as one service with distinct sub-components/tasks:

Request Handler: This is the part that receives the API call (once passed through the gateway). If using a web framework, this is the controller that handles /charges. It calls other components or modules to perform the steps like idempotency check, etc. It can use a thread pool or async event loop to handle many concurrent requests.
Idempotency Handler: When a request with an idempotency key comes in, the service will check a fast storage (could be an in-memory cache or a Redis cluster, or our DB) to see if the key exists. We might choose to use Redis for quick global lookup of idempotency keys (with entries expiring after, say, 24 hours or 7 days to limit storage). Alternatively, a DB table with a unique index on (merchant_id, idem_key) could serve the same purpose – insertion will fail if duplicate (so we know it’s a retry) and we can then fetch the result.
Business Logic: The service implements rules like “is the currency supported?”, “does the amount exceed the merchant’s processing limit?”, “is this merchant active?”, etc., to validate before processing. Then it proceeds through the steps: contact fraud service, vault, etc. If any step returns an error or disallows, it handles that by failing the transaction with appropriate status.
External Communication: The Payment Service will call out to the external payment network via an integration module. Likely we’ll implement this with robust HTTP client or SDK provided by the acquirer. We must handle timeout cases and error responses carefully (with retries or marking uncertain states). Possibly use a library or a separate thread for external calls so as not to block the main thread.
Database Operations: Using a Data Access Layer, the Payment Service will create/update records. For example, when a charge request comes in, we might immediately create a transaction record in DB with status “pending” (to reserve an ID and have a record in case of failure mid-way). Then after external auth, update it to “succeeded” or “failed”. This helps if something goes wrong after external call – we still have a record of the attempt. All DB operations should be done in a transaction to ensure consistency.
Concurrent Handling: The Payment Service must be thread-safe and handle possibly the same customer or merchant doing multiple transactions in parallel. Generally, each request is independent, but something like updating a merchant’s balance concurrently could cause race conditions. In a relational DB, we can rely on row-level locking – e.g., updating a balance with balance = balance + X safely within a transaction. Or we use a separate ledger that just inserts entries (which avoids contention entirely). We will want to avoid coarse-grained locking in the service; better to let the DB handle it or design the data model to minimize conflicts.

Data Model & Storage Choice: For core transactional data, a relational database is a solid choice due to its ACID guarantees. Financial transactions require atomicity and durability. A SQL database (like Postgres or MySQL) can enforce constraints (like unique idempotency keys, referential integrity between a charge and a refund record, etc.) and can do multi-row transactions easily.

Synchronous vs Asynchronous Processing: The Payment Service does most steps synchronously to provide a result in the API response. We choose sync for the primary flow because merchants (and customers) expect an immediate answer to a payment attempt. However, behind the scenes we use asynchronous processing for things that need not block the customer’s request: emailing receipts, notifying external systems, etc., are done async via events after the main flow commits.

Card Vault Service: Let’s detail this component since it’s critical for security:

It has its own secure database (often an HSM – Hardware Security Module – or an encrypted store) where card data is stored. Data is encrypted with strong keys; even DBAs can’t read it directly. The service when storing a card will generate a random token (maybe a UUID or Stripe uses a format “card_abcdef12345” for example) and associate it with the encrypted data. Only the vault service can decrypt.
When retrieving, the Payment Service provides the token and some auth (the vault will ensure the requesting service is allowed and perhaps that this token belongs to that merchant or is globally unique anyway). The vault returns the card info (PAN: Primary Account Number, expiry, maybe cardholder name if stored).
This service should be minimal: store and retrieve. It might also support deleting a card (if a customer wants their data removed).

External Payment Integration Service:

If we design this as a separate microservice (say call it “Acquirer Connector”), it would abstract the details of talking to different payment networks. It might have methods like authorizeCard(cardData, amount, merchantAccount). Under the hood, it formats the request according to the acquirer’s API spec, handles sending it over, and parses the response. If we connect to multiple providers, this service could have a routing logic: e.g., if card is Amex, maybe use a different processing path. Or if region is EU, use our EU acquirer vs US acquirer. That logic can be encapsulated here, keeping Payment Service simpler.
However, adding an extra network hop (Payment Service -> Connector Service -> external) might add a bit of latency. It’s a trade-off: modularity vs latency. If in the same data center, an RPC call is maybe 1-2ms, which is negligible compared to the external call. So it’s fine.
This service should also implement retry logic for external calls carefully. If a call times out, should it retry? It must be careful not to double-charge. Typically, you’d only retry if you are sure the first attempt didn’t go through. Often better approach is to query the status (if the API offers that) rather than blindly resubmit. Some acquirers have an operation like “inquire by transaction idempotency” to see if a transaction was processed. If not, then you can retry. If no such API, perhaps rely on our idempotency and unique transaction IDs such that the acquirer would reject a true duplicate anyway. We might implement limited retries for transient network errors. Also use a circuit breaker: if the acquirer’s endpoint is down, after a few failures, stop sending more for a short time, to allow fallback or at least not waste time.
If an external integration needs to be asynchronous (some gateways might do callback later), then this service would handle that complexity (storing a pending state and awaiting a callback or pulling a queue).

6.3 Retryable vs. Non-Retryable Errors

Retryable Errors are issues that might be resolved if tried again. These fall into two categories:

Transient Failures: Temporary glitches that often succeed on retry. Examples include network timeouts, connection resets, or a payment processor being momentarily unavailable. These faults are often self-correcting, so repeating the action after a short delay can succeed. Rate-limited requests (e.g. HTTP 429 Too Many Requests) also fall here – waiting and retrying later may work once the limit resets.
Soft Declines: Payment failures due to temporary customer or bank issues, which may succeed upon retry. Common cases are insufficient funds, a temporarily blocked card, or a failed authentication like 3D Secure that the customer can resolve and attempt again. Soft declines are considered reversible – roughly 80–90% of declines are soft and can be retried once the underlying issue is fixed (e.g. customer adds funds or unblocks their card).

Non-Retryable Errors are failures where retries won’t help, so the transaction should be considered a permanent failure:

Hard Declines: Permanent authorization failures such as stolen or canceled cards, invalid card details, or closed accounts. These are irreversible and should not be retried. For example, a “do not honor” or "card reported stolen" response is a final decision from the issuer.
Compliance or Fraud Rejections: If the payment is flagged for fraud or violates compliance rules, the system should not attempt it again without changes. Retrying could risk further flags or legal issues.
Merchant Configuration Errors: Cases like an invalid API request, currency mismatch, or missing configuration are logic errors that must be fixed (retrying the same request won’t succeed until the issue is resolved by the merchant/developer).

Retry Strategies

Here’s a concise overview of each retry strategy mentioned:

Fixed Interval
- Retries happen at a constant time interval (e.g., every 2 minutes).
- Simple to implement, but doesn’t adapt to changing load or error conditions.
Exponential Backoff
- Each retry increases the wait time exponentially (e.g., 1 min, then 2, 4, 8...).
- Gives downstream systems time to recover from overload or transient failures.
Linear Backoff
- The delay increases by a fixed increment each time (e.g., retry after 5 min, then 10, then 15...).
- Less aggressive than exponential but still avoids constant hammering at short intervals.
Jitter-Based (Randomized) Backoff
- Adds randomness to the delay (e.g., wait = base_delay * (1 + random factor)).
- Helps prevent synchronized retry “storms” when multiple clients fail at once.
Delayed/Deferred Queues
- Schedules retries for a specific future time rather than immediately (e.g., next day for “insufficient funds”).
- Useful for “soft decline” scenarios where waiting a longer period is more likely to succeed.

Often, real systems combine these strategies. For example, an initial quick retry (after a few seconds) in case of transient network blips, then use exponential backoff with jitter for subsequent attempts, and for certain decline reasons like “insufficient funds,” schedule a much later retry attempt (e.g. 24 hours later).

Failure Handling After Exhausting Retries

If all retry attempts are exhausted and the transaction still fails, the system should gracefully handle the permanent failure:

Mark Transaction as Failed: Sets the transaction status to a final “FAILED,” indicating no more automatic attempts will occur.
Notify Merchant/Customer: Sends a webhook or email notification explaining the final failure status.
Log for Audit & Analytics: Records all retry attempts, timestamps, and error codes for future reference.
Dead Letter Queue (DLQ): Optionally places the permanently failed transaction into a specialized queue if manual investigation or corrective actions are needed (e.g., an unknown error code from a payment processor that requires deeper follow-up).

6.4 Asynchronous Workflow and Communication

As noted, our design uses asynchronous messaging for certain tasks to improve throughput and decouple services. Let’s clarify how and where we use asynchronous processes:

Event Bus (Kafka or similar): After a transaction is processed, publishing an event (with all necessary details) allows other subscribers to react. This follows an event-driven architecture where producers of events don’t need to know who will consume them, reducing coupling. For example, the Payment Service just produces “PaymentSucceeded” event; the Webhook and Notification services independently handle it. This makes the Payment Service code simpler (it doesn’t have to call those services directly or wait for them). It also means if one of those services is down, it doesn’t block payments – the event will be in the queue and can be processed when they recover, improving fault tolerance.
Webhook Delivery Retries: The Webhook Service’s usage of asynchronous retry is notable: it likely has an internal schedule or queue for pending webhooks. If a webhook fails, it schedules a retry a few minutes later, and so on (exponential backoff). This is internal to that service, but it’s another async mechanism.

Consistency Consideration: Whenever we introduce asynchrony, we have to think about consistency. For example, if Payment Service commits a transaction and publishes an event, what if the event publish fails after the DB commit? We could end up with a transaction not notified. Solutions include: use the DB as a source of truth and have a separate process that scans for new transactions and emits events (effectively making event publishing idempotent and retryable). Or use an outbox pattern: write the event to a table in the same transaction, and have an event relay service read from that table to publish to Kafka (ensuring no lost events). This is known as the transactional outbox pattern to avoid missing events in case of failures. Given complexity, we’ll assume either a robust event publishing (maybe using Kafka’s transactional feature or using the outbox idea) so that events are not lost.

Step 7: Scalability and Performance Strategies

Horizontal Scaling: All stateless services (API Gateway, Payment Service, Fraud Service, Webhook Service, etc.) run in a cluster and can scale out by adding more instances behind load balancers. During peak loads (e.g., Black Friday sales), auto-scaling can spin up additional instances to handle increased TPS.
Database Sharding and Replication: Partition the Transaction Database by merchant or region to spread write load. Use read replicas to serve heavy read queries (e.g., for generating reports) without impacting writes. Ensure replication lag is minimal to keep data fresh across replicas.
Caching: Implement caching for frequently accessed data and ephemeral states. For example, use Redis or an in-memory cache for storing recent transaction statuses, exchange rates, or idempotency keys (to quickly detect duplicate requests).
Idempotency Keys: Use unique identifiers for each payment request so that retries (due to network issues or client timeouts) don’t result in double charges. The Payment Service can store recent idempotency keys in a cache with a short TTL to quickly detect repeats.

🤖 Don't fully get this? Learn it with Claude

Stuck on Designing Payment System? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🪜 Hint ladder (no spoilers)

Progressively stronger hints — you still solve it.

I'm working on the problem **Designing Payment System** (System Design). Give me a HINT LADDER: start with the tiniest nudge, then wait. Only reveal the next, stronger hint when I ask. Do NOT show the full solution unless I type 'show solution'. Keep me doing the thinking. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🎨 Explain the approach visually

See the technique, not just code.

Explain the optimal approach to **Designing Payment System** with a VISUAL walkthrough: trace it on a small concrete example using ASCII art / a step-by-step diagram, narrate what changes each step, then give time & space complexity with a one-line derivation. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🔍 Review my solution

Catch bugs, edge cases, sub-optimality.

I'll paste my solution to **Designing Payment System**. Review it for correctness, missed edge cases, and time/space complexity, then coach me toward the optimal — don't just rewrite it. Ask me to paste my code now. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

🔁 Drill the pattern

Lock in recognition with look-alikes.

Give me 2 problems that use the SAME underlying pattern as **Designing Payment System**. For each, let me attempt first, then review my answer and name the trigger signal that reveals the pattern. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes

← Designing Unique ID Generator Designing Reminder Alert System →