Knowledge Guide
HomeSystem DesignSystem Design Problems

medium Designing Code Judging System

Image
Image

Step 1: System Definition

An online coding judge platform (similar to LeetCode) is a web-based system where users solve programming problems by writing code and receiving automated feedback. The primary purpose is to evaluate user-submitted code against predefined test cases in a secure, controlled environment and return the results quickly. Key entities in this system include:

In summary, this system enables millions of users to practice coding problems and get instant verdicts on their solutions in various programming languages, while ensuring each submission runs safely in isolation.

Key Entities
Key Entities

Step 2. Requirements Analysis

Functional Requirements

  1. Problem Catalog & Details: Users can browse a list of coding problems and view individual problem details (description, input/output format, constraints, example cases).

  2. Code Submission: Users can write and submit code for a chosen problem in multiple programming languages (e.g., Python, Java, C++, JavaScript, Go, C#). The system must accept the code, associate it with the problem and user, and queue it for evaluation.

  3. Code Execution & Evaluation: The system compiles the code if needed (for compiled languages) and runs it against the problem’s test cases. It enforces runtime limits (time per test) and memory limits during execution. The execution for a single submission’s test cases will be sequential (no parallelization of test cases for a submission), but the system will handle many different user submissions concurrently.

  4. Result Feedback: Within a few seconds of submission, the user receives feedback. The feedback includes whether the solution passed all tests or failed (with error type: e.g., compilation error, wrong answer on a specific test, time limit exceeded, memory exceeded, runtime error). Ideally, results are delivered in near “real-time” (e.g,. ~3–4 seconds) for a smooth user experience.

  5. Competitive Programming Features: There should be support for timed contests:

    • Ability to create contests with a set of problems and a fixed duration.
    • Leaderboards: During a contest, the system tracks each participant’s score (for example, number of problems solved and time taken) and provides a live leaderboard that updates as submissions are judged. Users can view rankings in real-time or near real-time.
    • Rankings and Metrics: After contests, results are stored (who solved how many, rankings). The system may also have global ranking or user skill level metric based on contest performance or number of problems solved.
  6. Additional Features: (Optional scope) Users should be able to see their submission history and problem discussion or editorial. In a contest scenario, the system should handle scoring and ranking users, but the core design will focus on the submission judge functionality.

Non-Functional Requirements

Step 3. Capacity Estimation

To design for “web-scale,” we estimate the expected load and data volumes:

These estimations guide us in designing a system that can scale horizontally. We anticipate needing dozens of application servers and a scalable cluster of worker machines for code execution to handle peak loads, along with a high-capacity database or a partitioned data store to hold problems and submissions.

The system will adopt a distributed, modular architecture with clear separation of concerns. At a high level, it consists of the following major components:

1. Client Interface (Presentation Layer): The user interface (web browser or mobile app) provides problem listings, problem detail pages with an editor, and views for results and leaderboards. When a user writes code and hits “Run” or “Submit,” the client will send an HTTP request (or WebSocket message) to the backend. For example, a POST request to submit code for a problem. The client might then display a loading indicator while waiting for results. (We will use an asynchronous pattern: the client gets a submission token and polls for result, but from the user’s perspective, it’s still a quick turnaround.)

2. API Gateway and Load Balancer: All incoming requests go through a load balancer (e.g., Nginx, AWS ELB) or API Gateway. This ensures requests are distributed across multiple instances of our API servers and can also handle concerns like SSL termination and rate limiting. We might deploy API servers in multiple regions or availability zones for reliability. The API Gateway can also handle authentication tokens (verifying user identity) before forwarding requests.

3. API Servers / Backend Services: These are the stateless application servers that implement the REST (or GraphQL) API endpoints. We could structure them as microservices or as a single monolithic backend that internally calls modules. Given the scale, a microservices approach is beneficial to scale and develop independently. Key backend services include:

All these services should be stateless (they don’t keep session info in memory between requests). State (like user sessions, submission data, etc.) is stored in databases or caches, so any instance can serve any request. This allows horizontal scaling by simply adding more instances behind the load balancer.

4. Cache Layer: To improve read performance and reduce load on databases, an in-memory cache (e.g., Redis or Memcached) can be used. Frequently accessed data like the list of problems, problem details, or even recently fetched test cases can be cached. This is especially useful for hot contests where many users request the same problem data simultaneously. Additionally, caching recent submission results or statuses might help if users poll frequently for updates.

High-level Design
High-level Design

Data Flow: A typical request flow for code execution is:

  1. Submit Phase: A user submits code from the frontend. The Submission Service receives the submission, assigns it an ID, and stores a record (status = “pending”) in the submissions table. It then publishes a message to the Submission Queue with details (submission ID, problem ID, language, etc.). The user is immediately informed that the submission was received (the frontend might start polling for result using the submission ID).
  2. Execution Phase: A Code Execution worker subscribed to the queue picks up the job. It uses the problem ID to fetch the problem’s test cases (from the database or a file store). It then launches a sandbox environment (e.g., a Docker container or a secure process jail) for the submission. Inside the sandbox, the worker prepares the source code: if it’s a compiled language (C++, Java, Go, C#), it invokes the compiler (capturing compilation errors if any); for interpreted languages (Python, JavaScript), it may skip straight to execution or do a syntax check. After a successful compile, the worker runs the program against each test case input sequentially, enforcing the time and memory limits for each run. The output from the code is compared with the expected output for each test. If any test fails (wrong output, time limit exceeded, etc.), the worker can decide to stop further testing (typically, judges stop at the first failing test to save time). It then marks the submission’s result (e.g., “Wrong Answer on test 3” or “Time Limit Exceeded on test 7” or “Accepted” if all passed). Throughout this process, the worker monitors execution time and memory.
  3. Result Phase: The worker updates the submission record in the database with the result status, runtime, and memory usage stats, and possibly the output or error message if needed (e.g., for compile errors or runtime errors, the error message is saved for the user to view). The worker may also push the result to a lightweight notification service or cache to immediately inform the waiting frontend. The frontend either receives a push notification or more commonly, the next time it polls the getSubmissionStatus API, it will see the updated result. The user’s interface then displays the outcome (and any error details). The entire cycle typically completes within a few seconds, thanks to the parallelism and efficiency of the components.

This architecture is asynchronous and decoupled, which is key to scaling. The API servers (Submission Service) never execute code directly (they remain responsive to handle user interactions), and the heavy lifting is done by the background worker cluster. The use of a queue and separate workers adds a slight complexity (and a tiny delay in handing off jobs), but it dramatically improves reliability and throughput. If the submission rate spikes, jobs will queue up and workers will work through them; the system can auto-scale workers if the queue grows. If a worker machine crashes mid-execution, the job message can be re-queued or picked up by another worker (ensuring no submission is lost). This design also isolates failures – a crash in a sandbox or worker doesn’t bring down the main site. We also ensure components (API, queue, workers, DB) can be distributed across multiple servers and zones for high availability.

Step 5. Data Storage and Indexing

Problem Data: Problems can be stored in a database optimized for read performance. A NoSQL document store (like DynamoDB or MongoDB) is a good choice to store problem statements along with an array of test cases. Each problem entry contains fields like id, title, description, difficulty, allowed languages, and a list of test cases (each with input and expected output). This schema allows retrieving everything needed to evaluate that problem in one go. We index the id (primary key) and perhaps attributes like difficulty or tags to support filtering (e.g., list all easy problems). Because problem data changes rarely, we can also leverage caching and CDN distribution (for static content like images in problem descriptions). A search service (like Elasticsearch) could be added if we need text search on problems, but that’s optional.

Submission Data: Submissions (attempts by users) are stored for record-keeping, debugging, and features like user submission history or leaderboards. Each submission record includes userId, problemId, timestamp, code (or a reference to stored code), selected language, and the result (status, time used, memory used, etc.). This can be stored in a relational database (which makes querying easy, e.g., “find all submissions of user X for problem Y”). We need indexes on userId and problemId for retrieval. The volume of submission data can be high (imagine millions of submissions), so partitioning will be needed. We can partition by problem or by time (e.g., archive old submissions) to keep tables manageable. For global deployment, a single distributed database (e.g., CockroachDB or Cosmos DB) could keep submissions in sync worldwide, or we could isolate submissions per region and aggregate as needed. Consistency isn’t as critical for submissions except in competitions where a centralized leaderboard is needed (in that case, we ensure all regions report to one leaderboard service).

Test Cases Storage: Large test input/output files (for big problems) might be stored in an object storage (like S3 or HDFS) rather than in the database, with the problem entry containing pointers. This way, the execution engines can stream test files directly. We should store expected outputs securely (not exposed to users) to compare results. To speed up judging, these test cases could be cached on the execution servers or in a distributed filesystem accessible by all worker nodes.

Indexing & Search: We ensure quick lookup of problems by ID (primary key access) – this is straightforward. If we need to list problems by difficulty or tag, we can design queries or secondary indexes for that. For submissions, indexing by user enables showing “my submissions” quickly. Index by problem+status could help generate stats (like how many users solved it), though that’s optional.

Caching Layer: We use a fast in-memory cache for hot data. For example, when a problem is requested, cache it so subsequent requests (perhaps by other users opening the same problem) hit the cache. Similarly, cache the list of problems for the homepage. The cache can also store recently used test cases to avoid reading from disk repeatedly.

Step 6. Detailed Component Design

a. Submission Queue (Message Queue)

A critical component for decoupling is the queue that sits between the Submission service and the Code Execution workers. When a user submits code, the request is turned into a job message that is placed onto a queue (for example, RabbitMQ, Kafka, AWS SQS, etc.). The job message includes all information needed to run the code: submission ID, code, language, problem ID (to fetch test cases), and maybe the input test files or a reference to them.

Why a queue? Because execution can take a couple of seconds, and we don’t want the web server to be tied up waiting. The queue also buffers load – if there’s a spike of submissions, they queue up and workers consume as fast as possible. The queue ensures reliable delivery (e.g., if a worker crashes after taking a job, the job can be retried by another worker). For scale, we can have multiple queue instances or topics:

Detailed Design
Detailed Design

b. Code Execution Engine Design

The Code Execution Engine is the heart of the judge system – it safely runs user code and produces results. Key aspects of its design include the execution sandbox, handling multiple languages, enforcing limits, and scaling out workers.

Sandbox Environment: We use containerization to sandbox code. Each submission is executed inside a container that isolates it from other processes and the host system. Containers are much lighter weight than full virtual machines, allowing fast startup and teardown while still providing a secure isolation (shared OS kernel but process-level isolation). We will prepare container images for each supported language runtime. For example:

The sandbox will be configured with:

Job Processing: Each worker machine runs a service that constantly polls the job queue. When it receives a submission job, it processes it as follows:

  1. Fetch Data: It may need to fetch the problem’s test cases and expected outputs. We can optimize this by caching frequently used test cases locally. Each worker can have a cache (in-memory or on disk) keyed by ProblemID. If not present, it will query the Problem DB or a distributed cache to get the test cases. Since the set of problems is finite and not extremely large, workers could even pre-fetch all test cases for all problems in memory (a few GB at most) if memory allows, to make execution faster.
  2. Spin up Container: Use Docker (or another container runtime) to run the code. There are two approaches:
    • One-container-per-job: Launch a new container instance for this submission, with the appropriate language image, passing the code to it. This could involve writing the user code to a file in a shared volume or sending it as stdin to the container’s entrypoint. The container then compiles/runs the code.
    • Persistent Containers: Alternatively, keep a pool of warm containers or sandboxes. For example, each worker might have one container per language always running, and for each submission it injects the code into that container to execute. However, managing state reset between runs is tricky (you’d need to clean any file changes, memory, processes). It’s often simpler to start a fresh container for each submission to ensure a clean state, at the cost of a small startup latency (~100-200ms typically for a container launch with a pre-pulled image).
    • We can mitigate container startup overhead by reusing the image (which stays cached on the host) and possibly using fast storage. Also, since many code runs are short (<< 3s), the container startup becomes a significant portion; hence we might experiment with keeping containers alive. But for security and simplicity, assume one container per submission that ends when execution is done.
  3. Compilation: Inside the container, the first step is compiling the code if the language is compiled (C++, Java, etc.). The container’s script will attempt to compile and capture any compiler errors. If compilation fails, we skip running test cases. The compiler error output is truncated if huge and returned to the user.
    • We should set a compilation time limit as well (perhaps part of the overall 3s, or maybe separate like 3s for compile and 3s for run). In practice, most solutions compile quickly; if someone submits a ridiculously large code that compiles slowly, we might treat that as exceeding time.
    • The compilation occurs inside the sandbox too, so even malicious code trying to abuse the compiler is contained.
  4. Execution of Tests: We have a list of test inputs (and expected outputs). The container will run the user program on each test. This could be done by repeatedly invoking the program for each input file (which resets state each time) or by feeding all tests sequentially to a single run of the program. Most competitive judges run each test separately to isolate them and stop on the first failure (also to measure per-test time). We will:
    • For each test case: feed input to the program (for example, if the program reads from standard input, we pipe the input in; if the problem requires reading from file, the container script might place the input in a known file path).
    • Read the program’s output and compare with expected output. Use an exact match or allow minor differences based on problem spec (e.g. whitespace tolerance if specified).
    • If output doesn’t match, mark the result as “Wrong Answer” and record which test failed. Abort further testing to save time.
    • If the program hasn’t terminated by the time limit, force-kill it (Time Limit Exceeded).
    • If it crashes (segfault, runtime exception), capture that (Runtime Error).
    • If it passes the test, move to the next.
    • We count the execution time and memory usage. If multiple tests, we may take the maximum time or sum, depending on how we define runtime to user (usually either total or worst-case time).
    • If all tests pass, result = Accepted.
  5. Result Handling: The worker collects the outcome. For an accepted solution, we might store the total runtime and memory used. For a failure, we store the failure type and perhaps the message (like “Wrong answer on test 7” or the stderr output for a runtime error or compile error).
  6. Update Data Stores: The worker then updates the system with the results:
    • Update the Submissions DB: Mark that the submission is now completed, with fields like status=“Done”, verdict (“AC”/“WA”/“TLE”/etc.), runtime, etc. This makes the result durable and accessible.
    • Remove or acknowledge the message on the queue (so it’s not retried). If using a system like RabbitMQ, the message is consumed; if using Kafka, we commit the offset.
    • The worker can now pick up the next job from the queue and repeat.

Concurrency and Scaling in Execution Engine: We will run many workers in parallel. Each worker can also be multi-threaded or handle multiple jobs if the machine has multiple cores:

Handling Failures in Execution: If something goes wrong during execution (e.g., the container runtime hangs or the worker itself crashes), we need resilience:

Language Support: Adding a new language means:

Optimizations for Speed: To meet the 3-second end-to-end target consistently:

c. Leaderboard and Ranking Mechanism

The contest leaderboard has its own logic:

In summary, the leaderboard mechanism uses real-time updates to an in-memory cache for speed, combined with reliable storage in a database for permanent record. This ensures efficient retrieval even under heavy load of many users refreshing the standings.

d. Trade-offs and Alternatives

In architecture decisions, we made certain trade-offs:

In conclusion, the design uses a combination of distributed systems techniques: asynchronous processing via queues, horizontal scaling of stateless services, in-memory caching for hot data, and containerization for isolation and scalability of execution. This architecture can efficiently handle high concurrency and provide fast feedback to users, while maintaining security and reliability. By addressing each requirement with appropriate technology and considering future growth (horizontal scaling, sharding) and failures (redundancy, fault tolerance), the system will be robust and performant for a large user base similar to LeetCode.

🤖 Don't fully get this? Learn it with Claude

Stuck on Designing Code Judging System? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🪜 Hint ladder (no spoilers)

Progressively stronger hints — you still solve it.

I'm working on the problem **Designing Code Judging System** (System Design). Give me a HINT LADDER: start with the tiniest nudge, then wait. Only reveal the next, stronger hint when I ask. Do NOT show the full solution unless I type 'show solution'. Keep me doing the thinking. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🎨 Explain the approach visually

See the technique, not just code.

Explain the optimal approach to **Designing Code Judging System** with a VISUAL walkthrough: trace it on a small concrete example using ASCII art / a step-by-step diagram, narrate what changes each step, then give time & space complexity with a one-line derivation. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🔍 Review my solution

Catch bugs, edge cases, sub-optimality.

I'll paste my solution to **Designing Code Judging System**. Review it for correctness, missed edge cases, and time/space complexity, then coach me toward the optimal — don't just rewrite it. Ask me to paste my code now. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🔁 Drill the pattern

Lock in recognition with look-alikes.

Give me 2 problems that use the SAME underlying pattern as **Designing Code Judging System**. For each, let me attempt first, then review my answer and name the trigger signal that reveals the pattern. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes