Knowledge Guide
HomeConcurrencyConcurrency Problems

Problem 16 Scenario Multithreaded Web Crawler

Overview

A web crawler is a tool designed to browse the internet methodically and systematically to gather information from web pages. When deploying a multithreaded web crawler, one can expect significant performance improvements due to the concurrent fetching of multiple web pages. However, concurrency introduces challenges, especially when multiple threads try to access shared resources like a queue containing URLs. The primary concern is avoiding duplicate work, where two or more threads might end up processing the same URL because of simultaneous access.

Step-by-step Algorithm

Algorithm Walkthrough

Let's walk through a dry run of the web crawler example you provided, paying special attention to the synchronization aspects.

Initialization

Execution

The execution for each crawler agent can be divided into several steps:

  1. Get URL from Queue

    • Acquire queueLock to get exclusive access to toVisitList.
    • If toVisitList is not empty, pop a URL from the queue and release queueLock.
    • If toVisitList is empty, release queueLock and exit the loop (terminate the thread).
  2. Check if URL is Visited

    • Acquire visitedLock to check visitedRecord.
    • If URL is in visitedRecord, release visitedLock and go back to step 1.
    • If URL is not in visitedRecord, insert it and release visitedLock.
  3. Process Page: Print "Processing: URL" and sleep for 1 second to simulate processing time.

  4. Add New Links to Queue

    • Simulate fetching links from the page using getLinksFromPage.
    • Acquire queueLock.
    • For each link fetched:
      • Check if it is not in visitedRecord.
      • If it’s not, add it to toVisitList.
    • Release queueLock.
  5. Repeat: Go back to step 1.

Synchronization Aspects

Final Result

Through the use of mutex locks (queueLock and visitedLock), the program ensures that the shared resources (toVisitList and visitedRecord) are accessed in a thread-safe manner, preventing race conditions and preserving data integrity. The program simulates a web crawler where multiple agents are concurrently processing URLs, fetching new links, and ensuring that each URL is only processed once.

Flow Chart
Flow Chart
java
import java.util.HashSet;
import java.util.LinkedList;
import java.util.Queue;
import java.util.Set;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

public class Solution {

  private static final int NUM_AGENTS = 5; // Number of crawler agents
  private final Queue<String> toVisitList = new LinkedList<>(); // URLs waiting to be visited
  private final Set<String> visitedRecord = new HashSet<>(); // URLs that have been visited

  // Locks to protect shared resources
  private final Lock queueLock = new ReentrantLock();
  private final Lock visitedLock = new ReentrantLock();

  public static void main(String[] args) {
    Solution crawler = new Solution();
    crawler.startCrawling("http://example.com");
  }

  public void startCrawling(String initialUrl) {
    // Seed the toVisitList with an initial URL
    toVisitList.add(initialUrl);

    // Create and start crawler agents
    Thread[] agents = new Thread[NUM_AGENTS];
    for (int i = 0; i < NUM_AGENTS; i++) {
      agents[i] = new Thread(this::crawlerAgent);
      agents[i].start();
    }

    // Wait for all agents to finish
    for (Thread agent : agents) {
      try {
        agent.join();
      } catch (InterruptedException e) {
        e.printStackTrace();
      }
    }

    System.out.println("Crawling finished.");
  }

  public void crawlerAgent() {
    while (true) {
      String url;

      // Try to get a URL from the queue
      queueLock.lock();
      try {
        if (toVisitList.isEmpty()) {
          break;
        }
        url = toVisitList.poll();
      } finally {
        queueLock.unlock();
      }

      // Check if URL has been visited
      visitedLock.lock();
      try {
        if (visitedRecord.contains(url)) {
          continue;
        }
        visitedRecord.add(url);
      } finally {
        visitedLock.unlock();
      }

      // Simulate processing the page
      System.out.println("Processing: " + url);
      try {
        Thread.sleep(1000); // Simulate a delay for processing the page
      } catch (InterruptedException e) {
        e.printStackTrace();
      }

      // Simulated function to get links from a web page.
      String[] links = getLinksFromPage(url);

      // Add new links to the queue
      queueLock.lock();
      try {
        for (String link : links) {
          if (!visitedRecord.contains(link)) {
            toVisitList.add(link);
          }
        }
      } finally {
        queueLock.unlock();
      }
    }
  }

  public String[] getLinksFromPage(String url) {
    // For the sake of this example, return some dummy links.
    return new String[] { "link1", "link2", "link3" };
  }
}
🤖 Don't fully get this? Learn it with Claude

Stuck on Problem 16 Scenario Multithreaded Web Crawler? Open Claude, copy a block below, and it'll teach you this exact concept — visually and interactively.

🪜 Hint ladder (no spoilers)

Progressively stronger hints — you still solve it.

I'm working on the problem **Problem 16 Scenario Multithreaded Web Crawler** (Concurrency). Give me a HINT LADDER: start with the tiniest nudge, then wait. Only reveal the next, stronger hint when I ask. Do NOT show the full solution unless I type 'show solution'. Keep me doing the thinking. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🎨 Explain the approach visually

See the technique, not just code.

Explain the optimal approach to **Problem 16 Scenario Multithreaded Web Crawler** with a VISUAL walkthrough: trace it on a small concrete example using ASCII art / a step-by-step diagram, narrate what changes each step, then give time & space complexity with a one-line derivation. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🔍 Review my solution

Catch bugs, edge cases, sub-optimality.

I'll paste my solution to **Problem 16 Scenario Multithreaded Web Crawler**. Review it for correctness, missed edge cases, and time/space complexity, then coach me toward the optimal — don't just rewrite it. Ask me to paste my code now. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.
🔁 Drill the pattern

Lock in recognition with look-alikes.

Give me 2 problems that use the SAME underlying pattern as **Problem 16 Scenario Multithreaded Web Crawler**. For each, let me attempt first, then review my answer and name the trigger signal that reveals the pattern. If you're unsure or a claim isn't standard, say so and reason from first principles instead of guessing.

📝 My notes