The web crawler is responsible for collecting documents from the web. Key features include:
-
Avoiding Re-visits: Ensuring the crawler does not visit the same page more than once.
-
URL Normalization: Checking if different URLs refer to the same page.
-
Document Type Handling: Limiting crawling to specific document types (HTML for this project).
-
State Maintenance: Allowing the crawler to resume from where it left off after interruptions.
-
Robots.txt Compliance: Respecting rules set by web administrators to exclude certain pages.
-
Multithreading: Supporting user-defined number of threads with proper synchronization.
-
Seed Management: Careful selection and management of seed URLs.
-
Crawl Limit: Capable of crawling up to 6000 pages.
-
Visit Order: Utilizing appropriate data structures to determine the order of page visits.
The indexer processes the downloaded documents to facilitate fast and efficient querying. Features include:
-
Persistence: Maintaining the index in secondary storage (file structure or database).
-
Fast Retrieval: Optimized for quick response to queries for specific words or sets of words.
-
Incremental Updates: Capability to update the index with new documents without rebuilding from scratch.
-
Design Consideration: Ensuring compatibility with the ranker and search modules.
This module handles user search queries with the following features:
-
Preprocessing: Preparing search queries for efficient processing.
-
Stemming: Matching words with the same root (e.g., "travel" matches "traveler", "traveling").
-
Phrase Searching: Supporting phrase searches with quotation marks, ensuring precise order matching.
The ranker sorts search results based on relevance and popularity:
- Relevance: Calculated using methods like tf-idf, considering word occurrence in titles, headers, and body text.
- Popularity: Measured independently of the query, using algorithms like PageRank.
-
Details :
PR(i) = (1 - d) + d * Σ(PR(j) / Outlinks(j)) where j points to i
Where d is 0.15 "Approx".
This equation another view is
First I initialize M
M = (1-d) A + dB
where :
d is dumping factor,
S is the number of URLs on the web,
B is Matrix of S x S filled with 1/S float number,
A is a transition matrix of size S x S that indicates the relations between every URL and other URLs outgoing from it.Final Equation 💡
X = M.T * X
The number of Iterations is determined by the degree of precision required. Precision criteria: ⚡
| norm(X after multiplication operation) - norm(X before multiplication operation) | should be < Precision Factor
The web interface provides user interaction with the search engine:
- Query Handling: Receives and processes user queries.
- Result Display: Shows search results with snippets highlighting query words.
- Pagination: Handles large result sets by dividing them into pages.
Examples
- Java
- SpringBoot
- ReactJS
- "http://localhost:8090/ranker/rank" -- for applying PageRank algorithm
True or False "if there is an error"
- "http://localhost:8090/ranker/search"
{ "query": "al-Khwarizmi" }
Ranked URLs with appropriate information.
MIT © ahmed-kamal2004