Aurora

Intro

New Search Engine Coming to the World

Crawler

The web crawler is responsible for collecting documents from the web. Key features include:

Avoiding Re-visits: Ensuring the crawler does not visit the same page more than once.
URL Normalization: Checking if different URLs refer to the same page.
Document Type Handling: Limiting crawling to specific document types (HTML for this project).
State Maintenance: Allowing the crawler to resume from where it left off after interruptions.
Robots.txt Compliance: Respecting rules set by web administrators to exclude certain pages.
Multithreading: Supporting user-defined number of threads with proper synchronization.
Seed Management: Careful selection and management of seed URLs.
Crawl Limit: Capable of crawling up to 6000 pages.
Visit Order: Utilizing appropriate data structures to determine the order of page visits.

Indexer

The indexer processes the downloaded documents to facilitate fast and efficient querying. Features include:
Persistence: Maintaining the index in secondary storage (file structure or database).
Fast Retrieval: Optimized for quick response to queries for specific words or sets of words.
Incremental Updates: Capability to update the index with new documents without rebuilding from scratch.
Design Consideration: Ensuring compatibility with the ranker and search modules.

Query Processor

This module handles user search queries with the following features:
Preprocessing: Preparing search queries for efficient processing.
Stemming: Matching words with the same root (e.g., "travel" matches "traveler", "traveling").
Phrase Searching: Supporting phrase searches with quotation marks, ensuring precise order matching.

Ranker 🚀

The ranker sorts search results based on relevance and popularity:

Relevance: Calculated using methods like tf-idf, considering word occurrence in titles, headers, and body text.
Popularity: Measured independently of the query, using algorithms like PageRank.

PageRank Algorithm

Details :
```
PR(i) = (1 - d) + d * Σ(PR(j) / Outlinks(j))  where j points to i
```
Where d is 0.15 "Approx".

This equation another view is

First I initialize M
```
M = (1-d) A + dB
```
where :

d is dumping factor,
S is the number of URLs on the web,
B is Matrix of S x S filled with 1/S float number,
A is a transition matrix of size S x S that indicates the relations between every URL and other URLs outgoing from it.

Final Equation 💡
```
X = M.T * X
```
The number of Iterations is determined by the degree of precision required. Precision criteria: ⚡
```
| norm(X after multiplication operation) - norm(X before multiplication operation) |   should be < Precision Factor
```

Web interface

The web interface provides user interaction with the search engine:

Query Handling: Receives and processes user queries.
Result Display: Shows search results with snippets highlighting query words.
Pagination: Handles large result sets by dividing them into pages.

Examples

Built With

Java
SpringBoot
ReactJS

API "Spring Boot" 📖

POST:

"http://localhost:8090/ranker/rank" -- for applying PageRank algorithm
Response

True or False "if there is an error"

GET:

"http://localhost:8090/ranker/search"
Request Body

{ "query": "al-Khwarizmi" }

Response

Ranked URLs with appropriate information.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
.mvn/wrapper		.mvn/wrapper
Frontend		Frontend
assets		assets
dict		dict
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SeedList.txt		SeedList.txt
StopWords.txt		StopWords.txt
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aurora

Intro

New Search Engine Coming to the World

Crawler

Indexer

Query Processor

Ranker 🚀

PageRank Algorithm

Web interface

Built With

API "Spring Boot" 📖

POST:

Response

GET:

Request Body

Response

Database "MongoDB"

Contributors

LICENSE

About

Releases

Packages

Languages

License

New-pro125/Aurora_SearchEngine

Folders and files

Latest commit

History

Repository files navigation

Aurora

Intro

New Search Engine Coming to the World

Crawler

Indexer

Query Processor

Ranker 🚀

PageRank Algorithm

Web interface

Built With

API "Spring Boot" 📖

POST:

Response

GET:

Request Body

Response

Database "MongoDB"

Contributors

LICENSE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages