Skip to content

Submission for HackDataKIBots 2018 - Web crawler combined with document analysis

License

Notifications You must be signed in to change notification settings

manuel-lang/Autonomous-Semantic-Search-Engine

Repository files navigation

Autonomous Semantic Search Engine

A search engine that autonomously crawls documents from a given domain including their subdomains, analyzes them and renders them into a search frontend. This implementation demonstrates the functionality with the Stanford University's website. This project was implemented during the Next Iteration Hackathon 2018

Web crawler

Python implementation with Scrapy here.

Document analysis

Python implementation using Watson NLU (for Named Entities, Keywords), gensim (for Summarization and Semantic Representation) and a custom Document Type classifier (Random Forest, with sklearn). Title, a thumbnail and embedded images are also extracted from documents. See notebooks for specific implementations.

Web frontend

A react frontend that displays the information with additional image information using Bing Image Search here.

brew Dependencies