GitHub - wdickers/Focused_Crawler: Focused Crawler for VT's CTRNet

For the full technical report, please visit: https://docs.google.com/file/d/0B436PtOU57sJZkc5anMyNDZPaHM/edit?usp=sharing

File List

FocusedCrawler.py ● Driver class for this project ● Responsible for creating configuration and classifier object and calling crawler

crawler.py ● Crawler class responsible for collecting and exploring new URLs to find relevant pages ● Given a priority queue and a scoring class with a calculate_score(text) method

classifier.py ● Parent class of classifiers (non-VSM) including NaiveBayesClassifier and SVMClassifier ● Contains code for tokenization and vectorization of document text using sklearn ● Child classes only have to assign self.model

config.ini ● Configuration file for focused crawler in INI format

fcconfig.py ● Class responsible for reading configuration file, using ConfigParser ● Adds all configuration options to its internal dictionary (e.g. config[“seedFile”])

fcutils.py ● Contains various utility functions relating to reading files and sanitizing/tokenizing text

html_files.txt ● List of local files to act as training/testing set for the classifier (“repository docs”) ● Default name, but can be changed in configuration

labels.txt ● 1-to-1 correspondence with lines of repository, assigning numerical categorical label ● 1 for relevant, 0 for irrelevant ● Optional

lsiscorer.py ● Subclass of Scorer representing an LSI vector space model

NBClassifier.py ● Subclass of Classifier, representing a Naïve Bayes classifier

priorityQueue.py ● Simple implementation of a priority queue using a heap

scorer.py ● Parent class of scorers, which are non-classifier models, typically VSM

seeds.txt ● Contains URLs to relevant pages for focused crawler to start ● Default name, but can be modified in config.ini

SVMClassifier.py ● Subclass of Classifier, representing an SVM classifier

tfidfscorer.py ● Subclass of Scorer, representing a tf-idf vector space model

webpage.py ● Uses BeautifulSoup and nltk to extract webpage text

README.txt ● Documentation about Focused Crawler and its usage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
FocusedCrawler.py		FocusedCrawler.py
NBClassifier.py		NBClassifier.py
README.md		README.md
SVMClassifier.py		SVMClassifier.py
classifier.py		classifier.py
config.ini		config.ini
crawler.py		crawler.py
ctrnet.html		ctrnet.html
fcconfig.py		fcconfig.py
fcutils.py		fcutils.py
html_files.txt		html_files.txt
labels.txt		labels.txt
lsiscorer.py		lsiscorer.py
priorityQueue.py		priorityQueue.py
scorer.py		scorer.py
seeds.txt		seeds.txt
tfidfscorer.py		tfidfscorer.py
webpage.py		webpage.py

wdickers/Focused_Crawler

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages