This the NCKU course WEB RESOURCE DISCOVERY AND EXPLOITATION homework III, targe is create a crawler application to crawling millions webpage.
- Crawl millions of webpages
- Remove non-HTML pages
- Performance optimization
- How many page can crawl per hour
- Total time to crawl millions of pages
Spider with 台灣 E 院
Spider with 問 8 健康諮詢
Spider with Wiki
- Skip robot.txt
# edit settings.py
ROBOTSTXT_OBEY = False
- Use random user-agent
pip install fake-useragent
# edit middlewares.py
class FakeUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = UserAgent()
request.headers['User-Agent'] = ua.random
DOWNLOADER_MIDDLEWARES = {
"millions_crawler.middlewares.FakeUserAgentMiddleware": 543,
}
Spider | Total Page | Total Time (hrs) | Page per Hour |
---|---|---|---|
tweh | 152,958 | 1.3 | 117,409 |
w8h | 4,759 | 0.1 | 32,203 |
wiki* | 13,000,320 | 43 | 30,240 |
Spider | Total Page | Total Time (hrs) | Page per Hour |
---|---|---|---|
tweh | 153,288 | 0.52 | - |
w8h | 4,921 | 0.16 | - |
wiki* | 4,731,249 | 43.2 | 109,492 |
- create a .env file
bash create_env.sh
- Install Redis
sudo apt-get install redis-server
- Install MongoDB
sudo apt-get install mongodb
- Run Redis
redis-server
- run MongoDB
sudo service mongod start
- Run spider
cd millions-crawler
scrapy crawl [$spider_name] # $spider_name = tweh, w8h, wiki
pip install -r requirements.txt