Skip to content
CreatureKing edited this page Mar 9, 2015 · 4 revisions

---

If you're interested in participating in GSoC 2015 as a student, you should join the portia-crawler mailing list and post your questions and ideas there. All Portia development happens at GitHub Portia repo.

Brief explanation With Portia a user needs to browse through a website within the Portia webapp and scrape from there. This project hopes to allow users to define new spiders and templates as they normally browse the web without having to specifically open a website within Portia. Using a browser addon a user would be able to launch Portia toolboxes at the click of a button and start scraping straight away.
Required skills Javascript
Difficulty level Intermediate
Mentor(s) Jaoquin Sargiotto, Ruairi Fahy
Brief explanation One problem with traditionally scraping websites using XPath and CSS selectors is that when a website changes its layout your spiders may no longer work. This project aims to use crawl datasets to try to build new Portia spiders from website content and extracted data, repair spiders if the website layout has changed and then merge the templates used by the spiders into a small manageable number.
Required skills Python
Difficulty level Advanced
Mentor(s) Ruairi Fahy, Shane Evans