GitHub - maithilish/scoopi-scraper: Scoopi Web Scraper is a heavy duty tool to extract data from HTML pages.

CodeTab Scoopi Guide Quickstart and Guide

Scoopi is a tool to extract and transform data from web pages.

Libraries such as JSoup and HtmlUnit makes it quite easy to scrape web pages in Java, but they do well in scraping data from limited set of pages but things get pretty compilcated when you start to scrape thousands of pages. Scoopi is built on JSoup and HtmlUnit and the functionality offered by Scoopi are:

Scoopi is fully definition driven. Data structure, task workflow and pages to scrape are defined with a set of YML definition files and no coding skill is required
It can be configured to use either JSoup or HtmlUnit as scraper
Query can be written either using Selectors (JSoup) or XPath (HtmlUnit)
Scoopi is a multithreaded application which process pages in parallel for maximum throughput.
- even on a low end system with core 2 duo processor, it can load, parse and transform around 1000 pages in under two minutes.
Scoopi ships as Docker image so that it can run without any cumbersome installation
Scoopi persists pages and data to file system so that it recover from the failed state without repeating the tasks already completed
Can transform, filter and sort the data before output
Ships with built-in appenders such as FileAppender, DBAppender and ListAppender.
ScoopiEngine can be embeded in other programs and access scrapped data with ListAppender
Flexible workflow allows one to change sequence of steps
Scoopi is extensible. Developers can extend the predefined base steps or even create new ones with different functionality and weave them in workflow
Scoopi Cluster
- In cluster mode, it can scale horizontally by distributing tasks across multiple nodes
- Designed to run in various environments; in bare JVM or in Docker containers or even on high end container orchestration platforms such as Kubernetes
- For clustering, Scoopi Cluster uses Hazelcast IMDG, a fault-tolerant distributed in-memory computing platform

Scoopi Installation

To install and run Scoopi refer Quickstart and Guide.

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
config		config
coverage		coverage
dao		dao
def		def
di		di
engine		engine
exception		exception
helper		helper
metric		metric
model		model
notes		notes
plugin		plugin
pool		pool
scoopi		scoopi
src		src
step		step
store		store
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scoopi Installation

About

Releases 6

Packages

Contributors 2

Languages

License

maithilish/scoopi-scraper

Folders and files

Latest commit

History

Repository files navigation

Scoopi Installation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 2

Languages

Packages