Busyboi

Light, fast and scalable web scraper for structured data.

Features

Structured data
Reads scraping jobs from rabbitmq
Respects robotx.txt
Renders javascript
Regex matching option
Re-schedules jobs when url is unreachable
Allowes grouping of elements Parent -> child
Config based
Concurrency limiter (variable - default 10)
In memory cache (variable - default 15 minutes)

How to install?

$ git clone [email protected]:kamilernerd/busyboi.git
$ cd busyboi
$ make

CLI arguments

Usage of ./busyboi:
 -concurreny_limit int
   	Limits how many crawling jobs can run simultaneously (default 10)
 -queue_host string
   	Hostname for rabbitmq (default "localhost")
 -queue_name string
   	Queue name for rabbitmq (default "busyboi")
 -queue_password string
   	Password for rabbitmq (default "guest")
 -queue_port string
   	Port for rabbitmq (default "5672")
 -queue_user string
   	User for rabbitmq (default "guest")
 -cache_ttl int
       How long should cache be stored (in seconds). Default is 15 minutes.

Example config

{
   "collection": "some_random",
   "url": "somerandomurl.com/some/directory/index.html",
   "fields": [
       {
           "name": "some html element",
           "selector": "body > p[class=\"hello_world\"]"
       },
       {
           "name": "some html a element",
           "selector": "body > a[class=\"hello_world\"]",
           "attr": "href"
       },
       {
           "name": "a group of elements within a parent",
           "selector": ".some-parent-element",
           "children": [
               {
                   "name": "a link within a parent element",
                   "selector": "a[class=\"some_class"\]",
                   "attr": "href"
               }
           ]
       },
       {
           "name": "a group of elements within a parent",
           "selector": ".some-parent-element",
           "children": [
               {
                   "name": "a link within a parent element",
                   "selector": "a[class=\"some_class"\]",
                   "attr": "href",
                   "regex": "(http|s)+"
               }
           ]
       },
       {
           "name": "a group of elements within a parent",
           "selector": ".some-parent-element",
           "regex": "(http|s)+",
           "children": [
               {
                   "name": "a link within a parent element",
                   "selector": "a[class=\"some_class"\]",
                   "attr": "href"
               }
           ]
       }

   ]
}

TODO

Some long term storage support (MAYBE!)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
testdata		testdata
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
cache.go		cache.go
go.mod		go.mod
go.sum		go.sum
http.go		http.go
main.go		main.go
main_test.go		main_test.go
parser.go		parser.go
rabbitmq.go		rabbitmq.go
testserver.go		testserver.go
worker.go		worker.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Busyboi

Features

How to install?

CLI arguments

Example config

TODO

About

Languages

kamilernerd/busyboi

Folders and files

Latest commit

History

Repository files navigation

Busyboi

Features

How to install?

CLI arguments

Example config

TODO

About

Resources

Stars

Watchers

Forks

Languages