A Toy Web Crawler which crawls over the given domain and lists down the linkages between different webpages and the static assets each page uses via a sitemap (printed in a tree format).
Crawler doesn't crawl external links, only the links belonging to the same domain.
- Assumes Standard installation of GoLang 1.8.3
- Tested on amd64 Linux
> make install
> make test
> make build
> ./crawler -depth <Level> -delay <Milliseconds> -output <File> <URL>
> cat <File>
Flag | Default | Description |
---|---|---|
-depth <Level> | 3 | Specifies the level of depth till which the crawler should crawl from the given URL |
-delay <milliseconds> | 0 | Specify the politeness delay between each request to the domain |
-staticAsset <boolean> | true | Specify specify whether static assets should be listed on the sitemap or not |
-output <file> | StdOut | Specify the file to which the sitemap needs to be written |
-h | - | Prints the commandline utility usages |
> ./crawler -output sitemap.txt tomblomfield.com
> cat sitemap.txt
> ./crawler -h
- Each URL is fetched by a separate GoRoutine as fetching is an IO intensive operation, hence can leverage GoLang concurrency model to increase performance.
- Spawns only
runtime.NumCPU() * 3
Worker GoRoutines, to have more control over parallel requests and avoid memory allocation issues when crawling a huge website which might arise if each URL spawned it's own GoRoutine. - Set timeout of 20 seconds in HTTP Client. The default in
net/http
package is no timeout which can lead to unresponsive behavior. - Used generic interface
DataStore
which can be implemented later by a persistent/ in-memory data store like SQLite/Redis instead of theLocalStore
(Map with Mutex) used right now. - ByteArray to string conversion done using
reflect
andunsafe
packages by changingsliceHeader
tostringHeader
. In comparison tostring(byte[])
, this methodology would require no copy of data, hence more performant.
func ByteSliceToString(bArray []byte) string {
sliceHeader := (*reflect.SliceHeader)(unsafe.Pointer(&bArray))
stringHeader := reflect.StringHeader{
Data: sliceHeader.Data,
Len: sliceHeader.Len,
}
stringPointer := (*string)(unsafe.Pointer(&stringHeader))
return *stringPointer
}
- As all requests are to the same domain, added politeness delay by sleeping the goroutine for some random time in a range [
time.Sleep(time.Millisecond * time.Duration(rand.Intn(*delay)+*delay/2))
] set using-delay <Milliseconds>
flag. If nothing specified, then by default no delay is induced.
- Set
InsecureSkipVerify: true
inTLSClientConfig
which skips TLS verification as of now. - Cannot parse dynamically generated webpages as of now. Can be done by injecting a javascript engine/ headless chrome object and render dynamic content.
- Cannot resolve few URL which eventually resolve to another URL like
- tomblomfield.com/page/1 being the same as tomblomfield.com
- https://www.sitemaps.org/index.php resolves to https://www.sitemaps.org/index.html and https://www.sitemaps.org
- Lack of Integration Tests.
- The User Experience is not great as there is no progress bar showing the status of crawler.
- Design not extensible to build a distributed crawler.
- Makefile not up to standards, and should add dependency management package like dep. Currently listed all external dependencies in the Makefile.
- gotree - Used for printing Sitemap in tree format
- goquery - Used for parsing HTML pages
- govalidator - Used for validating URL