Possibility of a DSL for web scraping #11058

iaseth · 2023-04-30T02:53:43Z

iaseth
Apr 30, 2023

I often write web scrapers like this in Python, sometimes in JavaScript or Go. Considering how good and ubiquitous cURL is at data transfer, wouldn't it be a good idea to implement a DSL in cURL specifically for writing web scrapers? Something that can be run like this:

curl --run get_some_data.curl

It would need at least the following features:

Support for HTML/XML and JSON parsing
First class support for objects/dictionaries as in Python/JavaScript
Ability to export JSON
(optional) a package manager that would allow people to using scrapers written by others
(optional) a headless browser for crawling SPAs
(optional) ability to talk to databases

There is already a language called Curl. So we would need to come up with a new name.
What do you guys think about this?

dfandrich · 2023-04-30T04:49:04Z

dfandrich
Apr 30, 2023
Maintainer

This is a great idea, but far removed from what curl is designed to be. With few exceptions, curl transfers data but never looks at it. This kind of tool would need to parse and understand data to extract the useful information. A new tool using libcurl would make sense, but likely not under the curl umbrella. The new trurl tool for parsing URLs was just spawned from curl because libcurl already provides the URL parsing API is requires, but the curl tool wasn't really the appropriate place to add functions that would make full use of that API. I've often thought a high-level framework for using scrapers would be really useful. I never used it while it was around, but Yahoo Pipes sounded sort of like what I had in mind. Having a set of modules available that scrape or extract data from a variety of web services and being able to easily combine them in various ways to mashup and manipulate data from various sources on the Internet could open up all kinds of interesting possibilities. Naturally, being able to write your own modules to extract data from new sources would be necessary, and making the framework open would attract people to publish their own modules making the whole greater than the sum if its parts. I haven't really looked in depth to see if something like this is available today, but I'd start looking at things like WSO2 Mashup Server, Zapier, IFTTT, Quadrigram, Huginn, although many of those aren't Open Source. I also suspect a DSL for the tasks you mention would end up needing to be powerful enough to handle all the needed edge cases that the resulting language would end up being nearly as complicated as Python or Ruby or Lua or some other language that's already established to work in this area, albeit likely with a nice, simply syntax for the easy cases.

1 reply

iaseth Apr 30, 2023
Author

Thanks for such a detailed reply. I will look into the tools you mentioned.

Charlotte-br560 · 2024-03-23T07:43:59Z

Charlotte-br560
Mar 23, 2024

Interesting idea! Implementing a DSL for web scraping in cURL could definitely streamline the process. It could enhance readability and maintainability of scraping scripts. Integrating support for HTML/XML and JSON parsing, along with first-class object/dictionary support, would make it quite powerful. Adding features like JSON export and a package manager for sharing scrapers would be valuable additions. Regarding the name, perhaps something like 'CrawlScript' could work. Also, you might want to check out Crawlbase for inspiration or collaboration.

1 reply

iaseth Apr 1, 2024
Author

Thanks! I will check it out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility of a DSL for web scraping #11058

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Possibility of a DSL for web scraping #11058

iaseth Apr 30, 2023

Replies: 2 comments · 2 replies

dfandrich Apr 30, 2023 Maintainer

iaseth Apr 30, 2023 Author

Charlotte-br560 Mar 23, 2024

iaseth Apr 1, 2024 Author

iaseth
Apr 30, 2023

Replies: 2 comments 2 replies

dfandrich
Apr 30, 2023
Maintainer

iaseth Apr 30, 2023
Author

Charlotte-br560
Mar 23, 2024

iaseth Apr 1, 2024
Author