Replies: 2 comments 2 replies
-
This is a great idea, but far removed from what curl is designed to be. With
few exceptions, curl transfers data but never looks at it. This kind of tool
would need to parse and understand data to extract the useful information. A
new tool using libcurl would make sense, but likely not under the curl umbrella.
The new trurl tool for parsing URLs was just spawned from curl because
libcurl already provides the URL parsing API is requires, but the curl tool
wasn't really the appropriate place to add functions that would make full use
of that API.
I've often thought a high-level framework for using scrapers would be really
useful. I never used it while it was around, but Yahoo Pipes sounded sort of
like what I had in mind. Having a set of modules available that scrape or
extract data from a variety of web services and being able to easily combine
them in various ways to mashup and manipulate data from various sources on the
Internet could open up all kinds of interesting possibilities. Naturally, being
able to write your own modules to extract data from new sources would be
necessary, and making the framework open would attract people to publish their
own modules making the whole greater than the sum if its parts.
I haven't really looked in depth to see if something like this is available
today, but I'd start looking at things like WSO2 Mashup Server, Zapier,
IFTTT, Quadrigram, Huginn, although many of those aren't Open Source.
I also suspect a DSL for the tasks you mention would end up needing to be
powerful enough to handle all the needed edge cases that the resulting language
would end up being nearly as complicated as Python or Ruby or Lua or some other
language that's already established to work in this area, albeit likely with a
nice, simply syntax for the easy cases.
|
Beta Was this translation helpful? Give feedback.
1 reply
-
Interesting idea! Implementing a DSL for web scraping in cURL could definitely streamline the process. It could enhance readability and maintainability of scraping scripts. Integrating support for HTML/XML and JSON parsing, along with first-class object/dictionary support, would make it quite powerful. Adding features like JSON export and a package manager for sharing scrapers would be valuable additions. Regarding the name, perhaps something like 'CrawlScript' could work. Also, you might want to check out Crawlbase for inspiration or collaboration. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I often write web scrapers like this in Python, sometimes in JavaScript or Go. Considering how good and ubiquitous cURL is at data transfer, wouldn't it be a good idea to implement a DSL in cURL specifically for writing web scrapers? Something that can be run like this:
It would need at least the following features:
There is already a language called Curl. So we would need to come up with a new name.
What do you guys think about this?
Beta Was this translation helpful? Give feedback.
All reactions