You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following our earlier conversation @ikreymer, I'm posting the complex behaviour I've been working on here. I've been using the Chrome Tampermonkey extension to manage the development of the behaviour as a user script.
The target is the current https://sounds.bl.uk site, which uses lots of complex widgets to play audio tracks and build large trees of links that are loaded dynamically.
The script looks like this:
// ==UserScript==// @name sounds.bl.uk-auto// @namespace http://tampermonkey.net/// @version 0.1// @description Try to archive something totally tricky.// @author Andrew Jackson <[email protected]>// @match https://sounds.bl.uk/*// @icon https://www.google.com/s2/favicons?sz=64&domain=bl.uk// @grant none// ==/UserScript==// Implements automation of complex AJAX widgets on https://sounds.bl.uk/// A good, complex example: https://sounds.bl.uk/Arts-literature-and-performance/Theatre-Archive-Project/(asyncfunction(){'use strict';asyncfunctionsleep(ms){returnnewPromise(function(resolve,reject){setTimeout(()=>{resolve();},ms);})}asyncfunctionopen_all_lists(){while(true){varl=document.querySelectorAll('div[aria-hidden="false"] li[class="closed"] a');console.log(l);if(l.length>0){for(vareofl){e.click();awaitsleep(1000);}}else{break;}}}// Note that this doens't really work in Chrome because https://developer.chrome.com/blog/autoplay/ so need to override that for automation to work fullyasyncfunctionrun_all_players(){varps=document.querySelectorAll(".playable");for(varbuttonofps){button.click();awaitsleep(1000);}}awaitsleep(4000);// Run players:awaitrun_all_players();// Open all lists on first tab:awaitopen_all_lists();// Iterate over other tabs:vartabs=document.querySelectorAll(".tabbedContent > ul > li > a");for(vartaboftabs){// Switch tab:tab.click();awaitsleep(2000);// Iterate over closed list items:awaitopen_all_lists();}})();
It's still not perfect as a crawl script. I have tried using it in a Scrapy crawler running behind PyWB in archiving proxy mode, and it struggles with the audio files. There can be quite a few per page, and they are fast in normal use because the system uses HTTP range requests. Archiving HTTP 206's doesn't work, so PyWB grabs the files with a 200 and then returns chunks. But this makes the timing tricky to get right.
This ticket is not so much about archiving this particular site, but more about how best to develop new behaviours like this, and how best to test them and test the integration of them into Browsertrix Crawler.
The text was updated successfully, but these errors were encountered:
anjackson
changed the title
How to develop and test complex behaviours
How to develop and test complex behaviours?
Nov 10, 2022
Following our earlier conversation @ikreymer, I'm posting the complex behaviour I've been working on here. I've been using the Chrome Tampermonkey extension to manage the development of the behaviour as a user script.
The target is the current https://sounds.bl.uk site, which uses lots of complex widgets to play audio tracks and build large trees of links that are loaded dynamically.
The script looks like this:
It's still not perfect as a crawl script. I have tried using it in a Scrapy crawler running behind PyWB in archiving proxy mode, and it struggles with the audio files. There can be quite a few per page, and they are fast in normal use because the system uses HTTP range requests. Archiving HTTP 206's doesn't work, so PyWB grabs the files with a 200 and then returns chunks. But this makes the timing tricky to get right.
This ticket is not so much about archiving this particular site, but more about how best to develop new behaviours like this, and how best to test them and test the integration of them into Browsertrix Crawler.
The text was updated successfully, but these errors were encountered: