Description
Following our earlier conversation @ikreymer, I'm posting the complex behaviour I've been working on here. I've been using the Chrome Tampermonkey extension to manage the development of the behaviour as a user script.
The target is the current https://sounds.bl.uk site, which uses lots of complex widgets to play audio tracks and build large trees of links that are loaded dynamically.
The script looks like this:
// ==UserScript==
// @name sounds.bl.uk-auto
// @namespace http://tampermonkey.net/
// @version 0.1
// @description Try to archive something totally tricky.
// @author Andrew Jackson <[email protected]>
// @match https://sounds.bl.uk/*
// @icon https://www.google.com/s2/favicons?sz=64&domain=bl.uk
// @grant none
// ==/UserScript==
// Implements automation of complex AJAX widgets on https://sounds.bl.uk/
// A good, complex example: https://sounds.bl.uk/Arts-literature-and-performance/Theatre-Archive-Project/
(async function() {
'use strict';
async function sleep(ms){
return new Promise(function (resolve, reject) {
setTimeout(()=>{
resolve();
},ms);
})
}
async function open_all_lists() {
while(true) {
var l = document.querySelectorAll('div[aria-hidden="false"] li[class="closed"] a');
console.log(l);
if ( l.length > 0 ) {
for ( var e of l ) {
e.click();
await sleep(1000);
}
} else {
break;
}
}
}
// Note that this doens't really work in Chrome because https://developer.chrome.com/blog/autoplay/ so need to override that for automation to work fully
async function run_all_players() {
var ps = document.querySelectorAll(".playable");
for (var button of ps) {
button.click();
await sleep(1000);
}
}
await sleep(4000);
// Run players:
await run_all_players();
// Open all lists on first tab:
await open_all_lists();
// Iterate over other tabs:
var tabs = document.querySelectorAll(".tabbedContent > ul > li > a");
for( var tab of tabs ) {
// Switch tab:
tab.click();
await sleep(2000);
// Iterate over closed list items:
await open_all_lists();
}
})();
It's still not perfect as a crawl script. I have tried using it in a Scrapy crawler running behind PyWB in archiving proxy mode, and it struggles with the audio files. There can be quite a few per page, and they are fast in normal use because the system uses HTTP range requests. Archiving HTTP 206's doesn't work, so PyWB grabs the files with a 200 and then returns chunks. But this makes the timing tricky to get right.
This ticket is not so much about archiving this particular site, but more about how best to develop new behaviours like this, and how best to test them and test the integration of them into Browsertrix Crawler.