Skip to content

How to develop and test complex behaviours? #28

Closed
@anjackson

Description

@anjackson

Following our earlier conversation @ikreymer, I'm posting the complex behaviour I've been working on here. I've been using the Chrome Tampermonkey extension to manage the development of the behaviour as a user script.

The target is the current https://sounds.bl.uk site, which uses lots of complex widgets to play audio tracks and build large trees of links that are loaded dynamically.

The script looks like this:

// ==UserScript==
// @name         sounds.bl.uk-auto
// @namespace    http://tampermonkey.net/
// @version      0.1
// @description  Try to archive something totally tricky.
// @author       Andrew Jackson <[email protected]>
// @match        https://sounds.bl.uk/*
// @icon         https://www.google.com/s2/favicons?sz=64&domain=bl.uk
// @grant        none
// ==/UserScript==

// Implements automation of complex AJAX widgets on https://sounds.bl.uk/
// A good, complex example: https://sounds.bl.uk/Arts-literature-and-performance/Theatre-Archive-Project/
(async function() {
    'use strict';

    async function sleep(ms){
        return new Promise(function (resolve, reject) {
            setTimeout(()=>{
                resolve();
            },ms);
        })
    }

    async function open_all_lists() {
        while(true) {
            var l = document.querySelectorAll('div[aria-hidden="false"] li[class="closed"] a');
            console.log(l);
            if ( l.length > 0 ) {
                for ( var e of l ) {
                    e.click();
                    await sleep(1000);
                }
            } else {
                break;
            }
        }
    }

    // Note that this doens't really work in Chrome because https://developer.chrome.com/blog/autoplay/ so need to override that for automation to work fully
    async function run_all_players() {
        var ps = document.querySelectorAll(".playable");
        for (var button of ps) {
            button.click();
            await sleep(1000);
        }
    }


    await sleep(4000);

    // Run players:
    await run_all_players();

    // Open all lists on first tab:
    await open_all_lists();

    // Iterate over other tabs:
    var tabs = document.querySelectorAll(".tabbedContent > ul > li > a");
    for( var tab of tabs ) {
        // Switch tab:
        tab.click();
        await sleep(2000);
        // Iterate over closed list items:
        await open_all_lists();
    }

})();

It's still not perfect as a crawl script. I have tried using it in a Scrapy crawler running behind PyWB in archiving proxy mode, and it struggles with the audio files. There can be quite a few per page, and they are fast in normal use because the system uses HTTP range requests. Archiving HTTP 206's doesn't work, so PyWB grabs the files with a 200 and then returns chunks. But this makes the timing tricky to get right.

This ticket is not so much about archiving this particular site, but more about how best to develop new behaviours like this, and how best to test them and test the integration of them into Browsertrix Crawler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions