Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose the instance of WTF to the options so that it's possible to use .extend() #106

Open
andremacola opened this issue May 8, 2023 · 7 comments
Labels

Comments

@andremacola
Copy link
Contributor

Expose the instance of WTF to the options so that it's possible to use wtf.extend().

Despite WTF being up to date, I couldn't find any way to extend the functionality of WTF through dumpster-dive.

@andremacola
Copy link
Contributor Author

andremacola commented May 8, 2023

I ended up with another issue by allowing this, we need to be able to pass async functions and the JSONfn library is outdated in regards to that. Here is an updated version to allow async functions

let JSONfn;

if (!JSONfn) {
  JSONfn = {};
}

(function () {
  JSONfn.stringify = function (obj) {
    return JSON.stringify(obj, function (key, value) {
      if (typeof value === 'function') {
        if (value.constructor.name === 'AsyncFunction') {
          return '__async_fn__:' + value.toString();
        } else {
          return value.toString();
        }
      } else {
        return value;
      }
    });
  };

  JSONfn.parse = function (str) {
    return JSON.parse(str, function (key, value) {
      if (typeof value !== 'string') return value;

      if (value.substring(0, 13) === '__async_fn__:') {
        return eval('(' + value.substring(13) + ')');
      } else if (value.substring(0, 8) === 'function') {
        return eval('(' + value + ')');
      } else {
        return value;
      }
    });
  };
})();

exports.JSONfn = JSONfn;

It's also necessary to update the parseWiki function to async parseWiki and allow data = await options.custom(doc, options, wtf);.

It would be useful to pass not only the doc, but also the options and the wtf instance, so that we can use functions such as wtf.getCategoryPages(term) and some options if the user needs it.

Additionally, I think we will have to update the driver passed to the sundayDriver with an async function in the function passed to the each property.

All of this to be able to use WTF extensions, especially those that make secondary requests to fetch data that is not in the dump, such as pageviews, some images, etc.

Here is an example of my use:

  const data = {
    pageID: doc.pageID(),
    wdID: doc.wikidata(),
    title: doc.title(),
    url: doc.url(),
    image: '',
    redirectTo: isRedirect,
    isDisambiguation: isDisambiguation,
    ranking: 1,
    categories: isArticle || isCategoryTerm ? doc.categories() : [],
    terms: [],
    links: isDisambiguation
      ? doc
          .links()
          .filter(l => l.type() === 'internal' && l.text().length)
          .map(l => l.page() || l.text())
      : [],
    content: doc.content(isArticle),
    metadata: options.wikidata && isArticle && !isCategoryTerm ? await doc.getMetadata() : [],
    categoryData: isCategoryTerm ? await wtf.getCategoryPages(term) : []
  }

  try {
    if (isArticle || isDisambiguation) {
      const irt = await doc.getImageViewsRedirects()
      data.image = irt.image
      data.ranking = irt.ranking
      data.terms = irt.redirects
      // data.content = irt.content
    }
  } catch (err) {
    throw new Error('Failed to fetch extra data from Wikipedia:' + err.message)
  }

PS: IF you pass an async custom function the stdout log need a better approach too.

@spencermountain
Copy link
Owner

hey André - yup, i agree this is a bad limitation. You're right that any callback sent to the worker needs to go through JSONfn and this really reduces what is possible to do.

What about something like this?

dumpster({ file: './myfile.xml.bz2', wtf_lib:'path/to/lib.js' })

then you can extend, or change wtf in any way.

would that work for you?
cheers

@andremacola
Copy link
Contributor Author

@spencermountain I even made a fork for custom and it worked, but it was kind of a "hack".

You should update the wtf and mongodb version too. I can send a PR for this if you want to. For Mongo needs to update some files.

I think your suggestion would also solve it. Would it work asynchronously?

@spencermountain
Copy link
Owner

ha, love it. Ofcourse - prs welcome!
go nuts - i have no idea.

lemme say - if i were you, and had a dump working, i wouldn't call the wikimedia api anymore.
You can just do one pass for the categories, and one for the pages, or something. it's all sitting there!

cheers

@andremacola
Copy link
Contributor Author

The problem is that I need Pageviews for my current project. Some images on the Portuguese Wikipedia don't come with the dump, and I don't know why. Wikimedia doesn't make things easy, wikitext is a nightmare, my extension is called "wtf from hell" haha.

I gave up on working with data from Wiktionary, it's too much of a headache. You are a warrior for maintaining wtf_wikipedia.

@andremacola
Copy link
Contributor Author

Just an update: I updated to receive "extension: wikiFromHell" in the Dumpster parameters and updated to a more current method of JSONfn to support async and arrow functions.

The big problem, as far as I can tell, is that when passing into the Dumpster, wikiFromHell loses several function references that are called via require/import.

const dumpsterOptions = {
  file: './dumps/wikipedia/wiki.dump.xml',
  db: 'wikipedia',
  db_url: 'mongodb://' + process.env.MONGO_HOST,
  skip_redirects: false,
  skip_disambig: false,
  wikidata: false,
  extension: wikiFromHell,
  custom: wtfHellDocument
}

in parseWiki.js

if (options.extension) {
  wtf.extend(options.extension);
}

wikiFromHell

const isTrueRedirectFN = require('./isTrueRedirect.js');

const wikiFromHell = (models) => {
  const doc = models.Doc.prototype;
  const wtf = models.wtf;

  doc.isTrueRedirect = function () {
    return isTrueRedirectFN();
  };
}

In this cenario, isTrueRedirectFN lost the reference, I had to put all the function directly inside the doc.isTrueRedirect. Probably because the stringfy/parse the functions.

For now I'm using a patched dumpster-div with wikiFromHell directly extended inside parseWiki.

Is there a way to bypass the worker for the parseWiki extend the function?

@spencermountain
Copy link
Owner

yeah, i've seen some have luck with using function(){} instead of arrow functions - but I think you're right, the better solution would be to use const wtf = import(options.wtf_path) or something like that in the worker -
so you can casually import whatever sort of fancy-business you'd like

hope to get some time to look at this properly, next week or so
cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants