Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the XML SAX parser with a DOM parser #242

Open
palemieux opened this issue Mar 16, 2023 · 10 comments
Open

Replace the XML SAX parser with a DOM parser #242

palemieux opened this issue Mar 16, 2023 · 10 comments
Labels
Milestone

Comments

@palemieux
Copy link
Contributor

palemieux commented Mar 16, 2023

Currently, imscJS uses sax.js to parse IMSC documents. This allows the same code to be used in both node and in the browser. This is also doubles the size of the packaged browser JS module.

An alternative approach is to move to a DOM parser, so that the native parser can be used in browsers, while reverting to a external DOM parser in node.

@bbert would this help dash.js?

@gkatsev any further thoughts on this topic?

@nigelmegitt any strong feelings?

@bbert
Copy link

bbert commented Mar 17, 2023

@dsilhavy any opinion?
If it can reduce the bundle size, it can help, even if sax minified bundle is quite small.

FYI, in dash.js the metadataHandler in fromXML() is used: https://github.com/Dash-Industry-Forum/dash.js/blob/54612bb541c43cb7ce900f0c49111999ff817ae2/src/streaming/utils/TTMLParser.js#L130

@gkatsev
Copy link

gkatsev commented Mar 17, 2023

The sax parser is almost a 3rd of the minified-gzipped size of this project (checked via bundlephobia.com).

In another project, I've used https://github.com/xmldom/xmldom for doing the parsing in node and then in the bundle for browsers, it rewrites the import to be return window, since xmldom returns an object with a DOMParser property on it.

While I do think that worrying about a few kb here and there isn't a big deal when you're going to be streaming large video files, it's still helpful to reduce filesize when and where we can, particularly for code that ships to the client side. This seems like a good candidate to help reduce the bundled size of the project.

@nigelmegitt
Copy link
Contributor

@nigelmegitt any strong feelings?

From the documentation for xmldom:

xmldom has an own SAX parser implementation to do the actual parsing

I suspect that changing sax.js for xmldom would likely increase the parsing duration and the minified size, though I haven't checked. Maybe it needs a "spike" experiment.

Another XML parser that could be worth considering is fast-xml-parser, though again, I haven't tried it. It appears to support all the required features, but again, I haven't looked at the size.

If there is a library that uses the native parser when available, and can be compiled into both a "native" and a "not native" variant where the native one has a lower size and hopefully better performance, then I'm all in favour.

@gkatsev
Copy link

gkatsev commented Oct 30, 2023

xmldom is only one suggestion. Any other suitable project is fine. This would likely involve creating two build outputs. One to be used by browsers and one by node.
For example, with mpd-parser, we import xmldom https://github.com/videojs/mpd-parser/blob/56eee17cfbdfbbbdb03c443776b4033405c5448a/src/stringToMpdXml.js#L1

import {DOMParser} from '@xmldom/xmldom';

And in the rollup config, for the "globals" build, we have @xmldom/xmldom return window https://github.com/videojs/mpd-parser/blob/56eee17cfbdfbbbdb03c443776b4033405c5448a/scripts/rollup.config.js#L36

    defaults.browser['@xmldom/xmldom'] = 'window';

Unless we wanted to do dynamic code loading, there's probably no way to handle this without multiple builds.

@palemieux
Copy link
Contributor Author

I am fine with multiple builds.

@bbert
Copy link

bbert commented Oct 31, 2023

@nigelmegitt any strong feelings?

From the documentation for xmldom:

xmldom has an own SAX parser implementation to do the actual parsing

I suspect that changing sax.js for xmldom would likely increase the parsing duration and the minified size, though I haven't checked. Maybe it needs a "spike" experiment.

Another XML parser that could be worth considering is fast-xml-parser, though again, I haven't tried it. It appears to support all the required features, but again, I haven't looked at the size.

If there is a library that uses the native parser when available, and can be compiled into both a "native" and a "not native" variant where the native one has a lower size and hopefully better performance, then I'm all in favour.

Another XML parser to consider that claims to be faster than fast-xml-parser and with very smal bundle size: tXml.

This tXml parser is included in next dash.js v5 for MPD parsing.

@palemieux palemieux added the 2.0 label May 2, 2024
@palemieux palemieux added this to the 2.0 Release milestone May 2, 2024
littlespex added a commit to littlespex/imscJS that referenced this issue Jun 19, 2024
@littlespex
Copy link

littlespex commented Jun 19, 2024

After a few attempts at refactoring fromXML to use DOMParser instead of sax, I found it simpler to have a shim parser that acts like sax but uses DOMParser under the hood. Here is a working example:

littlespex#2

NOTE: This is built on top of #258

@nigelmegitt
Copy link
Contributor

I chatted with @dsilvahy and others about this in the context of dash.js last week. I think it would really make sense to profile imscJS in usage to identify any performance issues - I'm not sure if that has been done? My reasoning is that I think there's a high chance that it's the callbacks that are taking time to complete, rather than than the sax parsing itself.

As an example, we recently discovered an issue in TVs where switching from one string handling method to another, both native JS, made a huge performance difference - I can't exactly remember, but I think one option was to replace the beginning of a long string with an empty string, the other option was to split the long string and use the second part. It was surprising how much the performance was impacted; I wonder if similar small changes could make a big difference in overall parsing performance in imscJS.

@littlespex
Copy link

littlespex commented Jun 19, 2024

I agree that the performance of the parsing should be looked into, but maybe a new issue should be created to track that. For us, the main benifit for removing the parser is the reduction in the number of dependencies in players that use imscJs. For instance, dash.js already uses tXml for parsing, which means you need to bundle two parsers when using the library in its current state. The change above allows for any parser to be passed into the fromXML function with a simple wrapper.

@nigelmegitt
Copy link
Contributor

Thanks for clarifying your motivation @littlespex .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants