Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of malformed iframe tags #86

Open
rushter opened this issue Jul 13, 2021 · 5 comments
Open

Handling of malformed iframe tags #86

rushter opened this issue Jul 13, 2021 · 5 comments

Comments

@rushter
Copy link

rushter commented Jul 13, 2021

I've noticed a pretty annoying problem on some websites (I think there are at least a thousand of them in Alexa 1M).

An unclosed Iframe tag breaks all the HTML below it.

Here is an example:

<noscript>
    <iframe
            height="0" width="0" data-src="https://www.googletagmanager.com/ns.html?id=GTM-M5RK4MW" class="lazyload"
            src="">
        <noscript>
            <iframe src="https://www.googletagmanager.com/ns.html?id=GTM-M5RK4MW"
                    height="0" width="0">
        </noscript>
    </iframe>
</noscript>

It's missing the closing iframe tag but still works when parsing it using Modest.

But for some reason, if you open it in Chrome (to render the javascript parts) and dump HTML, you get this:

<noscript>
<iframe 
      height="0" width="0" data-src="https://www.googletagmanager.com/ns.html?id=" class="lazyload"
       src="">
       <noscript>
        <iframe src="https://www.googletagmanager.com/ns.html?id=" height="0" width="0">
</noscript>

Now there are no closing tags for both iframes.

The problem with this is that Modest will ignore everything after such a tag:

<noscript>
<iframe data-src="https://www.googletagmanager.com/ns.html?id=">
</noscript>


<script></script>
<script></script>
<script></script>

Seaching for script nodes using myhtml_get_nodes_by_name or using CSS selectors returns no results.

@lexborisov Are there any ways to improve this? Other parsers can still handle this.

@rushter rushter changed the title Handling of malformed HTML Handling of malformed iframe tags Jul 13, 2021
@vtorri
Copy link
Contributor

vtorri commented Jul 13, 2021

@rushter
Copy link
Author

rushter commented Jul 13, 2021

@rushter try https://github.com/lexbor/lexbor

I maintain a Python binding for Modest, lexbor is not ready to be replaced yet.

@omerh2802
Copy link

any updates on this?

@lexborisov
Copy link
Owner

Hi @rushter @omerh2802

Maybe we should tell the parser that we have enabled SCRIPT?
Then everything in the noscript tag will be treated as text. It won't affect anything else.

tree->flags |= MyHTML_TREE_FLAGS_SCRIPT

    myhtml_tree_t* tree = myhtml_tree_create();
    myhtml_tree_init(tree, myhtml);

    tree->flags |= MyHTML_TREE_FLAGS_SCRIPT;

    myhtml_parse(tree, MyENCODING_UTF_8, html, length);

@omerh2802
Copy link

Hi @lexborisov, thats sound like a great idea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants