Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update article link converting for stardict #1695

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bergentroll
Copy link

PR fixes two minor links matching issues I had met:

  • Expressions did not match HTML attributes with omitted quotes which is a case of minimized HTML. Now match correctly with no false-positives.
  • Expressions did not process <script src=...> tags. Now expression for img tag matches also script tag.

stardict.cc Outdated Show resolved Hide resolved
@vedgy
Copy link
Member

vedgy commented Apr 27, 2024

Can you attach or link to a dictionary that requires these fixes and post steps to reproduce the fixed bugs?

@bergentroll
Copy link
Author

  1. Get dpd_sample.tar.gz and add to GoldenDict
  2. Search for some entry, e. g. akāsatha

How it supposed to look:
dpd_ok

How it look:
dpd_bad

@vedgy
Copy link
Member

vedgy commented Apr 30, 2024

I downloaded dpd-goldendict.zip from https://github.com/digitalpalidictionary/digitalpalidictionary/releases and it works as supposed without your change:
DPD-js

Does the released dictionary employ some workaround in order to work correctly in current GoldenDict master?

@bergentroll
Copy link
Author

Does the released dictionary employ some workaround in order to work correctly in current GoldenDict master?

Yes, indeed. It skips header part minification and uses following snippet to source a script:

    <link id="bttn_pre" href="buttons.js" rel="preload">

    <script>
    {
        let s=document.createElement('script');
        s.src=document.getElementById('bttn_pre').getAttribute('href');
        document.head.appendChild(s);
    }
    </script>

Copy link
Member

@vedgy vedgy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now match correctly with no false-positives.

What is the "no false-positives" claim based on? Have you tested many Stardict dictionaries?

stardict.cc Outdated Show resolved Hide resolved
stardict.cc Outdated
const char * srcRegexpStr = "(<\\s*img\\s+[^>]*src\\s*=\\s*[\"']+)(?!(?:data|https?|ftp):)";
const char * hrefRegexpStr = "(<\\s*link\\s+[^>]*href\\s*=\\s*[\"']+)(?!(?:data|https?|ftp):)";
const char * srcRegexpStr = "(<\\s*(?:img|script)\\s[^>]*src\\s*=\\s*[\"']*)(?![^>]*(?:data|https?|ftp):)";
const char * hrefRegexpStr = "(<\\s*link\\s[^>]*href\\s*=\\s*[\"']*)(?![^>]*(?:data|https?|ftp):)";
Copy link
Member

@vedgy vedgy May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't replacing [\"']* with [\"']? (zero or one match) be safer?

EDIT: OK, * replaces a +, so it's safer than ?. But do you know why + was here in the first place? Why match multiple quote characters?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. * was chosen because it may be inner quotes, but we actually do not want to match it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we want to match inner quotes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because inner quotes is generally a part of a value, so if it is used for some reason, it should not be truncated from the value anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also do not think we should ignore inner quotes because HTML does not just ignore redundant quotes, or does it?

f29e5b1 originally introduced [\"']*. Then 85250bb introduced the negative look ahead and replaced it with [\"']+.

@Abs62, are you OK with not matching more than one quote character here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also do not think we should ignore inner quotes because HTML does not just ignore redundant quotes, or does it?

Well, not ignores, but decides as a part of value. An attribute value may be wrapped with single or double qoutes (and also may be not wrapped at all). What is inside quotes is a value.

stardict.cc Outdated Show resolved Hide resolved
stardict.cc Outdated Show resolved Hide resolved
stardict.cc Outdated Show resolved Hide resolved
stardict.cc Outdated Show resolved Hide resolved
stardict.cc Show resolved Hide resolved
@vedgy
Copy link
Member

vedgy commented May 20, 2024

The fix looks good to me, except that the 6 commits should be turned back into the original two (amended).

@bergentroll, how did you test this? Any StarDict dictionaries that rely on the previously existing feature? Which Qt versions of GoldenDict? I can test any Qt version if you can provide or link to a necessary StarDict dictionary and the relevant article(s) within it.

stardict.cc Outdated
@@ -466,22 +466,22 @@ string StardictDictionary::handleResource( char type, char const * resource, siz
{
QString articleText = QString( "<div class=\"sdct_h\">" ) + QString::fromUtf8( resource, size ) + "</div>";

const QString srcRegexpStr(
"(<\\s*(?:img|script)\\s[^>]*src\\s*=)(?!\\s*[\"']?(?:data|https?|ftp):)(\\s*[\"']?)" );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question, is <img src= http://example.com/a.png> valid HTML? (that is, with a space after = and without quotes around the URL). If not, should matching be prevented somehow?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is valid:

followed by zero or more ASCII whitespace, followed by the attribute value

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now there is another thought. Why there is a negative lookup only for http/https/ftp/data protocols? If it will be a gopher:// or let us say a tftp:// link, well, it possibly will not work anyway, but is it a reason to prepend a bres:// scheme?

May be it should be a negative lookup for ://?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be it should be a negative lookup for ://?

Do you know where precisely GoldenDict should and should not insert bres://? I don't know for sure, but bres links are processed in mainwindow.cc, articleview.cc and article_netmgr.cc to extract resources from a dictionary file. Therefore I suppose bres:// is needed or at least acceptable for whatever can be handled by code in those files and is potentially harmful whenever a web browser can follow the unaltered link.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract resources from a dictionary file

Can a resource has e.g. bres://{id}/file://filename.suf form? It is a valid but malformed URL and also contains invalid file path (as for POSIX).

I am sure the substitution is not a security issues solver, because it handles only some specific cases excluding any form of http:// links.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can test to answer your own questions. For StarDict, check what resourceName ends up in StardictResourceRequest::run() for different tricky URLs and figure out whether this run() function can possibly handle such a name.

stardict.cc Outdated Show resolved Hide resolved
@bergentroll
Copy link
Author

commits should be turned back into the original two

Done.

how did you test this? ... I can test any Qt version if you can provide or link to a necessary StarDict dictionary and the relevant article(s) within it.

My investigations was not deep and confined with Qt v5.15. So thank you, if I will find good samples, I will share here.

@vedgy
Copy link
Member

vedgy commented May 25, 2024

commits should be turned back into the original two

Done.

The renaming back to linkRe* and the type change to QString should be transferred to the first commit.

  - Regexps matches attributes with missing qoutes
  - Match also "<script src="
  - Match attributes exactly, not e.g. "pref-src"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants