Update article link converting for stardict #1695

bergentroll · 2024-04-22T15:34:42Z

PR fixes two minor links matching issues I had met:

Expressions did not match HTML attributes with omitted quotes which is a case of minimized HTML. Now match correctly with no false-positives.
Expressions did not process <script src=...> tags. Now expression for img tag matches also script tag.

stardict.cc

vedgy · 2024-04-27T16:11:33Z

Can you attach or link to a dictionary that requires these fixes and post steps to reproduce the fixed bugs?

bergentroll · 2024-04-29T18:29:35Z

Get dpd_sample.tar.gz and add to GoldenDict
Search for some entry, e. g. akāsatha

How it supposed to look:

How it look:

vedgy · 2024-04-30T07:02:55Z

I downloaded dpd-goldendict.zip from https://github.com/digitalpalidictionary/digitalpalidictionary/releases and it works as supposed without your change:

Does the released dictionary employ some workaround in order to work correctly in current GoldenDict master?

bergentroll · 2024-04-30T17:42:01Z

Does the released dictionary employ some workaround in order to work correctly in current GoldenDict master?

Yes, indeed. It skips header part minification and uses following snippet to source a script:

    <link id="bttn_pre" href="buttons.js" rel="preload">

    <script>
    {
        let s=document.createElement('script');
        s.src=document.getElementById('bttn_pre').getAttribute('href');
        document.head.appendChild(s);
    }
    </script>

vedgy

Now match correctly with no false-positives.

What is the "no false-positives" claim based on? Have you tested many Stardict dictionaries?

stardict.cc

vedgy · 2024-05-01T16:37:22Z

stardict.cc

- const char * srcRegexpStr = "(<\\s*img\\s+[^>]*src\\s*=\\s*[\"']+)(?!(?:data|https?|ftp):)";
- const char * hrefRegexpStr = "(<\\s*link\\s+[^>]*href\\s*=\\s*[\"']+)(?!(?:data|https?|ftp):)";
+ const char * srcRegexpStr = "(<\\s*(?:img|script)\\s[^>]*src\\s*=\\s*[\"']*)(?![^>]*(?:data|https?|ftp):)";
+ const char * hrefRegexpStr = "(<\\s*link\\s[^>]*href\\s*=\\s*[\"']*)(?![^>]*(?:data|https?|ftp):)";


Wouldn't replacing [\"']* with [\"']? (zero or one match) be safer?

EDIT: OK, * replaces a +, so it's safer than ?. But do you know why + was here in the first place? Why match multiple quote characters?

You are right. * was chosen because it may be inner quotes, but we actually do not want to match it.

Why don't we want to match inner quotes?

Because inner quotes is generally a part of a value, so if it is used for some reason, it should not be truncated from the value anyway.

I also do not think we should ignore inner quotes because HTML does not just ignore redundant quotes, or does it?

f29e5b1 originally introduced [\"']*. Then 85250bb introduced the negative look ahead and replaced it with [\"']+.

@Abs62, are you OK with not matching more than one quote character here?

I also do not think we should ignore inner quotes because HTML does not just ignore redundant quotes, or does it?

Well, not ignores, but decides as a part of value. An attribute value may be wrapped with single or double qoutes (and also may be not wrapped at all). What is inside quotes is a value.

stardict.cc

vedgy · 2024-05-20T19:05:36Z

The fix looks good to me, except that the 6 commits should be turned back into the original two (amended).

@bergentroll, how did you test this? Any StarDict dictionaries that rely on the previously existing feature? Which Qt versions of GoldenDict? I can test any Qt version if you can provide or link to a necessary StarDict dictionary and the relevant article(s) within it.

vedgy · 2024-05-21T07:45:06Z

stardict.cc

@@ -466,22 +466,22 @@ string StardictDictionary::handleResource( char type, char const * resource, siz
 {
 QString articleText = QString( "<div class=\"sdct_h\">" ) + QString::fromUtf8( resource, size ) + "</div>";

+ const QString srcRegexpStr(
+ "(<\\s*(?:img|script)\\s[^>]*src\\s*=)(?!\\s*[\"']?(?:data|https?|ftp):)(\\s*[\"']?)" );


One more question, is <img src= http://example.com/a.png> valid HTML? (that is, with a space after = and without quotes around the URL). If not, should matching be prevented somehow?

Yes, it is valid:

followed by zero or more ASCII whitespace, followed by the attribute value

Now there is another thought. Why there is a negative lookup only for http/https/ftp/data protocols? If it will be a gopher:// or let us say a tftp:// link, well, it possibly will not work anyway, but is it a reason to prepend a bres:// scheme?

May be it should be a negative lookup for ://?

May be it should be a negative lookup for ://?

Do you know where precisely GoldenDict should and should not insert bres://? I don't know for sure, but bres links are processed in mainwindow.cc, articleview.cc and article_netmgr.cc to extract resources from a dictionary file. Therefore I suppose bres:// is needed or at least acceptable for whatever can be handled by code in those files and is potentially harmful whenever a web browser can follow the unaltered link.

extract resources from a dictionary file

Can a resource has e.g. bres://{id}/file://filename.suf form? It is a valid but malformed URL and also contains invalid file path (as for POSIX).

I am sure the substitution is not a security issues solver, because it handles only some specific cases excluding any form of http:// links.

You can test to answer your own questions. For StarDict, check what resourceName ends up in StardictResourceRequest::run() for different tricky URLs and figure out whether this run() function can possibly handle such a name.

stardict.cc

bergentroll · 2024-05-25T13:24:43Z

commits should be turned back into the original two

Done.

how did you test this? ... I can test any Qt version if you can provide or link to a necessary StarDict dictionary and the relevant article(s) within it.

My investigations was not deep and confined with Qt v5.15. So thank you, if I will find good samples, I will share here.

vedgy · 2024-05-25T13:50:37Z

commits should be turned back into the original two

Done.

The renaming back to linkRe* and the type change to QString should be transferred to the first commit.

- Regexps matches attributes with missing qoutes - Match also "<script src=" - Match attributes exactly, not e.g. "pref-src"

bergentroll commented Apr 22, 2024

View reviewed changes

stardict.cc Outdated Show resolved Hide resolved

vedgy reviewed May 1, 2024

View reviewed changes

vedgy requested changes May 1, 2024

View reviewed changes

stardict.cc Outdated Show resolved Hide resolved

stardict.cc Outdated Show resolved Hide resolved

bergentroll force-pushed the master branch from 22fd842 to 1ef070b Compare May 1, 2024 18:25

vedgy reviewed May 16, 2024

View reviewed changes

stardict.cc Outdated Show resolved Hide resolved

vedgy reviewed May 17, 2024

View reviewed changes

stardict.cc Show resolved Hide resolved

vedgy reviewed May 21, 2024

View reviewed changes

stardict.cc Outdated Show resolved Hide resolved

bergentroll force-pushed the master branch from 30ec38f to 426d0be Compare May 25, 2024 13:13

bergentroll added 2 commits May 25, 2024 18:55

Move 2 common regexps to vars

0cd5872

Update stardict link converting regexps

5cabf35

- Regexps matches attributes with missing qoutes - Match also "<script src=" - Match attributes exactly, not e.g. "pref-src"

bergentroll force-pushed the master branch from 426d0be to 5cabf35 Compare May 25, 2024 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update article link converting for stardict #1695

Update article link converting for stardict #1695

bergentroll commented Apr 22, 2024

vedgy commented Apr 27, 2024

bergentroll commented Apr 29, 2024

vedgy commented Apr 30, 2024

bergentroll commented Apr 30, 2024

vedgy left a comment

vedgy May 1, 2024 •

edited

bergentroll May 1, 2024

vedgy May 1, 2024

bergentroll May 1, 2024

vedgy May 2, 2024

bergentroll May 15, 2024

vedgy commented May 20, 2024

vedgy May 21, 2024

bergentroll May 22, 2024

bergentroll May 22, 2024

vedgy May 23, 2024

bergentroll May 24, 2024

vedgy May 24, 2024

bergentroll commented May 25, 2024

vedgy commented May 25, 2024

Update article link converting for stardict #1695

Are you sure you want to change the base?

Update article link converting for stardict #1695

Conversation

bergentroll commented Apr 22, 2024

vedgy commented Apr 27, 2024

bergentroll commented Apr 29, 2024

vedgy commented Apr 30, 2024

bergentroll commented Apr 30, 2024

vedgy left a comment

Choose a reason for hiding this comment

vedgy May 1, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vedgy commented May 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bergentroll commented May 25, 2024

vedgy commented May 25, 2024

vedgy May 1, 2024 •

edited