-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update article link converting for stardict #1695
base: master
Are you sure you want to change the base?
Conversation
Can you attach or link to a dictionary that requires these fixes and post steps to reproduce the fixed bugs? |
|
I downloaded dpd-goldendict.zip from https://github.com/digitalpalidictionary/digitalpalidictionary/releases and it works as supposed without your change: Does the released dictionary employ some workaround in order to work correctly in current GoldenDict master? |
Yes, indeed. It skips header part minification and uses following snippet to source a script: <link id="bttn_pre" href="buttons.js" rel="preload">
<script>
{
let s=document.createElement('script');
s.src=document.getElementById('bttn_pre').getAttribute('href');
document.head.appendChild(s);
}
</script> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now match correctly with no false-positives.
What is the "no false-positives" claim based on? Have you tested many Stardict dictionaries?
stardict.cc
Outdated
const char * srcRegexpStr = "(<\\s*img\\s+[^>]*src\\s*=\\s*[\"']+)(?!(?:data|https?|ftp):)"; | ||
const char * hrefRegexpStr = "(<\\s*link\\s+[^>]*href\\s*=\\s*[\"']+)(?!(?:data|https?|ftp):)"; | ||
const char * srcRegexpStr = "(<\\s*(?:img|script)\\s[^>]*src\\s*=\\s*[\"']*)(?![^>]*(?:data|https?|ftp):)"; | ||
const char * hrefRegexpStr = "(<\\s*link\\s[^>]*href\\s*=\\s*[\"']*)(?![^>]*(?:data|https?|ftp):)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't replacing [\"']*
with [\"']?
(zero or one match) be safer?
EDIT: OK, *
replaces a +
, so it's safer than ?
. But do you know why +
was here in the first place? Why match multiple quote characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. *
was chosen because it may be inner quotes, but we actually do not want to match it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we want to match inner quotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because inner quotes is generally a part of a value, so if it is used for some reason, it should not be truncated from the value anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also do not think we should ignore inner quotes because HTML does not just ignore redundant quotes, or does it?
f29e5b1 originally introduced [\"']*
. Then 85250bb introduced the negative look ahead and replaced it with [\"']+
.
@Abs62, are you OK with not matching more than one quote character here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also do not think we should ignore inner quotes because HTML does not just ignore redundant quotes, or does it?
Well, not ignores, but decides as a part of value. An attribute value may be wrapped with single or double qoutes (and also may be not wrapped at all). What is inside quotes is a value.
The fix looks good to me, except that the 6 commits should be turned back into the original two (amended). @bergentroll, how did you test this? Any StarDict dictionaries that rely on the previously existing feature? Which Qt versions of GoldenDict? I can test any Qt version if you can provide or link to a necessary StarDict dictionary and the relevant article(s) within it. |
stardict.cc
Outdated
@@ -466,22 +466,22 @@ string StardictDictionary::handleResource( char type, char const * resource, siz | |||
{ | |||
QString articleText = QString( "<div class=\"sdct_h\">" ) + QString::fromUtf8( resource, size ) + "</div>"; | |||
|
|||
const QString srcRegexpStr( | |||
"(<\\s*(?:img|script)\\s[^>]*src\\s*=)(?!\\s*[\"']?(?:data|https?|ftp):)(\\s*[\"']?)" ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more question, is <img src= http://example.com/a.png>
valid HTML? (that is, with a space after =
and without quotes around the URL). If not, should matching be prevented somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is valid:
followed by zero or more ASCII whitespace, followed by the attribute value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now there is another thought. Why there is a negative lookup only for http/https/ftp/data protocols? If it will be a gopher://
or let us say a tftp://
link, well, it possibly will not work anyway, but is it a reason to prepend a bres://
scheme?
May be it should be a negative lookup for ://
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be it should be a negative lookup for
://
?
Do you know where precisely GoldenDict should and should not insert bres://
? I don't know for sure, but bres
links are processed in mainwindow.cc, articleview.cc and article_netmgr.cc to extract resources from a dictionary file. Therefore I suppose bres://
is needed or at least acceptable for whatever can be handled by code in those files and is potentially harmful whenever a web browser can follow the unaltered link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract resources from a dictionary file
Can a resource has e.g. bres://{id}/file://filename.suf
form? It is a valid but malformed URL and also contains invalid file path (as for POSIX).
I am sure the substitution is not a security issues solver, because it handles only some specific cases excluding any form of http://
links.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can test to answer your own questions. For StarDict, check what resourceName
ends up in StardictResourceRequest::run()
for different tricky URLs and figure out whether this run()
function can possibly handle such a name.
Done.
My investigations was not deep and confined with Qt v5.15. So thank you, if I will find good samples, I will share here. |
The renaming back to |
- Regexps matches attributes with missing qoutes - Match also "<script src=" - Match attributes exactly, not e.g. "pref-src"
PR fixes two minor links matching issues I had met:
<script src=...>
tags. Now expression forimg
tag matches alsoscript
tag.