Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating URL Regex #275

Closed
wants to merge 1 commit into from
Closed

Updating URL Regex #275

wants to merge 1 commit into from

Conversation

s-rah
Copy link
Member

@s-rah s-rah commented Oct 13, 2015

The old URL regex had a few issues which were revealed by fuzzing,
the biggest being that it accepted non-printable characters (e.g.
0x00 or 0x01) as part of the URL.

This created the scenario where a url of https://example.com/[0x00]
would be rendered as %2 (and attempting to open the link would give
a value like "https://example.com https://example.com " due to some
odd iteraction with the regex that I haven't quite worked out.

The new regex appears to work with all the iterations I have tried
and rejects non-printable characters.

Below are 2 picks, the first before the change with the old regex, and the second shot with the new regex.

The test strings were:

       ricochet.SendMessage("https://test.com/"+string(0x00)+"LOL", 5)
       ricochet.SendMessage("https://google.com", 5)
       ricochet.SendMessage("https://ع.com", 5)
       ricochet.SendMessage("https://عععععع<b>test</b>عععععععععع.com", 5)
       ricochet.SendMessage("https://عععععععععععععععع.com", 5)
       ricochet.SendMessage("https://ععع234233344ععععععععععععع.com/342?test=5&7=hello~", 5)

before
after

The old URL regex had a few issues which were revealed by fuzzing,
the biggest being that it accepted non-printable characters (e.g.
0x00 or 0x01) as part of the URL.

This created the scenario where a url of https://example.com/[0x00]
would be rendered as %2 (and attempting to open the link would give
a value like "https://example.com https://example.com " due to some
odd iteraction with the regex that I haven't quite worked out.

The new regex appears to work with all the iterations I have tried
and rejects non-printable characters.
@@ -41,7 +41,7 @@ LinkedText::LinkedText(QObject *parent)
: QObject(parent)
{
// Select things that look like URLs of some kind and allow QUrl::fromUserInput to validate them
linkRegex = QRegularExpression(QStringLiteral("([a-z]{3,9}:|www\\.)([^\\s,.);!>]|[,.);!>](?!\\s|$))+"), QRegularExpression::CaseInsensitiveOption);
linkRegex = QRegularExpression(QStringLiteral("([a-z]{3,9}:|www\\.)([\\p{L}\\p{N}\\-\\._~:=&/?#]+)"), QRegularExpression::CaseInsensitiveOption);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we are losing a (useful?) trick here.

This section would capture all non-whitespace characters, excluding the characters ,.);!> when they are at the end of the URL. The idea is that if you type I use http://ricochet.im/. It's neat., that period doesn't become part of the link.

Is there a way we can duplicate that effect here, by excluding those characters from the first class and adding them to a second one with the negative forward lookahead?

@special
Copy link
Member

special commented Nov 3, 2015

Also, nice catch - thanks for taking the time to fuzz this.

@s-rah
Copy link
Member Author

s-rah commented Nov 9, 2015

Closing in favor of new PR #302

@s-rah s-rah closed this Nov 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants