URLs are not matched #15

lemon234071 · 2021-04-06T12:29:13Z

text = "郭麒麟打卡,且听他分享防疫小知识https://www.zhihu.com/qution/319823639哈哈http//t.cn/a67ov8bt哈哈哈http://t.c"
cleantext.replace_urls(text, "XXX")

output:

郭麒麟打卡,且听他分享防疫小知识https://www.zhihu.com/qution/319823639哈哈http//t.cn/a67ov8bt哈哈哈哈http://t.c

Expected:

郭麒麟打卡,且听他分享防疫小知识XXX哈哈XXX哈哈哈哈XXX

The text was updated successfully, but these errors were encountered:

jfilter · 2021-08-26T00:30:36Z

Hey @lemon234071, thanks for reporting. I'm not sure how to handle this. Right now, the URL has to be somehow separated from other tokes (e.g. by a preceding space). In your string, the URLs could be detected by using the ASCII characters in the string. Maybe this can be useful to add a special handling for Chinese texts? I would not adapt the current URL regex for English (etc.). https://github.com/jfilter/clean-text/blob/master/cleantext/constants.py#L62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URLs are not matched #15

URLs are not matched #15

lemon234071 commented Apr 6, 2021 •

edited

jfilter commented Aug 26, 2021

URLs are not matched #15

URLs are not matched #15

Comments

lemon234071 commented Apr 6, 2021 • edited

jfilter commented Aug 26, 2021

lemon234071 commented Apr 6, 2021 •

edited