Skip to content

Revisit similarity measure #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MrOrz opened this issue Feb 3, 2018 · 10 comments
Open

Revisit similarity measure #53

MrOrz opened this issue Feb 3, 2018 · 10 comments

Comments

@MrOrz
Copy link
Member

MrOrz commented Feb 3, 2018

There are highly duplicate messages in the database.

Most of them contains the same sentences, but it is considered different by the user.

Maybe we should lower the threshold to make more documents considered as identical.

Identical documents:

👇👇永久限定米奇貼圖限時下載

https://cofacts.g0v.tw/article/AWFbPonmhutQxxU6tq_J
https://cofacts.g0v.tw/article/AWFaPnhqhutQxxU6tq-P
https://cofacts.g0v.tw/article/AWFWnVozhutQxxU6tq7Z
https://cofacts.g0v.tw/article/AWFaFN4whutQxxU6tq-C
(Notice the "similar articles" section)

https://cofacts.g0v.tw/article/AWFzR6zShutQxxU6trZZ
https://cofacts.g0v.tw/article/AWFvKvIbhutQxxU6trTp

超可愛 快點網址下載 不是詐騙 我下載到了 好開心哦 聖誕節才有 點網址就能套用主題了

https://cofacts.g0v.tw/article/AWCL78pVyCdS-nWhulxu
https://cofacts.g0v.tw/article/AWB5Q9nKyCdS-nWhulki

反向案例(真的有該貼圖,屬於廣告範疇):
https://cofacts.g0v.tw/replies?before=&after=&filter=NOT_RUMOR&mine=&q=%E8%B2%BC%E5%9C%96

台北捷運伴你同行 貼圖(確實有此活動)

https://cofacts.g0v.tw/article/AWFvYRb3hutQxxU6trUF
https://cofacts.g0v.tw/article/AWFu2p62hutQxxU6trSv
https://cofacts.g0v.tw/article/AWFuY5lshutQxxU6trRj
https://cofacts.g0v.tw/article/AWFvFI8-hutQxxU6trTa
https://cofacts.g0v.tw/article/AWFvWoO3hutQxxU6trT-
https://cofacts.g0v.tw/article/AWFvQX0PhutQxxU6trT3

這部影片在YouTube今天被檢舉下架了,可能因為擋人財路,還好我有先下載下來,再PO一次

https://cofacts.g0v.tw/article/AV2mDUCkyCdS-nWhudgw
https://cofacts.g0v.tw/article/AV27poFRyCdS-nWhudzs

臺電 節電獎勵金~撥打電話1911告訴服務人員你家的"電號"就可以完成登陸了!晚上也有服務哦!

https://cofacts.g0v.tw/article/AWFOkLYGhutQxxU6tqwd
https://cofacts.g0v.tw/article/AWFUmpH_hutQxxU6tq4E
https://cofacts.g0v.tw/article/AWFRgDbDhutQxxU6tq0A

Other related article that contains "台電" "登錄" :https://cofacts.g0v.tw/?before=&after=&q=%E5%8F%B0%E9%9B%BB%20%E7%99%BB%E9%8C%84&filter=all

Keyword = 甘南地方 武警 封鎖

https://cofacts.g0v.tw/article/AWFmXqeShutQxxU6trJL
https://cofacts.g0v.tw/article/AWFq8zOGhutQxxU6trNw
https://cofacts.g0v.tw/article/AWFguJQLhutQxxU6trDz
https://cofacts.g0v.tw/article/AWFfFhIghutQxxU6trCW
https://cofacts.g0v.tw/article/AWFlYBIUhutQxxU6trIE

「不要打開,立即刪除」

女孩面色痛苦、略帶啜泣臥倒坐在牆邊 (長文)

https://cofacts.g0v.tw/article/nemyj4xlfnl1
https://cofacts.g0v.tw/article/1wqix3n4uswys

同樣網址,政治獻金 (多兩個段落)

https://cofacts.g0v.tw/article/9mhsa87lqm7i
https://cofacts.g0v.tw/article/fd4im4wkrx3

板橋四維公園有1名小孩被2名婦人偷抱走

https://cofacts.g0v.tw/article/1on4z88bzzkwi
https://cofacts.g0v.tw/article/6hhaxmk177hb

@MrOrz
Copy link
Member Author

MrOrz commented Mar 1, 2018

Related to cofacts/rumors-api#66

@MrOrz
Copy link
Member Author

MrOrz commented Mar 12, 2018

This happens after I already answered this article:
https://cofacts.g0v.tw/article/2lp0017ewtas6

screenshot_20180312-155415

@MrOrz
Copy link
Member Author

MrOrz commented Mar 13, 2018

Another variant sent to db

https://cofacts.g0v.tw/article/jzmgzs21wsh4

@MrOrz
Copy link
Member Author

MrOrz commented Mar 14, 2018

@MrOrz
Copy link
Member Author

MrOrz commented Mar 20, 2018

Analysis 1

Input:

在德國,一台宝宝車不慎滑落鉄軌,火車快到了,宝宝的媽媽哭喊著、已經失去了希望,这个叙利亞青年的英勇行动拯救了孩子。德国政府為了表揚他的勇敢行为,給了这个敘利亞人一个德国公民的身份。請仔細看,什麼叫千鈞一髮!

moreLikeThis query selected 25 terms:

行为 了希 勇行 叙利 动拯 揚他 落鉄 鈞一 千鈞 敢行 个德 慎滑 宝的 鉄軌 叫千 个叙 宝車 媽哭 台宝 亞青
 个敘 利亞 这个 德国 宝宝

25 terms * 70% = 17.5 terms

but

@MrOrz
Copy link
Member Author

MrOrz commented Mar 20, 2018

Analysis 2

Input:

昨天上午發生在中國甘南地區的奇特現象 。車子開開就漂起來了。當天下午中科院的研究人員到場,武警部隊封鎖了方圓5公里的地域 。

moreLikeThis query selected 25 terms:

里的 的奇 到場 南地 開開 中科 員到 奇特 了方 方圓 科院 子開 甘南 鎖了 武警 開就 午發 午中 地域 國甘 警部 隊封 漂起 特現 就漂

25 terms * 70$ = 17.5 terms

but

@MrOrz
Copy link
Member Author

MrOrz commented Mar 21, 2018

Analysis 3 (long article, retrieval success)

Input: https://cofacts.g0v.tw/article/5487586967338-rumor

moreLikeThis query selected 25 terms:

人數 健保 個允 世界 國家 高達 台灣 我們 解放 性戀 結婚 同婚 死亡 萬人 性解 染愛 68 防疫 全民 感染 同性 男男 滋病 南非 愛滋

25 terms * 70% = 17.5 terms

And these documents exceeds criteria, so they are queried:

@MrOrz
Copy link
Member Author

MrOrz commented Mar 21, 2018

Elasticsearch has retrieval techniques built-in, such as tf-idf and vector space model.

However, it does not handle semantic / embeddings, which is mentioned in https://stackoverflow.com/questions/8772692/semantic-search-with-nlp-and-elasticsearch .

Some works integrating LSA with elasticsearch:
https://opensourceconnections.com/blog/2016/03/29/semantic-search-with-latent-semantic-analysis/

@MrOrz
Copy link
Member Author

MrOrz commented Oct 21, 2020

From 20201014 meeting note, as we now support highlighting, users can identify similar messages more easily.

We can directly lower the threshold of search hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant