-
Notifications
You must be signed in to change notification settings - Fork 17
Revisit similarity measure #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Related to cofacts/rumors-api#66 |
This happens after I already answered this article: |
Another variant sent to db |
Testing a fields analyzer: https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html Testing how a doc responses to a query: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html |
Here are some info to brush up our datamining knowledge: https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html Term vector of a doc: |
Analysis 1Input:
moreLikeThis query selected 25 terms:
25 terms * 70% = 17.5 terms but
|
Analysis 2Input:
moreLikeThis query selected 25 terms:
25 terms * 70$ = 17.5 terms but
|
Analysis 3 (long article, retrieval success)Input: https://cofacts.g0v.tw/article/5487586967338-rumor moreLikeThis query selected 25 terms:
25 terms * 70% = 17.5 terms And these documents exceeds criteria, so they are queried:
|
Elasticsearch has retrieval techniques built-in, such as tf-idf and vector space model. However, it does not handle semantic / embeddings, which is mentioned in https://stackoverflow.com/questions/8772692/semantic-search-with-nlp-and-elasticsearch . Some works integrating LSA with elasticsearch: |
From 20201014 meeting note, as we now support highlighting, users can identify similar messages more easily. We can directly lower the threshold of search hit. |
Uh oh!
There was an error while loading. Please reload this page.
There are highly duplicate messages in the database.
Most of them contains the same sentences, but it is considered different by the user.
Maybe we should lower the threshold to make more documents considered as identical.
Identical documents:
👇👇永久限定米奇貼圖限時下載
https://cofacts.g0v.tw/article/AWFbPonmhutQxxU6tq_J
https://cofacts.g0v.tw/article/AWFaPnhqhutQxxU6tq-P
https://cofacts.g0v.tw/article/AWFWnVozhutQxxU6tq7Z
https://cofacts.g0v.tw/article/AWFaFN4whutQxxU6tq-C
(Notice the "similar articles" section)
https://cofacts.g0v.tw/article/AWFzR6zShutQxxU6trZZ
https://cofacts.g0v.tw/article/AWFvKvIbhutQxxU6trTp
超可愛 快點網址下載 不是詐騙 我下載到了 好開心哦 聖誕節才有 點網址就能套用主題了
https://cofacts.g0v.tw/article/AWCL78pVyCdS-nWhulxu
https://cofacts.g0v.tw/article/AWB5Q9nKyCdS-nWhulki
反向案例(真的有該貼圖,屬於廣告範疇):
https://cofacts.g0v.tw/replies?before=&after=&filter=NOT_RUMOR&mine=&q=%E8%B2%BC%E5%9C%96
台北捷運伴你同行 貼圖(確實有此活動)
https://cofacts.g0v.tw/article/AWFvYRb3hutQxxU6trUF
https://cofacts.g0v.tw/article/AWFu2p62hutQxxU6trSv
https://cofacts.g0v.tw/article/AWFuY5lshutQxxU6trRj
https://cofacts.g0v.tw/article/AWFvFI8-hutQxxU6trTa
https://cofacts.g0v.tw/article/AWFvWoO3hutQxxU6trT-
https://cofacts.g0v.tw/article/AWFvQX0PhutQxxU6trT3
這部影片在YouTube今天被檢舉下架了,可能因為擋人財路,還好我有先下載下來,再PO一次
https://cofacts.g0v.tw/article/AV2mDUCkyCdS-nWhudgw
https://cofacts.g0v.tw/article/AV27poFRyCdS-nWhudzs
臺電 節電獎勵金~撥打電話1911告訴服務人員你家的"電號"就可以完成登陸了!晚上也有服務哦!
https://cofacts.g0v.tw/article/AWFOkLYGhutQxxU6tqwd
https://cofacts.g0v.tw/article/AWFUmpH_hutQxxU6tq4E
https://cofacts.g0v.tw/article/AWFRgDbDhutQxxU6tq0A
Other related article that contains "台電" "登錄" :https://cofacts.g0v.tw/?before=&after=&q=%E5%8F%B0%E9%9B%BB%20%E7%99%BB%E9%8C%84&filter=all
Keyword = 甘南地方 武警 封鎖
https://cofacts.g0v.tw/article/AWFmXqeShutQxxU6trJL
https://cofacts.g0v.tw/article/AWFq8zOGhutQxxU6trNw
https://cofacts.g0v.tw/article/AWFguJQLhutQxxU6trDz
https://cofacts.g0v.tw/article/AWFfFhIghutQxxU6trCW
https://cofacts.g0v.tw/article/AWFlYBIUhutQxxU6trIE
「不要打開,立即刪除」
女孩面色痛苦、略帶啜泣臥倒坐在牆邊 (長文)
https://cofacts.g0v.tw/article/nemyj4xlfnl1
https://cofacts.g0v.tw/article/1wqix3n4uswys
同樣網址,政治獻金 (多兩個段落)
https://cofacts.g0v.tw/article/9mhsa87lqm7i
https://cofacts.g0v.tw/article/fd4im4wkrx3
板橋四維公園有1名小孩被2名婦人偷抱走
https://cofacts.g0v.tw/article/1on4z88bzzkwi
https://cofacts.g0v.tw/article/6hhaxmk177hb
The text was updated successfully, but these errors were encountered: