TGNews

Links

Description in English: https://medium.com/@phoenixilya/news-aggregator-in-2-weeks-5b38783b95e3
Description in Russian: https://habr.com/ru/post/487324/

Demo

Russian: https://ilyagusev.github.io/tgcontest/ru/main.html
English: https://ilyagusev.github.io/tgcontest/en/main.html

Install

Prerequisites: CMake, Boost

$ sudo apt-get install cmake libboost-all-dev build-essential libjsoncpp-dev uuid-dev protobuf-compiler libprotobuf-dev

For MacOS

$ brew install boost jsoncpp ossp-uuid protobuf

If you got zip archive, just go to building binary

To download code and models:

$ git clone https://github.com/IlyaGusev/tgcontest
$ cd tgcontest
$ git submodule update --init --recursive
$ bash download_models.sh
$ wget https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.5.0%2Bcpu.zip
$ unzip libtorch-cxx11-abi-shared-with-deps-1.5.0+cpu.zip

For MacOS use https://download.pytorch.org/libtorch/cpu/libtorch-macos-1.5.0.zip

To build binary (in "tgcontest" dir):

$ mkdir build && cd build && Torch_DIR="../libtorch" cmake -DCMAKE_BUILD_TYPE=Release .. && make -j4

To download datasets:

$ bash download_data.sh

Run on sample:

./build/tgnews top data --ndocs 10000

Training

Russian FastText vectors training: VectorsRu.ipynb
Russian fasttext category classifier training: CatTrainRu.ipynb
Russian text embedder with triplet loss training (v3):
English FastText vectors training: VectorsEn.ipynb
English fasttext category classifier training: CatTrainEn.ipynb
English text embedder with triplet loss training (v3):
PageRank rating calculation: PageRankRating.ipynb
Russian ELMo-based sentence embedder training (not used):
XLM-RoBERTa pseudo-labeling for categorization:

Models

Language detection model (2 round): lang_detect_v10.ftz
Russian FastText vectors (2 round): ru_vectors_v3.bin
Russian categories detection model (2 round): ru_cat_v5.ftz
English FastText vectors (2 round): en_vectors_v3.bin
English categories detection model (2 round): en_cat_v5.ftz
PageRank-based agency rating: pagerank_rating.txt
Alexa agency rating: alexa_rating_4_fixed.txt
XLM-RoBERTa for categorization (pytorch-lightning checkpoint): xlmr_en_ru_cat_v1.tar.gz

Data

Russian news from 11.01.2019 to 10.05.2020 with gaps: ru_tg_1101_0510.jsonl.tar.gz
Russian news from 11.05.2020 to 17.05.2020: ru_tg_0511_0517.jsonl.tar.gz
English news from 11.01.2019 to 10.05.2020 with gaps: en_tg_1101_0510.jsonl.tar.gz
English news from 11.05.2020 to 17.05.2020: en_tg_0511_0517.jsonl.tar.gz

Markup

Russian categories raw train markup: ru_cat_v4_train_raw_markup.tsv
Russian categories aggregated train markup: ru_cat_v4_train_annot.json
Russian categories aggregated train markup in fastText format: ft_ru_cat_v4_train.txt
Russian categories manual train markup: ru_cat_v4_train_manual_annot.json
Russian categoreis manual train markup in fastText format: ft_ru_cat_v4_train_manual.txt
Russian categoreis raw test markup: ru_cat_v4_test_raw_markup.tsv
Russian categories aggregated test markup: ru_cat_v4_test_annot.json
Russian categories aggregated test markup in fastText format: ft_ru_cat_v4_test.txt
English categories aggregated train markup: en_cat_v4_train_annot.json
English categories aggregated train markup in fastText format: ft_en_cat_v4_train.txt
English categories aggregated test markup: en_cat_v4_test_annot.json
English categories aggregated test markup in fastText format: ft_en_cat_v4_test.txt
Russian clustering pairs: ru_pairs_raw_markup.tsv
English clustering pairs: en_pairs_raw_markup.tsv
Russian clustering pairs for one day (0517): ru_clustering_0517.tsv

Misc

Flamegraph: https://ilyagusev.github.io/tgcontest/flamegraph.svg

Other contestants

Round 2
- II place
  - Daring Frog: https://github.com/a-l-e-x-k/data_clustering_contest, article: https://medium.com/@alexkuznetsov/2nd-place-solution-for-telegram-data-clustering-contest-f28d55b98d30
  - Swift Skunk: https://github.com/sorrge/tg_news_cluster
- III place
  - Mindful Kitten: https://danlark.org/2020/07/31/news-aggregator-from-scratch-in-2-weeks/
- IV place
  - Bossy Gnu: https://github.com/maxoodf/tgnews
- Other:
  - Large Crab: https://github.com/ilya-ustinov/tgcontest
Round 1
- III place
  - Kooky Dragon: https://github.com/nick-baliesnyi/tgnews
- IV place
  - Sharp Sloth: https://github.com/thehemen/telegram-data-clustering
- Other
  - Desert Python: https://github.com/crazyleg/telegram_data_clustering_2019
  - Funky Peacock: https://github.com/Stepka/telegram_clustering_contest
  - Unknown animal: https://github.com/roman-rybalko/telegram-data-clustering-contest
  - Unknown animal: https://github.com/MarcoBuster/data-clustering-contest
  - Unknown animal: https://github.com/sudevschiz/tgnews
  - Unknown animal: https://github.com/crazyleg/telegram_data_clustering_2019
  - Unknown animal: https://github.com/77ph/tgnews
  - Unknown animal: https://github.com/akash-joshi/telegram-cluster
  - Unknown animal: https://github.com/dremovd/telegram-clustering

Contacts

Telegram: @YallenGusev

Name		Name	Last commit message	Last commit date
Latest commit History 618 Commits
configs		configs
models		models
scripts		scripts
src		src
test		test
thirdparty		thirdparty
toloka		toloka
viewer		viewer
wiki		wiki
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build_viewer.sh		build_viewer.sh
build_zip.sh		build_zip.sh
canonize.sh		canonize.sh
deb-packages.txt		deb-packages.txt
download_data.sh		download_data.sh
download_models.sh		download_models.sh
test_canonical.sh		test_canonical.sh
tgnews.sh		tgnews.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TGNews

Links

Demo

Install

Training

Models

Data

Markup

Misc

Other contestants

Contacts

About

Releases 1

Packages

Contributors 6

Languages

License

IlyaGusev/tgcontest

Folders and files

Latest commit

History

Repository files navigation

TGNews

Links

Demo

Install

Training

Models

Data

Markup

Misc

Other contestants

Contacts

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 6

Languages

Packages