Skip to content

Finding the top news stories of 2022 among 54,000+ news on AI, ML, NLP, data science and related fields.

Notifications You must be signed in to change notification settings

fredriko/metacurate-regularly

Repository files navigation

metacurate-regularly: clustering of news headlines.

TL;DR: This repository contains an experiment for embedding and clustering news headlines, as well as for describing the resulting clusters, and plotting them on a timeline.

The screenshot below shows the output of the clustering exercise: the top 50 news in 2022 regarding AI, machine learning, data science, and related fields based on data collected by metacurate.io. Here is the live graph showing the top 50 news stories, and here is a list of the 200 top stories, including all constituent headlines.

Top 50 AI/ML/data science news 2022 according to metacurate.io

In 2022, my hobby project metacurate.io collected 54k+ news items from sources related to artificial intelligence, machine learning, natural language processing, data science, and other tech news. This repository contains code for experimenting with the clustering of headlines, and describing the clusters.

The input data is available in data/metacurate_news_2022.csv. Example output is available in data/output/2022_1/. The output folder contains:

Installation with virtualenv

Requirements:

  • git
  • python 3.9 or newer (it might work with earlier versions, but it has not been tested)
  • pip
  • virtualenv
  • An API key from Cohere
  • Optional: Plotly Chart Studio credentials

Set up and activate a virtual Python environment by executing the following commands at a terminal prompt:

mkdir ~/venv
virtualenv -p python3 ~/venv/metacurate-regularly/
source ~/venv/metacurate-regularly/bin/activate

Clone the source code to your local machine and install its dependencies:

git clone [email protected]:fredriko/metacurate-regularly.git
cd metacurate-regularly
pip install -r requirements.txt

Get and set up a Cohere API Key

In order to use Topically to describe the clusters, you need to have an API key from cohere. Get an API key by following the instructions in the Topically repository. Take note of the key, and set the environment variable COHERE_API_KEY like so:

export COHERE_API_KEY=<your_key>

Optional: Get and set up Plotly Chart Studio credentials

In order to publish the generated Plotly plot to the web (Plotly Chart studio), you need to have an account and set up the credentials locally. Follow the instructions for getting an account here and edit the file set_up_plotly_credentials.py to include your username and api_key.

Run the file:

python src/set_up_plotly_credentials.py

to generate and store the credentials. This only has to be done once.

Run the code

To run the code, simply issue the following:

python main.py -c configs/metacurate_news_2022_1.json

NOTE that this is a long-running process: the vectorization step will take a long time if you're running on a CPU, and the clustering takes quite some time too.