Skip to content

Classifying users on social media, using text embeddings from OpenAI and others

Notifications You must be signed in to change notification settings

s2t2/openai-embeddings-2023

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

openai-embeddings-2023

OpenAI Text Embeddings for User Classification in Social Networks

Setup

Virtual Environment

Create and/or activate virtual environment:

conda create -n openai-env python=3.10
conda activate openai-env

Install package dependencies:

pip install -r requirements.txt

OpenAI API

Obtain an OpenAI API Key (i.e. OPENAI_API_KEY). We initially fetched embeddings from the OpenAI API via the notebooks, but the service code has been re-implemented here afterwards, in case you want to experiment with obtaining your own embeddings.

Users Sample

Obtain a copy of the "botometer_sample_openai_tweet_embeddings_20230724.csv.gz" CSV file, and store it in the "data/text-embedding-ada-002" directory in this repo. This file was generated by the notebooks, and is ignored from version control because it contains user identifiers.

Cloud Storage

We are saving trained models to Google Cloud Storage. You will need to create a project on Google Cloud, and enable the Cloud Storage API as necessary. Then create a service account and download the service account JSON credentials file, and store it in the root directory, called "google-credentials.json". This file has been ignored from version control.

From the cloud storage console, create a new bucket, and note its name (i.e. BUCKET_NAME).

Environment Variables

Create a local ".env" file and add contents like the following:

# this is the ".env" file...

OPENAI_API_KEY="sk__________"

GOOGLE_APPLICATION_CREDENTIALS="/path/to/openai-embeddings-2023/google-credentials.json"
BUCKET_NAME="my-bucket"

DATASET_ADDRESS="my_project.my_dataset"

Usage

OpenAI Service

Fetch some example embeddings from OpenAI API:

python -m app.openai_service

Embeddings per User (v1)

Demonstrate ability to load the dataset:

python -m app.dataset

Perform machine learning and other analyses on the data:

OpenAI Embeddings:

Word2Vec Embeddings:

Embeddings per Tweet (v1)

OpenAI Embeddings:

Testing

pytest --disable-warnings