OpenAI Text Embeddings for User Classification in Social Networks
Create and/or activate virtual environment:
conda create -n openai-env python=3.10
conda activate openai-env
Install package dependencies:
pip install -r requirements.txt
Obtain an OpenAI API Key (i.e. OPENAI_API_KEY
). We initially fetched embeddings from the OpenAI API via the notebooks, but the service code has been re-implemented here afterwards, in case you want to experiment with obtaining your own embeddings.
Obtain a copy of the "botometer_sample_openai_tweet_embeddings_20230724.csv.gz" CSV file, and store it in the "data/text-embedding-ada-002" directory in this repo. This file was generated by the notebooks, and is ignored from version control because it contains user identifiers.
We are saving trained models to Google Cloud Storage. You will need to create a project on Google Cloud, and enable the Cloud Storage API as necessary. Then create a service account and download the service account JSON credentials file, and store it in the root directory, called "google-credentials.json". This file has been ignored from version control.
From the cloud storage console, create a new bucket, and note its name (i.e. BUCKET_NAME
).
Create a local ".env" file and add contents like the following:
# this is the ".env" file...
OPENAI_API_KEY="sk__________"
GOOGLE_APPLICATION_CREDENTIALS="/path/to/openai-embeddings-2023/google-credentials.json"
BUCKET_NAME="my-bucket"
DATASET_ADDRESS="my_project.my_dataset"
Fetch some example embeddings from OpenAI API:
python -m app.openai_service
Demonstrate ability to load the dataset:
python -m app.dataset
Perform machine learning and other analyses on the data:
OpenAI Embeddings:
Word2Vec Embeddings:
OpenAI Embeddings:
pytest --disable-warnings