This project provides a pipeline, orchestrated by run.py
, to transform Reddit comments sourced from the Pushshift archive into TEI-XML format.
For detailed documentation, including in-depth descriptions of processing steps, data structure, and configuration options, please see the RedTEI Wiki.
To set up the environment, make sure you are using Python 3.12+:
python3.12 --version
Create and activate the virtual environment:
python3 -m venv .venv # ensure Python 3.12+ is used
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
To run the main script, use:
python run.py path/to/subreddit.zst
# or to process (and save) each comment individually
python run.py --no-group path/to/subreddit.zst
If you use this work, please refer to:
@misc{goettel2025redtei,
title = {{Reddit als (Text-)Ressource: Erstellung und Nachnutzbarkeit eines deutschsprachigen Reddit-Korpus.}},
author = {G{\"o}ttel, Sebastian, and K{\"o}rber, Lydia, and Barbaresi, Adrien},
type = {Poster},
month = mar,
year = "2025",
url = "https://zenodo.org/records/14944553",
doi = "10.5281/zenodo.14944553",
note = {Poster presented at DHd 2025 Under Construction (DHd2025)}
}
This repository is licensed under the GNU General Public License v3.0GNU General Public License v3.0.