This Go application processes Wiktionary data dumps to extract the International Phonetic Alphabet (IPA) transcriptions of words. It parses the latest Wiktionary dump, processes the data, and outputs the IPA transcriptions in a JSONL format. The project also includes a nightly job that checks for new Wiktionary dumps and automatically processes them if available.
- Automated Data Extraction: Extracts IPA transcriptions from Wiktionary dumps.
- JSONL Output: Outputs IPA transcriptions in a JSONL format with one IPA transcription per line.
- Nightly Checks: Automatically checks for new data dumps from Wiktionary and processes them if new data is available.
- GitHub Release Integration: Publishes processed data as a release on GitHub, allowing easy access to the latest IPA data.
If you simply want the latest processed IPA data, you can head over to the Releases section of this repository and download the latest JSONL file. No need to install or build the application unless you want to process the data yourself.
The output is in JSONL format (JSON Lines), meaning that each line contains one IPA transcription for a word. Here's an example of what the output looks like:
{"word":"aardvark","ipa":[{"lang":"en","IPA":["/ˈɑːd.vɑːk/"],"variant":"a=RP"}]}
{"word":"month","ipa":[{"lang":"en","IPA":["/mʌnθ/"]},{"lang":"en","IPA":["/mʌnθ/"]}]}Each entry contains:
word: The word being transcribed.ipa: A list of IPA transcriptions for different languages or pronunciation variants (e.g., RP for Received Pronunciation, US for American English).lang: The language code (e.g., "en" for English).IPA: A list of phonetic transcriptions.variant: Optional, specifies pronunciation variant (e.g., "RP", "US").
-
Install the application using Go:
go install github.com/terwey/wiktionary-ipa-extract/cmd/wiktionary-ipa-extract@latest
-
Ensure Go is properly set up and available in your system’s PATH.
To process the latest Wiktionary dump and extract IPA transcriptions, run the following command:
curl https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2 | ./wiktionary-ipa-extract --bz -o ipa-data.jsonlThis will download the latest dump, process it, and output the extracted IPA transcriptions into a JSONL file.
If you want to process a specific dump (or one downloaded manually), you can provide the file directly using the --input flag. Here’s how to do it:
-
Download the Wiktionary dump you wish to process from Wiktionary Dumps.
-
Run the command below, replacing
<dump-file>with the path to your downloaded file and<output-file>with the desired output filename../wiktionary-ipa-extract --input <dump-file> --bz -o <output-file>.jsonl
Example:
./wiktionary-ipa-extract --input ~/downloads/enwiktionary-20230101-pages-articles.xml.bz2 --bz -o ipa-data.jsonl
This project uses GitHub Actions to automate nightly runs that check for new Wiktionary dumps via an RSS feed. If a new dump is found, it is automatically processed, and the extracted IPA data is released on GitHub in JSONL format.
Contributions are welcome! Please submit a pull request or open an issue to discuss improvements or report bugs.
This project is licensed under the MIT License. See the LICENSE file for more details.