-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Massive German Text Corpus released #4
Comments
Hi @PhilipMay , thanks for that hint! Corpus looks really interesting and:
is really awesome! I'll definitely work with this corpus in near future 🤗 |
Hi @PhilipMay , just one question: I've downloaded the HEAD and MIDDLE archives (using the urls provided in 🤔 Thanks! |
Hmm - maybe 450 GB was a rather inaccurate estimate. What do you think @Phil1108 ? I would do 2 things: |
Number of files are correct (I checked both *.txt files and the links on the website). I will check the content length header of the provided files now, e.g: curl -I https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/middle/de_middle_0000_2015-48.tar.gz
HTTP/1.1 200 OK
Date: Thu, 22 Apr 2021 06:32:46 GMT
Server: Apache/2.4.41 (Ubuntu)
Last-Modified: Sat, 31 Oct 2020 13:12:16 GMT
ETag: "65419493-5b2f7431a0c00"
Accept-Ranges: bytes
Content-Length: 1698796691
Content-Type: application/x-gzip The ls -l de_middle_0000_2015-48.tar.gz
-rw-r--r-- 1 stefan users 1698796691 Okt 31 14:12 de_middle_0000_2015-48.tar.gz I'll report back if I find some broken tar archives 😅 |
With some bash magic: for url in $(cat gc4_corpus_middle_urls.txt)
do
filename=$(echo $url | cut -d "/" -f 8)
disk_size=$(stat -c "%s" $filename)
download_size=$(curl --silent -I $url | grep "Content-Length:" | cut -d " " -f 2)
echo $filename $disk_size $download_size
done Files for head and middle: comparison_head.txt So it turns out, that all downloaded files have the exact file size as their content-length header 🤗 |
And I calculates the number of downloaded bytes: More precisely: So I guess everything was ok! Thanks for providing this massive corpus, I will extract all archives now :) |
Good luck and thanks for reporting back. |
@stefan-it Yeah sorry was the usually 1000 vs 1024 issue, edited that in the Readme I usually never extract the data to keep disk usage low. |
Hi @PhilipMay and @Phil1108 , thanks again for providing the corpus (and the cool filtering script). I've trained an ELECTRA model (with a larger subword vocab than usual, 32k is coming this week or next week). I've done some preliminary experiments (GermEval 14 and 18) and the results are better than GELECTRA (base). Here's the repo with all 11 checkpoints (100k steps for a 1M trained model in total): https://github.com/stefan-it/gc4lm (Spoiler: the 900k checkpoint works best for NER in my experiments 😅) |
Hi @stefan-it
I just wantd to bring your attention to the release of "our" German colossal, cleaned Common Crawl corpus: https://german-nlp-group.github.io/projects/gc4-corpus.html
It is a massive (450 GB zipped) dataset based on Common Crawl with careful preprocessing and deduplication.
The main work was done by Philipp Reißel. Many thanks to iisys (the Institute of Information Systems Hof University) for hosting this dataset.
Maybe you want to use it with your next models... ;-)
The text was updated successfully, but these errors were encountered: