Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive German Text Corpus released #4

Open
PhilipMay opened this issue Apr 18, 2021 · 9 comments
Open

Massive German Text Corpus released #4

PhilipMay opened this issue Apr 18, 2021 · 9 comments

Comments

@PhilipMay
Copy link

Hi @stefan-it

I just wantd to bring your attention to the release of "our" German colossal, cleaned Common Crawl corpus: https://german-nlp-group.github.io/projects/gc4-corpus.html

It is a massive (450 GB zipped) dataset based on Common Crawl with careful preprocessing and deduplication.

The main work was done by Philipp Reißel. Many thanks to iisys (the Institute of Information Systems Hof University) for hosting this dataset.

Maybe you want to use it with your next models... ;-)

@stefan-it
Copy link
Owner

Hi @PhilipMay ,

thanks for that hint! Corpus looks really interesting and:

This preprocessing is filtering duplicates only inside the same dump. This step took approx. 50,000 CPU hours and 400 TB of network traffic to the common crawl s3 bucket.

is really awesome! I'll definitely work with this corpus in near future 🤗

@stefan-it
Copy link
Owner

Hi @PhilipMay ,

just one question: I've downloaded the HEAD and MIDDLE archives (using the urls provided in gc4_corpus_head_urls.txt and gc4_corpus_middle_urls.txt). However, a du -sh shows "only" 418GB in total. Can you confirm that, or how can I check if something went wrong. Here's my ls -hl of all files:

listing.txt

🤔

Thanks!

@PhilipMay
Copy link
Author

Hmm - maybe 450 GB was a rather inaccurate estimate. What do you think @Phil1108 ?
Or did the guys from iisys somehow lost files?

I would do 2 things:
Count the files and check if they match our number of links and then use gzip to test the archives.

@stefan-it
Copy link
Owner

Number of files are correct (I checked both *.txt files and the links on the website).

I will check the content length header of the provided files now, e.g:

curl -I https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/middle/de_middle_0000_2015-48.tar.gz
HTTP/1.1 200 OK
Date: Thu, 22 Apr 2021 06:32:46 GMT
Server: Apache/2.4.41 (Ubuntu)
Last-Modified: Sat, 31 Oct 2020 13:12:16 GMT
ETag: "65419493-5b2f7431a0c00"
Accept-Ranges: bytes
Content-Length: 1698796691
Content-Type: application/x-gzip

The Content-Length header returns the file size and is identical to the size on disk:

ls -l de_middle_0000_2015-48.tar.gz
-rw-r--r-- 1 stefan users 1698796691 Okt 31 14:12 de_middle_0000_2015-48.tar.gz

I'll report back if I find some broken tar archives 😅

@stefan-it
Copy link
Owner

stefan-it commented Apr 22, 2021

With some bash magic:

for url in $(cat gc4_corpus_middle_urls.txt)
do
  filename=$(echo $url | cut -d "/" -f 8)
  disk_size=$(stat -c "%s" $filename)

  download_size=$(curl --silent -I $url | grep "Content-Length:" | cut -d " " -f 2)
  echo $filename $disk_size $download_size 
done

Files for head and middle:

comparison_head.txt
comparison_middle.txt

So it turns out, that all downloaded files have the exact file size as their content-length header 🤗

@stefan-it
Copy link
Owner

stefan-it commented Apr 22, 2021

And I calculates the number of downloaded bytes: 448598516042 -> which is pretty close to 450GB then 😅

More precisely: 194227285957 (HEAD) + 254371230085 (MIDDLE) = 448598516042 in total.

So I guess everything was ok! Thanks for providing this massive corpus, I will extract all archives now :)

@PhilipMay
Copy link
Author

Good luck and thanks for reporting back.

@Phil1108
Copy link

Phil1108 commented Apr 22, 2021

@stefan-it Yeah sorry was the usually 1000 vs 1024 issue, edited that in the Readme

I usually never extract the data to keep disk usage low.
Just added another subtopic here https://german-nlp-group.github.io/projects/gc4-corpus.html#necessary-steps-before-usage with a short gist linked for custom filtering

@stefan-it
Copy link
Owner

Hi @PhilipMay and @Phil1108 ,

thanks again for providing the corpus (and the cool filtering script). I've trained an ELECTRA model (with a larger subword vocab than usual, 32k is coming this week or next week).

I've done some preliminary experiments (GermEval 14 and 18) and the results are better than GELECTRA (base). Here's the repo with all 11 checkpoints (100k steps for a 1M trained model in total):

https://github.com/stefan-it/gc4lm

(Spoiler: the 900k checkpoint works best for NER in my experiments 😅)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants