Massive German Text Corpus released #4

PhilipMay · 2021-04-18T15:19:21Z

I just wantd to bring your attention to the release of "our" German colossal, cleaned Common Crawl corpus: https://german-nlp-group.github.io/projects/gc4-corpus.html

It is a massive (450 GB zipped) dataset based on Common Crawl with careful preprocessing and deduplication.

The main work was done by Philipp Reißel. Many thanks to iisys (the Institute of Information Systems Hof University) for hosting this dataset.

Maybe you want to use it with your next models... ;-)

stefan-it · 2021-04-19T07:05:05Z

Hi @PhilipMay ,

thanks for that hint! Corpus looks really interesting and:

This preprocessing is filtering duplicates only inside the same dump. This step took approx. 50,000 CPU hours and 400 TB of network traffic to the common crawl s3 bucket.

is really awesome! I'll definitely work with this corpus in near future 🤗

stefan-it · 2021-04-22T06:25:43Z

Hi @PhilipMay ,

just one question: I've downloaded the HEAD and MIDDLE archives (using the urls provided in gc4_corpus_head_urls.txt and gc4_corpus_middle_urls.txt). However, a du -sh shows "only" 418GB in total. Can you confirm that, or how can I check if something went wrong. Here's my ls -hl of all files:

listing.txt

🤔

Thanks!

PhilipMay · 2021-04-22T06:32:03Z

Hmm - maybe 450 GB was a rather inaccurate estimate. What do you think @Phil1108 ?
Or did the guys from iisys somehow lost files?

I would do 2 things:
Count the files and check if they match our number of links and then use gzip to test the archives.

stefan-it · 2021-04-22T06:37:24Z

Number of files are correct (I checked both *.txt files and the links on the website).

I will check the content length header of the provided files now, e.g:

curl -I https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/middle/de_middle_0000_2015-48.tar.gz
HTTP/1.1 200 OK
Date: Thu, 22 Apr 2021 06:32:46 GMT
Server: Apache/2.4.41 (Ubuntu)
Last-Modified: Sat, 31 Oct 2020 13:12:16 GMT
ETag: "65419493-5b2f7431a0c00"
Accept-Ranges: bytes
Content-Length: 1698796691
Content-Type: application/x-gzip

The Content-Length header returns the file size and is identical to the size on disk:

ls -l de_middle_0000_2015-48.tar.gz
-rw-r--r-- 1 stefan users 1698796691 Okt 31 14:12 de_middle_0000_2015-48.tar.gz

I'll report back if I find some broken tar archives 😅

stefan-it · 2021-04-22T07:01:41Z

With some bash magic:

for url in $(cat gc4_corpus_middle_urls.txt)
do
  filename=$(echo $url | cut -d "/" -f 8)
  disk_size=$(stat -c "%s" $filename)

  download_size=$(curl --silent -I $url | grep "Content-Length:" | cut -d " " -f 2)
  echo $filename $disk_size $download_size 
done

Files for head and middle:

comparison_head.txt
comparison_middle.txt

So it turns out, that all downloaded files have the exact file size as their content-length header 🤗

stefan-it · 2021-04-22T07:05:35Z

And I calculates the number of downloaded bytes: 448598516042 -> which is pretty close to 450GB then 😅

More precisely: 194227285957 (HEAD) + 254371230085 (MIDDLE) = 448598516042 in total.

So I guess everything was ok! Thanks for providing this massive corpus, I will extract all archives now :)

PhilipMay · 2021-04-22T07:11:21Z

Good luck and thanks for reporting back.

Phil1108 · 2021-04-22T11:59:39Z

@stefan-it Yeah sorry was the usually 1000 vs 1024 issue, edited that in the Readme

I usually never extract the data to keep disk usage low.
Just added another subtopic here https://german-nlp-group.github.io/projects/gc4-corpus.html#necessary-steps-before-usage with a short gist linked for custom filtering

stefan-it · 2021-05-02T11:58:34Z

Hi @PhilipMay and @Phil1108 ,

thanks again for providing the corpus (and the cool filtering script). I've trained an ELECTRA model (with a larger subword vocab than usual, 32k is coming this week or next week).

I've done some preliminary experiments (GermEval 14 and 18) and the results are better than GELECTRA (base). Here's the repo with all 11 checkpoints (100k steps for a 1M trained model in total):

https://github.com/stefan-it/gc4lm

(Spoiler: the 900k checkpoint works best for NER in my experiments 😅)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive German Text Corpus released #4

Massive German Text Corpus released #4

PhilipMay commented Apr 18, 2021

stefan-it commented Apr 19, 2021

stefan-it commented Apr 22, 2021

PhilipMay commented Apr 22, 2021

stefan-it commented Apr 22, 2021

stefan-it commented Apr 22, 2021 •

edited

Loading

stefan-it commented Apr 22, 2021 •

edited

Loading

PhilipMay commented Apr 22, 2021

Phil1108 commented Apr 22, 2021 •

edited

Loading

stefan-it commented May 2, 2021

Massive German Text Corpus released #4

Massive German Text Corpus released #4

Comments

PhilipMay commented Apr 18, 2021

stefan-it commented Apr 19, 2021

stefan-it commented Apr 22, 2021

PhilipMay commented Apr 22, 2021

stefan-it commented Apr 22, 2021

stefan-it commented Apr 22, 2021 • edited Loading

stefan-it commented Apr 22, 2021 • edited Loading

PhilipMay commented Apr 22, 2021

Phil1108 commented Apr 22, 2021 • edited Loading

stefan-it commented May 2, 2021

stefan-it commented Apr 22, 2021 •

edited

Loading

stefan-it commented Apr 22, 2021 •

edited

Loading

Phil1108 commented Apr 22, 2021 •

edited

Loading