A collection of scripts for automatic corpus generation on Google Cloud Platform.
- Convert XLSX to newline-delimited JSON.
- Upload data to Cloud Storage.
- Create external tables in BigQuery from data in Cloud Storage.
- existing project on Google Cloud Platform.
- APIs enabled
- private key for a Google Cloud Service Account
- open Creditentials
- create/open a service account
- go to KEYS
- ADD KEY > Create new key
- in the dialog window create a JSON key
- place the JSON file in the same folder as the script
- rename it to "gcp_key.json"
- for the NLP script you have to download a HuSpaCy model
- the larger model is recommended (hu_core_news_lg)
- you can do this quickly the following way:
pip install https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl
You can modify the parameters in config.ini, which is structured the following way:
parameter | expected string |
---|---|
raw | folder containing raw XLSX data (relative/absolute path) |
json | output folder for newline-delimited JSON (relative/absolute path) |
schemas | folder containing schema information as JSON (relative/absolute path) |
parameter | expected string |
---|---|
project | GCP project ID |
bucket | Cloud Storage bucket name |
dataset | BigQuery dataset name |
parameter | expected string |
---|---|
xlsx2jsonl | True/False (turns conversion on/off) |
storage | True/False (turns Cloud Storage upload on/off) |
bigquery | True/False (turns BigQuery table generation on/off) |
parameter | expected string |
---|---|
nlp | True/False (turns text analysis on/off) |
Schemas are loaded from JSON files. Table structure is almost identical to cap_pilot_benchmark_xlsx_json.
name | type | mode |
---|---|---|
cap_id | INTEGER | REQUIRED |
nev_angol | STRING | NULLABLE |
nev_magyar | STRING | NULLABLE |
name | type | mode |
---|---|---|
szo_id | INTEGER | REQUIRED |
szoalak | STRING | NULLABLE |
lemma | STRING | NULLABLE |
entity_IOB | STRING | NULLABLE |
POS | STRING | NULLABLE |
morf_analysis | STRING | NULLABLE |
dependencia_el | STRING | NULLABLE |
mondat_id | INTEGER | NULLABLE |
name | type | mode |
---|---|---|
text_id | INTEGER | REQUIRED |
text_type | INTEGER | NULLABLE |
exact_date | DATE | NULLABLE |
cycle_number | INTEGER | NULLABLE |
parliamentary_id | STRING | NULLABLE |
text | STRING | NULLABLE |
napirendi_pont | STRING | NULLABLE |
video_felszolalas_ido | STRING | NULLABLE |
video_feszolalas_url | STRING | NULLABLE |
felszolalas_url | STRING | NULLABLE |
tokenszam | STRING | NULLABLE |
major_topic | STRING | NULLABLE |
COVID | STRING | NULLABLE |
text_id_old | STRING | NULLABLE |
name | type | mode |
---|---|---|
parliamentary_id | STRING | REQUIRED |
surname | STRING | NULLABLE |
first_name | STRING | NULLABLE |
birth_year | FLOAT | NULLABLE |
birth_place | STRING | NULLABLE |
sex | INTEGER | NULLABLE |
death_date | FLOAT | NULLABLE |
death_place | STRING | NULLABLE |
PERSON_POS | INTEGER | NULLABLE |
change_name | STRING | NULLABLE |
surname_new | STRING | NULLABLE |
surname_from | DATE | NULLABLE |
id_old | STRING | NULLABLE |
name | type | mode |
---|---|---|
mondat_id | INTEGER | REQUIRED |
raw_text | STRING | NULLABLE |
text_id | STRING | NULLABLE |
name | type | mode |
---|---|---|
cycle_number | INTEGER | REQUIRED |
cycle_years_from | INTEGER | NULLABLE |
cycle_years_to | INTEGER | NULLABLE |
cycle_from | DATE | NULLABLE |
cycle_to | DATE | NULLABLE |
name | type | mode |
---|---|---|
party_id | INTEGER | REQUIRED |
party_name_full_HUN | STRING | NULLABLE |
party_name_full_HUN_from | DATE | NULLABLE |
party_name_full_HUN_to | DATE | NULLABLE |
party_name2_full_HUN | STRING | NULLABLE |
party_name2_full_HUN_from | DATE | NULLABLE |
party_name2_full_HUN_to | DATE | NULLABLE |
party_name3_full_HUN | STRING | NULLABLE |
party_name_full3_HUN_from | DATE | NULLABLE |
party_name_full3_HUN_to | DATE | NULLABLE |