Skip to content

Commit

Permalink
[infra] Python 1.6.2 (#1109)
Browse files Browse the repository at this point in the history
* feat(infra): create version 1.6.2

* feat(infra): create version 1.6.2

* feat(infra): create version 1.6.2

* [infra] python-v1.6.2 (#1089)

* [infra] fix dataset_config.yaml folder path (#1067)

* feat(infra) merge master

* [infra] conform Metadata to new metadata changes (#1093)

* [dados-bot] br_ms_vacinacao_covid19 (2022-01-23) (#1086)

Co-authored-by: terminal_name <github_email>

* [dados] br_bd_diretorios_brasil.etnia_indigena (#1087)

* Sobe diretorio etnia_indigena

* Update table_config.yaml

* Update table_config.yaml

* feat: conform Metadata's schema to new one

* fix: conform yaml generation to new schema

* fix: delete test_dataset folder

Co-authored-by: Lucas Moreira <[email protected]>
Co-authored-by: Gustavo Aires Tiago <[email protected]>

Co-authored-by: Ricardo Dahis <[email protected]>
Co-authored-by: Lucas Moreira <[email protected]>
Co-authored-by: Gustavo Aires Tiago <[email protected]>

* feat(infra): 1.6.2a3 version

* feat(infra): 1.6.2a3 version

* fix(ingra): edit partitions and update_locally

* feat(infra): update_columns new fields and accepts local files

* [infra] option to make dataset public (#1020)

* feat(infra): option to make dataset public

* feat(infra): fix None data

* fix(infra): roll back

* fix(infra): fix retry in storage upload

* fix(infra): add option to dataset data location

* feat(infra): make staging dataset not public

* feat(infra): make staging dataset not public

* fix(infra): change bd version in actions

* fix(infra): add toml to install in ci

* fix(infra): remove a forget print

* fix(infra): fix location location

* fix(infra): fix dataset description

* feat(infra): bump-version

* feat(infra): temporal coverage as list in update_columns

* feat(infra): add new parameters to cli

* feat(infra): fix cli options

* [infra] change download functions to consume CKAN endpoints #1129  (#1130)

* [infra] add function to wrap bd_dataset_search endpoint

* Update download.py

* [infra] modify list_datasets function to consume CKAN endpoint

* [infra] fix list_dataset function to include limit and remove order_by

* [infra] change function list_dataset_tables to use CKAN endpoint

* [infra] apply PEP8 to list_dataset_tables and respective tests

* add get_dataset_description, get_table_description, get_table_columns

* [infra] fix dataset_config.yaml folder path (#1067)

* feat(infra) merge master

* fix files organization to match master

* remove download.py

* remove test_download

* Delete test_download.py

* remove test files

* remove test_download.py

* remove test_download.py

* remove test_download.py

* remove test_download.py

* add tests metadata

* remove test_download.py

* remove unused imports

* [infra] add _safe_fetch and get_table_size functions

Co-authored-by: lucascr91 <[email protected]>

* fix(infra): add a empty list to not a partition

* [infra] Adiciona suporte a Avro e Parquet (#1145)

* adiciona suporte a Avro e Parquet para upload

* Adds test for source formats

* [infra] update tests for avro, parquet, and csv upload

Co-authored-by: Gabriel Gazola Milan <[email protected]>
Co-authored-by: Isadora Bugarin  <[email protected] >
Co-authored-by: lucascr91 <[email protected]>

* [infra] Feedback messages in upload methods [issue #1059] (#1085)

* Creating dataclass config

* Success messages - create and update (table.py) using loguru

* feat: improve log level control

* refa: move logger config to Base.__init__

* Improving log level control

* Adjusting log level control function in base.py

* Fixing repeated 'DELETE' messages everytime Table is replaced.

* Importing 'dataclass' from 'dataclasses' to make config work.

* Fixing repeated 'UPDATE' messages inside other functions.

* Defining a new script message format.

* Definng standard log messages for 'dataset.py' functions

* Definng standard log messages for 'storage.py' functions

* Definng standard log messages for 'table.py' functions

* Definng standard log messages for 'metadata.py' functions

* Adds standard configuration to billing_project_id in download.py

* Configuring billing_project_id in download.py

* Configuring config_path in base.py

Co-authored-by: Guilherme Salustiano <[email protected]>
Co-authored-by: Isadora Bugarin <[email protected]>

* update toml

Co-authored-by: Ricardo Dahis <[email protected]>
Co-authored-by: Lucas Moreira <[email protected]>
Co-authored-by: Gustavo Aires Tiago <[email protected]>
Co-authored-by: lucascr91 <[email protected]>
Co-authored-by: Isadora Bugarin <[email protected]>
Co-authored-by: Gabriel Gazola Milan <[email protected]>
Co-authored-by: Isadora Bugarin  <[email protected] >
Co-authored-by: Guilherme Salustiano <[email protected]>
Co-authored-by: Isadora Bugarin <[email protected]>
  • Loading branch information
9 people authored Mar 14, 2022
1 parent 86b12f3 commit a9b5807
Show file tree
Hide file tree
Showing 22 changed files with 1,706 additions and 1,187 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/python-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ jobs:
run: |
cd python-package
pip install -r requirements-dev.txt
pip install coveralls
pip install coveralls toml
shell: bash
- name: Install package
run: |
Expand Down Expand Up @@ -109,7 +109,7 @@ jobs:
run: |
cd python-package
pip install -r requirements-dev.txt
pip install coveralls
pip install coveralls toml
shell: cmd
- name: Install package
run: |
Expand Down
3 changes: 2 additions & 1 deletion bases/br_bd_diretorios_brasil/dataset_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,5 @@ github_url:

# Não altere esse campo.
# Data da última modificação dos metadados gerada automaticamente pelo CKAN.
metadata_modified: '2022-02-09T21:59:32.440801'

metadata_modified: '2022-02-09T21:59:32.440801'
7 changes: 0 additions & 7 deletions bases/test_dataset/README.md

This file was deleted.

7 changes: 7 additions & 0 deletions python-package/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,10 @@ Publique nova versão
poetry version [patch|minor|major]
poetry publish --build
```

Versão Alpha e Beta

```
version = "1.6.2-alpha.3"
version = "1.6.2-beta.3"
```
3 changes: 2 additions & 1 deletion python-package/basedosdados/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,5 @@
get_dataset_description,
get_table_columns,
get_table_size,
)
search
)
107 changes: 83 additions & 24 deletions python-package/basedosdados/cli/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,10 +77,25 @@ def mode_text(mode, verb, obj_id):
default="raise",
help="[raise|update|replace|pass] if dataset alread exists",
)
@click.option(
"--dataset_is_public",
default=True,
help="Control if prod dataset is public or not. By default staging datasets like `dataset_id_staging` are not public.",
)
@click.option(
"--location",
default=None,
help="Location of dataset data. List of possible region names locations: https://cloud.google.com/bigquery/docs/locations",
)
@click.pass_context
def create_dataset(ctx, dataset_id, mode, if_exists):
def create_dataset(ctx, dataset_id, mode, if_exists, dataset_is_public, location):

Dataset(dataset_id=dataset_id, **ctx.obj).create(mode=mode, if_exists=if_exists)
Dataset(dataset_id=dataset_id, **ctx.obj).create(
mode=mode,
if_exists=if_exists,
dataset_is_public=dataset_is_public,
location=location,
)

click.echo(
click.style(
Expand All @@ -96,9 +111,9 @@ def create_dataset(ctx, dataset_id, mode, if_exists):
"--mode", "-m", default="all", help="What datasets to create [prod|staging|all]"
)
@click.pass_context
def update_dataset(ctx, dataset_id, mode):
def update_dataset(ctx, dataset_id, mode, location):

Dataset(dataset_id=dataset_id, **ctx.obj).update(mode=mode)
Dataset(dataset_id=dataset_id, **ctx.obj).update(mode=mode, location=location)

click.echo(
click.style(
Expand All @@ -110,10 +125,17 @@ def update_dataset(ctx, dataset_id, mode):

@cli_dataset.command(name="publicize", help="Make a dataset public")
@click.argument("dataset_id")
@click.option(
"--dataset_is_public",
default=True,
help="Control if prod dataset is public or not. By default staging datasets like `dataset_id_staging` are not public.",
)
@click.pass_context
def publicize_dataset(ctx, dataset_id):
def publicize_dataset(ctx, dataset_id, dataset_is_public):

Dataset(dataset_id=dataset_id, **ctx.obj).publicize()
Dataset(dataset_id=dataset_id, **ctx.obj).publicize(
dataset_is_public=dataset_is_public
)

click.echo(
click.style(
Expand Down Expand Up @@ -168,7 +190,12 @@ def cli_table():
help="[raise|replace|pass] actions if table config files already exist",
)
@click.option(
"--columns_config_url",
"--source_format",
default="csv",
help="Data source format. Only 'csv' is supported. Defaults to 'csv'.",
)
@click.option(
"--columns_config_url_or_path",
default=None,
help="google sheets URL. Must be in the format https://docs.google.com/spreadsheets/d/<table_key>/edit#gid=<table_gid>. The sheet must contain the column name: 'coluna' and column description: 'descricao'.",
)
Expand All @@ -180,14 +207,16 @@ def init_table(
data_sample_path,
if_folder_exists,
if_table_config_exists,
columns_config_url,
source_format,
columns_config_url_or_path,
):

t = Table(table_id=table_id, dataset_id=dataset_id, **ctx.obj).init(
data_sample_path=data_sample_path,
if_folder_exists=if_folder_exists,
if_table_config_exists=if_table_config_exists,
columns_config_url=columns_config_url,
source_format=source_format,
columns_config_url_or_path=columns_config_url_or_path,
)

click.echo(
Expand Down Expand Up @@ -232,9 +261,24 @@ def init_table(
help="[raise|replace|pass] actions if table config files already exist",
)
@click.option(
"--columns_config_url",
"--source_format",
default="csv",
help="Data source format. Only 'csv' is supported. Defaults to 'csv'.",
)
@click.option(
"--columns_config_url_or_path",
default=None,
help="Path to the local architeture file or a public google sheets URL. Path only suports csv, xls, xlsx, xlsm, xlsb, odf, ods, odt formats. Google sheets URL must be in the format https://docs.google.com/spreadsheets/d/<table_key>/edit#gid=<table_gid>.",
)
@click.option(
"--dataset_is_public",
default=True,
help="Control if prod dataset is public or not. By default staging datasets like `dataset_id_staging` are not public.",
)
@click.option(
"--location",
default=None,
help="google sheets URL. Must be in the format https://docs.google.com/spreadsheets/d/<table_key>/edit#gid=<table_gid>",
help="Location of dataset data. List of possible region names locations: https://cloud.google.com/bigquery/docs/locations",
)
@click.pass_context
def create_table(
Expand All @@ -247,7 +291,10 @@ def create_table(
force_dataset,
if_storage_data_exists,
if_table_config_exists,
columns_config_url,
source_format,
columns_config_url_or_path,
dataset_is_public,
location,
):

Table(table_id=table_id, dataset_id=dataset_id, **ctx.obj).create(
Expand All @@ -257,7 +304,10 @@ def create_table(
force_dataset=force_dataset,
if_storage_data_exists=if_storage_data_exists,
if_table_config_exists=if_table_config_exists,
columns_config_url=columns_config_url,
source_format=source_format,
columns_config_url_or_path=columns_config_url_or_path,
dataset_is_public=dataset_is_public,
location=location,
)

click.echo(
Expand Down Expand Up @@ -297,23 +347,32 @@ def update_table(ctx, dataset_id, table_id, mode):
@click.argument("dataset_id")
@click.argument("table_id")
@click.option(
"--columns_config_url",
"--columns_config_url_or_path",
default=None,
help="""\nGoogle sheets URL. Must be in the format https://docs.google.com/spreadsheets/d/<table_key>/edit#gid=<table_gid>.
\nThe sheet must contain the columns:\n
- nome: column name\n
- descricao: column description\n
- tipo: column bigquery type\n
- unidade_medida: column mesurement unit\n
- dicionario: column related dictionary\n
- nome_diretorio: column related directory in the format <dataset_id>.<table_id>:<column_name>
help="""\nFills columns in table_config.yaml automatically using a public google sheets URL or a local file. Also regenerate
\npublish.sql and autofill type using bigquery_type.\n
\nThe sheet must contain the columns:\n
- name: column name\n
- description: column description\n
- bigquery_type: column bigquery type\n
- measurement_unit: column mesurement unit\n
- covered_by_dictionary: column related dictionary\n
- directory_column: column related directory in the format <dataset_id>.<table_id>:<column_name>\n
- temporal_coverage: column temporal coverage\n
- has_sensitive_data: the column has sensitive data\n
- observations: column observations\n
\nArgs:\n
\ncolumns_config_url_or_path (str): Path to the local architeture file or a public google sheets URL.\n
Path only suports csv, xls, xlsx, xlsm, xlsb, odf, ods, odt formats.\n
Google sheets URL must be in the format https://docs.google.com/spreadsheets/d/<table_key>/edit#gid=<table_gid>.\n
""",
)
@click.pass_context
def update_columns(ctx, dataset_id, table_id, columns_config_url):
def update_columns(ctx, dataset_id, table_id, columns_config_url_or_path):

Table(table_id=table_id, dataset_id=dataset_id, **ctx.obj).update_columns(
columns_config_url=columns_config_url,
columns_config_url_or_path=columns_config_url_or_path,
)

click.echo(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ Email: {{ data_cleaned_by.email }}
{% call input(partitions) -%}
Partições (Filtre a tabela por essas colunas para economizar dinheiro e tempo)
---------
{% if (partitions.split(',') is not none) -%}
{% for partition in partitions.split(',') -%}
{% if (partitions is not none) -%}
{% for partition in partitions -%}
- {{ partition }}
{% endfor -%}
{%- endif %}
Expand Down
10 changes: 9 additions & 1 deletion python-package/basedosdados/constants.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
__all__ = ["constants"]
__all__ = ["config", "constants"]

from enum import Enum
from dataclasses import dataclass


@dataclass
class config:
verbose: bool = True
billing_project_id: str = None
project_config_path: str = None


class constants(Enum):
Expand Down
25 changes: 19 additions & 6 deletions python-package/basedosdados/download/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
BaseDosDadosInvalidProjectIDException,
BaseDosDadosNoBillingProjectIDException,
)
from basedosdados.constants import config, constants
from pandas_gbq.gbq import GenericGBQException


Expand Down Expand Up @@ -49,6 +50,10 @@ def read_sql(
Query result
"""

# standard billing_project_id configuration
if billing_project_id is None:
billing_project_id == config.billing_project_id

try:
# Set a two hours timeout
bigquery_storage_v1.client.BigQueryReadClient.read_rows = partialmethod(
Expand Down Expand Up @@ -86,8 +91,8 @@ def read_sql(
def read_table(
dataset_id,
table_id,
query_project_id="basedosdados",
billing_project_id=None,
query_project_id="basedosdados",
limit=None,
from_file=False,
reauth=False,
Expand All @@ -101,10 +106,10 @@ def read_table(
table_id (str): Optional.
Table id available in basedosdados.dataset_id.
It should always come with dataset_id.
query_project_id (str): Optional.
Which project the table lives. You can change this you want to query different projects.
billing_project_id (str): Optional.
Project that will be billed. Find your Project ID here https://console.cloud.google.com/projectselector2/home/dashboard
query_project_id (str): Optional.
Which project the table lives. You can change this you want to query different projects.
limit (int): Optional.
Number of rows to read from table.
from_file (boolean): Optional.
Expand All @@ -122,6 +127,10 @@ def read_table(
Query result
"""

# standard billing_project_id configuration
if billing_project_id is None:
billing_project_id == config.billing_project_id

if (dataset_id is not None) and (table_id is not None):
query = f"""
SELECT *
Expand All @@ -147,8 +156,8 @@ def download(
query=None,
dataset_id=None,
table_id=None,
query_project_id="basedosdados",
billing_project_id=None,
query_project_id="basedosdados",
limit=None,
from_file=False,
reauth=False,
Expand Down Expand Up @@ -180,10 +189,10 @@ def download(
table_id (str): Optional.
Table id available in basedosdados.dataset_id.
It should always come with dataset_id.
query_project_id (str): Optional.
Which project the table lives. You can change this you want to query different projects.
billing_project_id (str): Optional.
Project that will be billed. Find your Project ID here https://console.cloud.google.com/projectselector2/home/dashboard
query_project_id (str): Optional.
Which project the table lives. You can change this you want to query different projects.
limit (int): Optional
Number of rows.
from_file (boolean): Optional.
Expand All @@ -201,6 +210,10 @@ def download(
"Either table_id, dataset_id or query should be filled."
)

# standard billing_project_id configuration
if billing_project_id is None:
billing_project_id == config.billing_project_id

client = google_client(query_project_id, billing_project_id, from_file, reauth)

# makes sure that savepath is a filepath and not a folder
Expand Down
Loading

0 comments on commit a9b5807

Please sign in to comment.