Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenate datasets #17

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,19 @@ source environment/bin/activate
pip install -r requirements.txt
```

### Baixando os csv's com as proposições

```
python scripts/concatenate_datasets.py --start=<starting_year> --end=<ending_year>
```

Importante notar que, por algum motivo, a leitura do dataset baixado por esse script funciona somente se for usado os parâmetros sep='|' e lineterminator='\n', como demonstrado abaixo.

```
import pandas as pd
df = pd.read_csv("data/propositions.csv", sep='|', lineterminator="\n")
```

### Extraindo justificativas

Após o download dos arquivos das proposições:
Expand Down
50 changes: 45 additions & 5 deletions scripts/concatenate_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,55 @@
start = args.start
end = args.end

if start < 2000 or end < 2000:
raise ValueError("There are only propositions from 2000 forward")

if start > 2020 or end > 2020:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acho que aqui poderíamos pegar o ano atual com alguma biblioteca tipo time

raise ValueError("Cannot get propositions from the future")

if end < start:
raise ValueError("Start date should be smaller than end date")
sys.exit()

df = None
for year in range(start, end):

dtypes = {
'id': 'int64',
'uri': 'object',
'siglaTipo': 'object',
'numero': 'int64',
'ano': 'int64',
'codTipo': 'int64',
'descricaoTipo': 'object',
'ementa': 'object',
'ementaDetalhada': 'object',
'keywords': 'object',
'dataApresentacao': 'object',
'uriOrgaoNumerador': 'object',
'uriPropAnterior': 'float64',
'uriPropPrincipal': 'object',
'uriPropPosterior': 'object',
'urlInteiroTeor': 'object',
'urnFinal': 'float64',
'ultimoStatus_dataHora': 'object',
'ultimoStatus_sequencia': 'int64',
'ultimoStatus_uriRelator': 'object',
'ultimoStatus_idOrgao': 'float64',
'ultimoStatus_siglaOrgao': 'object',
'ultimoStatus_uriOrgao': 'object',
'ultimoStatus_regime': 'object',
'ultimoStatus_descricaoTramitacao': 'object',
'ultimoStatus_idTipoTramitacao': 'int64',
'ultimoStatus_descricaoSituacao': 'object',
'ultimoStatus_idSituacao': 'float64',
'ultimoStatus_despacho': 'object',
'ultimoStatus_url': 'object',
}

for year in range(start, end+1):
url_base = f'https://dadosabertos.camara.leg.br/arquivos/proposicoes/csv/proposicoes-{year}.csv'
if df is None:
df = pd.read_csv(url_base, sep=';')
df = pd.read_csv(url_base, sep=';', dtype=dtypes)
else:
df = pd.concat([pd.read_csv(url_base, sep=';'), df])
df = pd.concat([pd.read_csv(url_base, sep=';', dtype=dtypes), df])

df.to_csv("data/propositions.csv", sep=';')
df.to_csv("data/propositions.csv", sep='|', line_terminator='\n')
2 changes: 2 additions & 0 deletions scripts/download_propositions.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ def subset_dataset(df, start, end):
df = df[df['siglaTipo'].isin(['PEC', 'PL', 'PLP',
'MPV', 'PLV', 'PDL',
'PRC', 'REQ', 'RIC'])]

df = df.reset_index(drop=True)
return df

def main():
Expand Down