data.org, in partnership with a global financial services institution, Harvard OpenDP, and University of Javariana, has launched a Privacy Enhancing Technologies (PETs) for Public Health Challenge. Up to five winners will be awarded $50,000 each.
This pioneering competition invites academic innovators (Masters, PhDs, Postdocs, faculty, etc.) in differential privacy, epidemiology, data science, and machine learning, etc. to create privacy solutions that will help unlock sensitive data for public health advancements and drive social impact.
You can also find more information about the challenge, timing, and funding awards, etc. by visiting Privacy-Enhancing Technologies (PETs) for Public Health Challenge - https://data.org/initiatives/pets-challenge/
In this repository you will find examples for some of the Epidemiological decision-making policy scenario of the challenge using open source databases available online for the 4 selected locations.The notebooks include examples on:
- Effective reproduction number estimations using {EpiEstim} (A. Cori, et al. 2013) for Bogotá D.C., Medellín and Brasília
- Nowcasting of COVID-19 cases for Bogotá D.C. using {EpiNow2}
- Forecasting of deaths in Bogotá D.C. using {sktime}
The sources for both Bogotá D.C. and Medellín cities are alocated by the National Institute of Health. There you will be able to find the updated cases dataset and a historical (legacy) list of datasets corresponding to snapshots of the data available in real time.
In the following table you can see a summary of some variables of interests in the updated dataset:
Name | Description | Type |
---|---|---|
fecha reporte web | Date of publication in the website | POSIXct |
fecha de notificación | Notification date in the SIVIGILA platform | POSIXct |
Código DIVIPOLA municipio | Municipality code (11001 for Bogotá, 5001 for Medellín) |
Integer |
Nombre municipio | Municipality name | Character |
Edad | Age | Integer |
Unidad de medida de edad | Age measurement unit (1-years, 2-months, 3-days) | Integer |
Sexo | Sex | Character |
Estado | Patient state (used to filter deceased cases) | Character |
Recuperado | Recuperado (Recovered), Fallecido (Deceased),N/A (Deceased by other causes than COVID) | Character |
Fecha de inicio de síntomas | Date of onset | POSIXct |
Fecha de muerte | Date of death | POSIXct |
Fecha de diagnóstico | Laboratory confirmation date | POSIXct |
Fecha de recuperación | Date of recovery | POSIXct |
This data ranges from March 2020 up to January 2024. You can find a pipeline to read and group this dataset in the script
download_covid19_data.R
.
The available legacy data consists of individual tables for each date. During
early stages of the pandemic, public health agencies had not agreed yet on what
structure should the data have, which is why not all the tables have the same
structure. Similar variables to those in the table above may be available on
each snapshot of the data under different names and with different data types.
The script download_col_legacy_data.R
reads the legacy data directly from the
source,
looks for the notification and onset dates, and concatanates the snapshots in
a single data frame labelling each snapshot by its register date. This is done
for one snapshot per week from April to October 2020. Then, the incidences by
notification and onset are computed grouping the data. This data is used to
correct right truncation bias due to notfication delay in the
Nowcasting notebook.
Additional spatial information at the individual level can be found in the confirmed cases for Bogotá city published by datosabiertos.bogota.gov.co, where the column Localidad
refers to the residence area of each case. NOTE: As of 14/08/2024, this dataset is now presented as a yearly count of deaths and cases by localidad, age group and sex. Each residence area can correspond to several postcodes according to the following table (extracted from here):
Localidad | Códigos Postales |
---|---|
Usaquén | 110111-110151 |
Chapinero | 110211-110231 |
Santa Fe | 110311-110321 |
San Cristóbal | 110411-110441 |
Usme | 110511-110571 |
Tunjuelito | 110611-110621 |
Bosa | 110711-110741 |
Kennedy | 110811-110881 |
Fontibón | 110911-110931 |
Engativá | 111011-111071 |
Suba | 111111-111176 |
Barrios Unidos | 111211-111221 |
Teusaquillo | 111311-111321 |
Los Mártires | 111411 |
Antonio Nariño | 111511 |
Puente Aranda | 111611-111631 |
La Candelaria | 111711 |
Rafael Uribe Uribe | 111811-111841 |
Ciudad Bolívar | 111911-111981 |
Sumapaz | 112011-112041 |
Otherwise, the dataset is fairly similar to that from the INS referred above. Which you can get by filtering:
zipcodes_pets <- df_zipcodes %>%
filter((`country code` == "CO" & `admin name1` == "Antioquia" & `admin name2` == "Medellín"))
Similarly for all the locations:
zipcodes_pets <- df_zipcodes %>%
filter(
(`country code` == "BR" & `admin name1` == "Distrito Federal") |
(`country code` == "CL" & `admin name1` == "Región Metropolitana") |
(`country code` == "CO" & `admin name1` == "Bogota, D.C.") |
(`country code` == "CO" & `admin name1` == "Antioquia" & `admin name2` == "Medellín")
)
A compelling compilation of COVID-19 related data for Brazil is available in the
covid19br GitHub repository by Wesley Cota.
The data set cases-brazil-states.csv
contains the following relevant variables
(among others):
name | description | Type |
---|---|---|
epi_week | Epidemiological week | Integer |
date | Notification date | Date |
country | Name of the country (always "Brazil" ) |
Character |
state | Name of the federative unit ("DF" for Brazilia) |
Character |
city | Name of the municipality | Character |
newDeaths | Number of reported new deaths | Integer |
deaths | Total number of deaths | Integer |
newCases | Number of reported new cases | Integer |
totalCases | Total number of cases | Integer |
This data ranges from March 2020 up to March 2023. A complete description of the dataset can be found in the English version of the README in the source repository.
Although real-time snapshots of the data are not directly available, tt may be
possible to extract them from the git history of the repository by searching for
old versions of the cases-brazil-states.csv
. Moreover, a complete example
nowcasting for Brazil can be found in the
Observatório Covid-19 BR project, where the
Nowcaster package is employed.
Individual level data about deaths caused by COVID-19 in Santiago de Chile can be found in Centralized open repository of state, which contains a cumulative register of deceases. The following table is a summarized data dictionary for this dataset:
name | description | Type |
---|---|---|
FECHA_DEF | Date of death | Date |
SEXO_NOMBRE | Biological sex | Character |
EDAD_TIPO | Age measurement unit (1-years, 2-months) | Integer |
EDAD_CANT | Age | Character |
CODIGO_COMUNA | Code of the residence commune of the diseased | Integer |
COMUNA | Residence commune of the diseased ("Santiago" for Santiago de Chile) |
Character |
GLOSA_SUBCATEGORIA_DIAG1 | Cause of death | Character |
CODIGO_CATEGORIA_DIAG1 | Code of the cause of death ("U071" for identified covid19 cases) |
Integer |
This data ranges from April 2020 up to February 2024. A complete data dictionary can be downloaded from the source. Similarly as for
the INS' data from Colombia, in the download_covid19_data.R
script you can find
a simple pipeline to clean and group this dataset to obtain the daily incidence
of deaths.