The repository forms part of the LIFE STRESS project. The LIFE STRESS project has received funding from the LIFE Programme of the European Union. The contents of this publication are the sole responsibility of Theia Finance Labs and do not necessarily reflect the opinion of the European Union. Project: 101074271 — LIFE21-GIC-DE-Stress.
The preprocessing functions included in this repo have for purpose to generate dataframes to use as an input of the climate stress testing tool TRISK available on this repository :
TRISK input dataframes follow this structure :
The goal of STDataMGMT is to provide the input data required for the stress test that cannot be shared as part of a public repository. The 2DII stress test needs several external data inputs that usually require a license which prohibits redistributing the data. 2DII does not have the right to publicly share that data though, which is why this repository needs to stay private.The repository will be used to point the main stress testing code to, when said data needs to be loaded. This allows publishing the code containing 2DII’s main climate stress testing methodology open source.
The underlying logic of the data preparation is the following quintet:
-
raw data are stored with the affix raw_ in the data-raw folder
-
in the R folder the data preparation function is stored with the affix prepare_/aggregate_
-
in the data-raw folder the file that calls the data preparation function is stored with the affix run_
-
prepared data are stored in the data-raw folder without affix
-
in the test folder the test of the data preparation function is stored with the affix test_
- To generate all stress test input files, create the required inputs in
data-raw/DBs
, then run thedata-raw/run_workflow.R
file and retrieve the newly generated files indata-raw
. - An example usage of the preprocessing functions using synthetic data
can be found in
vignettes/run_synthetic_workflow.Rmd
- generated by:
- data_raw/prep_datalake.R
- description:
- Contains the matching between IDs of different sources.
- unique index :
- isin
- company_id
- columns:
- isin: ISIN of a company. Multiple ISINs can match the same company_id. Can be NA if and only if company_id is not NA
- company_name: name of the company. Unique for each company_id
- company_id: id of a company. can be NA if and only if isin is not NA
- generated by:
- data_raw/prep_datalake.R
- description:
- Contains financial features for different ISINs.
- unique index:
- isin
- columns:
- isin: ISIN of company
- ald_sector: location of the asset, in ISO2c format. Can be obtained from the first 2 characters of the ISIN.
- net_profit_margin: net profit margin (between 0 and 1)
- pd: probability of default of an asset (between 0 and 1)
- debt_equity_ratio : debt equity ratio of an asset (between 0 and 1)
- volatility: volatility of an asset (between 0 and 1)
- asset_drift: drift of an asset (between 0 and 1)
- generated by:
- data_raw/prep_datalake.R
- description:
- contains the production information for each company
- unique index:
- company_id
- ald_sector
- ald_business_unit
- ald_location
- columns:
- company_id: id of the company
- ald_sector: sector of the asset (ex: Power, Oil&Gas.. ) see
STDataMgmt::production_types$ald_sector
for all possible values - ald_business_unit: business unit of the asset (ex: RenewablesCap,
OilCap) see
STDataMgmt::production_types$ald_business_unit
for all possible values. - ald_location: country where is located the asset, in ISO2c format
- activity_unit: unit of the production (ex: MW, GJ, …) see
STDataMgmt::production_types$ald_production_unit
for all possible values. - Equity Ownership YYYY : amount of production for a given year. 1 column per year.
- generated by:
- data_raw/prep_datalake.R
- description:
- contains the emissions information for each company
- unique index:
- company_id
- ald_sector
- ald_business_unit
- ald_location
- columns:
- company_id: id of the company
- ald_sector: sector of the asset (ex: Power, Oil&Gas.. ) see
STDataMgmt::production_types$ald_sector
for all possible values - ald_business_unit: business unit of the asset (ex: RenewablesCap,
OilCap) see
STDataMgmt::production_types$ald_business_unit
for all possible values. - ald_location: country where is located the asset, in ISO2c format
- activity_unit: unit of the production (ex: tCO2, tCO2e) see
STDataMgmt::production_types$emissions_factor_unit
for all possible values. - Equity Ownership YYYY : amount of emissions for a given year. 1 column per year.
- generated by:
- data_raw/prep_datalake.R
- description:
- Contains the linking information between companies, parent and subsidiaries.
- unique index:
- parent_company_id
- subsidiary_company_id
- columns:
- parent_company_id: the parent company_id
- subsidiary_company_id: the subsidiary company_id
- ownership_level: rank of the subsidiary towards the parent company. If equal to 0, then parent_company_id == subsidiary_company_id. If equal to 1, then the parent company is the highest parent of this subsidiary.
- generated by :
- data-raw/run_prepare_prewrangled_financial_data_stress_test.R
- description:
- Financial data input for the stress test.
- unique index:
- company_id
- columns:
- company_id: the company id
- pd: the probability of default
- net_profit_margin: the net profit margin of a company
- debt_equity_ratio: the debt equity ratio of a company
- volatility: the volatility of a company
- asset_drift: the asset drift of a company
- net_profit_margin_rawdata_ratio : the ratio between financial indicators coming from the raw input data, and how many came from averages. The final indicator of a company can be mixed, as it can be an aggregate of indicators from multiple countries if a company owns assets in different countries.
- debt_equity_ratio_rawdata_ratio: idem for debt_equity_ratio
- volatility_rawdata_ratio: idem for volatiliy
- asset_drift_rawdata_ratio: idem for asset_drift
- generated by :
- data-raw/run_prepare_abcd_tress_test_input.R
- description :
- production and emissions factor for different companies. An asset here refers to the eventual aggregation of the production and emissions of multiple ISINs from the raw company data (DB_company_activities and DB_company_emissions)
- unique index:
- company_id
- ald_sector
- ald_business_unit
- scenario_geography
- year
- columns:
- company_id: id of the company
- company_name: name of the company
- ald_sector: sector of the asset
- ald_business_unit: business unit of the asset
- scenario_geography: scenario geography of the asset
- year: year of production
- ald_production_unit: production unit of the asset
- emissions_factor_unit: unit of the emission factor
- plan_tech_prod: production in the ald_sector/ald_business_unit, for a given year
- plan_emission_factor: emission in the ald_sector/ald_business_unit, for a given year
- plan_sec_prod: production in the ald_sector, for a given year
In order to get the scenario data into the required format as defined in
r2dii.climate.stress.test
, the function prepare_scenario_data()
assumes that input in the format of data found on the dropbox under
Processed Data is available. The input argument should point to a
corresponding file. The start year of the scenario file must be defined,
as does the output directory (NOT the file name). Please do not write
the output data to any directory in this repository, unless you are
specifically tasked with updating the default auxiliary files for the
2DII stress test.
ADD INFO ON PACTA ROUTINE ONCE CLEAR
Production data preparation uses the PAMS dataset from Asset Resolution as an input, which is stored on the dropbox under ST_INPUTS_PRODUCTION. Start year, as well as time-horizon, scenario sources, and sectors are defined manually.Data are aggregated to company, technology, geography and year level and production as well as emission factors are retained.
We remove entries with faulty or missing ownership information and sectors that we cannot cover in the stress test as of now. Entries with missing information on ALD locations are removed as well.For all entries that have missing emission factors, we mean impute based on the corresponding combination of country, technology, year or sub region, technology, year.
The function prepare_prewrangled_capacity_factors_WEOYYYY()
reads in
raw power generation and capacity data and wrangles that data into the
required format for usage in r2dii.climate.stress.test
. It uses input
files that are to be found on the dropbpox under Raw_Data, under
WEOYYYY_Raw_Data.csv
, with YYYY being the year of the release of the
report.The user furthermore has to set a start year and an end year of
the preparation time frame. The start year should correspond with the
year of publication and the end year should be the same as the one used
in r2dii.climate.stress.test
.
Beside wrangling the data into the expected format in terms of pivoting
and renaming columns, the function will linearly interpolate missing
values between the start and end years of the preparation time frame for
power generation and capacity. It will transform both indicators into
the same unit (GW) and calculate capacity factors per scenario, scenario
geography, sector, technology and year. It also ensures the
sector/technology breakdown is the same as in
r2dii.climate.stess.test
, as long as the input file is an IEA WEO
release.
Lastly, the directory to which the resulting file will be written has to be set. By default, the function will not overwrite any file within this repository. So if you are trying to update the contents of the repository, you will have to do that explicitly.
The function prepare_price_data_long_WEOYYYY()
reads price data based
on the raw data format provided by IEA WEO and wrangles that data into
clean long format.The start year has to be set manually. For fossil
fuels, price data by scenario is derived from Ask Antonio. For power
generation, data on the levelized cost of electricity (LCOE) by scenario
is taken from
https://www.iea.org/articles/levelised-cost-of-electricity-calculator
Ask Antonio. The function prepare_lcoe_adjusted_price_data_weo()
reads in the long format prepared price data from IEA scenarios and
filters LCOE for the power sector. Using the LCOE for power generation
implies that this indicator must be transformed into effective prices
for each company, by applying the following transformation:
- LCOE = (1 - net_profit_margin) * price
- price = LCOE / (1 - net_profit_margin)
This transformation happens in r2dii.cliamte.stress.test
though, as it
requires company level information.
We want to be able to use all data at least on a global granularity, with additional layers of granularity being a nice-to-have. For some variables, the raw data only contain a list of selected regional data. In these cases, we approximate the global price as a simple arithmetic mean of the regional prices. A more sophisticated approximation would probably take into account the sizes of the markets and add some missing regions, but for our purposes, this is still a reasonable approximation. Specifically, we use this approximation for Coal and Gas extraction as well as for all given types of power generation.
In order to obtain all relevant types of financial information for the companies for which we have production data (abcd_stress_test_input), we need to merge and clean the raw financial data from Eikon with the companies data provided by Asset Resolution.
prepare_financial_data()
requires the following inputs : -
financial_data the merge on isin between DB_ids and DB_assets_eikon.
Not all isin have to match a company_id, but the highest the matching
rate, the better quality will the output data be. - companies_data
DB_companies_activities. Brings the information on missing company_id
and ald_sector column. - ownership_tree DB_ownership_tree -
minimum_sample_size minimum sample size tolerated to use an
average - minimum_ratio_sample minimum ratio of a sample tolerated
to use an average - allowed_range_npm tolerated range for the
net_profit margin, c(min_npm, max_npm)
prepare_financial_data()
works by applying the following steps : -
Remove implausible values from the raw data, by setting the cell where
it appears to NA . Those values will be filled with averages later on. -
Using the country associated to each asset, map each country to a world
region. We’re not using the scenarios regions, because that would create
duplicate rows. - When a company owns several assets in different
countries, but same region, aggregate the financial indicators from
country to region. - Add new rows with NA financial value for the
missing company_id from companies_data. - If the ownership tree is
provided as an input, match missing financial values of subsidiaries to
their closest parent compahy, if the parent company has non-NA financial
values in the raw data. - Compute averages, and pick the most granular
one possible to fill missing values - As financial indicators are still
at the (company_id, ald_sector, region) granularity level, aggregate
those to the company level to be used in the stress test.
EU LIFE Project Grant
Co-funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or CINEA. Neither the European Union nor the granting authority can be held responsible for them.
Scientific Transition Risk Exercises for Stress tests & Scenario Analysis has received funding from the European Union’s Life programme under Grant No. LIFE21-GIC-DE-Stress under the LIFE-2021-SAP-CLIMA funding call.