Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement "best case" and "worst case" for emission_profile() on product-level #242

Closed
maurolepore opened this issue Sep 13, 2023 · 19 comments
Assignees
Labels
enhancement New feature or request

Comments

@maurolepore
Copy link
Contributor

maurolepore commented Sep 13, 2023

Currently the tiltIndicators are calculated using only an equal-weight approach which assumes that the company produces products with low, medium, and high risk equally.

Anne shared code (here) what implements it. And I reflected on it here.

@maurolepore maurolepore self-assigned this Sep 13, 2023
@maurolepore maurolepore changed the title Add the ability to calculate indicators using "best case", "worst case", and "equal weight" (current default) Implement "best case", "worst case" Sep 13, 2023
@maurolepore maurolepore changed the title Implement "best case", "worst case" Implement "best case" and "worst case" Sep 13, 2023
@AnneSchoenauer
Copy link

AnneSchoenauer commented Sep 28, 2023

Update

@maurolepore please note that this issue refers to all indicators. So to the emission profile, emission upstream profile, sector profile, sector upstream profile. You can find how I would implement it in this output table here always on the company level data for the emission profile and sector profile. For the upstream profiles I still need to clarify the methodological questions.

Please keep this still in the backlog and we will also have further disucssions once we decide to focus on this again.

@AnneSchoenauer AnneSchoenauer added the enhancement New feature or request label Jan 4, 2024
@AnneSchoenauer
Copy link

For emission_share_wc find this reprex:

library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

# Sample data
data <- tribble(
  ~companies_id, ~company_name, ~country, ~emission_share_ew, ~emission_category, ~benchmark,
  "040-tegelzetter_00000005336021-634273001", "040 tegelzetter", "netherlands", 0,   "high", "all",
  "040-tegelzetter_00000005336021-634273001", "040 tegelzetter", "netherlands", 1,   "medium", "all",
  "040-tegelzetter_00000005336021-634273001", "040 tegelzetter", "netherlands", 0,   "low", "all",
  "a-beverwijk-en-zonen-bv_nld009707-00101",  "a. beverwijk en zonen b.v.", "netherlands", 0.5, "high", "all",
  "a-beverwijk-en-zonen-bv_nld009707-00101",  "a. beverwijk en zonen b.v.", "netherlands", 0,   "medium", "all",
  "a-beverwijk-en-zonen-bv_nld009707-00101",  "a. beverwijk en zonen b.v.", "netherlands", 0.5, "low", "all"
)

# Automatically calculate emission_share_wc
add_worst_case_emissions <- function(data) {
  risk_order <- c("low", "medium", "high")
  
  grouped_data <- data %>%
    group_by(benchmark, companies_id, company_name) %>%
    mutate(
      max_risk_category = risk_order[max(match(emission_category[emission_share_ew > 0], risk_order))],
      emission_share_wc = ifelse(emission_category == max_risk_category & emission_share_ew > 0,
                                  1, 0)
    ) %>%
    ungroup()
  
  return(grouped_data)
}


result <- add_worst_case_emissions(data)
print(result)
#> # A tibble: 6 × 8
#>   companies_id          company_name country emission_share_ew emission_category
#>   <chr>                 <chr>        <chr>               <dbl> <chr>            
#> 1 040-tegelzetter_0000… 040 tegelze… nether…               0   high             
#> 2 040-tegelzetter_0000… 040 tegelze… nether…               1   medium           
#> 3 040-tegelzetter_0000… 040 tegelze… nether…               0   low              
#> 4 a-beverwijk-en-zonen… a. beverwij… nether…               0.5 high             
#> 5 a-beverwijk-en-zonen… a. beverwij… nether…               0   medium           
#> 6 a-beverwijk-en-zonen… a. beverwij… nether…               0.5 low              
#> # ℹ 3 more variables: benchmark <chr>, max_risk_category <chr>,
#> #   emission_share_wc <dbl>

Created on 2024-01-16 with reprex v2.0.2

For emission_share_bc find this reprex:

#for the best case

library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

# Sample data
data <- tribble(
  ~companies_id, ~company_name, ~country, ~emission_share_ew, ~emission_category, ~benchmark,
  "040-tegelzetter_00000005336021-634273001", "040 tegelzetter", "netherlands", 0,   "high", "all",
  "040-tegelzetter_00000005336021-634273001", "040 tegelzetter", "netherlands", 1,   "medium", "all",
  "040-tegelzetter_00000005336021-634273001", "040 tegelzetter", "netherlands", 0,   "low", "all",
  "a-beverwijk-en-zonen-bv_nld009707-00101",  "a. beverwijk en zonen b.v.", "netherlands", 0.5, "high", "all",
  "a-beverwijk-en-zonen-bv_nld009707-00101",  "a. beverwijk en zonen b.v.", "netherlands", 0,   "medium", "all",
  "a-beverwijk-en-zonen-bv_nld009707-00101",  "a. beverwijk en zonen b.v.", "netherlands", 0.5, "low", "all"
)

# Automatically calculate emission_share_wc
add_worst_case_emissions <- function(data) {
  risk_order <- c("high", "medium", "low")
  
  grouped_data <- data %>%
    group_by(benchmark, companies_id, company_name) %>%
    mutate(
      max_risk_category = risk_order[max(match(emission_category[emission_share_ew > 0], risk_order))],
      emission_share_wc = ifelse(emission_category == max_risk_category & emission_share_ew > 0,
                                  1, 0)
    ) %>%
    ungroup()
  
  return(grouped_data)
}


result <- add_worst_case_emissions(data)
print(result)
#> # A tibble: 6 × 8
#>   companies_id          company_name country emission_share_ew emission_category
#>   <chr>                 <chr>        <chr>               <dbl> <chr>            
#> 1 040-tegelzetter_0000… 040 tegelze… nether…               0   high             
#> 2 040-tegelzetter_0000… 040 tegelze… nether…               1   medium           
#> 3 040-tegelzetter_0000… 040 tegelze… nether…               0   low              
#> 4 a-beverwijk-en-zonen… a. beverwij… nether…               0.5 high             
#> 5 a-beverwijk-en-zonen… a. beverwij… nether…               0   medium           
#> 6 a-beverwijk-en-zonen… a. beverwij… nether…               0.5 low              
#> # ℹ 3 more variables: benchmark <chr>, max_risk_category <chr>,
#> #   emission_share_wc <dbl>

Created on 2024-01-16 with reprex v2.0.2

Is this making it a bit clearer?

@Tilmon Tilmon assigned kalashsinghal and unassigned maurolepore Feb 6, 2024
@kalashsinghal
Copy link
Collaborator

@AnneSchoenauer I have some issues in understanding the above reprexes.

  1. You only added the function add_worst_case_emissions in both reprexes above. Please review above reprexes and add the add_best_case_emissions function in the reprex so that I can understand the methodology better.

  2. We have emissions_share column in the company level output and not the emissions_share_ew. It seems like emissions_share_ew is the output column after applying the code. If yes, then please update the column names.

  3. Please note that the input to the function add_worst_case_emissions is same as the output. Hence, its difficult to understand what the code is actually changing. Please use different input so that its easy to understand the code.

Thanks for spending time on creating the reprex! :)

@AnneSchoenauer
Copy link

Hi @kalashsinghal, yes you are right - this was a very very bad reprex. Please have a look if this works for you below. Please note that I didn't write any code but showed you in a sample data set how the result should look like.

library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)

# Create the initial dataset
companies_co2 <- tibble(
  companies_id = rep("a", 13),
  grouped_by = c("all", "isic_4digit", "tilt_sector", "unit", "unit_isic_4digit", "unit_tilt_sector", "NA", 
                 "all", "isic_4digit", "tilt_sector", "unit", "unit_isic_4digit", "unit_tilt_sector"),
  risk_category = c("high", NA, "high", "high", NA, "high", NA, "medium", "medium", "medium", "medium", "medium", "medium"),
  profile_ranking = c(1, NA, 1, 1, NA, 1, NA, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5),
  clustered = c(rep("a", 6), "b", rep("c", 6)),
  activity_uuid_product_uuid = c(rep("a", 6), NA, rep("c", 6)),
  co2_footprint = c(rep(1, 6), NA, rep(0.5, 6))
)

# Function to add amount of distinct products
add_amount_of_distinct_products <- function(data) {
  products <- data %>% 
    group_by(companies_id) %>% 
    summarise(amount_of_distinct_products = n_distinct(clustered))
  
  data <- data %>% 
    left_join(products, by = "companies_id")
  
  return(data)
}

companies_co2 <- add_amount_of_distinct_products(companies_co2)

# You can directly use 'companies_co2' to further process or create new datasets as needed.
# Make sure to adjust the final dataset creation part as per your specific requirements.

Created on 2024-02-13 with reprex v2.0.2

Linda used with financial data similiar functions. Here are the functions that she used:

add_worst_case_finance <- function(data) {
  risk_order <- c("low", "medium", "high")
  
  grouped_data <- data %>%
    group_by(company_name) %>%
    mutate(
      max_risk_category = risk_order[max(match(pctr_risk_category, risk_order))],
      product_count = sum(pctr_risk_category == max_risk_category),
      worst_case_finance = ifelse(pctr_risk_category == max_risk_category,
                                  amount_total / product_count, 0)
    ) %>%
    ungroup() %>%
    select(-product_count, -max_risk_category)
  
  return(grouped_data)
}
add_best_case_finance <- function(data) {
  risk_order <- c("high", "medium", "low")
  
  grouped_data <- data |> 
    group_by(company_name) |> 
    mutate(
      max_risk_category = risk_order[max(match(pctr_risk_category, risk_order))],
      product_count = sum(pctr_risk_category == max_risk_category),
      best_case_finance = ifelse(pctr_risk_category == max_risk_category,
                                  amount_total / product_count, 0)
    ) |> 
    ungroup() |> 
    select(-product_count, -max_risk_category) 
  
  return(grouped_data)
}

You can find her work here in this zip folder:
weighing_loanbook (2).zip

However, please note that she used the following toy dataset here which also has financial data. The logic however is the same. In the worst case scenario you only count the worst case products (so with the highest emission profile) and in the best case you only count the products with the lowest emission profile.

Don't know how best to implement this in code. But let me know if there are questions.
Thanks!

@Tilmon
Copy link
Collaborator

Tilmon commented Feb 20, 2024

Hi @kalashsinghal , just FYI, @lindadelacombaz already worked with best case and worst case toy data a lot in tiltPlot and can support you with code or general knowledge if needed!

@kalashsinghal
Copy link
Collaborator

Hi @AnneSchoenauer, What does the variable amount_total signifies in the above reprex?

@AnneSchoenauer
Copy link

This is only important when it comes to credit data, so with financial data. So without credit data you can leave it like this.

@kalashsinghal
Copy link
Collaborator

This is only important when it comes to credit data, so with financial data. So without credit data you can leave it like this.

@AnneSchoenauer So then what should be the replacement of this expression amount_total / product_count if I don't have to use the amount_total?

@AnneSchoenauer
Copy link

Amount total is the total amount of the products of the company, and product count the amount of the products in a certain risk category. So for example if the company has 4 products in total and in the equal weight scenario 2 are in the high and 2 are in the low risk category, you would have 2/4 for low, and 2/4 for high. If it is the best case it is 4/4 for low and 4/4 for high.

@kalashsinghal
Copy link
Collaborator

@maurolepore Output of this task is not required for Bundesbank data, however it is required to create toy dataset for financials . I have to delegate this work to you too. Thanks!

@Tilmon
Copy link
Collaborator

Tilmon commented Apr 9, 2024

@maurolepore see the file "weighing_loanbook (2).zip" in this comment #242

@AnneSchoenauer
Copy link

UPDATE:

I read through the chain I noticed that over the months this issue and the understanding of the complexity changed. Therfore, here is now the final ticket that we need to look for.

Starting point: emission_profile product_level. Tickets will follow for sector profile and transition risk as well as how to calculate the company level. There will also be edge cases with NAs which I would like to discuss if you say that you understand this ticket here.

Task Breakdown:

  • Calculate Equal Weights:

For each product in a company, calculate the equal_weight by dividing one by the number of distinct products per company. This represents the relative weight of each product.

  • Determine Emission Risk Extremes:
    Identify the lowest and highest emission profiles (e.g., "low", "medium", "high") for each company.

  • Calculate Best and Worst Case Scenarios:
    Best Case: Compute the proportion of each product that matches the lowest emission profile in its company.
    Worst Case: Compute the proportion of each product that matches the highest emission profile in its company.

  • Implement Additional Columns:

  • amount_of_distinct_products: count all ep_products

  • equal_weight: calculate 1/all ep_products

  • dummy_best: Binary column (1 or 0) indicating if the product's emission profile matches the company's lowest risk category.

  • dummy_worst: Binary column (1 or 0) indicating if the product's emission profile matches the company's highest risk category.

  • count_best_case_products: Count of all products that match the lowest emission profile within the company.

  • count_worst_case_products: Count of all products that match the highest emission profile within the company.

  • best_case: Calculate as dummy_best divided by count_best_case_products.

  • worst_case: Calculate as dummy_worst divided by count_worst_case_products.

Please note that the dummy_best solution and the count solution was my solution on how I would implement the best_case and worst_case column. The dummy columns are not needed as well as the count columns for best case and worst case. However in the new outputs we should see amount_of_distinct_products, equal_weight, best_case and worst_case.

Here is the reprex to the above:

# Load necessary libraries
library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Example data
data <- tibble(
  company_name = c("bruno", "bruno", "bruno", "mauro", "mauro", "mirja", "mirja", "pasant", "pasant", "peter", "peter", "peter peasant", "peter peasant", "tilman", "tilman", "tilman", "tilman", "tilman", "tilman"),
  postcode = c(27568, 27568, 27568, 39221, 39221, 34117, 34117, 80337, 80337, 88131, 88131, 53773, 53773, 12043, 12043, 12043, 12043, 12043, 12043),
  emission_profile = c("high", "high", "medium", "high", "high", "high", "high", "high", "low", "high", "medium", "high", "high", "high", "high", "high", "low", "low", "medium")
)

# Calculate the number of products per company
products <- data %>% 
  group_by(company_name) %>% 
  summarise(amount_of_distinct_products = n()) %>%
  ungroup()

# Join this count back to the main data frame
data <- data %>%
  left_join(products, by = "company_name") %>%
  mutate(equal_weight = 1 / amount_of_distinct_products)

# Function to add case scenarios and calculate best_case and worst_case
add_case_scenarios <- function(data) {
  risk_order <- c("low", "medium", "high")
  
  data %>%
    group_by(company_name) %>%
    mutate(
      max_risk_category = risk_order[max(match(emission_profile, risk_order))],
      min_risk_category = risk_order[min(match(emission_profile, risk_order))],
      count_best_case_products = sum(emission_profile == min_risk_category),
      count_worst_case_products = sum(emission_profile == max_risk_category),
      dummy_best = as.integer(emission_profile == min_risk_category),
      dummy_worst = as.integer(emission_profile == max_risk_category),
      best_case = dummy_best / count_best_case_products,
      worst_case = dummy_worst / count_worst_case_products
    ) %>%
    ungroup()
}

# Apply function
data <- add_case_scenarios(data)

# Print data to check the new columns
print(data)
#> # A tibble: 19 × 13
#>    company_name  postcode emission_profile amount_of_distinct_pro…¹ equal_weight
#>    <chr>            <dbl> <chr>                               <int>        <dbl>
#>  1 bruno            27568 high                                    3        0.333
#>  2 bruno            27568 high                                    3        0.333
#>  3 bruno            27568 medium                                  3        0.333
#>  4 mauro            39221 high                                    2        0.5  
#>  5 mauro            39221 high                                    2        0.5  
#>  6 mirja            34117 high                                    2        0.5  
#>  7 mirja            34117 high                                    2        0.5  
#>  8 pasant           80337 high                                    2        0.5  
#>  9 pasant           80337 low                                     2        0.5  
#> 10 peter            88131 high                                    2        0.5  
#> 11 peter            88131 medium                                  2        0.5  
#> 12 peter peasant    53773 high                                    2        0.5  
#> 13 peter peasant    53773 high                                    2        0.5  
#> 14 tilman           12043 high                                    6        0.167
#> 15 tilman           12043 high                                    6        0.167
#> 16 tilman           12043 high                                    6        0.167
#> 17 tilman           12043 low                                     6        0.167
#> 18 tilman           12043 low                                     6        0.167
#> 19 tilman           12043 medium                                  6        0.167
#> # ℹ abbreviated name: ¹​amount_of_distinct_products
#> # ℹ 8 more variables: max_risk_category <chr>, min_risk_category <chr>,
#> #   count_best_case_products <int>, count_worst_case_products <int>,
#> #   dummy_best <int>, dummy_worst <int>, best_case <dbl>, worst_case <dbl>

Created on 2024-04-15 with reprex v2.0.2

@Tilmon
Copy link
Collaborator

Tilmon commented Apr 23, 2024

Conclusion from Sprint 2024-04-23
@kalashsinghal to come up with first draft PR and share it with @maurolepore for review.

@AnneSchoenauer
Copy link

@Tilmon please find here an XLS sheet (sheet 1), where I implemented the best case and worst case for the sector profile and transition risk score (@kalashsinghal please ignore this comment for now please). I included now the NAs for the ew but I excluded the NAs for best_case and worst_case. This could lead to the fact that the best_case are worse than the equal_weight. It would be great if we could discuss this pretty soon so that we are both aligned on this issue and that we can tell the tech team how to handle NAs :) I put it to discuss it on Wednesday!

@kalashsinghal
Copy link
Collaborator

count_best_case_products: Count of all products that match the lowest emission profile within the company.
count_worst_case_products: Count of all products that match the highest emission profile within the company.

@AnneSchoenauer A single product have six benchmarks (like: "all", "tilt_sector", ..etc.). Each benchmark can have different risk categories. So, on the basis of which benchmark will we judge that the product matches the lowest risk category of the company? For example:

Risk categories of a product with benchmark "all" is low and for benchmark "tilt_sector" is high, and company's lowest risk category is high, then for benchmark "all", the product doesn't match to the company's high risk category, but for benchmark "tilt_sector" it does match. Is this an issue? or am i missing something?

@AnneSchoenauer
Copy link

Hi @kalashsinghal we will do it for all benchmarks individually so before you do it you would need to group it for each benchmark.

@AnneSchoenauer
Copy link

@kalashsinghal --> push already a draft work without expectations that it is clean but at least the latest work is already pushed.

@maurolepore
Copy link
Contributor Author

From our conversation during sprint planning I understand this issue moves from tiltIndicator to tiltIndicatorAfter.

Moving it now ...

@Tilmon
Copy link
Collaborator

Tilmon commented Jun 4, 2024

@kalashsinghal @AnneSchoenauer adjusted the ticket name to reflect that it's only for product-level required

@Tilmon Tilmon closed this as completed Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants