Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New summarize_range() and jitter_range() #622

Merged
merged 127 commits into from
Nov 23, 2023
Merged

New summarize_range() and jitter_range() #622

merged 127 commits into from
Nov 23, 2023

Conversation

maurolepore
Copy link
Contributor

@maurolepore maurolepore commented Nov 17, 2023

Closes #554

This PR adds new helpers to (1) summarize the range of a column in a dataframe by groups, and (2) jitter the range, expanding it towards the left and right of the range.

NOTE the amount of jitter is hard to set precisely via funciton arguments. Instead we may call the function and calculate the average deviation of the jitter. This PR does not formally introduce a helper to do that. See #624.


TODO

  • Link related issue/PR.
  • Describe the goal of the PR. Avoid details that are clear in the diff.
  • Mark the PR as draft.
  • Include a unit test.
  • Review your own PR in "Files changed".
  • Ensure the PR branch is updated.
  • Ensure the checks pass.
  • Change the status from draft to ready.
  • Polish the PR title and description.
  • Assign a reviewer.

EXCEPTIONS

  • Slide here any item that you intentionally choose to not do.

@maurolepore maurolepore changed the title New jitter_co2_range New jitter_co2_range() Nov 17, 2023
@maurolepore maurolepore changed the title WIP: New summarize_range() and jitter_range() New summarize_range() and jitter_range() Nov 20, 2023
@maurolepore maurolepore marked this pull request as ready for review November 20, 2023 13:49
@AnneSchoenauer
Copy link

AnneSchoenauer commented Nov 20, 2023

@maurolepore. Could you please explain to me the following:

Why do the values in the first column don't differ? So why do we have the same values for min, max min_jitter and max_jitter? If it is the case that 10% is quite small we should definitely increase the values? Do you have a recommendation which one would be good? I see here that you added some plots showing the uncertainty - maybe you could do this for 0.2, 0.5 and 0.7 to understand the differences.

grouped_by       risk_category   min    max min_jitter max_jitter
#>    <chr>            <chr>         <dbl>  <dbl>      <dbl>      <dbl>
#>  1 all              high          58.1  176.        58.1      176.  

Also it seems the case that the max_jitter is not always going in the right direction - so that it not always increases the values. See for example column 5. So here the max value is 176. and the max_jiiter is 175. which is lower than the actual value.

grouped_by       risk_category   min    max min_jitter max_jitter
#>  5 unit_isic_4digit high           2.07 176.         2.03     175.

And a last comment - the min and max values are only the min and max values from the europages matched products to ecoinvent right? Actually it is similiar to the benchmark - we would want to include the ranges from all ecoinvent data... So the question is if you could get the raw ecoinvent data with the benchmarks, calculate the ranges and then pick the ep_products based on the europage companies.

@maurolepore
Copy link
Contributor Author

maurolepore commented Nov 20, 2023

@AnneSchoenauer,

Thanks for your sharp eye!

  1. I owe you the plot.

  2. Very weird. Here is a reprex with 1 million values and I don't see that problem. My only guess is that I shared the reprex above before I fully finished the logic. In any case I'll need to test a bit more to make sure I got it right.

devtools::load_all()
#> ℹ Loading tiltIndicator
library(dplyr, warn.conflicts = FALSE)
set.seed(123)
options(width = 200)

n <- 1e6
data <- tibble(x = rnorm(n), y = 1:n)

out <- data |> 
  summarize_range(x, .by = y) |> 
  jitter_range() |> 
  mutate(min_jitter_is_smaller = min_jitter < min) |> 
  mutate(max_jitter_is_bigger = max_jitter > max)
out
#> # A tibble: 1,000,000 × 7
#>        y     min     max min_jitter max_jitter min_jitter_is_smaller max_jitter_is_bigger
#>    <int>   <dbl>   <dbl>      <dbl>      <dbl> <lgl>                 <lgl>               
#>  1     1 -0.560  -0.560     -0.661     -0.538  TRUE                  TRUE                
#>  2     2 -0.230  -0.230     -0.366     -0.181  TRUE                  TRUE                
#>  3     3  1.56    1.56       1.51       1.56   TRUE                  TRUE                
#>  4     4  0.0705  0.0705    -0.0763     0.0941 TRUE                  TRUE                
#>  5     5  0.129   0.129      0.0850     0.231  TRUE                  TRUE                
#>  6     6  1.72    1.72       1.70       1.79   TRUE                  TRUE                
#>  7     7  0.461   0.461      0.444      0.626  TRUE                  TRUE                
#>  8     8 -1.27   -1.27      -1.32      -1.22   TRUE                  TRUE                
#>  9     9 -0.687  -0.687     -0.963     -0.645  TRUE                  TRUE                
#> 10    10 -0.446  -0.446     -0.508     -0.305  TRUE                  TRUE                
#> # ℹ 999,990 more rows

any(!out$min_jitter_is_smaller)
#> [1] FALSE

any(!out$max_jitter_is_bigger)
#> [1] FALSE

Created on 2023-11-20 with reprex v2.0.2

  1. Yeah, the toy data in the reprex above imitates the products dataset -- not the ecoinvent dataset. But the new functions are completely general. summarize_range(data, col, .by)takes any dataset indata, any column in col, and any groups in .by(orgroup_by()`). So we just need to call these functions with the required data.

@maurolepore
Copy link
Contributor Author

Answering 1, on average the jitter increases with amount but you will still see individual values of jitter that are close to the original value.

library(tidyverse)
devtools::load_all()
#> ℹ Loading tiltIndicator

# Helper
plot_jitter_amount <- function(amount) {
  data <- tibble(x = rnorm(100))
  
  jitter <- data |> 
    summarize_range(x, .by = x) |> 
    jitter_range(amount)
  
  ggplot(jitter) +
    geom_line(aes(x, x)) +
    geom_point(aes(min_jitter, x), color = "blue") +
    geom_point(aes(max_jitter, x), color = "red")
}

c(0.1, 0.2, 0.5, 0.7) |> 
  set_names() |> 
  map(plot_jitter_amount)
#> $`0.1`

#> 
#> $`0.2`

#> 
#> $`0.5`

#> 
#> $`0.7`

Created on 2023-11-20 with reprex v2.0.2

@maurolepore
Copy link
Contributor Author

As for a recomendation, a quick experiment suggests the average difference between the original values (min, max) and the jitter values (min_jitter, max_jitter) is not what I expected (e.g. 10% for amount = 0.1) and also that percent difference is higher for min than max.

I'll experiment a bit more and try find an algorithm and a value of amount that moves each side of the range about 10% on average and evenly for both sides of the range.

@AnneSchoenauer
Copy link

  1. Thanks @mauro - I would then trust your recommendation. Otherwise I would implement 0.2 or 0.3. But maybe easier to communicate to the banks that on average the deviation will be 10% of the original value so if you can implement your recommendation we will go with this!
  2. Thanks for checking - after you tested it we are good to go :)
  3. I guess this is then a workflow issue - We just need to make sure that when Kalash gives you the ecoinvent data that you run it on the ecoinvent data and then thereafter only filter those ones that were also been mapped with europages. @kalashsinghal said he will follow up with you on this workflow and maybe best to coordinate at the tech meeting - in any case we make sure that this code here now stays in R :)

@maurolepore
Copy link
Contributor Author

maurolepore commented Nov 22, 2023

Thanks @AnneSchoenauer,

I tried a couple of ideas and realized that the algorithm can be a little tricky. For example I tried to sample a random number from 1-100 and use that as a percent-displacement of the points we want to jitter. That sounded neat to me but then I realize the case when a point is 0 would not be handled gracefully -- for example adding a 10% of 0 to 0 is still 0.

So I decided to be conservative and use the base R implementation of jitter() -- which considers the case when x is 0:

?jitter():

factor numeric; if positive, used as amount (see below), otherwise, if = 0 the default is factor * z/50.

Although we still lack precise control of the "percent" average-jittter, we can calculate it after the fact. For example, this reprex shows a little helper to calculate the mean deviation for each side of the range. This would allow us to report the jitter range along with the calculated mean deviation of min and max.

library(tidyverse)
devtools::load_all()
#> ℹ Loading tiltIndicator

set.seed(123)

data <- tibble(x = rnorm(1000))
amount <- 0.2

mean_jitter_percent <- function(x, jitter) {
  deviation <- abs(abs(x) - abs(jitter))
  percent <- deviation * 100 / abs(x)
  mean(percent[is.finite(percent)])
}

jitter <- data |>
  summarize_range(x, .by = x) |>
  jitter_range(amount = amount)

mean_jitter_percent(jitter$min, jitter$min_jitter)
#> [1] 48.40077
mean_jitter_percent(jitter$max, jitter$max_jitter)
#> [1] 54.14247

ggplot(jitter) +
  geom_line(aes(x, x)) +
  geom_point(aes(min_jitter, x), color = "blue") +
  geom_point(aes(max_jitter, x), color = "red")

Created on 2023-11-22 with reprex v2.0.2

@AnneSchoenauer
Copy link

Sounds really good Mauro! Thanks :)

@maurolepore maurolepore merged commit 79f3aea into main Nov 23, 2023
8 checks passed
@maurolepore maurolepore deleted the 554_show_co2-range branch November 23, 2023 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In emissions_profile*() inform the co2 range for each risk_category
2 participants