New `summarize_range()` and `jitter_range()` #622

maurolepore · 2023-11-17T17:34:53Z

Closes #554

This PR adds new helpers to (1) summarize the range of a column in a dataframe by groups, and (2) jitter the range, expanding it towards the left and right of the range.

NOTE the amount of jitter is hard to set precisely via funciton arguments. Instead we may call the function and calculate the average deviation of the jitter. This PR does not formally introduce a helper to do that. See #624.

TODO

EXCEPTIONS

Slide here any item that you intentionally choose to not do.

AnneSchoenauer · 2023-11-20T16:17:48Z

@maurolepore. Could you please explain to me the following:

Why do the values in the first column don't differ? So why do we have the same values for min, max min_jitter and max_jitter? If it is the case that 10% is quite small we should definitely increase the values? Do you have a recommendation which one would be good? I see here that you added some plots showing the uncertainty - maybe you could do this for 0.2, 0.5 and 0.7 to understand the differences.

grouped_by       risk_category   min    max min_jitter max_jitter
#>    <chr>            <chr>         <dbl>  <dbl>      <dbl>      <dbl>
#>  1 all              high          58.1  176.        58.1      176.

Also it seems the case that the max_jitter is not always going in the right direction - so that it not always increases the values. See for example column 5. So here the max value is 176. and the max_jiiter is 175. which is lower than the actual value.

grouped_by       risk_category   min    max min_jitter max_jitter
#>  5 unit_isic_4digit high           2.07 176.         2.03     175.

And a last comment - the min and max values are only the min and max values from the europages matched products to ecoinvent right? Actually it is similiar to the benchmark - we would want to include the ranges from all ecoinvent data... So the question is if you could get the raw ecoinvent data with the benchmarks, calculate the ranges and then pick the ep_products based on the europage companies.

maurolepore · 2023-11-20T18:09:45Z

@AnneSchoenauer,

Thanks for your sharp eye!

I owe you the plot.
Very weird. Here is a reprex with 1 million values and I don't see that problem. My only guess is that I shared the reprex above before I fully finished the logic. In any case I'll need to test a bit more to make sure I got it right.

devtools::load_all()
#> ℹ Loading tiltIndicator
library(dplyr, warn.conflicts = FALSE)
set.seed(123)
options(width = 200)

n <- 1e6
data <- tibble(x = rnorm(n), y = 1:n)

out <- data |> 
  summarize_range(x, .by = y) |> 
  jitter_range() |> 
  mutate(min_jitter_is_smaller = min_jitter < min) |> 
  mutate(max_jitter_is_bigger = max_jitter > max)
out
#> # A tibble: 1,000,000 × 7
#>        y     min     max min_jitter max_jitter min_jitter_is_smaller max_jitter_is_bigger
#>    <int>   <dbl>   <dbl>      <dbl>      <dbl> <lgl>                 <lgl>               
#>  1     1 -0.560  -0.560     -0.661     -0.538  TRUE                  TRUE                
#>  2     2 -0.230  -0.230     -0.366     -0.181  TRUE                  TRUE                
#>  3     3  1.56    1.56       1.51       1.56   TRUE                  TRUE                
#>  4     4  0.0705  0.0705    -0.0763     0.0941 TRUE                  TRUE                
#>  5     5  0.129   0.129      0.0850     0.231  TRUE                  TRUE                
#>  6     6  1.72    1.72       1.70       1.79   TRUE                  TRUE                
#>  7     7  0.461   0.461      0.444      0.626  TRUE                  TRUE                
#>  8     8 -1.27   -1.27      -1.32      -1.22   TRUE                  TRUE                
#>  9     9 -0.687  -0.687     -0.963     -0.645  TRUE                  TRUE                
#> 10    10 -0.446  -0.446     -0.508     -0.305  TRUE                  TRUE                
#> # ℹ 999,990 more rows

any(!out$min_jitter_is_smaller)
#> [1] FALSE

any(!out$max_jitter_is_bigger)
#> [1] FALSE

^{Created on 2023-11-20 with reprex v2.0.2}

Yeah, the toy data in the reprex above imitates the products dataset -- not the ecoinvent dataset. But the new functions are completely general. summarize_range(data, col, .by)takes any dataset indata, any column in col, and any groups in .by(orgroup_by()`). So we just need to call these functions with the required data.

maurolepore · 2023-11-20T20:00:06Z

Answering 1, on average the jitter increases with amount but you will still see individual values of jitter that are close to the original value.

library(tidyverse)
devtools::load_all()
#> ℹ Loading tiltIndicator

# Helper
plot_jitter_amount <- function(amount) {
  data <- tibble(x = rnorm(100))
  
  jitter <- data |> 
    summarize_range(x, .by = x) |> 
    jitter_range(amount)
  
  ggplot(jitter) +
    geom_line(aes(x, x)) +
    geom_point(aes(min_jitter, x), color = "blue") +
    geom_point(aes(max_jitter, x), color = "red")
}

c(0.1, 0.2, 0.5, 0.7) |> 
  set_names() |> 
  map(plot_jitter_amount)
#> $`0.1`

#> 
#> $`0.2`

#> 
#> $`0.5`

#> 
#> $`0.7`

^{Created on 2023-11-20 with reprex v2.0.2}

maurolepore · 2023-11-20T20:29:17Z

As for a recomendation, a quick experiment suggests the average difference between the original values (min, max) and the jitter values (min_jitter, max_jitter) is not what I expected (e.g. 10% for amount = 0.1) and also that percent difference is higher for min than max.

I'll experiment a bit more and try find an algorithm and a value of amount that moves each side of the range about 10% on average and evenly for both sides of the range.

AnneSchoenauer · 2023-11-21T08:09:49Z

Thanks @mauro - I would then trust your recommendation. Otherwise I would implement 0.2 or 0.3. But maybe easier to communicate to the banks that on average the deviation will be 10% of the original value so if you can implement your recommendation we will go with this!
Thanks for checking - after you tested it we are good to go :)
I guess this is then a workflow issue - We just need to make sure that when Kalash gives you the ecoinvent data that you run it on the ecoinvent data and then thereafter only filter those ones that were also been mapped with europages. @kalashsinghal said he will follow up with you on this workflow and maybe best to coordinate at the tech meeting - in any case we make sure that this code here now stays in R :)

maurolepore · 2023-11-22T08:05:13Z

Thanks @AnneSchoenauer,

I tried a couple of ideas and realized that the algorithm can be a little tricky. For example I tried to sample a random number from 1-100 and use that as a percent-displacement of the points we want to jitter. That sounded neat to me but then I realize the case when a point is 0 would not be handled gracefully -- for example adding a 10% of 0 to 0 is still 0.

So I decided to be conservative and use the base R implementation of jitter() -- which considers the case when x is 0:

?jitter():

factor numeric; if positive, used as amount (see below), otherwise, if = 0 the default is factor * z/50.

Although we still lack precise control of the "percent" average-jittter, we can calculate it after the fact. For example, this reprex shows a little helper to calculate the mean deviation for each side of the range. This would allow us to report the jitter range along with the calculated mean deviation of min and max.

library(tidyverse)
devtools::load_all()
#> ℹ Loading tiltIndicator

set.seed(123)

data <- tibble(x = rnorm(1000))
amount <- 0.2

mean_jitter_percent <- function(x, jitter) {
  deviation <- abs(abs(x) - abs(jitter))
  percent <- deviation * 100 / abs(x)
  mean(percent[is.finite(percent)])
}

jitter <- data |>
  summarize_range(x, .by = x) |>
  jitter_range(amount = amount)

mean_jitter_percent(jitter$min, jitter$min_jitter)
#> [1] 48.40077
mean_jitter_percent(jitter$max, jitter$max_jitter)
#> [1] 54.14247

ggplot(jitter) +
  geom_line(aes(x, x)) +
  geom_point(aes(min_jitter, x), color = "blue") +
  geom_point(aes(max_jitter, x), color = "red")

^{Created on 2023-11-22 with reprex v2.0.2}

AnneSchoenauer · 2023-11-22T17:40:55Z

Sounds really good Mauro! Thanks :)

maurolepore added 15 commits November 17, 2023 17:30

crucial

c4e3435

Crucial names

82fe270

wip

ed58434

refactor

7e166c1

wip

7bed28e

wip

98988bf

wip

338560d

wip

53ea649

wip

41b8df2

wip

1d2fabb

wip

7404a3c

wip

5c231d1

wip

3e04346

wip

844a8ef

Merge branch 'main' into 554_show_co2-range

96777d4

maurolepore changed the title ~~New jitter_co2_range~~ New jitter_co2_range() Nov 17, 2023

maurolepore added 14 commits November 17, 2023 19:10

tdd-fail

87741ac

wip

f758134

wip

440a9ce

Handles NA

64d9c07

wip

0e2682b

wip

87a206f

wip

38e8f2c

wip

ecb9a1b

wip

e0b33a3

wip

a34005b

wip

786b3c6

wip

6f96b83

wip

9c59594

wip

9f58ed2

wip

c862551

maurolepore changed the title ~~WIP: New summarize_range() and jitter_range()~~ New summarize_range() and jitter_range() Nov 20, 2023

maurolepore added 5 commits November 20, 2023 13:37

styoe

59c637b

Move-as-helpers

1a5c090

Title

231af97

wip

f35a4c9

wip

6bbd803

maurolepore marked this pull request as ready for review November 20, 2023 13:49

maurolepore requested a review from AnneSchoenauer November 20, 2023 13:49

maurolepore added 3 commits November 22, 2023 07:40

use base jitter()

e7d5c41

document

cfa89c7

wip

4613e13

maurolepore added 5 commits November 22, 2023 11:14

wip

a8c14d9

refactor

982015f

wip

2eea863

wip

532c448

wip

4dea023

AnneSchoenauer approved these changes Nov 22, 2023

View reviewed changes

maurolepore added 2 commits November 23, 2023 10:00

Polish tests

bb4a373

style

3441b22

maurolepore mentioned this pull request Nov 23, 2023

New helper to calculate the mean jitter percent #624

Closed

maurolepore merged commit 79f3aea into main Nov 23, 2023
8 checks passed

maurolepore deleted the 554_show_co2-range branch November 23, 2023 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New `summarize_range()` and `jitter_range()` #622

New `summarize_range()` and `jitter_range()` #622

maurolepore commented Nov 17, 2023 •

edited

Loading

AnneSchoenauer commented Nov 20, 2023 •

edited

Loading

maurolepore commented Nov 20, 2023 •

edited

Loading

maurolepore commented Nov 20, 2023

maurolepore commented Nov 20, 2023

AnneSchoenauer commented Nov 21, 2023

maurolepore commented Nov 22, 2023 •

edited

Loading

AnneSchoenauer commented Nov 22, 2023

New summarize_range() and jitter_range() #622

New summarize_range() and jitter_range() #622

Conversation

maurolepore commented Nov 17, 2023 • edited Loading

AnneSchoenauer commented Nov 20, 2023 • edited Loading

maurolepore commented Nov 20, 2023 • edited Loading

maurolepore commented Nov 20, 2023

maurolepore commented Nov 20, 2023

AnneSchoenauer commented Nov 21, 2023

maurolepore commented Nov 22, 2023 • edited Loading

AnneSchoenauer commented Nov 22, 2023

New `summarize_range()` and `jitter_range()` #622

New `summarize_range()` and `jitter_range()` #622

maurolepore commented Nov 17, 2023 •

edited

Loading

AnneSchoenauer commented Nov 20, 2023 •

edited

Loading

maurolepore commented Nov 20, 2023 •

edited

Loading

maurolepore commented Nov 22, 2023 •

edited

Loading