-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New summarize_range()
and jitter_range()
#622
Conversation
summarize_range()
and jitter_range()
summarize_range()
and jitter_range()
@maurolepore. Could you please explain to me the following: Why do the values in the first column don't differ? So why do we have the same values for min, max min_jitter and max_jitter? If it is the case that 10% is quite small we should definitely increase the values? Do you have a recommendation which one would be good? I see here that you added some plots showing the uncertainty - maybe you could do this for 0.2, 0.5 and 0.7 to understand the differences.
Also it seems the case that the max_jitter is not always going in the right direction - so that it not always increases the values. See for example column 5. So here the max value is 176. and the max_jiiter is 175. which is lower than the actual value.
And a last comment - the min and max values are only the min and max values from the europages matched products to ecoinvent right? Actually it is similiar to the benchmark - we would want to include the ranges from all ecoinvent data... So the question is if you could get the raw ecoinvent data with the benchmarks, calculate the ranges and then pick the ep_products based on the europage companies. |
Thanks for your sharp eye!
devtools::load_all()
#> ℹ Loading tiltIndicator
library(dplyr, warn.conflicts = FALSE)
set.seed(123)
options(width = 200)
n <- 1e6
data <- tibble(x = rnorm(n), y = 1:n)
out <- data |>
summarize_range(x, .by = y) |>
jitter_range() |>
mutate(min_jitter_is_smaller = min_jitter < min) |>
mutate(max_jitter_is_bigger = max_jitter > max)
out
#> # A tibble: 1,000,000 × 7
#> y min max min_jitter max_jitter min_jitter_is_smaller max_jitter_is_bigger
#> <int> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#> 1 1 -0.560 -0.560 -0.661 -0.538 TRUE TRUE
#> 2 2 -0.230 -0.230 -0.366 -0.181 TRUE TRUE
#> 3 3 1.56 1.56 1.51 1.56 TRUE TRUE
#> 4 4 0.0705 0.0705 -0.0763 0.0941 TRUE TRUE
#> 5 5 0.129 0.129 0.0850 0.231 TRUE TRUE
#> 6 6 1.72 1.72 1.70 1.79 TRUE TRUE
#> 7 7 0.461 0.461 0.444 0.626 TRUE TRUE
#> 8 8 -1.27 -1.27 -1.32 -1.22 TRUE TRUE
#> 9 9 -0.687 -0.687 -0.963 -0.645 TRUE TRUE
#> 10 10 -0.446 -0.446 -0.508 -0.305 TRUE TRUE
#> # ℹ 999,990 more rows
any(!out$min_jitter_is_smaller)
#> [1] FALSE
any(!out$max_jitter_is_bigger)
#> [1] FALSE Created on 2023-11-20 with reprex v2.0.2
|
Answering 1, on average the jitter increases with library(tidyverse)
devtools::load_all()
#> ℹ Loading tiltIndicator
# Helper
plot_jitter_amount <- function(amount) {
data <- tibble(x = rnorm(100))
jitter <- data |>
summarize_range(x, .by = x) |>
jitter_range(amount)
ggplot(jitter) +
geom_line(aes(x, x)) +
geom_point(aes(min_jitter, x), color = "blue") +
geom_point(aes(max_jitter, x), color = "red")
}
c(0.1, 0.2, 0.5, 0.7) |>
set_names() |>
map(plot_jitter_amount)
#> $`0.1`
Created on 2023-11-20 with reprex v2.0.2 |
As for a recomendation, a quick experiment suggests the average difference between the original values ( I'll experiment a bit more and try find an algorithm and a value of amount that moves each side of the range about 10% on average and evenly for both sides of the range. |
|
Thanks @AnneSchoenauer, I tried a couple of ideas and realized that the algorithm can be a little tricky. For example I tried to sample a random number from 1-100 and use that as a percent-displacement of the points we want to jitter. That sounded neat to me but then I realize the case when a point is 0 would not be handled gracefully -- for example adding a 10% of 0 to 0 is still 0. So I decided to be conservative and use the base R implementation of
Although we still lack precise control of the "percent" average-jittter, we can calculate it after the fact. For example, this reprex shows a little helper to calculate the mean deviation for each side of the range. This would allow us to report the jitter range along with the calculated mean deviation of library(tidyverse)
devtools::load_all()
#> ℹ Loading tiltIndicator
set.seed(123)
data <- tibble(x = rnorm(1000))
amount <- 0.2
mean_jitter_percent <- function(x, jitter) {
deviation <- abs(abs(x) - abs(jitter))
percent <- deviation * 100 / abs(x)
mean(percent[is.finite(percent)])
}
jitter <- data |>
summarize_range(x, .by = x) |>
jitter_range(amount = amount)
mean_jitter_percent(jitter$min, jitter$min_jitter)
#> [1] 48.40077
mean_jitter_percent(jitter$max, jitter$max_jitter)
#> [1] 54.14247
ggplot(jitter) +
geom_line(aes(x, x)) +
geom_point(aes(min_jitter, x), color = "blue") +
geom_point(aes(max_jitter, x), color = "red") Created on 2023-11-22 with reprex v2.0.2 |
Sounds really good Mauro! Thanks :) |
Closes #554
This PR adds new helpers to (1) summarize the range of a column in a dataframe by groups, and (2) jitter the range, expanding it towards the left and right of the range.
NOTE the amount of jitter is hard to set precisely via funciton arguments. Instead we may call the function and calculate the average deviation of the jitter. This PR does not formally introduce a helper to do that. See #624.
TODO
EXCEPTIONS