Minor spelling / mathematical updates to Scoring rule vignette. (#969)
- A bug was fixed where
crps_sample()
could fail in edge cases. - Implemented a new forecast class,
forecast_ordinal
with appropriate metrics. Ordinal forecasts are a form of categorical forecasts. The main difference between ordinal and nominal forecasts is that the outcome is ordered, rather than unordered.
This update represents a major rewrite of the package and introduces breaking changes. If you want to keep using the older version, you can download it using remotes::install_github("epiforecasts/[email protected]")
.
The update aims to make the package more modular and customisable and overall cleaner and easier to work with. In particular, we aimed to make the suggested workflows for evaluating forecasts more explicit and easier to follow (see visualisation below). To do that, we clarified input formats and made them consistent across all functions. We refactord many functions to S3-methods and introduced forecast
objects with separate classes for different types of forecasts. A new set of as_forecast_<type>()
functions was introduced to validate the data and convert inputs into a forecast
object (a data.table
with a forecast
class and an additional class corresponding to the forecast type (see below)). Another major update is the possibility for users to pass in their own scoring functions into score()
. We updated and improved all function documentation and added new vignettes to guide users through the package. Internally, we refactored the code, improved input checks, updated notifications (which now use the cli
package) and increased test coverage.
The most comprehensive documentation for the new package after the rewrite is the revised version
of our original scoringutils
paper.
- The main function of the package is still the function
score()
. However, we reworked the function and updated and clarified its input requirements.- The previous columns "true_value" and "prediction" were renamed.
score()
now requires columns called "observed" and "predicted" (some functions still assume the existence of amodel
column as default but this is not a strict requirement). The columnquantile
was renamed toquantile_level
andsample
was renamed tosample_id
score()
is now a generic. It has S3 methods for the classesforecast_point
,forecast_binary
,forecast_quantile
,forecast_sample
, andforecast_nominal
, which correspond to the different forecast types that can be scored withscoringutils
.score()
now callsna.omit()
on the data, instead of only removing rows with missing values in the columnsobserved
andpredicted
. This is becauseNA
values in other columns can also mess up e.g. grouping of forecasts according to the unit of a single forecast.score()
and many other functions now require a validatedforecast
object.forecast
objects can be created using the functionsas_forecast_point()
,as_forecast_binary()
,as_forecast_quantile()
, andas_forecast_sample()
(which replace the previouscheck_forecast()
). A forecast object is a data.table with classforecast
and an additional class corresponding to the forecast type (e.g.forecast_quantile
).score()
now returns objects of classscores
with a stored attributemetrics
that holds the names of the scoring rules that were used. Users can callget_metrics()
to access the names of those scoring rules.score()
now returns one score per forecast, instead of one score per sample or quantile. For binary and point forecasts, the columns "observed" and "predicted" are now removed for consistency with the other forecast types.- Users can now also use their own scoring rules (making use of the
metrics
argument, which takes in a named list of functions). Default scoring rules can be accessed using the functionget_metrics()
, which is a a generic with S3 methods for each forecast type. It returns a named list of scoring rules suitable for the respective forecast object. For example, you could callget_metrics(example_quantile)
. Column names of scores in the output ofscore()
correspond to the names of the scoring rules (i.e. the names of the functions in the list of metrics). - Instead of supplying arguments to
score()
to manipulate individual scoring rules users should now manipulate the metric list being supplied usingpurrr::partial()
andselect_metric()
. See?score()
for more information. - the CRPS is now reported as decomposition into dispersion, overprediction and underprediction.
- functionality to calculate the Probability Integral Transform (PIT) has been deprecated and replaced by functionality to calculate PIT histograms, using the
get_pit_histogram()
function; as part of this change, nonrandomised PITs can now be calculated for count data, and this is is done by default
- The previous columns "true_value" and "prediction" were renamed.
- The
as_forecast_<type>()
functions create a forecast object and validates it. They also allow users to rename/specify required columns and specify the forecast unit in a single step, taking over the functionality ofset_forecast_unit()
in most cases. See?as_forecast()
for more information. - Some
as_forecast_<type>()
functions like e.g.as_forecast_point()
andas_forecast_quantile()
have S3 methods for converting from another forecast type to the respective forecast type. For example,as_forecast_quantile()
has a method for converting from aforecast_sample
object to aforecast_quantile
object by estimating quantiles from the samples.
- An example workflow for scoring a forecast now looks like this:
forecast_quantile <- as_forecast_quantile( example_quantile, observed = "observed", predicted = "predicted", model = "model", quantile_level = "quantile_level", forecast_unit = c("model", "location", "target_end_date", "forecast_date", "target_type") ) scores <- score(forecast_quantile)
- Overall, we updated the suggested workflows for how users should work with the package. The following gives an overview (see the updated paper for more details).
-
We standardised input formats both for
score()
as well as for the scoring rules exported byscoreingutils
. The following plot gives a overview of the expected input formats for the different forecast types inscore()
. -
Support for the interval format was mostly dropped (see PR #525 by @nikosbosse and reviewed by @seabbs). The co-existence of the quantile and interval format let to a confusing user experience with many duplicated functions providing the same functionality. We decided to simplify the interface by focusing exclusively on the quantile format.
- The function
bias_range()
was removed (users should now usebias_quantile()
instead) - The function
interval_score()
was made an internal function rather than being exported to users. We recommend usingwis()
instead.
- The function
- To create and validate a new
forecast
object, users can useas_forecast_<type>()
. To revalidate an existingforecast
object users can callassert_forecast()
(which validates the input and returnsinvisible(NULL)
.assert_forecast()
is a generic with methods for the different forecast types. Alternatively, users can call ``as_forecast_()` again to re-validate a forecast object. Simply printing the object will also provide some additional information. - Users can test whether an object is of class
forecast_*()
using the functionis_forecast()
. Users can also test for a specificforecast_*
class using the appropriateis_forecast.forecast_*
method. For example, to check whether an object is of classforecast_quantile
, you would use you would usescoringutils:::is_forecast.forecast_quantile()
.
- The functionality for computing pairwise comparisons was now split from
summarise_scores()
. Instead of doing pairwise comparisons as part of summarising scores, a new function,add_relative_skill()
, was introduced that takes summarised scores as an input and adds columns with relative skill scores and scaled relative skill scores. - The function
pairwise_comparison()
was renamed toget_pairwise_comparisons()
, in line with otherget_
-functions. Analogously,plot_pairwise_comparison()
was renamed toplot_pairwise_comparisons()
. - Output columns for pairwise comparisons have been renamed to contain the name of the metric used for comparing.
- Replaced warnings with errors in
get_pairwise_comparison
to avoid returningNULL
add_coverage()
was replaced by a new function,get_coverage()
. This function comes with an updated workflow where coverage values are computed directly based on the original data and can then be visualised usingplot_interval_coverage()
orplot_quantile_coverage()
. An example workflow would beexample_quantile |> as_forecast_quantile() |> get_coverage(by = "model") |> plot_interval_coverage()
.
- The function
avail_forecasts()
was renamed toget_forecast_counts()
. This represents a change in the naming convention where we aim to name functions that provide the user with additional useful information about the data with a prefix "get_". Sees Issue #403 and #521 and PR #511 by @nikosbosse and reviewed by @seabbs for details.- For clarity, the output column in
get_forecast_counts()
was renamed from "Number forecasts" to "count". get_forecast_counts()
now also displays combinations where there are 0 forecasts, instead of silently dropping corresponding rows.plot_avail_forecasts()
was renamedplot_forecast_counts()
in line with the change in the function name. Thex
argument no longer has a default value, as the value will depend on the data provided by the user.
- For clarity, the output column in
- The function
find_duplicates()
was renamed toget_duplicate_forecasts()
. - Renamed
interval_coverage_quantile()
tointerval_coverage()
. - "range" was consistently renamed to "interval_range" in the code. The "range"-format (which was mostly used internally) was renamed to "interval"-format
- Renamed
correlation()
toget_correlations()
andplot_correlation()
toplot_correlations()
- Removed abs_error and squared_error from the package in favour of
Metrics::ae
andMetrics::se
.get_duplicate_forecasts()
now sorts outputs according to the forecast unit, making it easier to spot duplicates. In addition, there is acounts
option that allows the user to display the number of duplicates for each forecast unit, rather than the raw duplicated rows. - Deleted the function
plot_ranges()
. If you want to continue using the functionality, you can find the function code here or in the Deprecated-visualisations Vignette. - Removed the function
plot_predictions()
, as well as its helper functionmake_NA()
, in favour of a dedicated Vignette that shows different ways of visualising predictions. For future reference, the function code can be found here (Issue #659) or in the Deprecated-visualisations Vignette. - Removed the function
plot_score_table()
. You can find the code in the Deprecated-visualisations Vignette. - Removed the function
merge_pred_and_obs()
that was used to merge two separate data frames with forecasts and observations. We moved its contents to a new "Deprecated functions"-vignette. - Removed
interval_coverage_sample()
as users are now expected to convert to a quantile format first before scoring. - Function
set_forecast_unit()
was deleted. Instead there is now aforecast_unit
argument inas_forecast_<type>()
as well as inget_duplicate_forecasts()
. - Removed
interval_coverage_dev_quantile()
. Users can still access the difference between nominal and actual interval coverage usingget_coverage()
. pit()
,pit_sample()
andplot_pit()
have been removed and replaced by functionality to create PIT histograms (pit_histogram_sampel()
andget_pit_histogram()
)
bias_quantile()
changed the way it handles forecasts where the median is missing: The median is now imputed by linear interpolation between the innermost quantiles. Previously, we imputed the median by simply taking the mean of the innermost quantiles.- In contrast to the previous
correlation
function,get_correlations
doesn't round correlations by default. Instead,plot_correlations
now has adigits
argument that allows users to round correlations before plotting them. Alternatively, usingdplyr
, you could call something likemutate(correlations, across(where(is.numeric), \(x) signif(x, digits = 2)))
on the output ofget_correlations
. wis()
now errors by default if not all quantile levels form valid prediction intervals and returnsNA
if there are missing values. Previously,na.rm
was set toTRUE
by default, which could lead to unexpected results, if users are not aware of this.
- The deprecated
..density..
was replaced withafter_stat(density)
in ggplot calls. - Files ending in ".Rda" were renamed to ".rds" where appropriate when used together with
saveRDS()
orreadRDS()
. - Added a subsetting
[
operator for scores, so that the score name attribute gets preserved when subsetting. - Switched to using
cli
for condition handling and signalling, and added tests for all thecheck_*()
andtest_*()
functions. See #583 by @jamesmbaazam and reviewed by @nikosbosse and @seabbs. scoringutils
now requires R >= 4.0
- Updates documentation for most functions and made sure all functions have documented return statements
- Documentation pkgdown pages are now created both for the stable and dev versions.
- Added unit tests for many functions
scoringutils
now depends on R 3.6. The change was made since packagestestthat
andlifecycle
, which are used inscoringutils
now require R 3.6. We also updated the Github action CI check to work with R 3.6 now.- Added a new PR template with a checklist of things to be included in PRs to facilitate the development and review process
- Fixes a bug with
set_forecast_unit()
where the function only worked with a data.table, but not a data.frame as an input. - The metrics table in the vignette Details on the metrics implemented in
scoringutils
had duplicated entries. This was fixed by removing the duplicated rows.
- This minor update fixes a few issues related to gh actions and the vignettes displayed at epiforecasts.io/scoringutils. It
- Gets rid of the preferably package in _pkgdown.yml. The theme had a toggle between light and dark theme that didn't work properly
- Updates the gh pages deploy action to v4 and also cleans up files when triggered
- Introduces a gh action to automatically render the Readme from Readme.Rmd
- Removes links to vignettes that have been renamed
This major release contains a range of new features and bug fixes that have been introduced in minor releases since 1.1.0
. The most important changes are:
- Documentation updated to reflect changes since version 1.1.0, including new transform and workflow functions.
- New
set_forecast_unit()
function allows manual setting of forecast unit. summarise_scores()
gains newacross
argument for summarizing across variables.- New
transform_forecasts()
andlog_shift()
functions allow forecast transformations. See the documentation fortransform_forecasts()
for more details and an example use case. - Input checks and test coverage improved for bias functions.
- Bug fix in
get_prediction_type()
for integer matrix input. - Links to scoringutils paper and citation updates.
- Warning added in
interval_score()
for small interval ranges. - Linting updates and improvements.
Thanks to @nikosbosse, @seabbs, and @sbfnk for code and review contributions. Thanks to @bisaloo for the suggestion to use a linting GitHub Action that only triggers on changes, and @adrian-lison for the suggestion to add a warning to interval_score()
if the interval range is between 0 and 1.
- The documentation was updated to reflect the recent changes since
scoringutils 1.1.0
. In particular, usage of the functionsset_forecast_unit()
,check_forecasts()
andtransform_forecasts()
are now documented in the Vignettes. The introduction of these functions enhances the overall workflow and help to make the code more readable. All functions are designed to be used together with the pipe operator. For example, one can now use something like the following:
example_quantile |>
set_forecast_unit(c("model", "location", "forecast_date", "horizon", "target_type")) |>
check_forecasts() |>
score()
Documentation for the transform_forecasts()
has also been extended. This functions allows the user to easily add transformations of forecasts, as suggested in the paper "Scoring epidemiological forecasts on transformed scales". In an epidemiological context, for example, it may make sense to apply the natural logarithm first before scoring forecasts, in order to obtain scores that reflect how well models are able to predict exponential growth rates, rather than absolute values. Users can now do something like the following to score a transformed version of the data in addition to the original one:
data <- example_quantile[true_value > 0, ]
data |>
transform_forecasts(fun = log_shift, offset = 1) |>
score() |>
summarise_scores(by = c("model", "scale"))
Here we use the log_shift()
function to apply a logarithmic transformation to the forecasts. This function was introduced in scoringutils 1.1.2
as a helper function that acts just like log()
, but has an additional argument offset
that can add a number to every prediction and observed value before applying the log transformation.
- Made
check_forecasts()
andscore()
pipeable (see issue #290). This means that users can now directly use the output ofcheck_forecasts()
as input forscore()
. Asscore()
otherwise runscheck_forecasts()
internally anyway this simply makes the step explicit and helps writing clearer code.
Release by @seabbs in #305. Reviewed by @nikosbosse and @sbfnk.
- The
prediction_type
argument ofget_forecast_unit()
has been changed dropped. Instead a new internal functionprediction_is_quantile()
is used to detect if a quantile variable is present. Whilst this is an internal function it may impact some users as it is accessible via `find_duplicates().
- Made imputation of the median in
bias_range()
andbias_quantile()
more obvious to the user as this may cause unexpected behaviour. - Simplified
bias_range()
so that it usesbias_quantile()
internally. - Added additional input checks to
bias_range()
,bias_quantile()
, andcheck_predictions()
to make sure that the input is valid. - Improve the coverage of unit tests for
bias_range()
,bias_quantile()
, andbias_sample()
. - Updated pairwise comparison unit tests to use more realistic data.
- Fixed a bug in
get_prediction_type()
which led to it being unable to correctly detect integer (instead categorising them as continuous) forecasts when the input was a matrix. This issue impactedbias_sample()
and alsoscore()
when used with integer forecasts resulting in lower bias scores than expected.
- Added a new argument,
across
, tosummarise_scores()
. This argument allows the user to summarise scores across different forecast units as an alternative to specifyingby
. See the documentation forsummarise_scores()
for more details and an example use case.
- Added a new function,
set_forecast_unit()
that allows the user to set the forecast unit manually. The function removes all columns that are not relevant for uniquely identifying a single forecast. If not done manually,scoringutils
attempts to determine the unit of a single automatically by simply assuming that all column names are relevant to determine the forecast unit. This can lead to unexpected behaviour, so setting the forecast unit explicitly can help make the code easier to debug and easier to read (see issue #268). When used as part of a workflow,set_forecast_unit()
can be directly piped intocheck_forecasts()
to check everything is in order.
- Added links to the scoringutils paper Evaluating Forecasts with scoringutils in R to the package.
- Updated citation formatting to comply with newer standards.
- Added a warning to
interval_score()
if the interval range is between 0 and 1. Thanks to @adrian-lison (see #277) for the suggestion.
- Switched to a linting GitHub Action that only triggers on changes. Inspired by @bisaloo recent contribution to the
epinowcast
package. - Updated package linters to be more extensive. Inspired by @bisaloo recent contribution to the
epinowcast
package. - Resolved all flagged linting issues across the package.
- Added a new function,
transform_forecasts()
to make it easy to transform forecasts before scoring them, as suggested in Bosse et al. (2023), https://www.medrxiv.org/content/10.1101/2023.01.23.23284722v1. - Added a function,
log_shift()
that implements the default transformation function. The function allows to add an offset before applying the logarithm.
- Added a small change to
interval_score()
which explicitly converts the logical vector to a numeric one. This should happen implicitly anyway, but is now done explicitly in order to avoid issues that may come up if the input vector has a type that doesn't allow the implicit conversion.
A minor update to the package with some bug fixes and minor changes.
- Removed the on attach message which warned of breaking changes in
1.0.0
. - Renamed the
metric
argument ofsummarise_scores()
torelative_skill_metric
. This argument is now deprecated and will be removed in a future version of the package. Please use the new argument instead. - Updated the documentation for
score()
and related functions to make the soft requirement for amodel
column in the input data more explicit. - Updated the documentation for
score()
,pairwise_comparison()
andsummarise_scores()
to make it clearer what the unit of a single forecast is that is required for computations - Simplified the function
plot_pairwise_comparison()
which now only supports plotting mean score ratios or p-values and removed the hybrid option to print both at the same time.
- Missing baseline forecasts in
pairwise_comparison()
now trigger an explicit and informative error message. - The requirements table in the getting started vignette is now correct.
- Added support for an optional
sample
column when using a quantile forecast format. Previously this resulted in an error.
Major update to the package and most package functions with lots of breaking changes.
- New and updated Readme and vignette.
- The proposed scoring workflow was reworked. Functions were changed so they can easily be piped and have simplified arguments and outputs.
- The function
eval_forecasts()
was replaced by a functionscore()
with a much reduced set of function arguments. - Functionality to summarise scores and to add relative skill scores was moved
to a function
summarise_scores()
- New function
check_forecasts()
to analyse input data before scoring - New function
correlation()
to compute correlations between different metrics - New function
add_coverage()
to add coverage for specific central prediction intervals. - New function
avail_forecasts()
allows to visualise the number of available forecasts. - New function
find_duplicates()
to find duplicate forecasts which cause an error. - All plotting functions were renamed to begin with
plot_
. Arguments were simplified. - The function
pit()
now works based on data.frames. The oldpit
function was renamed topit_sample()
. PIT p-values were removed entirely. - The function
plot_pit()
now works directly with input as produced bypit()
- Many data-handling functions were removed and input types for
score()
were restricted to sample-based, quantile-based or binary forecasts. - The function
brier_score()
now returns all brier scores, rather than taking the mean before returning an output. crps()
,dss()
andlogs()
were renamed tocrps_sample()
,dss_sample()
, andlogs_sample()
- Testing was expanded
- Minor bugs were fixed, for example a bug in the
as_forecast_quantile()
function (#223)
- Package data is now based on forecasts submitted to the European Forecast Hub (https://covid19forecasthub.eu/).
- All example data files were renamed to begin with
example_
. - A new data set,
summary_metrics
was included that contains a summary of the metrics implemented inscoringutils
.
- The 'sharpness' component of the weighted interval score was renamed to dispersion. This was done to make it more clear what the component represents and to maintain consistency with what is used in other places.
- Added a function
check_forecasts()
that runs some basic checks on the input data and provides feedback.
- Minor bug fixes (previously, 'interval_score' needed to be among the selected metrics).
- All data.tables are now returned as
table[]
rather than astable
, such that they don't have to be called twice to display the contents.
- Added a function,
pairwise_comparison()
that runs pairwise comparisons between models on the output ofeval_forecasts()
- Added functionality to compute relative skill within
eval_forecasts()
. - Added a function to visualise pairwise comparisons.
- The WIS definition change introduced in version 0.1.5 was partly corrected such that the difference in weighting is only introduced when summarising over scores from different interval ranges.
- "sharpness" was renamed to 'mad' in the output of [score()] for sample-based forecasts.
eval_forecasts()
can now handle a separate forecast and truth data set as as input.eval_forecasts()
now supports scoring point forecasts along side quantiles in a quantile-based format. Currently the only metric used is the absolute error.
- Many functions, especially
eval_forecasts()
got a major rewrite. While functionality should be unchanged, the code should now be easier to maintain - Some of the data-handling functions got renamed, but old names are supported as well for now.
- Changed the default definition of the weighted interval score. Previously,
the median prediction was counted twice, but is no only counted once. If you
want to go back to the old behaviour, you can call the interval_score function
with the argument
count_median_twice = FALSE
.
- Added basic plotting functionality to visualise scores. You can now
easily obtain diagnostic plots based on scores as produced by
score
. correlation_plot()
shows correlation between metrics.plot_ranges()
shows contribution of different prediction intervals to some chosen metric.plot_heatmap()
visualises scores as heatmap.plot_score_table()
shows a coloured summary table of scores.
- Renamed "calibration" to "coverage".
- Renamed "true_values" to "true_value" in data.frames.
- Renamed "predictions" to "prediction" in data.frames.
- Renamed "is_overprediction" to "overprediction".
- Renamed "is_underprediction" to "underprediction".
- The by argument in
score
now has a slightly changed meaning. It now denotes the lowest possible grouping unit, i.e. the unit of one observation and needs to be specified explicitly. The default is nowNULL
. The reason for this change is that most metrics need scoring on the observation level and this the most consistent implementation of this principle. The pit function receives its grouping now fromsummarise_by
. In a similar spirit,summarise_by
has to be specified explicitly and e.g. doesn't assume anymore that you want 'range' to be included. - For the interval score,
weigh = TRUE
is now the default option. - Renamed true_values to true_value and predictions to prediction.
- Updated quantile evaluation metrics in
score
. Bias as well as calibration now take all quantiles into account. - Included option to summarise scores according to a
summarise_by
argument inscore()
The summary can return the mean, the standard deviation as well as an arbitrary set of quantiles. score()
can now return pit histograms.- Switched to
ggplot2
for plotting.
- All scores in score were consistently renamed to lower case.
Interval_score
is nowinterval_score
,CRPS
is nowcrps
etc.
- Included support for grouping scores according to a vector of column names
in
score()
. - Included support for passing down arguments to lower-level functions in
score()
- Included support for three new metrics to score quantiles with
score()
: bias, sharpness and calibration
- Example data now has a horizon column to illustrate the use of grouping.
- Documentation updated to explain the above listed changes.
- Included support for a long as well as wide input formats for quantile forecasts that are scored with
score()
.
- Updated documentation for the
score()
. - Added badges to the
README
.