Skip to content

Commit

Permalink
Merge pull request #21 from reconhub/sha-option
Browse files Browse the repository at this point in the history
add hash function option for hash_names
  • Loading branch information
zkamvar authored May 13, 2019
2 parents d9abd97 + 0fec040 commit e8aabf1
Show file tree
Hide file tree
Showing 11 changed files with 234 additions and 123 deletions.
9 changes: 7 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,13 @@
`fit_disc_gamma()` to fit a discretized gamma distribution to the empirical
incubation distribution results (@ffinger, #13).

## New Features

- `clean_labels()` gains the `protect` argument to protect meaningful symbols
in the data.
- `hash_names()` now has the `hashfun` option that allows users to specify
either a "fast" or "secure" hashing function to use (@zkamvar, #21).

## New Vignettes

- "Estimating Incubation Period" shows the use of empirical incubation dist.
Expand All @@ -22,8 +29,6 @@
- `clean_labels()` can now handle non-latin characters and gains the `trans_id`
argument, which allows the user to customise the transformations
(see https://github.com/reconhub/epitrix/issues/19 for details).
- `clean_labels()` gains the `protect` argument to protect meaningful symbols
in the data.

# epitrix 0.2.2

Expand Down
71 changes: 53 additions & 18 deletions R/hash_names.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,22 @@
#' then each entry is hashed. The function can either return a full detailed
#' output, or short labels ready to use for 'anonymised data'.
#' Before concatenation (using "_" as a separator) to form labels,
#' inputs are modified using \code{\link{clean_labels}}.
#' inputs are modified using [clean_labels()]
#'
#' The argument \code{salt} should be used for salting the algorithm, i.e. adding
#' The argument `salt` should be used for salting the algorithm, i.e. adding
#' an extra input to the input fields (the 'salt') to change the resulting hash
#' and prevent identification of individuals via pre-computed hash
#' tables.
#'
#' It is highly recommend to choose a secret, random salt in order make it harder
#' for an attacker to decode the hash.
#'
#' @seealso \code{\link{clean_labels}}, used to clean labels prior to hashing.
#' @seealso [clean_labels()], used to clean labels prior to hashing\cr
#' [sodium::hash()] for available hashing functions.
#'
#' @author Thibaut Jombart \email{thibautjombart@@gmail.com},
#' Dirk Shchumacher \email{mail@@dirk-schumacher.net}
#' Dirk Shchumacher \email{mail@@dirk-schumacher.net},
#' Zhian N. Kamvar \email{zkamvar@@gmail.com}
#'
#' @export
#'
Expand All @@ -30,9 +32,17 @@
#' \code{data.frame}, including original labels, shortened hash, and full
#' hash.
#'
#' @param hashfun This defines the hashing function to be used. If you specify
#' "secure" (default), it will use [sodium::scrypt()], which will be secure,
#' but will be slow for large data sets. For fast hashing with no colisions,
#' you can sepecify "fast", and it will use [sodium::sha256()], which is
#' several orders of magnitude faster than [sodium::scrypt()]. You can also
#' specify a hashing function that takes and returns a [raw][base::raw]
#' vector of bytes that can be converted to character with [rawToChar()].
#'
#' @param salt An optional object that can be coerced to a character
#' to be used to 'salt' the hashing algorithm (see details).
#' Ignored if \code{NULL} (default).
#' Ignored if `NULL`.
#'
#' @param clean_labels A logical indicating if labels of variables should be
#' standardized; defaults to `TRUE`
Expand All @@ -43,18 +53,25 @@
#' last_name <- c("Doe", "Smith", "Dupont")
#' age <- c(25, 69, 36)
#'
#' hash_names(first_name, last_name, age)
#' # secure hashing
#' hash_names(first_name, last_name, age, hashfun = "secure")
#'
#' # fast hashing
#' hash_names(first_name, last_name, age,
#' size = 8, full = FALSE)
#' size = 8, full = FALSE, hashfun = "fast")
#'
#'
#' ## salting the hashing (more secure!)
#'
#' hash_names(first_name, last_name) # unsalted - less secure
#' hash_names(first_name, last_name, salt = 123) # salted with an integer
#' hash_names(first_name, last_name, salt = "foobar") # salted with an character
#'
#' ## using a different hash algorithm if you want things to run faster
#'
#' hash_names(first_name, last_name, hashfun = "fast") # use sha256 algorithm

hash_names <- function(..., size = 6, full = TRUE, salt = NULL, clean_labels = TRUE) {
hash_names <- function(..., size = 6, full = TRUE, hashfun = "secure", salt = NULL, clean_labels = TRUE) {
x <- list(...)
x <- lapply(x, function(e) paste(unlist(e)))

Expand All @@ -65,8 +82,11 @@ hash_names <- function(..., size = 6, full = TRUE, salt = NULL, clean_labels = T
lab <- do.call(paste_, x)


## create the hashing function that returns a character string
hashfun <- hash(salt, f = hashfun)
## hash it all
hash <- vapply(lab, hash(salt), NA_character_)
hash <- vapply(lab, hashfun, NA_character_)
## trim the results
hash_short <- substr(hash, 1, size)

if (full) {
Expand All @@ -82,15 +102,30 @@ hash_names <- function(..., size = 6, full = TRUE, salt = NULL, clean_labels = T
return(out)
}

hash <- function(salt = NULL) {
stopifnot(is.null(salt) || length(salt) == 1L)
salt <- if (is.null(salt)) {
raw(32L)
} else {
sodium::hash(charToRaw(as.character(salt)))
hash <- function(salt = NULL, f = sodium::scrypt) {
if (is.character(f)) {
f <- match.arg(tolower(f), c("secure", "fast"))
f <- if (f == "secure") sodium::scrypt else sodium::sha256
}
function(x) {
stopifnot(is.character(x))
sodium::bin2hex(sodium::scrypt(charToRaw(x), salt = salt))
# First check if the hashing function has "salt" in the arguments
if (any(names(formals(f)) == "salt")) {
# if it does, create a salt
stopifnot(is.null(salt) || length(salt) == 1L)
salt <- if (is.null(salt)) {
raw(32L)
} else {
sodium::hash(charToRaw(as.character(salt)))
}
function(x) {
stopifnot(is.character(x))
sodium::bin2hex(f(charToRaw(x), salt = salt))
}
} else {
# if it does not, append the salt (if applicable)
function(x, s = salt) {
x <- if (is.null(s)) x else paste(x, s, sep = "_")
stopifnot(is.character(x))
sodium::bin2hex(f(charToRaw(x)))
}
}
}
32 changes: 18 additions & 14 deletions docs/articles/epitrix.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit e8aabf1

Please sign in to comment.