Simulating Likert-type questionnaire data with R

I have been working on multiple projects that involved analysing questionnaires with “Likert scales”, i.e. ordinal variables or pseudo-continuous variables obtained by summing all the ordinal items. Before engaging into the experiments proper, we should conduct robust power analyses, code testing, sanity checks, etc. to ensure that the data collection and analysis will be as smooth as possible. In several complex settings such as multivariate analyses or multilevel modelling, simulation can be a powerful tool to do these tasks by allowing to test intricate computations on synthetic data.

I found out (after the deed, of course) that several packages already existed that could have eased my work. The closest to what I coded here is the LikertMakeR package, which is really comprehensive. Also check out latent2likert for a more item-based simulation approach. Note: this is not a standalone package because I don’t think the functions add a signficant improvement over those from the packages above, but I like the straightforward solutions I came up with, which is why I document them here.¹

source("scripts/simulate_questionnaires.R")
source("scripts/plot_questionnaires.R")

Simulate items for a given score

My initial problem was to find a way to simulate a fixed number of bounded ordinal variables (questionnaire items) that sum to a given score. I wanted to be able to simulate the score distributions of a whole population on a multi-item questionnaire based on literature or assumptions, then simulate individual items making up the scale (which is the reverse process of more “item-based” approaches like latent2likert). This resulted in the simulate_items function²: given a score, a number of items, a minimum and a maximum value, it returns a vector of simulated items.

# Subject with a score of 32 on a 12-item questionnaire ranging from 1 to 5
simulate_items(score = 32, n_items = 12, min_item = 1, max_item = 5)

 [1] 1 4 2 2 3 2 3 3 2 1 4 5

# Subject with a score of 35 on a 8-item questionnaire ranging from 1 to 7
simulate_items(35, 8, 1, 7, verbose = TRUE)

Item scores are  7 7 2 2 4 4 5 4  with a total score of  35 

[1] 7 7 2 2 4 4 5 4

It can be mapped on a distribution to simulate the items for a whole sample of subjects.

n_subjects <- 50

# Normal distribution of scores
df <- 
  tibble(
    subject = 1:n_subjects,
    score = rnorm(n_subjects, mean = 32, sd = 5) |> round()
  ) |>
  rowwise() |> 
  mutate(item = list(simulate_items(score, n_items = 12, min_item = 1, max_item = 5))) |> 
  unnest_wider(item, names_sep = "_")

display(head(df))

subject	score	item_1	item_2	item_3	item_4	item_5	item_6	item_7	item_8	item_9	item_10	item_11	item_12
1	35	4	1	3	2	1	4	1	5	2	4	3	5
2	23	3	2	5	2	1	1	1	1	1	1	3	2
3	30	2	2	3	2	1	2	4	2	3	5	2	2
4	34	2	4	3	2	4	3	5	3	2	3	1	2
5	27	2	2	1	3	2	5	3	2	2	1	3	1
6	28	4	1	2	1	4	1	3	3	2	3	1	3

On this basis, I created the simulate_questionnaires function to simulate the scores for:

A given number of subjects.
Several scales or sub-scales with different distributions, optionally skewed, which can be correlated³.
Using either the sample mean of total scores (sum of all items, e.g., M = 45/80) or the sample mean of item scores (e.g., M = 3.5/5).
Optionally simulating every individual item for each subject with simulate_items.

df <- 
  simulate_questionnaires(
    n_subjects = 1000,
    names = c("scale_1", "scale_2"),  
    distrib = c("skew_normal", "normal"), 
    method = "score_means",
    means = c(33.2, 24.5),   
    sds = c(8.2, 4.4),       
    skews = c(-0.4, 0),
    corrs = c(0.02),
    n_items = c(10, 10),
    min_item = c(1, 1),
    max_item = c(5, 5),
    add_items = TRUE,
    print_corrs = TRUE
    )

Expected correlations:
     [,1] [,2]
[1,] 1.00 0.02
[2,] 0.02 1.00

Simulated correlations:
              score_scale_1 score_scale_2
score_scale_1          1.00          0.03
score_scale_2          0.03          1.00
Data frame:

display(head(df))

subject	score_scale_1	score_scale_2	mean_scale_1	mean_scale_2	scale_1_item_1	scale_1_item_2	scale_1_item_3	scale_1_item_4	scale_1_item_5	scale_1_item_6	scale_1_item_7	scale_1_item_8	scale_1_item_9	scale_1_item_10	scale_2_item_1	scale_2_item_2	scale_2_item_3	scale_2_item_4	scale_2_item_5	scale_2_item_6	scale_2_item_7	scale_2_item_8	scale_2_item_9	scale_2_item_10
1	47	23	4.70	2.30	5	3	5	5	5	5	5	4	5	5	2	3	3	2	2	3	2	2	2	2
2	31	33	3.10	3.30	3	4	2	3	2	2	2	5	3	5	4	4	1	3	3	4	5	3	3	3
3	39	28	3.90	2.80	1	5	2	3	5	5	5	5	5	3	2	5	4	3	2	4	2	3	2	1
4	39	20	3.90	2.00	5	2	4	5	4	4	2	3	5	5	2	2	2	2	2	2	1	3	3	1
5	26	13	2.60	1.30	1	3	1	4	4	3	1	3	2	4	2	1	2	1	1	1	2	1	1	1
6	45	28	4.50	2.80	5	5	2	5	5	4	5	5	5	4	3	4	2	3	3	3	3	1	4	2

A plotting function is also provided:

scores <- plot_questionnaires(df, var = "score", questionnaire = "survey")
means  <- plot_questionnaires(df, var = "mean",  questionnaire = "survey") 

# Laying out the plots with the `patchwork` package
scores + means + plot_layout(guides = "collect")

Example simulation of the OSIVQ

Let’s see a more realistic simulation for a cognitive styles questionnaire I use often: the Object-Spatial Imagery and Verbal Questionnaire (OSIVQ, Blazhenkova & Kozhevnikov, 2009). The OSIVQ has 45 items divided into three scales with 15 items each and the following properties:

Object scale (OSIVQ-O), M = 3.63, SD = 0.62
Spatial scale (OSIVQ-S), M = 2.83, SD = 0.66
Verbal scale (OSIVQ-V), M = 3.00, SD = 0.68

B & K also precise that the OSIVQ-O is negatively skewed with Skewness = -.392, the other two being normally distributed. They report that O and S have a correlation of -0.03, O and V 0.12, and S and V -0.18.

df_osivq <- 
  simulate_questionnaires(
    n_subjects = 1000,
    names = c("osivq_o", "osivq_s", "osivq_v"),  
    distrib = c("skew_normal", "normal", "normal"), 
    method = "item_means",
    means = c(3.63, 2.83, 3),   
    sds = c(0.62, 0.66, 0.68),       
    skews = c(-0.392, 0, 0),
    corrs = c(-0.03, 0.12, -0.18),
    n_items = c(15, 15, 15),
    min_item = c(1, 1, 1),
    max_item = c(5, 5, 5),
    add_items = FALSE,
    print_corrs = TRUE
    )

Expected correlations:
      [,1]  [,2]  [,3]
[1,]  1.00 -0.03  0.12
[2,] -0.03  1.00 -0.18
[3,]  0.12 -0.18  1.00

Simulated correlations:
             mean_osivq_o mean_osivq_s mean_osivq_v
mean_osivq_o         1.00        -0.04         0.07
mean_osivq_s        -0.04         1.00        -0.15
mean_osivq_v         0.07        -0.15         1.00
Data frame:

display(head(df_osivq))

subject	score_osivq_o	score_osivq_s	score_osivq_v	mean_osivq_o	mean_osivq_s	mean_osivq_v
1	60	44	40	4.00	2.95	2.72
2	53	69	47	3.58	4.62	3.17
3	51	42	47	3.45	2.85	3.18
4	59	44	54	3.95	2.96	3.66
5	55	41	45	3.72	2.75	3.06
6	53	48	65	3.58	3.26	4.39

osivq_scores <- plot_questionnaires(df_osivq, var = "score", questionnaire = "OSIVQ")
osivq_means  <- plot_questionnaires(df_osivq, var = "mean", questionnaire = "OSIVQ")

osivq_scores + osivq_means + plot_layout(guides = "collect")

This looks pretty similar to the data presented in the original article! :tada:

VVIQ

This repository also contains code to simulate data for the Vividness of Visual Imagery Questionnaire (VVIQ, Marks, 1973). This task was pretty specific because the distribution of the VVIQ is not normal and I did not have precise statistical parameters from the literature, only general visual imagery prevalence data. According to Wright et al., 2024, the VVIQ distribution has the following characteristics:

0.9% of the population score 16 (aphantasia)
3.3% score between 17 and 32 (hypophantasia)
89.7% score between 33 and 74 (typical imagery)
6.1% score between 75 and 80 (hyperphantasia)

… And that’s it. I had to solve this by creating four distributions, one for each group, and then sample from this mixture of distributions. This resulted in the simulate_vviq function, which creates a data frame with a given number of subjects, simulates VVIQ total and mean scores for the four groups, and optionally simulates each item’s scores with the simulate_items function.

df_vviq <- simulate_vviq(n_subjects = 1000, add_items = TRUE)

df_vviq |> 
  group_by(group) |> 
  slice(1) |> 
  display()

subject	group	score_vviq	mean_vviq	vviq_item_1	vviq_item_2	vviq_item_3	vviq_item_4	vviq_item_5	vviq_item_6	vviq_item_7	vviq_item_8	vviq_item_9	vviq_item_10	vviq_item_11	vviq_item_12	vviq_item_13	vviq_item_14	vviq_item_15	vviq_item_16
105	aph	16	1.00	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
17	hypo	32	2.00	1	2	5	1	1	1	1	4	2	5	2	1	2	2	1	1
1	typical	67	4.19	2	5	5	5	5	2	2	5	5	5	5	4	5	5	2	5
24	hyper	78	4.88	5	5	3	5	5	5	5	5	5	5	5	5	5	5	5	5

A plot_vviq function is also provided to plot the distributions of the scores and means of the VVIQ.

vviq_scores <- df_vviq |> plot_vviq(var = "score", print = FALSE)
vviq_means  <- df_vviq |> plot_vviq(var = "mean", print = FALSE) + 
  labs(title = NULL, y = NULL)

vviq_scores + vviq_means + plot_layout(guides = "collect") & theme(legend.position = "bottom")

I believe the structures presented in these scripts could be useful to anyone who needs to simulate Likert-type questionnaire data or draw some inspiration to do it. I hope they will be useful to you! :cherry_blossom:

Note: this repository is a Quarto project endowed with a renv R environment to ensure the stability of the packages used. The repository is based on this Quarto project template: you can find a quick tutorial to use this project structure and an in-depth explanation of its elements in the README of the template.

I still added them to my (secret 👀) personal package though, to access them quickly without copy-pasting. ↩
The function is close in its purpose to the makeItemsScale from the LikertMakeR package. ↩
Conceptually, this could be a simulation of multiple correlated questionnaires (Q1 with several items on a construct correlated with Q2 on another construct) or a questionnaire with correlated sub-scales… Or both at the same time, just name your scales however you want. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README_files		README_files
bibliography		bibliography
images		images
renv		renv
scripts		scripts
.Rprofile		.Rprofile
.gitignore		.gitignore
LICENSE		LICENSE
README.html		README.html
README.md		README.md
README.qmd		README.qmd
_quarto.yml		_quarto.yml
open-this-file-first.Rproj		open-this-file-first.Rproj
renv.lock		renv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simulating Likert-type questionnaire data with R

Simulate items for a given score

Example simulation of the OSIVQ

VVIQ

About

Releases

Packages

Languages

subject	score	item_1	item_2	item_3	item_4	item_5	item_6	item_7	item_8	item_9	item_10	item_11	item_12
1	35	4	1	3	2	1	4	1	5	2	4	3	5
2	23	3	2	5	2	1	1	1	1	1	1	3	2
3	30	2	2	3	2	1	2	4	2	3	5	2	2
4	34	2	4	3	2	4	3	5	3	2	3	1	2
5	27	2	2	1	3	2	5	3	2	2	1	3	1
6	28	4	1	2	1	4	1	3	3	2	3	1	3

subject	score	item_1	item_2	item_3	item_4	item_5	item_6	item_7	item_8	item_9	item_10	item_11	item_12
1	35	4	1	3	2	1	4	1	5	2	4	3	5
2	23	3	2	5	2	1	1	1	1	1	1	3	2
3	30	2	2	3	2	1	2	4	2	3	5	2	2
4	34	2	4	3	2	4	3	5	3	2	3	1	2
5	27	2	2	1	3	2	5	3	2	2	1	3	1
6	28	4	1	2	1	4	1	3	3	2	3	1	3

License

m-delem/simulate-questionnaire-data

Folders and files

Latest commit

History

Repository files navigation

Simulating Likert-type questionnaire data with R

Simulate items for a given score

Example simulation of the OSIVQ

VVIQ

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages

subject	score	item_1	item_2	item_3	item_4	item_5	item_6	item_7	item_8	item_9	item_10	item_11	item_12
1	35	4	1	3	2	1	4	1	5	2	4	3	5
2	23	3	2	5	2	1	1	1	1	1	1	3	2
3	30	2	2	3	2	1	2	4	2	3	5	2	2
4	34	2	4	3	2	4	3	5	3	2	3	1	2
5	27	2	2	1	3	2	5	3	2	2	1	3	1
6	28	4	1	2	1	4	1	3	3	2	3	1	3