Add metrics to evaluate fidelity of longitudinal datasets #198
Labels
data:sequential
Related to timeseries datasets
feature request
Request for a new feature
under discussion
Issue is currently being discussed
Suggested tests
Conditional probability distribution in simulated vs. - conditional probability of Event A|B is calculated as the probability of seeing Event B within X days of Event A’s start.
Bag of words - Event frequencies can be used to define a features matrix per person and centroids created for the original dataset. The number of people assigned to each of the centroids in the original dataset vs simulated can be compared using a distance metric.
Event durations -
Time to event analysis- This test requires an event to be marked as a reference event (e.g. the first event that is recorded for a subject). The reference event occurs in each subject’s timeline.
Event sequence length distribution - where event sequence length is the number of events recorded for each person. Distance in the distribution of event sequence length between original and simulated can then be calculated.
N-grams frequency - An event sequence can be generated per person by making a list of events experienced by a person/unit ordered by the start date of an event. N-grams can then be computed by creating N-grams from this sequence of strings/events per person. Fidelity is quantified using MSE/R2 comparing N-gram frequency in simulated vs original datasets using varying values of N
Definition
SubjectID = Identifier for a subject
Event = An identifier to define the type of event
Start date = date of event starting
End date = date of event ending
Event Duration = Days between End date and start date
The text was updated successfully, but these errors were encountered: