Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics to evaluate fidelity of longitudinal datasets #198

Open
ashafquat-mdsol opened this issue Aug 25, 2022 · 3 comments
Open

Add metrics to evaluate fidelity of longitudinal datasets #198

ashafquat-mdsol opened this issue Aug 25, 2022 · 3 comments
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature under discussion Issue is currently being discussed

Comments

@ashafquat-mdsol
Copy link

ashafquat-mdsol commented Aug 25, 2022

Suggested tests

  • Conditional probability distribution in simulated vs. - conditional probability of Event A|B is calculated as the probability of seeing Event B within X days of Event A’s start.

    • Differences within the probability distributions can be computed.
    • New conditional probabilities that are not seen in original can be flagged as artifacts and all conditional probabilities that exist in original but not in simulated can be flagged as missing
  • Bag of words - Event frequencies can be used to define a features matrix per person and centroids created for the original dataset. The number of people assigned to each of the centroids in the original dataset vs simulated can be compared using a distance metric.

  • Event durations -

    • t-test/KS test can be used to compare the distribution of event durations in original vs. simulated. Where the differences are significant these can be flagged.
    • All event durations missing can be flagged.
    • Mean, median, percentiles, standard deviation, min, max of event durations per event type are calculated and plotted on a line plot. The MSE/R2 per plot quantifies the alignment between original and simulated
  • Time to event analysis- This test requires an event to be marked as a reference event (e.g. the first event that is recorded for a subject). The reference event occurs in each subject’s timeline.

    • Time to event for event X is calculated as time between reference event occurring and event X occurring.
    • Mean, median, percentiles, standard deviation, min, max of time to event per event type are calculated and plotted on a line plot. The MSE/R2 per plot quantifies the alignment between original and simulated-
    • t-test/KS test can be used to compare the distribution of time to event in original vs. simulated per event type. Where the differences are significant these can be flagged.
    • Survival probability/Log-rank test per event type can be used to identify differences in original vs simulated. (For the model, Event = 1 if Event X (e.g. Death) occurs in the subject's timeline, 0 otherwise. Time to Event = time between reference event and Event X occurring if Event =1; time between reference event and last event observed. )
  • Event sequence length distribution - where event sequence length is the number of events recorded for each person. Distance in the distribution of event sequence length between original and simulated can then be calculated.

  • N-grams frequency - An event sequence can be generated per person by making a list of events experienced by a person/unit ordered by the start date of an event. N-grams can then be computed by creating N-grams from this sequence of strings/events per person. Fidelity is quantified using MSE/R2 comparing N-gram frequency in simulated vs original datasets using varying values of N

Definition
SubjectID = Identifier for a subject
Event = An identifier to define the type of event
Start date = date of event starting
End date = date of event ending
Event Duration = Days between End date and start date

@ashafquat-mdsol ashafquat-mdsol added feature request Request for a new feature new Label applied to new issues labels Aug 25, 2022
@npatki
Copy link
Contributor

npatki commented Aug 31, 2022

Hi @ashafquat-mdsol, thanks for filing this issue. I suggest we align the terminology to what the PAR model uses:

  • Entity columns: Identify which row belongs to which sequence
  • Sequence index: (Optional) Identifies the order of the sequence, eg a date time column

I see many similarities between the metrics you describe and the MSAS algorithm, described in the most recent PAR model paper (http://arxiv.org/abs/2207.14406). At a high level, the algorithm works in the following way:

  1. Compute a metric for every sequence in the real data to get a distribution X
  2. Compute the same metric for every sequence in the synthetic data to get a distribution X’
  3. Return the KSComplement score, which quantifies the similarity between distributions X and X’

Ideally, we should create 1 issue per metric, as we always aim to have each pull request close a specific issue. I suggest that we can start with the sequence length distribution metric, as it seems the simplest of the ones you have listed so far. We can file a new issue for it, and I can provide some feedback about the API (metric, parameter name, etc.) before implementation. Does that sound good?

Based on how that goes, we can repeat the process for the other metrics.

@npatki npatki added under discussion Issue is currently being discussed data:sequential Related to timeseries datasets and removed new Label applied to new issues labels Aug 31, 2022
@npatki npatki mentioned this issue Aug 31, 2022
@ashafquat-mdsol
Copy link
Author

@npatki that sounds perfect. We wanted to reach an alignment on the metrics to implement so we can use this issue to discuss the set we want to implement. I will create a separate ticket for the sequence length distribution metric and we can definitely go from there.

@ashafquat-mdsol
Copy link
Author

Just created this issue: #203 for the sequence length distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants