Skip to content

Commit

Permalink
Add internal links
Browse files Browse the repository at this point in the history
  • Loading branch information
woodthom2 committed Jun 21, 2024
1 parent 364e525 commit f09f849
Show file tree
Hide file tree
Showing 7 changed files with 18 additions and 18 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,13 @@ One increasingly popular option is retrospective harmonization. This involves ta

However, not all constructs can be measured with such simple, categorical questions. Take the above outcome variable (depression) for instance. Depression is a complex, heterogeneous experience, characterized by a multitude of symptoms that can be experienced to various degrees and in different combinations. In large-scale surveys, depression is typically measured with standardized questionnaires – participants are asked to report on a range of symptoms, their responses are assigned numerical values, and these are summed to form a “total depression score” for each individual. Although this remains the most viable and plausible strategy for measuring something as complex as depression, there is no “gold standard” questionnaire that is universally adopted by researchers. Instead, there are well over 200 established depression scales. In a [recent review](https://www.closer.ac.uk/wp-content/uploads/210715-Harmonisation-measurement-properties-mental-health-measures-british-cohorts.pdf) (McElroy et al., 2020), we noted that the content of these questionnaires can differ markedly, e.g. different symptoms are assessed, or different response options are used.

How can researchers harmonize such complex measures? One option would be to standardize scores within each data set, thus transforming everyone’s raw score to a rank ordering within their given sample. Although straightforward, this approach has a number of weaknesses. First and foremost, you are assuming that both questionnaires are measuring the same underlying construct, and are measuring it equally well. Second, by standardizing a measure within a cohort, you are removing all information about the mean and standard deviation, making it impossible to compare the average level of a construct across datasets.
How can researchers harmonize such complex measures? One option would be to standardize scores within each data set, thus transforming everyone’s raw score to a rank ordering within their given sample. Although straightforward, this approach has a number of weaknesses. First and foremost, you are assuming that both questionnaires are measuring the same underlying construct, and are measuring it equally well. Second, by standardizing a measure within a [cohort](/item-harmonisation/harmony-a-free-ai-tool-to-merge-cohort-studies), you are removing all information about the mean and standard deviation, making it impossible to compare the average level of a construct across datasets.

An alternative approach is to apply retrospective harmonization at the item-level. Although questionnaires can differ considerably on the number and nature of questions asked, there is often considerable overlap at the [semantic](https://harmonydata.ac.uk/semantic-text-matching-with-deep-learning-transformer-models)/content level. Let’s return to our earlier example of depression. Although there are many different questionnaires that can be used to assess this experience, they often ask the same types of questions. Below is an example of content overlap in two of the most common measures of psychological [distress](https://harmonydata.ac.uk/how-far-can-we-go-with-harmony-testing-on-kufungisisa-a-cultural-concept-of-distress-from-zimbabwe) used in children, the Revised Children’s Anxiety and Depression Scale (RCADS), and the Mood and Feelings Questionnaire (MFQ).

{{< image src="images/blog/blog-pic-1.png" alt="img" >}}

By identifying, recoding, and testing the equivalence of subsets of items from different questionnaires (for guidelines see our previous report), researchers can derive harmonized sub-scales that are directly comparable across studies. Our group has previously used this approach to study trends in mental health across different generations (Gondek et al., 2021), and examine how socio-economic deprivation impacted adolescent mental health across different cohorts (McElroy et al., 2022).
By identifying, recoding, and testing the equivalence of subsets of items from different questionnaires (for guidelines see our previous report), researchers can derive harmonized sub-scales that are directly comparable across studies. Our group has previously used this approach to study trends in mental health across different generations (Gondek et al., 2021), and examine how socio-economic deprivation impacted adolescent mental health across different [cohorts](/item-harmonisation/harmony-a-free-ai-tool-for-cross-cohort-research) (McElroy et al., 2022).

One of the main challenges to retrospectively harmonizing questionnaire data is identifying the specific items that are comparable across the measures. In the above example, we used expert opinion to match candidate items based on their content, and used psychometric tests to determine how plausible it was to assume that matched items were directly comparable. Although our results were promising, this process was time-consuming, and the reliance on expert opinion introduces an element of human [bias](https://fastdatascience.com/how-can-we-eliminate-bias-from-ai-algorithms-the-pen-testing-manifesto) – i.e. different experts may disagree on which items match. As such, we are currently working on a [project](https://fastdatascience.com/starting-a-data-science-project) supported by Wellcome, in which we aim to develop an online tool, ‘Hamony’, that uses machine learning to help researchers match items from different questionnaires based on their underlying meaning. Our overall aim is to streamline and add consistency and replicability to the harmonization process. We plan to test the utility of this tool by using it to harmonize measures of mental health and social connectedness across two cohort of young people from the UK and and Brazil.

Expand Down
6 changes: 3 additions & 3 deletions content/en/blog/clinical-trial-research-data-harmonisation.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Clinical data harmonization plays a pivotal role in advancing clinical research
- Standardizing data elements and definitions across trials is essential for ensuring consistency in measurements and assessments. Consistent data facilitates accurate comparisons between studies, allowing researchers to draw meaningful conclusions and make informed decisions. Without harmonization, variations in data definitions and measurement units could lead to misinterpretations and compromises in the reliability of research outcomes.

3. **Improving Data Quality:**
- Harmonization acts as a quality assurance mechanism by identifying and rectifying discrepancies in data. By establishing standardized data elements and validation processes, harmonization contributes to enhanced data quality. High-quality data is imperative for the credibility of clinical research, ensuring that the results accurately reflect the true impact of medical interventions and interventions.
- Harmonization acts as a quality assurance mechanism by identifying and rectifying discrepancies in data. By establishing standardized data elements and [validation](/harmonisation-validation/clinical-global-impression-of-change-cgic) processes, harmonization contributes to enhanced data quality. High-quality data is imperative for the credibility of clinical research, ensuring that the results accurately reflect the true impact of medical interventions and interventions.

4. **Enhancing Research Reproducibility:**
- Reproducibility is a cornerstone of scientific inquiry, and harmonized data sets the stage for more reproducible research. A clear and standardized framework for data analysis, achieved through harmonization, enables researchers to replicate studies with confidence. Reproducible research is critical for validating study findings, building a robust scientific knowledge base, and instilling confidence in the broader scientific community.
Expand Down Expand Up @@ -87,7 +87,7 @@ The Clinical Data Harmonization Playbook, developed by the Center for Data to He

6. **Sustainability and Scalability:**
- **Adaptability of Frameworks:** Designing scalable and adaptable frameworks allows for the incorporation of evolving research needs, technologies, and standards. This ensures that harmonization efforts remain relevant over time.
- **Long-Term Funding and Incentives:** Establishing sustainable funding mechanisms and incentives is crucial for the ongoing maintenance and enhancement of harmonized datasets and infrastructure. This long-term support ensures the longevity and impact of harmonization initiatives.
- **Long-Term Funding and Incentives:** Establishing [sustainable](/making-harmony-sustainable-long-term) funding mechanisms and incentives is crucial for the ongoing maintenance and enhancement of harmonized datasets and infrastructure. This long-term support ensures the longevity and impact of harmonization initiatives.

By adhering to these principles, the Clinical Data Harmonization Playbook provides a solid foundation for effective, ethical, and impactful clinical data harmonization in the realm of clinical research.

Expand All @@ -110,7 +110,7 @@ While the benefits of data harmonization are clear, the process is not without i
- Establishing standardized protocols for data collection, measurement, and reporting is a key aspect of harmonization. However, reaching a consensus on these standards can be challenging, especially when dealing with diverse medical specialties and research domains.

5. **Temporal and Longitudinal Variability:**
- Clinical data often span different time periods and may involve longitudinal studies. Managing temporal variability and ensuring the consistency of data over time present challenges in harmonization, as data collection methods and technologies may evolve.
- Clinical data often span different time periods and may involve [longitudinal studies](/item-harmonisation/harmony-a-free-ai-tool-for-longitudinal-study). Managing temporal variability and ensuring the consistency of data over time present challenges in harmonization, as data collection methods and technologies may evolve.

6. **Data Quality and Missing Values:**
- Variability in data quality across different sources poses a significant challenge. Addressing missing or incomplete data requires careful consideration and may involve the development of imputation strategies to enhance the reliability of harmonized datasets.
Expand Down
2 changes: 1 addition & 1 deletion content/en/blog/data-harmonisation-healthcare.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Harmonisation methods in data science and healthcare research aim to standardize
### Ensuring Reliable Data for Algorithms

#### Quality Over Quantity
In the realm of data harmonisation, the emphasis on quality over quantity cannot be overstated. Accurate and reliable data is paramount for developing algorithms that are truly effective and can lead to meaningful insights and outcomes. The focus should be on ensuring that each data point collected and integrated into larger datasets meets a high standard of quality, as this will significantly impact the performance of machine learning models and AI algorithms. Poor-quality data can lead to inaccurate predictions, biased outcomes, and ultimately, decisions that may not be in the best interest of patients or research objectives.
In the realm of data harmonisation, the emphasis on quality over quantity cannot be overstated. Accurate and reliable data is paramount for developing algorithms that are truly effective and can lead to meaningful insights and outcomes. The focus should be on ensuring that each data point collected and integrated into larger datasets meets a high standard of quality, as this will significantly impact the [performance](/measuring-the-performance-of-nlp-algorithms) of machine learning models and AI algorithms. Poor-quality data can lead to inaccurate predictions, biased outcomes, and ultimately, decisions that may not be in the best interest of patients or research objectives.


#### Role of Machine Learning and AI
Expand Down
6 changes: 3 additions & 3 deletions content/en/blog/how-does-harmony-work.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ Harmony uses techniques from the field of [natural language processing](https://

There are a number of approaches to quantify the [similarity](https://fastdatascience.com/finding-similar-documents-nlp) between strings of text. The simplest approach is known as the Bag-of-Words approach. This is *not* how Harmony currently works, but it is one of the first things we tried!

If we want to compare the GAD-7 question 4 (*Trouble relaxing*) to the Beck’s Anxiety Inventory question 4 (*Unable to relax*), we would break down each text into the words present. We usually remove suffixes like *ing* at this stage (this is called lemmatisation).
If we want to compare the [GAD-7](/ces-d-vs-gad-7) question 4 (*Trouble relaxing*) to the [Beck](/harmonisation-validation/beck-depression-inventory-ii-bdi-ii)’s Anxiety Inventory question 4 (*Unable to relax*), we would break down each text into the words present. We usually remove suffixes like *ing* at this stage (this is called lemmatisation).

| | GAD-7 Q4 | Beck Q4 |
| | [GAD-7](/gad-7-vs-ghq-12) Q4 | Beck Q4 |
| ---------- | -------- | ------- |
| trouble | 1 | 0 |
| relax(ing) | 1 | 1 |
Expand All @@ -44,7 +44,7 @@ The obvious drawbacks of the Jaccard method are that
- It ignores syntax (the order of the words in the texts).
- It cannot cope with synonyms.
- It won’t notice negation (*I was not happy* and *I was very happy* both equally match *you were happy*).
- Most crucially, our remit for the Harmony [project](https://fastdatascience.com/starting-a-data-science-project) is that we want to harmonise data from different languages, such as Portuguese and English. Clearly the bag-of-words approach would not work when the texts are in different languages, unless you translated them first.
- Most crucially, our remit for the Harmony [project](https://fastdatascience.com/starting-a-data-science-project) is that we want to harmonise data from different [languages](/harmony-supports-over-8-languages), such as Portuguese and English. Clearly the bag-of-words approach would not work when the texts are in different languages, unless you translated them first.

{{< image src="images/blog/Jaccard-checklist.drawio-min-768x634.png" alt="Jaccard checklist" >}}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,9 @@ Shona (_chiShona_) is spoken in Zimbabwe and belongs to the Bantu language famil

In Shona, derived verbs can be created from simple verbs by attaching suffixes to the verb stem.

I tried using Harmony to see how it would harmonise “kufungisisa” (thinking too much) to a Western instrument such as GHQ-12.
I tried using Harmony to see how it would harmonise “kufungisisa” (thinking too much) to a Western instrument such as [GHQ-12](/gad-7-vs-ghq-12).

Although English is the best-resource language for [natural language processing](https://naturallanguageprocessing.com/), [multilingual NLP techniques](https://fastdatascience.com/multilingual-natural-language-processing/) are catching up even for lower-resourced languages. There exist some [NLP](https://fastdatascience.com/portfolio/nlp-consultant/) [models](https://harmonydata.ac.uk/semantic-text-matching-with-deep-learning-transformer-models) for Shona. I used the sentence [transformer](https://harmonydata.ac.uk/how-does-harmony-work) model `Davlan/xlm-roberta-base-finetuned-shona` which is a modification of ROBERTA trained on Shona texts[7]. I plugged one into Harmony and tried to match the [Shona symptom questionnaire for the detection of depression and anxiety](https://depts.washington.edu/edgh/zw/hit/web/project-resources/shona_symptom_questionnaire.pdf), which is used in Zimbabwe[6].
Although English is the best-resource language for [natural language processing](https://naturallanguageprocessing.com/), [multilingual NLP techniques](https://fastdatascience.com/multilingual-natural-language-processing/) are catching up even for lower-resourced [languages](/harmony-supports-over-8-languages). There exist some [NLP](https://fastdatascience.com/portfolio/nlp-consultant/) [models](https://harmonydata.ac.uk/semantic-text-matching-with-deep-learning-transformer-models) for Shona. I used the sentence [transformer](https://harmonydata.ac.uk/how-does-harmony-work) model `Davlan/xlm-roberta-base-finetuned-shona` which is a modification of ROBERTA trained on Shona texts[7]. I plugged one into Harmony and tried to match the [Shona symptom questionnaire for the detection of depression and anxiety](https://depts.washington.edu/edgh/zw/hit/web/project-resources/shona_symptom_questionnaire.pdf), which is used in Zimbabwe[6].

{{< image src="images/blog/Screenshot-from-2023-07-13-12-34-30.png" alt="img" >}}

Expand Down
Loading

0 comments on commit f09f849

Please sign in to comment.