-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RE datasets #162
Comments
Same issue, have you got answer regarding this? |
Not yet! I decided to do some cleaning up manually/using grep |
Hi all, It seems like the examples could be categorized as annotation errors. (and also from the inherent nature of the dataset: please see the second section of this reply or this reply ). For the given examples, they are from PMID : 18347176. https://pubmed.ncbi.nlm.nih.gov/18347176/
is from the end of the abstract. When taking a look at the original EUADR corpus from https://biosemantics.erasmusmc.nl/index.php/resources/euadr-corpus , we can find an annotation file "18347176.txt" in it.
From the above lines, you can see the original annotated file says that T393C (id : 27 - from the 9th column) and oropharyngeal (28) have a relation (TRUE) but T393C (27) and hypopharyngeal cancer (29) do not (which seems to be an error to me as well). (Added after the discussion; thanks for constructive discussions James Morrill, luana-be, and Amir Kadivar!) GAD and EUADR datasets can be classified as weakly labeled (or distant supervision) datasets that are notably noisy. As mentioned in this reply, since we now have multiple high-quality BioRE datasets, I personally suggest that we need to refrain from using weakly labeled datasets and move to use other datasets such as ChemProt, DrugProt, or other human-labeled datasets for evaluating BioLMs. As a BioNLP researcher, I think RE datasets are difficult to make and sometimes contains erroneous examples as it requires extensive manual work of healthcare professionals. Thank you! Here is a python script I used to check the dataset. inptext=<get 18347176.txt>
inpTok = [ele.split("\t") for ele in inptext.splitlines()]
print(len(inpTok)) # should be about 52
entities = {int(ele[8]):"entity : '%s', "%ele[3]+ele[7] for ele in inpTok if ele[2]=="concept"}
print(len(entities)) # should be 36
entities[28] # > "entity : 'oropharyngeal', ['sda/1', 'sda/10', 'sda/15']" |
The dataset does not seem to be a few cases of poorly labelled samples, it seems to be extremely poor. Could something else have gone wrong somewhere? For example, it fails on even the most obvious changes to the input sentence
The model appears to predict almost zero difference in all the cases i've tried between a positive association and a negated version. This suggests to me it's not an artifact of just some poor labels. I think its also worth noting that the poor labels, that happen continually, are not a challenge to annotate (by and large). I don't quite buy that its because RE is difficult to do. |
@luana-be if you did end up doing some manual processing, is there any chance you would share said labels? |
@jambo6 my manual processing was not enough to get good results. I totally gave up on these RE datasets! Sorry :-( |
Ah no worries. Thanks for the reply. Are you familiar with BioNLP by any chance? I know there exist good gene-gene relationship databases and I was wondering : is there anything that would likely prevent a model trained on gene-gene relationships perform poorly on gene-disease? It feels to me the language is relatively similar. Obviously its not ideal and I'm sure you'd miss certain things but I wouldn't have thought it would be too bad... |
That's a good idea @jambo6 ! Thanks for the insight :-) |
@wonjininfo thank you for looking into this. I also ran into the exact same issue described here and documented my findings in #153. I agree with @jambo6 that this is a serious issue, beyond the usual "labeled data is hard". I'd even add that the scope goes beyond BioBert and this repository per se. At this point, this GAD RE dataset has become a de facto benchmark, used by a lot of other folks, and BioBert is now part of the genealogy of the dataset (e.g. see BLURB). I encourage you to look at my findings in #153; you can find a summary of it below. I set up a tiny experiment. I picked 20 random examples from the official BioBert GAD RE dataset and verified their prescribed labels. I found that the true/false labels were basically no better than a coin toss. I then tried to trace the genealogy of the GAD RE dataset, which roughly goes something like this: Becker et al. (2004), Bravo et al. (2015), and Lee, Yoon, et al. 2019 (i.e. BioBert). My conclusion so far is that the main problem with the datset we have now is the very definition of true/false labels. Becker et al. (2004)'s dataset was a good ole, manually curated RE dataset with 5,000 data points, each being a tuple like It does sound crazy, but it's my best theory of what's gone wrong and how badly. |
Yeah you are right on this, I just read Gu et al., "Domain-Specific Language Model Pretraining for Biomedical The Genetic Association Database corpus was created semi-automatically using the Genetic Association Archive. Specifically, the archive contains a list of gene-disease associations, with the corresponding sentences in the PubMed abstracts reporting the association studies. Bravo et al. used a biomedical NER tool to identify gene and disease mentions, and create the positive examples from the annotated sentences in the archive, and negative examples from gene-disease co-occurrences that were not annotated in the archive. Interestingly, this odd labelling method did not appear to be a cause for concern. |
Thanks for your guys' exploration! I came across the same issue and was glad that I'm not the only one. By the way, does anyone know if there exists any high-quality dataset for RE tasks? Thanks ahead! |
I used this one recently and it worked quite well: https://github.com/sujunhao/RENET2 I stuck a fine-tuned model on huggingface if you are interested in trying it out: https://huggingface.co/jambo/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext-finetuned-renet |
Appreciate your links so much! @jambo6, but the dataset link in that repo (http://www.bio8.cs.hku.hk/RENET2/renet2_data_models.tar.gz) doesn't seem to work :( I also like your idea of utilizing gene-gene relationship databases. In fact, I did a little investigation and found DisGeNET may serve as a good resource. Take Alzheimer's Disease as an example, this page https://www.disgenet.org/browser/0/1/1/C0002395/ lists all the association types and the evidence (free text) related to it. I was just wondering whether we could constitute a corpus based on that and establish a multi-classification NLP task. |
It works fine for me, have you tried it in another browser? Disgenet is good, however their association types/evidences are only based on NLP models, not expert-curated; as such anything trained on their labels is only likely to be as good as ones trained on the original datasets they used. In fact, I think the RE dataset they used is precisely this problematic GAD one. (See the paper where they explicitly mention the GAD dataset). |
@jambo6 Thx I tried IE and it works! That's insane... Can't imagine this problematic dataset has been utilized by such a highly cited work... Days ago a friend of mine just recommended me another source of biomedical NLP datasets: https://www.i2b2.org/NLP/DataSets/Main.php, you can also have a look. |
Hello all, I carefully read this issue thread and also combined my experience of the aforementioned RE challenge; I think I have overlooked the situation in the previous comments. As @amirkdv suggested, I agree that the GAD dataset clearly seems to be weakly labeled (distant supervision). In my defence, what I remember about the time we selected the RE dataset was, that we did not have an abundant choice of BioRE datasets at the time we wrote the paper. We selected the dataset by the popularity, i.e. the number of papers cited, since we thought that highly cited would represent the "reputation" and the quality. (This approach was too naive and I feel responsible for the following studies). Back then, high-quality BioRE datasets were rare (at least to the best of our knowledge) and it seems like GAD was one of the widely used RE datasets. To conclude, I agree that the GAD and EUADR datasets are weakly supervised (distant supervision) datasets. And since we now have multiple high-quality BioRE datasets, I personally suggest that we need to refrain from using weakly labeled datasets and move to use other datasets such as ChemProt, DrugProt, or other human-labeled datasets for evaluating BioLMs. Thank you all very much for your constructive comments, I deeply appreciate them! |
ps2) @amirkdv I think your experiment on 20 random examples is very interesting.
|
Hello,
I'm using the GAD and EUADR datasets for relation extraction and I'm noticing contradictory annotations in both sets.
Here is an example extracted from the EUADR test set:
How are these sentences actually annotated?
Thank you for your help,
Luana
The text was updated successfully, but these errors were encountered: