Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete gender perturbations in the subject dataset #84

Open
ege-erdogan opened this issue Sep 23, 2024 · 1 comment
Open

Incomplete gender perturbations in the subject dataset #84

ege-erdogan opened this issue Sep 23, 2024 · 1 comment

Comments

@ege-erdogan
Copy link

ege-erdogan commented Sep 23, 2024

Hi, and thanks for the nice work. We've discovered some samples in the subject dataset (downloaded from OSF) for which the subject gender cannot be clearly distinguished between the male and female versions of the same sentence, and the ground truth labels do not cover all the words corresponding to the subject. Two examples (bold words are part of the subject but stay the same):

(MALE) Because his father works with horses , Matilda demands the definition of a horse .
(FEMALE) Because her father works with horses , Matilda demands the definition of a horse .

and

(MALE) Zain seeks escape in an ultimate manner by committing suicide , drowning herself in the waters of the Gulf of Mexico.
(FEMALE) Chloe seeks escape in an ultimate manner by committing suicide , drowning herself in the waters of the Gulf of Mexico.

Appears to be human labeling error according to A.1.1 in the paper but we wanted to notify you and see if you were aware of this or updated the dataset to fix this issue.

Edit: to clarify, in the second examples 'herself' is not part of the grammatical subject but refers to the subject so should be modified accordingly to be consistent, while in the first sentence 'Matilda' is the subject.

Best,
Ege

@rickwg
Copy link
Collaborator

rickwg commented Sep 25, 2024

Hey Ege, thanks for bringing that to our attention - great catch! We definitely could've been clearer about how we put together the 'subject' dataset. Let me break it down:
For the 'subject' dataset, we're only labeling the first part of the grammatical subject. If there's a second part, we're leaving it out. As for those sentences you pointed out, we're altering the bold words specifically for the 'all' dataset.
Just so you know, we're actually in the process of updating our datasets. We've realized that using names to determine gender is a critical weakness in our current setup, so we're working on fixing that.
I'll make sure to close this issue once we've got the updated versions published and ready to go.
Really appreciate you flagging this. If you have any other questions or spot anything else, feel free to let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants