Skip to content

Transcript review before deposit: 16 Milan. Spanish #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Camille-Gonzalez-VA opened this issue Mar 4, 2025 · 11 comments
Closed

Transcript review before deposit: 16 Milan. Spanish #23

Camille-Gonzalez-VA opened this issue Mar 4, 2025 · 11 comments

Comments

@Camille-Gonzalez-VA
Copy link

Transcript Language:

Spanish

Task:

  1. Comment on the issue to say you can pick this up. An OLS team member will assign it to you, but you can begin work immediately without waiting for the github assignment.
  2. Please review the transcript in question, and keep an eye open for anything that could reveal WHO the transcript participant actually is. Don't comment here - instead please speak to an OLS team member to ask them to redact it.
@rivaquiroga
Copy link

I'll pick this up

@rivaquiroga
Copy link

Done!
There are some linguistic features that make it possible to infer the country of origin of the interviewee. I'm not sure if you want to anonymize that too

@yochannah
Copy link
Member

yochannah commented Mar 7, 2025 via email

@iramosp
Copy link
Member

iramosp commented Mar 7, 2025

@rivaquiroga , do you have suggestions on how to do this? As our expert linguist, I'm happy to follow your lead on this - sorry I said it wasn't needed before.
@yochannah, I don't think this is trivial (@rivaquiroga, correct me if I'm wrong!) and we'll have to recheck the other Spanish transcripts for consistency.

@rivaquiroga
Copy link

The thing with anonymization is that it is usually the sum of small linguistic features that seem trivial on their own and random pieces of information that allow you to reconstruct who the person is. For example, I think I have a pretty good guess of who this person might be, considering gender markers, their variety of Spanish, a mention of a specific programming language, the fact that they are not in the capital city of their country, and a mention of something related to their project (which I suggested anonymizing).
Considering that the pool of mentees from the Spanish-speaking world is relatively small, my suggestion here is to take extra measures to make it safer. I can re-check the four transcriptions if needed

@iramosp
Copy link
Member

iramosp commented Mar 7, 2025

Agree!
But I'm still not sure how to approach this. For example, would it be ok to switch linguistic features to another Spanish variant, say my own?
For consistency, @rivaquiroga, do you think we should also de-identify gender?

@rivaquiroga
Copy link

You need to remove the features that are characteristic of that variant of Spanish, not change them. In most cases, these include filler words, certain adjectives, nouns, etc. For example, in Chilean Spanish, I could say something like: "Llegué a mi casa y me comí un pan batido con palta." (Some people might even be able to identify the city where I'm based if they have knowledge of Chilean breads.) The edited version could be: "Llegué a mi casa y comí [producto local]." If you change it to your variety, it would be: "Llegué a mi casa y me comí un pan con aguacate." But that is not what I said. So the approach is to de-identify, not to modify.
Gender as an indirect identifier is a complicated issue in languages with morphological gender, like Spanish (i.e., it affects the way nouns and adjectives are written).
I also don't know if gender is a relevant variable in the paper you are writing.
If you are careful with all the other direct and indirect identifiers, gender alone won't be enough to re-identify participants. But you also have to take into account who your "population" is. For example, if there are only two men from Spanish-speaking countries in the OLS cohorts and you interviewed one of them, it would be very easy to re-identify him if you are not extremely cautious about other identifiers. And even if it is not possible to identify with 100% certainty which of the two you interviewed, both can be affected by the consequences of opening the interview.

@iramosp
Copy link
Member

iramosp commented Mar 21, 2025

linguistic features that make it possible to infer the country of origin of the interviewee. I'm not sure if you want to anonymize that too

@rivaquiroga, if you still have some time, it'd be good to anonymize this too. I tried doing it for the other Spanish transcripts, and will give them another pass after looking at your suggestions for this one.

@rivaquiroga
Copy link

@iramosp, I'll work on that this week and let you know when I'm done

@rivaquiroga
Copy link

@iramosp, done!

@iramosp
Copy link
Member

iramosp commented Apr 2, 2025

¡Muchas gracias!

This was a very meticulous review, and I learned a lot from it 🙌

@iramosp iramosp closed this as completed Apr 2, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in Impact Paper momentum Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

4 participants