Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates in Output Files #115

Open
AnneSchoenauer opened this issue Jan 3, 2024 · 4 comments
Open

Duplicates in Output Files #115

AnneSchoenauer opened this issue Jan 3, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@AnneSchoenauer
Copy link

Dear @SKruthoff and @ysherstyuk,

Thanks a lot for your work on this!
@Tilmon noticed the following in the emission profiles_company

Duplicates: running dplyr::distinct() on the datasets emission_profile_company.csv, emission_profile_product.csv, emission_profile_upstream_at_company_level.csv shows that all these 3 datasets have duplicates. Only tested for these 3. All datasets should be tested for duplications and duplications avoided. E.g. the companies_id "adolf-wurth-gmbh-co-kg_00000004971238-001" has all rows twice in the emission_profile_product.csv.

Could you double check if there is a quality check included that would avoid this? And do you know where the duplicates come from? Is this an issue in the code from GitHub or is there something happening on DataBricks that makes this mistake? If it is due to the code on GitHub we would need to investigate where this comes from.

Best
Anne

@AnneSchoenauer AnneSchoenauer added the bug Something isn't working label Jan 3, 2024
@Tilmon
Copy link
Collaborator

Tilmon commented Jan 16, 2024

@SKruthoff and @ysherstyuk do you already have an update on this issue? Or will you be able to work on it still this week? Thanks!
cc' @AnneSchoenauer

@SKruthoff
Copy link
Collaborator

Hi,

yes I am working on it in the line of comparing the outputs. The duplicates seem to come from the column extra_rowid. Kalash right now is working on removing this column from the final output as he said that it should not be a part of the user facing output.

After this column is removed, I will rerun the package again and double check if the issue for the duplicates is solved.

@Tilmon
Copy link
Collaborator

Tilmon commented Jan 17, 2024

Sounds good, thanks for the update!

@Tilmon
Copy link
Collaborator

Tilmon commented Aug 27, 2024

@SKruthoff @ysherstyuk is this issue already resolved? If so, please close the ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants