Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look at distributions of features in each batch #3

Open
4 tasks
bethac07 opened this issue Feb 16, 2022 · 1 comment
Open
4 tasks

Look at distributions of features in each batch #3

bethac07 opened this issue Feb 16, 2022 · 1 comment
Labels
Experiments Tracking experimental questions, results, or analysis

Comments

@bethac07
Copy link

The question was raised as to whether the feature composition of each batch is the same; I think it's going to be hard to do this on a per feature level because of the random dropout of highly correlated features, but I do think there are a few metrics we can quite easily generate for each batch based just on the columns present in each CSV:

  • How many total features did this batch use?
  • What percent of features are Cells vs Nuclei vs Cytoplasm? (These should add to 100)
  • What percent of features are Texture vs Neighbors vs AreaShape etc? (These should add to 100)
  • What percent of features are RNA vs DNA vs ER vs Mito vs AGP vs BF? (These should not typically add to 100 but may coincidentally, since AreaShape features have no channels and Colocalization have 2)
@bethac07 bethac07 added the Experiments Tracking experimental questions, results, or analysis label Feb 16, 2022
@bethac07
Copy link
Author

bethac07 commented Feb 16, 2022

(I am not 100% convinced this will tell us anything super useful, but since it's really easy(see note) we might as well do it . I would do #2 first though, because if we decide that ie for all our comparisons we want to use subsampled data to make things more apples to apples, we'd then have to do this again)

Note: (should be <30 lines of code- make a list of all of the compartments+channels+measurement types, open each per-batch CSV, grab total column count, and then for each "thing to check for" report len([x for x in columns if thing_to_check_for in x]) (I'm sure there are more efficient ways to do this if we had to check millions of columns times thousands of factors but this should be enough for our small purposes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Experiments Tracking experimental questions, results, or analysis
Projects
None yet
Development

No branches or pull requests

1 participant