Lesson design: dedicated organization/workflowsection #75

rkdarst · 2019-08-11T17:40:01Z

I've suggested that this lesson be split into two parts: one on workflows and one on environments. Part of this comes from user interviews I have done which indicate that arranging files, codes, etc. is one of the most important untaught skills. Plus, the lesson is kind of long and already has two diverse parts.

This goes a bit beyond "code" and is more about data. It's up for debate if this is on topic for us.

Here is my current backwards lesson design on the new first half (on workflows):

For who:
a) new researcher who is starting from nowhere, and needs to organize their work and use the different systems available to them properly (they have many choices).
b) existing researcher who has stuff spread all over and has made a mess
c) group leader who needs to keep their group's stuff in line.

Misc topics which may need covering, unordered:

How to arrange stuff
- each "topic" gets a short name (slug)
- you have different super-directories that can contain projects: ~/git, cluster:/scratch/, cluster:/project/, version control host, etc. Each possible machine/locaiton has different trade-offs: backed up or not, long-term or not, shareable or not.
- flat organization: each system/filesystem has one place to put stuff, non-nested.
- A directory can be single use or multi-use:
  - singel-use cases: code, software package repo, data
  - project dir: has subdirs for different purposes: code, data, scratch, results, papers
- names should be unique and shouldn't be reused for different purposes. But you can reuse the name for different dirs that are for the same project if in some locations it is for e.g. data. Ideally no duplicate files unless they are the same.
- how to syncronize things across systems:
  - small stuff: version control. This is always preferable
  - original data: could manually be done. Try to always avoid manually syncing things that can change.
  - other synchronizers: unison, but is there anything more modern?
  - try as hard as you can to avoid
- How the named directories can related:
  - One can use another as a code library
  - One can use another as a data source
  - ...?
Multi-person projects
- sharing editable code is not a good idea. Sharing original data OK. sharing scratch data risky.
- use version control system to sync, each person has their own workig copies. e.g. user1/proj1, user2/proj1, etc.
- If you have a shared directory, each user makes their own workspace inside with their working stuff. e.g. proj/user1/, proj/user2/, etc. Each of these user dirs would have e.g. code/, scratch/, etc.
Avoid duplication and copy and paste
In order to explain the above, we need to invent consistent terms for the name directories and any other types of directories.
arranging files within directories
- If your project is anything other than trivial, you will eventually want to automate it. Plan for that already.
- types of automation: single code multiple data, multiple code single data, and combinations.
- each file has certain source files
- arrange your data into "parallel series" which you have a single command to run to generate output from inputs
- TODO this needs to be finished and we need some way to explain this.
- The snakemake example goes over the automation but last I checked not the motivation/data setup. enough. But it may be enough to learn the arrangement passively, but we should make sure it is pointed out.

Possible exercises:

[more needed... could include "what is wrong with this setup", "organize these files", and so on.]
Interpert several makefiles and say: are they SIMD or MISD or MIMD, and ask how they work.
What type of problems can easily fit in the SIMD and MISD paradigms?
Which of these are not a good system for sharing files/code/data, and what can go wrong with it: github, email, personal webspaces, archive
Who has had stories about disasters of organization?
given a list of a lot of file names for a sample project, decide which go into which subdirs.
Evaluate a sample project with some sort of flaw... perhaps no separation of data (original and scratch) and code.
One episode should explain the concept of the word count repo be to organize the current word count example into the necessary directories in order to do the automation using the snakemake example.

bast · 2019-12-01T10:09:49Z

There are many good thoughts here but the risk is that this issue is too big for anybody to tackle few days before teaching it. A symptom of it is that it is around now for almost 4 months.

Also before we go into a big redesign we should see whether we are not trying to reinvent something that The Turing Way is not already doing and rather we could contribute to their lessons.

samumantha · 2023-08-23T12:49:32Z

Are there still some of @rkdarst really good suggestions that are not implemented yet but in your opinion could/should be?

bast · 2023-08-23T13:28:39Z

If I leave it open, it will stay open forever. If I close it, I risk being rude and miss some good points. I don't know it's many good points here but some go beyond the 2 h format we have here. Maybe RD can point out 2 or 3 most important points which we should absolutely implement?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lesson design: dedicated organization/workflowsection #75

Lesson design: dedicated organization/workflowsection #75

rkdarst commented Aug 11, 2019 •

edited

bast commented Dec 1, 2019

samumantha commented Aug 23, 2023

bast commented Aug 23, 2023

Lesson design: dedicated organization/workflowsection #75

Lesson design: dedicated organization/workflowsection #75

Comments

rkdarst commented Aug 11, 2019 • edited

bast commented Dec 1, 2019

samumantha commented Aug 23, 2023

bast commented Aug 23, 2023

rkdarst commented Aug 11, 2019 •

edited