Release v3.4.0: Plotting class labels, RELION 3.1 support, and phase-randomization for FSCs · ml-struct-bio/cryodrgn

In this minor release we are adding several new features and commands, as well as expanding a few existing ones and introducing some key refactorings to the codebase to make these changes easier to implement.

New features

full support for RELION 3.1 .star files with optics values stored in a separate grouped table before or after the main table (#241, #40, #10)
- refactored Starfile class now has properties .apix and .resolution that return particle-wise optics values for commonly used parameters, as well as methods .get_optics_values() and .set_optics_values() for any parameter
  - these methods automatically use the optics table if available
- cryodrgn parse_ctf_star can now load all particle-wise optics values from the .star file itself instead of the current behavior of relying upon user input for parameters such as A/px, resolution, voltage, spherical aberration, etc., or just taking the first value found in the file
backproject_voxel now computes FSC threshold values corrected for mask overfitting using high resolution phase randomization as done in cryoSPARC, as well as showing FSC curves and threshold values for various types of masks:
cryodrgn_utils plot_classes for creating plots of cryoDRGN results colored by a given set of particle class labels
- for now, only creates 2D kernel density plots of the latent space embeddings clustered using UMAP and PCA, but more plots will be added in the future:
```
$ cryodrgn_utils plot_classes 002_train-vae_dim.256 9 --labels published_labels_major.pkl --palette viridis --svg
```
  analyze.9/umap_kde_classes.png

Improvements to existing features

backproject_voxel also now creates a new directory using -o/--outdir into which it places output files, instead of naming all files after the output reconstructed volume -o/--outfile
- files within this directory will always have the same names across runs:
  - backproject.mrc the full reconstructed volume
  - half_map_a.mrc, half_map_b.mrc reconstructed half-maps using an odd/even particle split
  - fsc-vals.txt all five FSC curves in space-delimited format
  - fsc-plot.png a plot of these five FSC curves as shown above
downsample can now downsample each of the individual files in a stack referenced by a .star or .txt file, returning a new .star file or .txt file referencing the new downsampled stack
- used by specifying a .star or .txt file as -o/--outfile when using a .star or .txt file as input:
```
cryodrgn downsample my_particle_stack.star -D 128 -o particles.128.star --datadir folder_with_subtilts/ --outdir my_new_datadir/
```
cryodrgn_utils fsc can now take three volumes as input, in which case the first volume will be used to generate masks to produce cryoSPARC-style FSC curve plots including phase randomization for the “tight” mask (see New features above)
cryodrgn_utils plot_fsc is now more flexible with the types of input files it can accept for plotting, including .txt files with the new type of cryoSPARC-style FSC curve output from backproject_voxel
cryodrgn filter --force for less interactivity after the selection has been made
filter_mrcs prints both original and new number of particles; generates output file name automatically if not given
cryodrgn abinit_het saves configs alongside model weights in weights.pkl for easier access and output checkpoint identification

Addressing bugs and other issues

better axis labels for FSC plotting, passing Apix values from backproject_voxel (#385)
cryodrgn filter doesn’t show particle indices in hover text anymore, as this proved visually distracting; we now show these indices in a text box in the corner of the plot
cryodrgn filter saves chosen indices as a np.array instead of Python standard list to prevent type issues in downstream analyses
commands_utils.translate_mrcs was not working (was assuming particles.images() returned a numpy array instead of a torch Tensor) — this has been fixed and tests added for translations of image stacks
going back to listing modules to be included in the cryodrgn and cryodrgn_utils command line interfaces explicitly, as Python will sometimes install older modules into the corresponding folders which confuses automated scanning for command modules
fixing parsing of 8bit and 16bit .mrc files produced using e.g. --outmode=int8 in EMAN2 (#113)
adding support and continuous integration testing for Python 3.11

Refactoring classes that parse input files

There were some updates we wanted to make to the ImageSource class and its children which was introduced in a refactoring of the processes used to load and parse input datasets in v3.0.0. We also sought to simplify and clean up the code in the methods used to parse .star file and .mrcs file data in cryodrgn.starfile and cryodrgn.mrc respectively.

the code for the ImageSource base class and its children classes in cryodrgn.source have been cleaned up to improve code style, remove redundancies, and support the Starfile and mrcfile refactorings described below
- more consistent and sensible parsing of filenames with datadir for _MRCDataFrameSource classes such as TxtFileSource and StarfileSource (#386)
  - all of this logic is now contained in a new method _MRCDataFrameSource.parse_filename which is applied in __init__:
    1. If the filename by itself points to a file that exists, use filename.
    2. Otherwise, if os.path.join(datadir, newname) exists, use that.
    3. Finally, try os.path.join(datadir, os.path.basename(newname)).
    4. If that doesn’t exist, throw an error!
- adding ImageSource.orig_n attribute which is often useful for accessing the original number of particles in the stack before filtering was applied
- adding ImageSource.write_mrc(), to avoid having to use MRCFile.write() for ImageSource objects; MRCFile.write() use case for arrays has been replaced by mrcfile.write_mrc (see below)
  - see use in a refactored cryodrgn downsample for batch writing to .mrc output
- adding MRCFileSource.write(), a wrapper for mrcfile.write_mrc()
- adding MRCFileSource.apix property for convenient access to header metadata
- getting rid of ArraySource, whose behavior can be subsumed into ImageSource with lazy=False
- improving error messages in ImageSource.from_file(), ._convert_to_ndarray(), images()
- ImageSource.lazy is now a property, not an attribute, and is dynamically dependent on whether self.data has actually been loaded or not
- adding _MRCDataFrameSource.sources convenience iterator property
- StarfileSource now inherits directly from the Starfile class (as well as _MRCDataFrameSource) for better access to .star utilities than using a Starfile object as an attribute (.df in the old v3.3.3 class)
.star file methods have been refactored to establish three clear ways of accessing and manipulating .star data for different levels of features, with RELION3.1 operations now implemented in Starfile class methods:
- cryodrgn.starfile.parse_star and write_star to get and perform simple operations on the main data table and/or the optics table
  e.g. in filter_star:
```
stardf, data_optics = parse_star(args.input)
...
write_star(args.o, data=filtered_df, data_optics=new_optics)
```
- cryodrgn.starfile.Starfile for access to .star file utilities like generating optics values for each particle in the main data table using parameters saved in the optics table
  e.g. in parse_ctf_star:
```
stardata = Starfile(args.star)
logger.info(f"{len(stardata)} particles")
apix = stardata.apix
resolution = stardata.resolution
...
ctf_params[:, i + 2] = (
    stardata.get_optics_values(header)
    if header not in overrides
    else overrides[header]
)
```
- cryodrgn.source.StarfileSource for access to .star file utilities along with access to the images themselves using ImageSource methods like .images()
- see our more detailed write-up for more information:
  Starfile Refactor
for .mrc files, we removed MRCFile as there are no analogues presently for the kinds of methods supported by Starfile; the operations on the image array requiring data from the image header are presently contained within MRCFileSource, reflecting the fact that .mrcs files are the image data themselves and not pointers to other files containing the data
- MRCFile, which consisted solely of static parse and write methods, has been replaced by the old names of these methods (parse_mrc and write_mrc)
  - MRCFile.write(out_mrc, vol) → write_mrc(out_mrc, vol)
  - in the case of when vol is an ImageSource object, we now do ImageSource.write_mrc()
- in general, parse_mrc and write_mrc are for using the entire image stack as an array, while MRCFileSource is for accessing batches of images as tensors
- mrc module is now named mrcfile for better verbosity and to match starfile module which is its parallel for processing input files
- examples from across the codebase:
  - commands_utils.add_psize
    
    old:
```
from cryodrgn.mrc import MRCFile, MRCHeader
from cryodrgn.source import ImageSource

header = MRCHeader.parse(args.input)
header.update_apix(args.Apix)

src = ImageSource.from_file(args.input)
MRCFile.write(args.o, src, header=header)
```
    new:
```
from cryodrgn.mrcfile import parse_mrc, write_mrc

vol, header = parse_mrc(args.input)
header.apix = args.Apix
write_mrc(args.o, vol, header=header)
```
  - commands_utils.flip_hand
    old:
```
src = ImageSource.from_file(args.input)
# Note: Proper flipping (compatible with legacy implementation) only happens when chunksize is equal to src.n
MRCFile.write(
    outmrc,
    src,
    transform_fn=lambda data, indices: np.array(data.cpu())[::-1],
    chunksize=src.n,
)
```
    Note that the awkward combination of MRCFileSource and MRCFile above meant having to cast the images from tensors to arrays after they were loaded!
    
    new:
```
vol, header = parse_mrc(args.input)
vol = vol[::-1]
write_mrc(outmrc, vol, header=header)
```
- also made some updates to MRCHeader for ease of use:
  - making mrc module variables like DTYPE_FOR_MODE header class attributes
  - creating properties apix and origin with .getter and .setter methods, simplifying retrieval of these values
    - e.g. header.origin = (0, -1, 0) instead of header.update_origin(0, -1, 0) , with header.origin instead of header.get_origin() to get values

Code Quality Control

improving module-level docstrings with more info and usage examples
- better parsing of multi-line example usage commands split up by \ in cryodrgn.command_line when producing help messages for -h
- see e.g. cryodrgn.dataset, cryodrgn.starfile, cryodrgn.source, cryodrgn filter, cryodrgn filter_mrcs
in automated CI testing, we now test 3.9 + 1.12, 3.10 + 2.1, and 3.11 + 2.4 in terms of Python version + PyTorch version, instead of doing all pairs of {3.9, 3.10} and {1.12, 2.1, 2.3}, allowing for CI testing to be expanded into Python 3.11 without running too many test jobs
better error messages for cryodrgn.pose and cryodrgn.ctf when inputs don’t match in dimension or have an unexpected format
creating new module cryodrgn.masking, moving e.g. utils.window_mask() to masking.spherical_window_mask()
bringing back unittest.sh, a set of smoke tests for reconstruction commands that can be run outside of pytest and regular automated CI testing, by replacing outdated commands (#267)
first release with regression pipeline testing, confirming that outputs of key reconstruction commands has remained unchanged: see summary here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.4.0: Plotting class labels, RELION 3.1 support, and phase-randomization for FSCs

New features

Improvements to existing features

Addressing bugs and other issues

Refactoring classes that parse input files

Code Quality Control