Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom barcode demultiplexing forward-reverse primer combination #1253

Open
HannahBenisty opened this issue Feb 19, 2025 · 8 comments
Open
Labels
barcode Issues related to barcoding

Comments

@HannahBenisty
Copy link

Hi,

I have a custom barcode setup, where the forward and reverse primers are different from each other in each read. I haven’t been able to find information on how to format the sequence file for this type of arrangement. Could you provide guidance on how to properly prepare the file in this case?

Thank you!
Hannah

@HalfPhoton
Copy link
Collaborator

Hi @HannahBenisty,

I'm assuming you're referring to the flanking regions of your barcodes when you say 'primer' in your question but please correct me if I'm wrong here.

The documentation on custom barcode configurations describes how you can define a barcoding configuration where the front and rear flanking regions are different. The sequences file is a just a fasta file as described on the same page.

Best regards,
Rich

@HalfPhoton HalfPhoton added the barcode Issues related to barcoding label Feb 19, 2025
@HannahBenisty
Copy link
Author

Hi Rich,

Thanks a lot for the fast reply. Flanking regions of my barcodes are different for the flank 1 and flank 2, as well as the barcodes in the front and the rear. Where do we indicate to which samples correspond each barcode combination?

Thanks!

@HalfPhoton
Copy link
Collaborator

In the sample sheet.

@HannahBenisty
Copy link
Author

HannahBenisty commented Feb 21, 2025

Hi,

I managed to demultiplexed the reads, however when I use the sample_sheet all reads end up unclassified. Could you please help me with this?

dorado demux all.fastq -o demux --kit-name custom --barcode-arrangement arrangement.toml --emit-fastq --barcode-sequences sequences --sample-sheet sample_sheet1 --barcode-both-ends

arrangement file

[arrangement]
name = "custom"
kit = "custom"

mask1_front = "CTACACGACGCTCTTCCGATCT"
mask1_rear = "AGRGTTYGATYMTGGCTCAG"
mask2_front = "AAGCAGTGGTATCAACGCAGAG"
mask2_rear = "TRGYTACCTTGTTACGACTT"

barcode1_pattern = "F%02i"
barcode2_pattern = "B%02i"
first_index = 1
last_index = 2

[scoring]
min_soft_barcode_threshold = 0.2
min_hard_barcode_threshold = 0.2
min_barcode_score_dist = 0.1

sample-sheet file

experiment_id,kit,flow_cell_id,alias,barcode
test,custom,PBA84545,1A1,F01
test,custom,PBA84545,1A1,B01
test,custom,PBA84545,1A2,F02
test,custom,PBA84545,1A2,B02

sequences file

>F01
GATCGAGTCA
>B01
TCATCGACGT
>F02
GATCGAGTCA
>B02
CATGATCGAC

@HalfPhoton
Copy link
Collaborator

Hi @HannahBenisty,

I think there are a few issues here.

The barcode name in the sample sheet should be the complete name barcodeXX

experiment_id,kit,flow_cell_id,alias,barcode
test,custom,PBA84545,1A1,barcode01
test,custom,PBA84545,1A2,barcode02

We don't support the ambiguity codes that you've used in the mask definitionsR/M/Y. They might still work but they'll always count count as mis-matches.

mask1_rear = "AGRGTTYGATYMTGGCTCAG"

Which dorado version are you using? I don't recognise these fields in [scoring] and they're not a part of the current specification:

min_soft_barcode_threshold = 0.2
min_hard_barcode_threshold = 0.2

Best regards,
Rich

@HannahBenisty
Copy link
Author

Hi Rich,

Thanks a lot! Just changing the sample sheet as you indicate worked correctly.

Sorry about the scoring part of the arrangement file, this was added by mistake.

Regarding the mask definitions, do you have any recommendation to be specified in the arrangement file or should I keep it as it is?

Best,
Hannah

@HalfPhoton
Copy link
Collaborator

I'm pleased to hear you're getting better results now - thanks for letting us know!


Regarding the IUPAC ambiguity codes in the masks - You might see better results if the codes were replaced by choosing one of the options but this is dependent on which permutations you have etc. If you have an even distribution of all permutations then leaving them as mis-matches is probably the best approach.

However, if for example, you have only have 2 permutations for mask1_rear from the possible 2**4 then matching 2 ambiguity codes with one and 2 with the other might give better results as you'll get 2 more matches in each case.

I've started the discussion internally to see if we could support ambiguity codes in a future version of Dorado.

Best regards,
Rich

@HannahBenisty
Copy link
Author

Thanks!

Regarding the barcodes, is there a maximum of barcodes? It looks like demultiplexing works with low amount of barcodes eg. 10, but if I increase (eg. 99) then all reads are unclassified.

Best,
Hannah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
barcode Issues related to barcoding
Projects
None yet
Development

No branches or pull requests

2 participants