Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train.py error - Expected 2D array, got 1D array instead #24

Open
Rhinogradentia opened this issue Mar 17, 2025 · 11 comments
Open

train.py error - Expected 2D array, got 1D array instead #24

Rhinogradentia opened this issue Mar 17, 2025 · 11 comments

Comments

@Rhinogradentia
Copy link

Rhinogradentia commented Mar 17, 2025

Hi,

another question.
The tool was installed via conda on python 3.7.

I have the following error when running train.py:

(plassClass) /PlasClass$ train.py -p plasmids.fasta -c genome.fasta -o train/ -n 25
Starting PlasClass training
Getting reference lengths
Sampling 96 fragments for length 1000
Getting k-mer frequencies
Learning classifier
Saving classifier
Sampling 9 fragments for length 10000
Getting k-mer frequencies
Learning classifier
Saving classifier
Sampling 0 fragments for length 100000
Getting k-mer frequencies
Learning classifier
Traceback (most recent call last):
  File "/home/<user>/miniconda3_new/envs/plassClass/bin/train.py", line 197, in <module>
    main(args)
  File "/home/<user>/miniconda3_new/envs/plassClass/bin/train.py", line 193, in main
    train(plasfile,chromfile,outdir,num_procs,ks,lens)
  File "/home/<user>/miniconda3_new/envs/plassClass/bin/train.py", line 172, in train
    scaler = StandardScaler().fit(data)
  File "/home/<user>/miniconda3_new/envs/plassClass/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 639, in fit
    return self.partial_fit(X, y)
  File "/home/<user>/miniconda3_new/envs/plassClass/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 663, in partial_fit
    force_all_finite='allow-nan')
  File "/home/<user>/miniconda3_new/envs/plassClass/lib/python3.7/site-packages/sklearn/utils/validation.py", line 521, in check_array
    "if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The fasta files contain ncbi sequences - 4 in the plasmid-file and 7 in the genome file, no empty lines, but only one species.

What might be the reason for this error and what can I do to solve it?

Thank you in advance.
Best,
Nadine

@dpellow
Copy link
Collaborator

dpellow commented Mar 17, 2025

I believe this happens because none of the sequences are in the 100K nt length bin. Is that correct, are you using full length genomes?

If this is correct I would modify the length bins you are using via the -l parameter.

@Rhinogradentia
Copy link
Author

Yes, I used the complete sequences. I must have missed it that I should bin them. Where can I find this info? Should I just split the chromosomes in 100k pieces?
Thank you.
Best,
Nadine

@dpellow
Copy link
Collaborator

dpellow commented Mar 17, 2025

No you don't need to bin the sequences, plasclass does that. But one of the bins is empty which is why it is giving an error. I'm not sure why that bin is empty - what are the lengths of the 11 sequences. Are you able to share the fasta file?

@Rhinogradentia
Copy link
Author

Rhinogradentia commented Mar 17, 2025

Yes - I can share them - they are public. I've attached them.

sequences.zip

genome:

NC_006077.1 Kluyveromyces lactis mitochondrion, complete genome 40291
NC_006042.1 Kluyveromyces lactis strain NRRL Y-1140 chromosome F complete sequence 2602197
NC_006041.1 Kluyveromyces lactis strain NRRL Y-1140 chromosome E complete sequence 2234072
NC_006040.1 Kluyveromyces lactis strain NRRL Y-1140 chromosome D complete sequence 1715506
NC_006039.1 Kluyveromyces lactis strain NRRL Y-1140 chromosome C complete sequence 1753957
NC_006038.1 Kluyveromyces lactis strain NRRL Y-1140 chromosome B complete sequence 1320834
NC_006037.1 Kluyveromyces lactis strain NRRL Y-1140 chromosome A complete sequence 1062590

plasmids:

M11815.1 Plasmid pGKL1 from killer yeast (K.lactis), complete 8876
X01095.1 Yeast DNA killer plasmid pGKL1 8874
X01096.1 Yeast DNA killer plasmid pGKL2 left terminal region 793
X01097.1 Yeast DNA killer plasmid pGKL2 right terminal region 1317

@dpellow
Copy link
Collaborator

dpellow commented Mar 24, 2025

@Rhinogradentia you have no plasmids that are more than 10K nt long so there are no positive sequences for your classifier to train on at that length. If you are only interested in training on the plasmids in your reference file, you don't need to classify any sequences > 10Kb as plasmids and so you shouldn't train a model for that length bin. You can define the length bins you need using -l.

I'm not sure what use case you are trying to train a model for - the normal use case would use a very large database of training sequences.

@Rhinogradentia
Copy link
Author

@dpellow, thank you for this clarification. The database your model is trained on does not contain yeasts (or at least I could not find any), which I'm interested in. Therefore, my idea was to train a new model based on a known plasmid. This may be the wrong approach.

@dpellow
Copy link
Collaborator

dpellow commented Mar 24, 2025

you can try it, but I would try to create a larger database for training. Are you just trying to find the specific sequences you listed in your fasta file in a sample?

@Rhinogradentia
Copy link
Author

At least a very similar one - I already tried other approaches, and there are some contigs which might be plasmids, but I'm not very convinced right now. I will try to build a larger db/set of sequences and try again. Thank you for being so helpful

@dpellow
Copy link
Collaborator

dpellow commented Mar 24, 2025

ok, you can try with -l 1000,5000,10000 and see if works and before trying a bigger database

@dpellow
Copy link
Collaborator

dpellow commented Mar 24, 2025

also, did you try to just use plasclass without training it? Did it work?

@Rhinogradentia
Copy link
Author

Rhinogradentia commented Mar 24, 2025

yes, there were some results - but they couldn't be circularized - but then I'm not even sure if my plasmid is circular (they can be linear in yeasts). I will try what you suggested. I'm really thankful for your support. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants