[Help Wanted] Generate LASER embeddings for a large number of sentences (15.7 million)

For my university FYP project related to text simplification, there's a requirement for me to generate LASER embeddings for a large number of sentences. (15.7 million) However when I try to generate LASER embeddings using the `SentenceEncoder` in the `embed.py`, the program stays fully utilized for around 12 hours and then exits without any error (I assume it is because of the high CPU and GPU utilization). I'm using the `SentenceEncoder` in the following way.

Initialize the `SentenceEncoder` with the following params. I'm using the pretrained encoder (`models/bilstm.93langs.2018-12-26.pt` )
```py
SentenceEncoder(encoder_path, max_tokens=3000, cpu=False, verbose=True)
``` 
And then generate LASER embeddings as follows. 

```py
embeddings = encoder.encode_sentences(read_lines(bpe_filepath))
```
I tried to execute the setup with above params in a GCP compute engine with 16 cores with 102 GB memory and 1 Nvidia Tesla T4 GPU. The CPU utilization reaches `100%` while the GPU utilization is somewhere around `90%`. It stays like that for around 12 hours and exits without any error. (no error in `nohup.out`). 

Any idea about what could go wrong ? I'm stucked at this point for several weeks and really appreciate if someone can help me.   

cc @hoschwenk 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Help Wanted] Generate LASER embeddings for a large number of sentences (15.7 million) #192

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Help Wanted] Generate LASER embeddings for a large number of sentences (15.7 million) #192

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions