Skip to content
This repository was archived by the owner on Aug 6, 2025. It is now read-only.
This repository was archived by the owner on Aug 6, 2025. It is now read-only.

[Help Wanted] Generate LASER embeddings for a large number of sentences (15.7 million) #192

@NomadXD

Description

@NomadXD

For my university FYP project related to text simplification, there's a requirement for me to generate LASER embeddings for a large number of sentences. (15.7 million) However when I try to generate LASER embeddings using the SentenceEncoder in the embed.py, the program stays fully utilized for around 12 hours and then exits without any error (I assume it is because of the high CPU and GPU utilization). I'm using the SentenceEncoder in the following way.

Initialize the SentenceEncoder with the following params. I'm using the pretrained encoder (models/bilstm.93langs.2018-12-26.pt )

SentenceEncoder(encoder_path, max_tokens=3000, cpu=False, verbose=True)

And then generate LASER embeddings as follows.

embeddings = encoder.encode_sentences(read_lines(bpe_filepath))

I tried to execute the setup with above params in a GCP compute engine with 16 cores with 102 GB memory and 1 Nvidia Tesla T4 GPU. The CPU utilization reaches 100% while the GPU utilization is somewhere around 90%. It stays like that for around 12 hours and exits without any error. (no error in nohup.out).

Any idea about what could go wrong ? I'm stucked at this point for several weeks and really appreciate if someone can help me.

cc @hoschwenk

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions