Skip to content

Language detector always loads the small ngrams set #15

@Toflar

Description

@Toflar

Hey, thanks for the cool library! I'm using it in Loupe.

I noticed a performance issue:

<?php

use Nitotm\Eld\LanguageDetector;

include __DIR__ . '/vendor/autoload.php';

$languageDetector = new LanguageDetector();
$languageDetector->langSubset(['de']);

var_dump($languageDetector->detect('Guten Tag.'));

On the first call, this will write a subset for just de into the vendor (/subsets) directory. This is perfect!
However, the LanguageDetector will always load the small.php ngrams, even though I only need the subset for German.

Also it is loading the dataset already in the __construct() which means that you cannot instantiate the LanguageDetector object without causing the data to be loaded into memory. This means, it's always loaded even if nobody calls ->detect() on the object later. This should ideally be converted to a lazy evaluation. So only load the data once it's used for the first time 😊

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions