-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPIKE: In place enrichment #741
Conversation
too scared to kill and rerun while backfilling in dev
One thing that I'm not sure about: is the memory being eaten up by the index, or by running |
Couldn't help myself, sorry: opensanctions/nomenklatura#143 |
@pudo for these type issues the local enricher is gonna return zavod entities and I'm struggling to see how to tell the type system that since it's based on the zavod store we're creating, not an argument to the enricher.
could the solution be to decouple the return types from the argument types? This changes the interfaces but I can't think of a scenario where the the enricher needs what's returned to be the same type as the query entity - it just needs to be a CompositeEntity, right? diff --git a/zavod/zavod/runner/local_enricher.py b/zavod/zavod/runner/local_enricher.py
index 53dc3687..f5f0b0a9 100644
--- a/zavod/zavod/runner/local_enricher.py
+++ b/zavod/zavod/runner/local_enricher.py
@@ -2,7 +2,7 @@ import logging
from typing import Generator, Optional
from followthemoney.namespace import Namespace
-from nomenklatura.entity import CE
+from nomenklatura.entity import CE, CompositeEntity
from nomenklatura.dataset import DS
from nomenklatura.cache import Cache
from nomenklatura.enrich.common import Enricher, EnricherConfig
@@ -12,6 +12,7 @@ from nomenklatura.matching import get_algorithm
from zavod.meta import get_catalog
from zavod.store import get_store
+from zavod.entity import Entity
log = logging.getLogger(__name__)
@@ -52,7 +53,7 @@ class LocalEnricher(Enricher):
if self.get_config_bool("strip_namespace"):
self._ns = Namespace()
- def match(self, entity: CE) -> Generator[CE, None, None]:
+ def match(self, entity: CE) -> Generator[Entity, None, None]:
for match_id, index_score in self._index.match(entity)[:MATCH_CANDIDATES]:
match = self._view.get_entity(match_id.id)
if match is None:
@@ -70,7 +71,7 @@ class LocalEnricher(Enricher):
if result.score >= self._threshold:
yield match
- def _traverse_nested(self, entity: CE, depth: int) -> Generator[CE, None, None]:
+ def _traverse_nested(self, entity: CE, depth: int) -> Generator[CompositeEntity, None, None]:
if depth == 0:
return
@@ -82,5 +83,5 @@ class LocalEnricher(Enricher):
for prop, adjacent in self._view.get_adjacent(entity):
yield from self._traverse_nested(adjacent, depth - 1)
- def expand(self, entity: CE, match: CE) -> Generator[CE, None, None]:
+ def expand(self, entity: CE, match: CE) -> Generator[CompositeEntity, None, None]:
yield from self._traverse_nested(match, 2) |
python contrib/index_bench.py datasets/_externals/ext_md_companies.yml nomenklatura.index.Index
…nt outputs with expansion
ef9253d
to
afb586a
Compare
towards #691