Filter distributed data by one criteria using indexes #20

lecoqlibre · 2023-11-23T09:37:24Z

Let's say we have N number of PODs like P1, P2, P3 ... PN. These PODs contain an index dedicated to the targeted subject.

Targeted subject:

The subject is an URI (OBJ)
The subject is a literal (LIT)

Example of scenario for OBJ: each POD contain some products like carrot, tomato, apples and so on. These PODs also contain a product type index listing all products of a certain type. We want a client App to be able to list all the products of a certain type that are hosted on P1 ... PN. When the user clicks on a product type, the App will show all the products of that type.

Example of scenario for LIT: each POD contain some persons with family name. Each of these PODs contain a family name index which groups family names by their first 3 letters. The App will display all the persons on all the POD that match the family name provided by the user.

I see 2 strategies here:

On The Fly (OTF): The App can browse the indexes on each POD and merge them on the fly to get the response.
Agent (BOT): The App can read an already merged index on a POD (being its own or not), this index being managed by a bot (agent).

Equity

Ho to manage "equity": the fact that we take the same number of results on every POD like the first from P1, the first from P2, the first from P3, the second from P3, the second from P1, the second from P2, etc.

Score

Do we want to privilege some PODs over others? Maybe some of them are more relevant regarding speed, accuracy, frequency, etc.

Scalability

What are important metrics for indexing strategies?
What metrics could be used to trigger the appliance of different indexing strategies? In real time or not.
What is the ceiling from which it is critical to split a large index into smaller pieces?
What are the ceilings below which the OTF strategy remains efficient?
How to calculate the indexing cost?
How to calculate the parsing cost?
How to calculate the querying cost?
Calculate the ratio between the number of items and the weight of various index types.
What mechanisms could be used to sync the merging of distributed indexes? Loops? Notifications like ActivityPub?

Security

When a distributed indexing strategy is sync in real time following the growth of data, re-computation of indexes is triggered when data increase or decrease (ex: divide large file, merge small files). An attacker could change again and again the volume of data on one or several PODs of the cluster to try to overload the server. Which mechanisms could be used to prevent this?
- introduce a delay between strategy changes
- a limit of changes over the time (ex: limit to X changes in 1 hour)

lecoqlibre self-assigned this Nov 23, 2023

balessan added the Indexing strategy label Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter distributed data by one criteria using indexes #20

Filter distributed data by one criteria using indexes #20

lecoqlibre commented Nov 23, 2023 •

edited

Loading

Filter distributed data by one criteria using indexes #20

Filter distributed data by one criteria using indexes #20

Comments

lecoqlibre commented Nov 23, 2023 • edited Loading

Equity

Score

Scalability

Security

lecoqlibre commented Nov 23, 2023 •

edited

Loading