-
-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About online index updating #447
Comments
Hi there, Unfortunately, we do not currently support online index updates, but this is something that we may be willing to support in the future. Depending on the size of the index and the frequency of the updates, there are various approaches which may be best. Perhaps you could list some more details, and we might be able to determine if anything is likely to be worked on in the future. Of course, we welcome collaborators, so you are welcome to work on this yourself too. |
I haven't really given a thought to this feature, and I don't really know what approach would be best. But I'm thinking what would be the easiest way to support that, and here are some thoughts: As you may know from the documentation, PISA has a rather unique indexing pipeline, with indexing separated into separate stages: parsing, inverting, and compression. This diagram could help to visualize it. (btw @JMMackenzie this image doesn't render in the docs, we should probably fix that.) I don't see a way to update index with a single document fast without serious refactoring and possibly even structural changes (but maybe I'm missing it?). However, I can see supporting updating in batches that's reasonably fast. This is not to say a batch couldn't be just a single document, but it won't be much faster than a thousand documents. ParsingParsing is largely independent except for one crucial functionality: documents and terms are assigned IDs at this stage. The way it works now, though, is that parsing is done in batches, and they are then merged together, including the ID mappings. Now, during an update, merging a forward index could be optional, and only mappings would be merged if someone doesn't care about maintaining the forward index. We would need to make sure that the old document IDs stay the same. InvertingWe could use the newly parsed forward index to build a small inverted index, and merge it with the old one. I believe we already have the mechanism to do so in our code, since we invert in batches as well. CompressionProbably the whole thing should be rebuilt. Significantly more work would be required otherwise (I think). CaveatsNote that the above approach means that you need to keep your uncompressed index and keep merging to it. This might or might not be a problem depending on the size. This wouldn't be superfast but also might be acceptable, depending on your update pattern. Also, note that most of the "merging" I refer to is taking a number of files and producing a new one. The old ones could be removed right after, but you still need roughly twice as much storage available as your index. AdvantagesThe advantage of such "hacky" approach is that it would be doable in reasonable time and without overhauling the entire indexing pipeline. It also should be too difficult. |
Hi, Is there any progress? I just find this awesome project and looking for some info about this feature |
Unfortunately, I haven't been able to look into that. I just graduated and started a new job, plus have been busy in personal life. Currently, I cannot tell when I'd have time to look into that, but if someone else took the lead on it, I'd be probably able to help with review and discussion. |
Dear my friends,
First thank you all for the great project !
This search engine is the most fancy I've found on Github !
In our case, we will have streaming data which should be updated to the exist index during queries, but I can't find any description of this feature in the doc.
So is there any plan to support this feature ?
The text was updated successfully, but these errors were encountered: