Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naive TimeMap counting algorithm causes UI to display possibly inaccurate count #283

Closed
ibnesayeed opened this issue Dec 15, 2017 · 5 comments · Fixed by #284
Closed

Naive TimeMap counting algorithm causes UI to display possibly inaccurate count #283

ibnesayeed opened this issue Dec 15, 2017 · 5 comments · Fixed by #284
Assignees
Labels

Comments

@ibnesayeed
Copy link
Contributor

mCount = out.count("memento")

This rudimentary approach of counting the occurrence of the term memento in the response could result in wrong memento count when the term appears in URI-Ms. This could happen for two reasons:

  • the URI-R contains the term such as mementoweb.org
  • some archives have memento in their path such as /memento/<datetime>/<URI-R>

A more reliable approach would be to count it when the TM is processed for counting archives. Currently the TM is being processed twice, which can be slow for big TMs.

@machawk1
Copy link
Owner

Nice catch. This is a good candidate for a test case.

Also a good idea to reuse the archive counting process (new as of < 12 hours ago, so pardon the delay ;)) for counting URI-Ms.

@machawk1 machawk1 added the bug label Dec 15, 2017
@machawk1 machawk1 changed the title Wrong memento count Naive TimeMap counting algorithm causes UI to display possibly inaccurate count Dec 15, 2017
@machawk1
Copy link
Owner

@ibnesayeed What are your thoughts on using CDXJ then performing one of the following to obtain the count?

  1. Subtract metadata lines count from line count of CDXJ TM
  2. Increment a counter for each lines that starts with a [0-9], indicative of it being a memento.

This would allow us to exploit the features of CDXJ but still incur the temporal expense of converting from a Link to a CDXJ in MemGator, for which a Link-based solution might be more efficient (at the cost of parsing the rel).

@ibnesayeed
Copy link
Contributor Author

A PR is on its way. :)

@machawk1
Copy link
Owner

Ah, ok. In the future, if you are working on a ticket, let me know and I can assign it to you so we don't waste work cycles (I was working on a solution as well). I'll defer to your upcoming PR.

@ibnesayeed
Copy link
Contributor Author

Ah, ok. In the future, if you are working on a ticket, let me know and I can assign it to you so we don't waste work cycles (I was working on a solution as well). I'll defer to your upcoming PR.

I thought I did, but apparently I forgot to mention here. Anyways, the PR #284 is in now for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants