Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregated TimeMap summary #97

Open
ibnesayeed opened this issue Mar 21, 2017 · 12 comments
Open

Aggregated TimeMap summary #97

ibnesayeed opened this issue Mar 21, 2017 · 12 comments
Assignees

Comments

@ibnesayeed
Copy link
Member

We need an API endpoint that provides a summary of the aggregated TimeMap, preferably in JSON format. The summary can group memento counts for each upstream archive, and also a nested distribution on year and month levels.

@machawk1
Copy link
Member

preferably in JSON format

Why JSON?

Also mind the verbiage w/ "memento count" vs "URI-M count" per https://arxiv.org/abs/1703.03302

@ibnesayeed
Copy link
Member Author

JSON will be directly consumable by many visualization libraries in JS or other languages. As far as "memento count" is concerned, if "URI-M count" is preferred then we might need to update the X-Memento-Count header as well.

@ibnesayeed
Copy link
Member Author

Here is a sample output draft.

{
	"original_uri": "http://example.org/index.html",
	"total_mementos": 54,
	"archives": {
		"web.archive.org": {
			"count": 53,
			"first": {
				"datetime": "2002-10-16T10:13:37Z",
				"uri": "http://web.archive.org/web/20021016101337/http://example.org/index.html"
			},
			"last": {
				"datetime": "2016-04-10T22:12:45Z",
				"uri": "http://web.archive.org/web/20160410221245/http://example.org/index.html"
			}
		},
		"archive.is": {
			"count": 1,
			"first": {
				"datetime": "2013-09-16T08:37:01Z",
				"uri": "http://archive.is/20130916083701/http://example.org/index.html"
			},
			"last": {
				"datetime": "2013-09-16T08:37:01Z",
				"uri": "http://archive.is/20130916083701/http://example.org/index.html"
			}
		},
		"webarchive.org.uk": {
			"count": 0
		}
	},
	"periods": {
		"2002": {
			"10": 10,
			"12": 6
		},
		"2003": {
			"01": 1,
			"02": 3,
			"05": 2,
			"09": 1,
			"11": 4
		},
		"2005": {
			"02": 3,
			"04": 7,
			"05": 2,
			"08": 5
		},
		"2013": {
			"07": 1,
			"09": 3
		},
		"2016": {
			"02": 5,
			"19": 1
		}
	}
}

@machawk1
Copy link
Member

@ibnesayeed total_mementos seems semantically inconsistent with other quantifiers, e.g., the count fields in the above JSON and the X-Memento-Count header.

@ibnesayeed
Copy link
Member Author

ibnesayeed commented Oct 10, 2017

Thanks @machawk1, the point is taken. We can perhaps make it more coherent across the board. However, the primary goal of this sample output was to communicate the intended implementation to collect ideas of what other information can be provided to aid tools as well as what tools can be built if such information is available.

@machawk1
Copy link
Member

@ibnesayeed Right. I just wanted to encourage consistency. The temporal breakdown you have will be really useful. What are your thoughts on having that same sort of breakdown (optionally, additionally, and/or in lieu of the inter-archive) on a per-archive basis?

@ibnesayeed
Copy link
Member Author

ibnesayeed commented Oct 10, 2017

What are your thoughts on having that same sort of breakdown (optionally, additionally, and/or in lieu of the inter-archive) on a per-archive basis?

That is certainly doable if it seems useful for some applications/visualizations. However, it would increase the size of the response. One might also think about the possibility of breaking down data on archives within each monthly period too. So, I think we should structure it in a way that future extensions don't break the current structure while being able to add more fine-grained breakdown information.

Additionally, breakdown on http/https and www/naked can also be added. In future, if TimeMaps include status code, some stats on that can be provided as well.

@machawk1
Copy link
Member

Adding period information also increases the size of the response. There should probably be a way to specify the granularity of the temporal breakdown and whether per-archive information is included. The year-month choice seems arbitrary with the alternative of a by-year breakdown being a more expected default.

@ibnesayeed
Copy link
Member Author

Adding period information also increases the size of the response.

It sure does, but the number of items are capped to a max of number of archives + 12 * number of years since archival started (this will only add a maximum of 12 entries each year). Nesting periods under archives or the other way brings in number of archives * 12 * number of years since archival started. The total number of mementos is a not a factor here.

There should probably be a way to specify the granularity of the temporal breakdown and whether per-archive information is included. The year-month choice seems arbitrary with the alternative of a by-year breakdown being a more expected default.

We can make a dedicated endpoint that accepts various parameters to let the client pick and choose what it wants. However, that would increase the complexity of the code (difficult to explain and maintain) and yield a confusing API documentation. This is perhaps the perfect opportunity to introduce GraphQL in MemGator, but I would hold on to it, because, it would require some serious planning to see what other endpoints can go that route.

For now, this endpoint should give enough high level summary of a TimeMap that can help various visualization and archival exploration applications. The choice of month granularity is a good compromise between usefulness and response size. without the complexities of parameters. A client can easily accumulate yearly stats from the monthly breakdown, but the reverse would not be possible.

@machawk1
Copy link
Member

machawk1 commented Apr 18, 2024

I made some headway on this issue, see the issue-97 branch. The current output on that branch yields something like:

{
 "original_uri": "http://matkelly.com",
 "archives": {
  "web.archive.org":{
   "count": 208,
   "first":{
    "datetime": 20060514123511,
    "uri": "https://web.archive.org/web/20060514123511/http://www.matkelly.com:80/",
   }
   "last":{
    "datetime": 20240413142440,
    "uri": "https://web.archive.org/web/20240413142440/https://matkelly.com/",
   }
 },
  "archive.md":{
   "count": 18,
   "first":{
    "datetime": 20130618191814,
    "uri": "http://archive.md/20130618191814/http://matkelly.com/",
   }
   "last":{
    "datetime": 20210406203127,
    "uri": "http://archive.md/20210406203127/https://matkelly.com/",
   }
 },
  "wayback.archive-it.org":{
   "count": 3,
   "first":{
    "datetime": 20140210154006,
    "uri": "https://wayback.archive-it.org/all/20140210154006/http://matkelly.com/",
   }
   "last":{
    "datetime": 20160805024730,
    "uri": "https://wayback.archive-it.org/all/20160805024730/http://matkelly.com/",
   }
 },
  "arquivo.pt":{
   "count": 11,
   "first":{
    "datetime": 20200218230719,
    "uri": "https://arquivo.pt/wayback/20200218230719mp_/https://matkelly.com/",
   }
   "last":{
    "datetime": 20230121055854,
    "uri": "https://arquivo.pt/wayback/20230121055854mp_/http://matkelly.com/",
   }
 },
 "total_mementos": 240

The temporal breakdown still needs to be done and there are likely some formatting issues and code cleanup to do.

Task:

  • Change "datetime" value to long-form (RFC1123?) per @ibnesayeed's example
  • Create summary

@machawk1
Copy link
Member

@ibnesayeed Also suggested to add entries for archives that report zero mementos for the URI-R.

@machawk1
Copy link
Member

Also, change count to memento_count to be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants