Skip to content

Commit

Permalink
docs: re-add section about architecture
Browse files Browse the repository at this point in the history
  • Loading branch information
makkus committed Jul 26, 2021
1 parent 381c313 commit 03d3428
Show file tree
Hide file tree
Showing 27 changed files with 6,845 additions and 8 deletions.
2 changes: 1 addition & 1 deletion AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
Contributors
============

* Markus Binsteiner <markus[email protected]>
* Markus Binsteiner <markus@frkl.io>
8 changes: 4 additions & 4 deletions dev/dev.ipynb

Large diffs are not rendered by default.

32 changes: 32 additions & 0 deletions docs/architecture/assumptions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Assumptions & considerations

## Core assumptions

I consider the following assumptions a given. They are not fuelled by user stories, but are the 'minimal' requirements
that emerged after initially presenting the 'open questions', and in other discussions with Sean and the team. If any
of those assumptions are wrong, some of the conclusions below will have to be adjusted.

- our (only) target audience (for now) are digital historians (and maybe also other digital humanity researchers) who can't code themselves
- the most important outcome of our project is for our target audience to be able to execute workflows in order to explore, explain, transform or augment their data
- we want the creation of workflows to be as easy and frictionless as possible, although not at the expense of end-user usability
- we want our product to be used by all DH researchers around the word, independent of their affiliation(s)
q- collaboration/sharing of data is not a priority, most of our target audience are either individuals, sometimes small teams (sharing of results and sharing of workflows are different issues, and not included in this assumption)

## Considerations around adoption

One way to look at how to prioritize and implement some of our user stories is through the lens of ease-of-adoption:
which characteristics make our application more likely to be adopted, by a larger group of researchers?

Those ones are obvious (at least to me) -- in no particular order:

- ease of workflow use
- ease of file-management use
- ease of installation (if there is one involved)
- whether there is a login/account creation requirement
- how well it integrates and plays with tools researchers already use day to day
- provides relevant (to them) workflows
- the cheaper to use the better (free/monthly cost/pay-per-usage)
- stability / reliability
- performance (most importantly on the compute side, but also UI)
- how easy it is to create workflows, and what skills are necessary to do that (easier creation -> more workflows)
- whether and how easy it will be to share, re-use and adapt workflows (different to sharing data)

Large diffs are not rendered by default.

Large diffs are not rendered by default.

50 changes: 50 additions & 0 deletions docs/architecture/data/data_centric_approach.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# A (parallel?) data centric approach for kiara/lumy"
]
},
{
"cell_type": "markdown",
"source": [
"- decision between Workflow creation and Workflow execution\n",
"-"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
608 changes: 608 additions & 0 deletions docs/architecture/data/data_formats.ipynb

Large diffs are not rendered by default.

36 changes: 36 additions & 0 deletions docs/architecture/data/dev.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
71 changes: 71 additions & 0 deletions docs/architecture/data/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
From looking at the user stories, and after listening to the interviews Lorella conducted and also considering my own
personal experience in eResearch, I think its save to say that the central topic we are dealing with is data. Without data,
none of the other topics (workflows, visualisation, metadata...) would even exist. Because of its central nature I want to lay out the different forms it comes in, and which characteristics of it are
important in our context.

## What's data?

Data is created from sources. Sources come in different forms (analog, digital) and can be anything from handwritten
documents in an archive to a twitter feed. Photos, cave-paintings, what have you. I'm not websters dictionary, but I think
one usable working definition of data could be a 'materialized source', in our context 'materialized source in digital form'.
From here on out I'll assume we are talking about 'digital' data when I mention data.

One thing I'll leave out in this discussion is what is usually called 'dirty data' in data engineering, although it is
an important topic. Most of the issues there map fairly well to the structured/unstructured thing below. There are a few
differences, but in the interest of clarity let's ignore those for now...

## Structured data / Data transformations

Important for us is that data can come in two different formats: unstructured, and, who'd have guessed... structured. The same
piece of data can be theoretically expressed in structured as well as unstructured form: the meaning to a researcher would
be 100% the same, but the ways to handle, digest and operate with the data can differ, and in most scenarios adding structure
opens up possibilities to work with the data that weren't there before. In my head I call those two forms 'useless', and
'useful' data, but researcher usually get a bit agitated when I do, so I have learned to not do that in public anymore.

For researchers, the most (and arguably only) important feature of 'structure' is that it enables them to
do *more* with the data they already possess. By means of computation. I think it's fair to say that only structured data
can be used in a meaningful way in a computational context. With the exception that unstructured data is useful input to
create structured data.

One more thing to mention is that the line between structured and un-structured is sometimes hard to draw,
and can depend entirely on context. "One persons structured data is another persons unstructured data.", something like that.
In addition, in some instances unstructured data can be converted to structured data trivially, meaning without much effort
or any user-interaction. I'd argue we can consider those sorts of datasets basically 'structured'.

### Example

Lets use a simple example to illustrate all that: *a digital image of a document*.

Depending on what you are interested in, such an image might already be structured data. For example it could contain geo-tags, and a
timestamp, which are both digitally readable. If you want to visualize on a map where a document is from, you can do that instantly.
Structured data, yay!

Similarly, if you are interested in the color of the paper of the document (ok, I'm stretching my argument here as this seems fairly
unlikely, but this is really just to illustrate...), you might get the color histogram of the image (which is trivial to extract,
but needs some batch-computation), and for your purposes you would also consider the image file structured data.

Now, if you are interested in the text content of the document, things get more interesting. You will have to jump
through some hoops, and feed the image file to an OCR pipeline that will spit out a text file for example. The data
itself would still be the same, but now computers can access not only some probably irrelevant metadata, but also the text content,
which, in almost all cases, is where the 'soul' of the data is.

It could be argued that 'just' a text file is not actually structured. I'd say that groups of ascii-characters that
can be found in english-language dictionaries, separated by whitespaces and new-lines can be considered a structure,
even if only barely. The new format certainly allows the researcher to interact with the data in other ways (e.g. full-text search).

We can go further, and might be interested in characteristics of the text content (language, topics, etc.). This is where
the actual magic happens, everything before that is just rote data preparation: turning unstructured (or 'other-ly' structured)
data into (meaningful) structured data... On a technical level, those two parts (preparation/computation) of a research workflow might look (or be)
the same, but I think there is a difference worth keeping in mind. If I don't forget I'll elaborate on that later.

## 'Big-ish' data

I'm not talking about real 'Big data'-big data here, just largish files, or lots of them, or both. I don't think we'll encounter many use-cases where we have to move
or analyze terabytes of data, but I wouldn't be surprised if we come across a few gigabytes worth of it every now and then.

There are a few things we have to be prepared for, in those cases:

- transferring that sort of data is not trivial (esp. from home internet connections with limited upload bandwidth) -- and we will most likely have to be able to offer some sort of resumable-upload (and download) option (in case of a hosted solution)
- if we offer a hosted service, we will have to take into account and plan for this, so we don't run out of storage space (we might have to impose quotas, for example)
- computation-wise, we need to make sure we are prepared for large datasets and handle that in a smart way (if we load a huge dataset into memory, it can crash the machine where that is done)
- similarly, when we feed large datasets into a pipeline, we might not be able to just duplicate and edit the dataset like we could do for small amounts of data (too expensive, storage-wise) -- so we might need to have different strategies in place on how to execute a workflow, depending on file sizes (for example some sort of copy-on-write)
Loading

0 comments on commit 03d3428

Please sign in to comment.