docs: re-add section about architecture

DHARPA-Project · Jul 26, 2021 · 03d3428 · 03d3428
1 parent 381c313
commit 03d3428
Show file tree

Hide file tree

Showing 27 changed files with 6,845 additions and 8 deletions.
diff --git a/AUTHORS.rst b/AUTHORS.rst
@@ -2,4 +2,4 @@
 Contributors
 ============
 
-* Markus Binsteiner <markus[email protected]>
+* Markus Binsteiner <markus@frkl.io>
diff --git a/dev/dev.ipynb b/dev/dev.ipynb
diff --git a/docs/architecture/assumptions.md b/docs/architecture/assumptions.md
@@ -0,0 +1,32 @@
+# Assumptions & considerations
+
+## Core assumptions
+
+I consider the following assumptions a given. They are not fuelled by user stories, but are the 'minimal' requirements
+that emerged after initially presenting the 'open questions', and in other discussions with Sean and the team. If any
+of those assumptions are wrong, some of the conclusions below will have to be adjusted.
+
+- our (only) target audience (for now) are digital historians (and maybe also other digital humanity researchers) who can't code themselves
+- the most important outcome of our project is for our target audience to be able to execute workflows in order to explore, explain, transform or augment their data
+- we want the creation of workflows to be as easy and frictionless as possible, although not at the expense of end-user usability
+- we want our product to be used by all DH researchers around the word, independent of their affiliation(s)
+q- collaboration/sharing of data is not a priority, most of our target audience are either individuals, sometimes small teams (sharing of results and sharing of workflows are different issues, and not included in this assumption)
+
+## Considerations around adoption
+
+One way to look at how to prioritize and implement some of our user stories is through the lens of ease-of-adoption:
+which characteristics make our application more likely to be adopted, by a larger group of researchers?
+
+Those ones are obvious (at least to me) -- in no particular order:
+
+ - ease of workflow use
+ - ease of file-management use
+ - ease of installation (if there is one involved)
+ - whether there is a login/account creation requirement
+ - how well it integrates and plays with tools researchers already use day to day
+ - provides relevant (to them) workflows
+ - the cheaper to use the better (free/monthly cost/pay-per-usage)
+ - stability / reliability
+ - performance (most importantly on the compute side, but also UI)
+ - how easy it is to create workflows, and what skills are necessary to do that (easier creation -> more workflows)
+ - whether and how easy it will be to share, re-use and adapt workflows (different to sharing data)
diff --git a/docs/architecture/data/.ipynb_checkpoints/data_formats-checkpoint.ipynb b/docs/architecture/data/.ipynb_checkpoints/data_formats-checkpoint.ipynb
diff --git a/docs/architecture/data/.ipynb_checkpoints/requirements-checkpoint.ipynb b/docs/architecture/data/.ipynb_checkpoints/requirements-checkpoint.ipynb
diff --git a/docs/architecture/data/data_centric_approach.ipynb b/docs/architecture/data/data_centric_approach.ipynb
@@ -0,0 +1,50 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "# A (parallel?) data centric approach for kiara/lumy"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "- decision between Workflow creation and Workflow execution\n",
+    "-"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/docs/architecture/data/data_formats.ipynb b/docs/architecture/data/data_formats.ipynb
diff --git a/docs/architecture/data/dev.ipynb b/docs/architecture/data/dev.ipynb
@@ -0,0 +1,36 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    ""
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/docs/architecture/data/index.md b/docs/architecture/data/index.md
@@ -0,0 +1,71 @@
+From looking at the user stories, and after listening to the interviews Lorella conducted and also considering my own
+personal experience in eResearch, I think its save to say that the central topic we are dealing with is data. Without data,
+none of the other topics (workflows, visualisation, metadata...) would even exist. Because of its central nature I want to lay out the different forms it comes in, and which characteristics of it are
+important in our context.
+
+## What's data?
+
+Data is created from sources. Sources come in different forms (analog, digital) and can be anything from handwritten
+documents in an archive to a twitter feed. Photos, cave-paintings, what have you. I'm not websters dictionary, but I think
+one usable working definition of data could be a 'materialized source', in our context 'materialized source in digital form'.
+From here on out I'll assume we are talking about 'digital' data when I mention data.
+
+One thing I'll leave out in this discussion is what is usually called 'dirty data' in data engineering, although it is
+an important topic. Most of the issues there map fairly well to the structured/unstructured thing below. There are a few
+differences, but in the interest of clarity let's ignore those for now...
+
+## Structured data / Data transformations
+
+Important for us is that data can come in two different formats: unstructured, and, who'd have guessed... structured. The same
+piece of data can be theoretically expressed in structured as well as unstructured form: the meaning to a researcher would
+be 100% the same, but the ways to handle, digest and operate with the data can differ, and in most scenarios adding structure
+opens up possibilities to work with the data that weren't there before. In my head I call those two forms 'useless', and
+'useful' data, but researcher usually get a bit agitated when I do, so I have learned to not do that in public anymore.
+
+For researchers, the most (and arguably only) important feature of 'structure' is that it enables them to
+do *more* with the data they already possess. By means of computation. I think it's fair to say that only structured data
+can be used in a meaningful way in a computational context. With the exception that unstructured data is useful input to
+create structured data.
+
+One more thing to mention is that the line between structured and un-structured is sometimes hard to draw,
+and can depend entirely on context. "One persons structured data is another persons unstructured data.", something like that.
+In addition, in some instances unstructured data can be converted to structured data trivially, meaning without much effort
+or any user-interaction. I'd argue we can consider those sorts of datasets basically 'structured'.
+
+### Example
+
+Lets use a simple example to illustrate all that: *a digital image of a document*.
+
+Depending on what you are interested in, such an image might already be structured data. For example it could contain geo-tags, and a
+timestamp, which are both digitally readable. If you want to visualize on a map where a document is from, you can do that instantly.
+Structured data, yay!
+
+Similarly, if you are interested in the color of the paper of the document (ok, I'm stretching my argument here as this seems fairly
+unlikely, but this is really just to illustrate...), you might get the color histogram of the image (which is trivial to extract,
+but needs some batch-computation), and for your purposes you would also consider the image file structured data.
+
+Now, if you are interested in the text content of the document, things get more interesting. You will have to jump
+through some hoops, and feed the image file to an OCR pipeline that will spit out a text file for example. The data
+itself would still be the same, but now computers can access not only some probably irrelevant metadata, but also the text content,
+which, in almost all cases, is where the 'soul' of the data is.
+
+It could be argued that 'just' a text file is not actually structured. I'd say that groups of ascii-characters that
+can be found in english-language dictionaries, separated by whitespaces and new-lines can be considered a structure,
+even if only barely. The new format certainly allows the researcher to interact with the data in other ways (e.g. full-text search).
+
+We can go further, and might be interested in characteristics of the text content (language, topics, etc.). This is where
+the actual magic happens, everything before that is just rote data preparation: turning unstructured (or 'other-ly' structured)
+data into (meaningful) structured data... On a technical level, those two parts (preparation/computation) of a research workflow might look (or be)
+the same, but I think there is a difference worth keeping in mind. If I don't forget I'll elaborate on that later.
+
+## 'Big-ish' data
+
+I'm not talking about real 'Big data'-big data here, just largish files, or lots of them, or both. I don't think we'll encounter many use-cases where we have to move
+or analyze terabytes of data, but I wouldn't be surprised if we come across a few gigabytes worth of it every now and then.
+
+There are a few things we have to be prepared for, in those cases:
+
+- transferring that sort of data is not trivial (esp. from home internet connections with limited upload bandwidth) -- and we will most likely have to be able to offer some sort of resumable-upload (and download) option (in case of a hosted solution)
+- if we offer a hosted service, we will have to take into account and plan for this, so we don't run out of storage space (we might have to impose quotas, for example)
+- computation-wise, we need to make sure we are prepared for large datasets and handle that in a smart way (if we load a huge dataset into memory, it can crash the machine where that is done)
+- similarly, when we feed large datasets into a pipeline, we might not be able to just duplicate and edit the dataset like we could do for small amounts of data (too expensive, storage-wise) -- so we might need to have different strategies in place on how to execute a workflow, depending on file sizes (for example some sort of copy-on-write)