diff --git a/README.md b/README.md index 017e7ff8..c79ffb88 100644 --- a/README.md +++ b/README.md @@ -14,14 +14,12 @@ Project Meadowlark is a research and development effort to explore potential for use of new technologies, including managed cloud services, for starting up a "cloud native" Ed-Fi compatible API. - -- [Milestone 0.3.0](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/releases/tag/v0.3.0) has been released with Docker and +* [Milestone 0.3.0](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/releases/tag/v0.3.0) has been released with Docker and real OAuth2 support. - -- [Milestone 0.4.0](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/releases/tag/v0.4.0) includes full PostgreSQL support, +* [Milestone 0.4.0](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/releases/tag/v0.4.0) includes full PostgreSQL support, load balancer support with NGINX, instructions to use Kafka and performance evaluation. -See [Project Meadowlark - Exploring Next Generation Technologies](https://techdocs.ed-fi.org/x/RwJqBw) in Tech Docs for more +👀 See [Vision](./docs/VISION.md) in Tech Docs for more information on the background and design decisions for this project. ## Getting Started diff --git a/docker/kafka/Dockerfile b/docker/kafka/Dockerfile index 571fa130..ac718928 100644 --- a/docker/kafka/Dockerfile +++ b/docker/kafka/Dockerfile @@ -8,7 +8,7 @@ COPY --chown=gradle:gradle /ed-fi-kafka-connect-transforms /home/gradle/src WORKDIR /home/gradle/src RUN gradle installDist --no-daemon -FROM debezium/connect:2.3@sha256:dfa59c008a03f45c7b286d2874f2e6dbe04f3db6f26b6f01806c136abb07381a +FROM debezium/connect:2.7.0-Final@sha256:a69c0bf30a269a0c53a98d9caf61a45f74a7bab18ebac6081a53af64ceba78b4 LABEL maintainer="Ed-Fi Alliance, LLC and Contributors " ARG package=opensearch-connector-for-apache-kafka-3.1.0.tar diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 00000000..5df91b3d --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,282 @@ +# Meadowlark Architecture + +# Introduction + +Project Meadowlark is a proof-of-concept implementation of the Ed-Fi API Specification, currently supporting Data Standard 3.3b, built on managed services provided by AWS. This document describes the system architecture, including: managed service infrastructure and flow; frameworks used in programming the solution; and notes about potential future direction. + +*→ [More information on  Meadowlark](../project-meadowlark-documentation.md)* + +## Cloud Managed Services + +The big three cloud providers (Amazon, Google, Microsoft) all provide similar managed services that could have been used to build this application. The choice of Amazon Web Services (AWS) is not an endorsement of Amazon *per se*. Rather, the development team needed to commit to one service in order to remain focused on delivering a usable proof-of-concept without over-engineering up-front. Development of a full-fledged *product*  based on Meadowlark would require additional effort to ensure that the core software can easily be used on any cloud platform or on-premises. + +→ *[More information on provider parity](../project-meadowlark-documentation/meadowlark-provider-parity-analysis.md)* + +## Infrastructure + +The following diagram illustrates the managed service infrastructure utilized by Meadowlark. + +![Infrastructure diagram](../images/infrastructure.png) + +What does each of these services provide? + +* An **API Gateway** is a front-end web server that acts as a proxy back to the separate serverless functions. With the help of the API Gateway, client applications need know only a single base URL, and the different resource endpoints can opaquely point back to different back-end services or functions. +* **Serverless Functions** are small, purpose-built, serverless runtime hosts for application code. In the AWS ecosystem, Lambda Functions serve this purpose. In the Meadowlark solution, there are ten different Lambda Functions that handle inbound requests from the API Gateway. For simplicity, only a single icon represents all ten in the diagram above. +* **Database** services are provisioned with a NoSQL document store. For ease of use, the Meadowlark project used Amazon's **DynamoDB** . One of the powerful features of many NoSQL databases is **Change Data Capture (CDC) Streaming**: each change to an item stored in the database creates an event on a stream. Another serverless function detects this event to provide post-processing for saving into another datastore. + * ![(info)](https://edfi.atlassian.net/wiki/s/695013191/6452/be943731e17d7f4a2b01aa3e67b9f29c0529a211/_/images/icons/emoticons/information.png) + + This CDC triggering of a serverless function is an incredibly powerful extension point for adding downstream post-processing of any kind. Examples: generate notifications, initiate second-level business rule validation. + * ![(warning)](https://edfi.atlassian.net/wiki/s/695013191/6452/be943731e17d7f4a2b01aa3e67b9f29c0529a211/_/images/icons/emoticons/warning.png) + + *Any serious attempt to turn Meadowlark into a full-fledged project would require moving away from DynamoDB to an open source document storage database, such as MongoDB, Cassandra, or ScyllaDB.* +* **OpenSearch** is an open source NoSQL database originally based on ElasticSearch, providing high-performance indexing and querying capabilities. All of the "GET by query" (aka "GET by example") client requests are served by this powerful search engine. +* **Log Monitoring** supports centralized collection, monitoring, and alerting on logs. In the Meadowlark implementation on AWS, **CloudWatch** provides that functionality. + +### Utilizing Multiple Databases + +In traditional application development, including the Ed-Fi ODS/API Platform, all Create-Read-Update-Delete (CRUD) operations are served by a single database instance. Project Meadowlark has instead adopted the strategy of choosing database engines that are a good fit-to-purpose. "NoSQL"  databases are a good fit for online transaction processing (OLTP) because they enable storage of the raw API payloads (JSON) directly in the database. This improves both the write speed and the speed of retrieving a single object from the database, since there are no joins to perform between different tables. In the current implementation, AWS's native and proprietary DynamoDB was selected as the primary transaction database for the simple reason that its architecture was interesting to explore. There are other document and key-value storage systems that could easily be used instead of DynamoDB. + +A key difference between this document storage approach, compared to relational database modeling, comes in the form of searchability. Many key-value and document databases have the ability to add "secondary indexes" that can help find individual items by some global criteria. But these are limited and very different than the indexes found in a relational database, which can be tuned to identify items based on any column. In other words, when storing an entire document, most key-value and document databases fare poorly when trying to search by query terms (e.g. "get all students with last name Doe").  + +This is where OpenSearch shines. Based on ElasticSearch, OpenSearch is also a NoSQL document store. The key difference is that it indexes everything in the document, and has a powerful search engine across the indexes. OpenSearch is not designed to be a robust solution for high performance write operations, so it does not make sense to write directly to it. + +To streamline the Meadowlark API functionality, that code only writes to one database (DynamoDB). It then use's that database's *change data capture* trigger or stream to push new data out to another serverless function. That next function writes data to OpenSearch in a completely asynchronous / non-blocking process, and naturally deletes removed objects from OpenSearch. + +> [!TIP] +> Early on, the development team also experimented with writing the item out to blob storage (S3) in addition to OpenSearch. With JSON objects stored in S3, it was incredibly easy to build a simple analytics dashboard in Amazon QuickSight, with Amazon Athena sitting in the middle as the query engine. +> The S3 work was removed for expediency after an upgrade to the AWS SDK broke the code, and it may be restored in the future. Additionally, it may be useful to explore having the "GET by ID" requests served from blob storage instead of from the transaction database to take advantage of lower-cost reads; this can also be combined with CDN caching (for example with descriptors) to further improve performance and potentially lower the per-transaction cost. + +### Eventual Consistency + +Highly scalable databases such as DynamoDB and Cassandra store multiple copies of the data for resiliency and high availability, and only one of these copies receives the initial write operation. The service guarantees that all other copies will eventually come up to date with that initial write operation: the data will *eventually be consistent*. The tradeoff is in favor of connection reliability: queries are not blocked by write operations.[](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html) + +Many people find this disturbing at first, if they are used to thinking about transaction locking in relational databases. But the reality is less scary than it sounds. + +[Amazon states](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html) that it typically takes "one second or less" to bring all copies up to date. Let's compare the outcomes of the following three scenarios: + +| Time | Scenario 1 | Scenario 2 | Scenario 3 | +| --- | --- | --- | --- | +| 10:01:01.000 AM | **Client A reads a record** | Client B writes an update to that record | Client B writes an update to that record | +| 10:01:01.500 AM (half second) | Client B writes an update to that record | **Client A reads a record** | All DynamoDB copies are up-to-date | +| 10:01:02.000 AM (full second) | All DynamoDB copies are up-to-date | All DynamoDB copies are up-to-date | **Client A reads a record** | +| *Status* | *Client A has stale data* | *Client A* might *have stale data* | *Client A has current data* | + +In Scenario 1, Client A receives stale data because they requested it *half a second* before Client B writes an update. *And this is no different than in a relational database*. + +In Scenario 2, the Client B writes an update *half a second* before Client A sends a read. Client A might coincidentally be assigned to read from the first database node that received the record, or it might read from a node that is lagging by half a second. Thus it *might* get stale data, though this is not guaranteed. + +Finally in Scenario 3, Client A asks for a record a full second after Client B had written an update, and Client A is *nearly* guaranteed to get the current (not stale) data. *Again, same as with a standard relational database*. + +The practical difference between the guaranteed consistency of a relational database and the eventual consistency of a distributed database like DynamoDB is thus more a matter of happenstance than anything else. In either case, if Client A reads from the system a millisecond before Client B writes, then Client A will have stale data. If Client A reads *after*  Client B writes, then the window of time for getting stale data goes up to perhaps a second. *But if they do get stale data, they will never know that they weren't in scenario 1.* + +Eventual consistency is likely "good enough." But it does deserve further community consideration before using it in a production system. + +### Data Duplication + +For many people, this process of copying data into two storage locations (DynamoDB and OpenSearch) may seem very strange. Programmers are taught to "write once", avoiding the costs of storing and maintaining duplicate data. + +From the storage perspective, there is a false assumption here: when a relational database table has indexes, you are already storing duplicate copies of the data. With paired DynamoDB and OpenSearch, that hidden truth simply comes to the surface. Furthermore, the cost of storage is generally much lower than computation: so one should optimize for compute time more than for storage volume (within reason). OpenSearch is computationally powerful for indexed searches, whereas DynamoDB is computationally expensive if you try a full-table scan to look for an object via *ad hoc* query. + +There is also an eventual consistency challenge here, one that is more significant than with DynamoDB by itself: there is a greater probability of an error in the CDC stream → Lamba function → OpenSearch write process than in the DynamoDB node synchronization process. This too deserves further scrutiny and operational testing, after changing to a different primary transactional database. + +## Programming Framework + +### Application Code + +The application code has been written in TypeScript running on Node.js, which are [popular tools](https://insights.stackoverflow.com/survey/2021#technology-most-popular-technologies) for modern web application development. Using TypeScript/JavaScript also gives us the advantage of leveraging MetaEd, as discussed in the next section. + +The code uses the latest software development kit from Amazon, AWS SDK 3, to mediate interactions with AWS services: receiving requests from API Gateway, writing to DynamoDB, and writing to OpenSearch. + +As a proof-of-concept, the development team did not spend as much time writing unit tests as would be done in production-ready code. That said, there are unit tests to cover approximately 60% of the application code (as of initial release), with the biggest gap being in the database persistence code, which is naturally harder to unit test.  + +### MetaEd + +The Ed-Fi Data Standard is defined in code through [MetaEd files](https://edfi.atlassian.net/wiki/display/EDFITOOLS/MetaEd+IDE). The MetaEd application has a *build*  tool that generates JSON and SQL files that the ODS/API Platform leverages for auto-generating  significant portions of the Ed-Fi ODS/API Platform. By leveraging that same MetaEd code base, Meadowlark is able to construct an entire API surface at runtime without having to generate source code files. And thanks to the (essentially) schema-less nature of the NoSQL databases, there is no need for resource-specific mapping code when performing operations on items in the databases. + +Because it translates MetaEd files directly into an API surface, Meadowlark does not have any Data Standard specific code. No code changes are needed to support a newer (or older) Data Standard, although migrating data from one standard to another would require an external process. + +### Deployment + +The [Serverless Framework](https://www.serverless.com/framework/docs) provides all of the heavy lifting for packaging the source code into Lambda functions, provisioning required resources in AWS, and setting up the necessary user permissions on AWS objects. The Serverless Framework also serves as an abstraction layer that should ease the transition from one cloud provider platform to another. Even where details need to change from one provider to the next, at least Serverless gives a common YML-based configuration language, instead of having to learn the nuances of each provider's native domain-specific deployment language (AWS Cloud Formation, Azure Resource Manager, Google Deployment Manager).  + +### Table Design + +*This table design is DynamoDB-specific, though it may be appropriate in similar database systems such as Cassandra. If moving to a document store such as MongoDB, Firebase, or CouchbaseDB, then this design would need to be revisited.* + +Meadowlark uses the [single-table design](https://aws.amazon.com/blogs/compute/creating-a-single-table-design-with-amazon-dynamodb/) approach for storage in DynamoDB, with the following structure: + +| Column Name | Purpose | +| --- | --- | +| pk​ | Hash key (aka partition key) - one half of the primary key​. | +| sk | Range key (aka sort key) - the other half of the primary key | +| naturalKey | Plain text version of the natural key | +| info | Contains the JSON document for a resource | + +There are also a couple of experimental columns and secondary indexes for exploring relationship-based authorization.  + +Meadowlark creates a unique resource ID by calculating a  [SHA-3](https://en.wikipedia.org/wiki/SHA-3) (cShake 128) hash value from the natural key. This value is stored as the sort key, `sk` . The partition key, `pk` , contains entity type information: schema, model version, and domain entity name. + +> [!TIP] +> In DynamoDB, an "item" is analogous to a "record" in a relational database. Thus a single object being stored in a DynamoDB table is stored as "an item". + +### Referential Integrity and Item Types + +An important feature of an Ed-Fi API is the ability to enforce referential integrity, rejecting modification requests where the modified item refers to another item that does not actually exist. An Ed-Fi API also rejects attempts to delete items that are referred to by other items. + +Most NoSQL databases do not support referential integrity, whereas the ODS/API Platform leverages referential integrity checking built into the SQL database. Therefore Meadowlark had to develop its own system for referential integrity checks, in application code. In short, Meadowlark transactionally writes extra items to the transactional database with pointers to the referenced items. These items are trivial to look up. + +>[!WARNING] + + > Due to eventual consistency, there is a small but real possibility of a referential integrity check *miss* *.* To what extent does this matter?  Another question for the community to explore. + +To illustrate: assume that a Meadowlark instance already has descriptors loaded, and an API client wants to load a School and a Course that belongs to that school. Adding excitement to the scenario: in the Ed-Fi Data Model, a School *is an* Education Organization (extends / inherits). + +![Course dependencies diagram](../images/course-dependencies.png) + +Below is the successful POST request to create the new school: + +```none title="Request" +POST http://aws-created-url/stage-name/v3.3b/ed-fi/schools + +{ + "schoolId": 122, + "nameOfInstitution": "A School", + "educationOrganizationCategories" : [ + { + "educationOrganizationCategoryDescriptor": "uri://ed-fi.org/EducationOrganizationCategoryDescriptor#Other" + } + ], + "schoolCategories": [ + { + "schoolCategoryDescriptor": "uri://ed-fi.org/SchoolCategoryDescriptor#All Levels" + } + ], + "gradeLevels": [] +} +``` + +```none title="Response" +HTTP/1.1 201 Created +x-metaed-project-name: Ed-Fi +x-metaed-project-version: 3.3.1-b +x-metaed-project-package-name: ed-fi-model-3.3b +location: /v3.3b/ed-fi/schools/7a5cf3f4a68015c0922e24c73401a21e9fd1767ef60c0b3300f2301e +content-type: application/json; charset=utf-8 +vary: origin +access-control-allow-credentials: true +access-control-expose-headers: WWW-Authenticate,Server-Authorization +cache-control: no-cache +content-length: 0 +Date: Mon, 06 Dec 2021 14:47:42 GMT +Connection: close +``` + +Since there are two descriptors, the application code must validate that those are legitimate descriptors. The following DynamoDB items exist, therefore the Post is validated: + +* SchoolCategory + * pk = `​TYPE#Ed-Fi#3.3.1-b#SchoolCategoryDescriptor` + * sk = `ID#0f1474d47271406f6b47eabeba2fca6dd5a8b49a3b9d4e5b8d0e87e8` + * naturalKey = `​NK#uri://ed-fi.org/SchoolCategoryDescriptor#All Levels` + * info =  `{"namespace":{"S":"uri://ed-fi.org/SchoolCategoryDescriptor"},"description":{"S":"All Levels"},"shortDescription":{"S":"All Levels"},"\_unvalidated":{"BOOL":true},"codeValue":{"S":"All Levels"}}` +* EducationOrganizationCategoryDescriptor + * pk = `TYPE#Ed-Fi#3.3.1-b#EducationOrganizationCategoryDescriptor` + * sk = `ID#04c7f019c56684b0539135ab2d955e4c03bc85b3841cdd87fb970f35` + * naturalKey = `NK#uri://ed-fi.org/EducationOrganizationCategoryDescriptor#Other` + * info =  `{"namespace":{"S":"uri://ed-fi.org/EducationOrganizationCategoryDescriptor"},"description":{"S":"Other"},"shortDescription":{"S":"Other"},"\_unvalidated":{"BOOL":true},"codeValue":{"S":"Other"}}` + +Now that the POST has been accepted, Meadowlark saves the following records in a transaction: + +* School + * pk = `​TYPE#Ed-Fi#3.3.1-b#School` + * sk = `ID#7a5cf3f4a68015c0922e24c73401a21e9fd1767ef60c0b3300f2301e​` + * naturalKey = `NK#schoolId=122` + * info = `{"educationOrganizationCategories":{"L":\[{"M":{"educationOrganizationCategoryDescriptor":{"S":"uri://ed-fi.org/EducationOrganizationCategoryDescriptor#Other"}}}\]},"schoolCategories":{"L":\[{"M":{"schoolCategoryDescriptor":{"S":"uri://ed-fi.org/SchoolCategoryDescriptor#All Levels"}}}\]},"gradeLevels":{"L":\[\]},"schoolId":{"N":"122"},"nameOfInstitution":{"S":"A School"}}` +* Education Organization + * pk = `TYPE#Ed-Fi#3.3.1-b#EducationOrganization` + * sk = `ASSIGN#ID#7a5cf3f4a68015c0922e24c73401a21e9fd1767ef60c0b3300f2301e` + +The second item, of type "Assign", helps to recognize entity super types when performing referential integrity validation checks. Please note that the hash value in the Assign item's `sk`  matches the hash value for the individual school. + +Now that there is a school, the client next creates a new Course, which has a reference to Education Organization. In this scenario, that Education Organization will be the School that was just created. For referential integrity, Meadowlark must determine if the Education Organization Id actually exists. Based on the payload, Meadowlark doesn't "know" to look for a *School* with this particular Education Organization Id – could be a Local or State Education Agency, for example. Hence the creation of the Assign item with `TYPE#Ed-Fi#3.3.1-b#EducationOrganization`  and the School's natural key hash value, which Meadowlark uses for the integrity lookup. + +```none title="Request" +POST http://aws-created-url/stage-name/v3.3b/ed-fi/courses + +{ + "educationOrganizationReference": { + "educationOrganizationId": 122 + }, + "courseCode": "1234", + "courseTitle": "A Course", + "numberOfParts": 1, + "identificationCodes": [] +} +``` + +```none title="Response" +HTTP/1.1 201 Created +x-metaed-project-name: Ed-Fi +x-metaed-project-version: 3.3.1-b +x-metaed-project-package-name: ed-fi-model-3.3b +location: /v3.3b/ed-fi/courses/2717e6e9275502cb2da0e3bdbf5c2ba3395f9e2117bdc7e03c216138 +content-type: application/json; charset=utf-8 +vary: origin +access-control-allow-credentials: true +access-control-expose-headers: WWW-Authenticate,Server-Authorization +cache-control: no-cache +content-length: 0 +Date: Mon, 06 Dec 2021 15:32:03 GMT +Connection: close +``` + +As Course does not extend any other entity, there is no need for it to have a complementary Assign item. However, another type of referential integrity comes into play now: we must make sure that no client can delete the School without first deleting the referencing Course.  Meadowlark handles this by creating additional items along with the Course: one pointing from Course to School and one in reverse, making it easy to lookup the relationship in either direction. + +* Course + * pk = `TYPE#Ed-Fi#3.3.1-b#Course` + * sk = `ID#2717e6e9275502cb2da0e3bdbf5c2ba3395f9e2117bdc7e03c216138` + * naturalKey = `NK#courseCode=1234#educationOrganizationReference.educationOrganizationId=12` + * info = `{"courseTitle":{"S":"A Course"},"numberOfParts":{"N":"1"},"educationOrganizationReference":{"M":{"educationOrganizationId":{"N":"122"}}},"identificationCodes":{"L":[]},"courseCode":{"S":"1234"}}` +* From Course To School + * pk = `FREF#ID#2717e6e9275502cb2da0e3bdbf5c2ba3395f9e2117bdc7e03c216138` + * sk = `TREF#ID#7a5cf3f4a68015c0922e24c73401a21e9fd1767ef60c0b3300f2301e` +* To School From Course + * pk = `TREF#ID#7a5cf3f4a68015c0922e24c73401a21e9fd1767ef60c0b3300f2301e` + * sk = `FREF#ID#2717e6e9275502cb2da0e3bdbf5c2ba3395f9e2117bdc7e03c216138` + * info = `{"Type":{"S":"TYPE#Ed-Fi#3.3.1-b#Course"},"NaturalKey":{"S":"NK#courseCode=1234#educationOrganizationReference.educationOrganizationId=122"}}` + +The `info`  column in the "to ... from" item allows Meadowlark to provide a meaningful message when it rejects a Delete request based on referential integrity: + +```none title="Request" +DELETE http://aws-created-url/stage-name/v3.3b/ed-fi/schools/7a5cf3f4a68015c0922e24c73401a21e9fd1767ef60c0b3300f2301e +``` + +```none title="Response" +HTTP/1.1 409 Conflict +x-metaed-project-name: Ed-Fi +x-metaed-project-version: 3.3.1-b +x-metaed-project-package-name: ed-fi-model-3.3b +content-type: application/json; charset=utf-8 +vary: origin +access-control-allow-credentials: true +access-control-expose-headers: WWW-Authenticate,Server-Authorization +cache-control: no-cache +content-length: 741 +Date: Mon, 06 Dec 2021 15:51:10 GMT +Connection: close + +{ + "error": "Unable to delete this item because there are foreign keys pointing to it", + "foreignKeys": [ + { + "NaturalKey": "NK#courseCode=1234#educationOrganizationReference.educationOrganizationId=122", + "Type": "TYPE#Ed-Fi#3.3.1-b#Course" + } + ] +} +``` + +## Descriptor Support + +Meadowlark comes bundled with the default set of Data Standard 3.3.1-b descriptor XML files, which need to be uploaded at startup. See the deployment instructions for more information on running an upload load script. + +At this time, the application does not support all CRUD operations on descriptors: only GET and DELETE are supported, with the latter being more circumstantial than purposeful. The upload script and endpoint process bulk XML files from the Data Standard without  needing the normal API resource endpoints. This omission was for expediency and would need to be corrected in a production-ready system. diff --git a/docs/FINDINGS-AND-QUESTIONS.md b/docs/FINDINGS-AND-QUESTIONS.md new file mode 100644 index 00000000..da4e7c3d --- /dev/null +++ b/docs/FINDINGS-AND-QUESTIONS.md @@ -0,0 +1,48 @@ +# Other Findings and Questions + +The development of the Meadowlark proof-of-concept organically raised questions about alternative ODS API features or patterns that might support the Ed-Fi ecosystem equally well. This document discusses a few of these. + +## Authorization + +The ODS API's main authorization pattern is based on establishing relationships from resources to education organizations – subclasses of EducationOrganization, or EdOrg for short. API clients are assigned one or more EdOrgs and a strategy that specifies CRUD permissions over API classes for which specific resources can be traced to one of these EdOrgs. + +This strategy is powerful and logical but also complex to implement. On the implementation side, each new authorization scheme needs to be driven by relational database views that materialize how each API resource can be traced to an EdOrg. Such views are custom code. + +This strategy has also created complexity for API clients. As noted above, the relationships that drive authorizations are opaque and not easily presented to an API client. This strategy also results in strange interaction scenarios, such as the fact that a client cannot read a Student or Parent resource the client just wrote (because it has no relation to an EdOrg yet). + +As noted above, this is not to say that the ODS API approach is wrong, but only that for some cases the complexity may not be justified. For example, in the case of a SIS client providing data to an API where the scope is a single LEA, these permissions probably suffice: + +* *For this particular API instance, your client has the ability to Create API resources for any of the following API classes:* *(list classes here)* +* *For any resource you write, your client can also Read, Update or Delete that same resource.* + +Implementing these rules is considerably simpler and demands no customized SQL or other materialized means to connect each resource to an EdOrg. + +Clearly, in the context in which data is being read out of the API the ODS EdOrg authorization pattern becomes potentially much more useful.  But in many cases of data out – particularly early one – the scope of that authorization in field work still tends to be "all district data across these API resources for school year X" + +In summary, the ODS API pattern of using EdOrg relationships to drive authorization is powerful and worth preserving, but the Meadowlark project suggests that a set of simpler patterns might eliminate complexity from many early field projects. As a implementation advances in complexity, an API host may choose to enable more powerful and complex designs. + +## Validation Flexibility + +The ODS API use of a relational database system for storage reduces the ability of the API to adapt to disparate validation needs. This can also be seen as a strength: the ODS API generally won't accept data that has met a fairly high benchmark for quality, and this has pushed data quality back to the source systems and responsibility for data quality back to vendors. + +Meadowlark's architecture opens up new possibilities – simple to implement – for  more tunable validation. Using a document store means the product can annotate unvalidated documents for deferred validation, or provide annotations on "how validated" the document is, e.g. support Level 2-style validation as an add-on.  + +Of course, at issue here is understanding when (if ever) it is appropriate to lower data validation requirements for dating coming in via API.  + +## Native Storage to Support Eventing + +The ability to retain a JSON document opens many possibilities for downstream processing and eventing. As a document posted to the API represents a "one logical event" in the operations of an school district (e.g., "student X was marked absent on day Y"), the pre-packaging of that data opens up the possibility for other data consumers to consume it as a documents (e.g., the document could be posted to a log of attendance events to which other systems subscribe). Meadowlark itself uses this mechanism to index the documents in a search engine for query support. + +The relational format of the ODS data storage delivers other benefits, such as the ability to perform complex validations based on SQL, so it is is not a case of one storage format is better than the other, but that there are use case benefits to each. Indeed, there are also certainly ways where both technologies could be mixed. + +## Analytics Modules + +The Meadowlark team experimented with downstream analytics processing using the above eventing mechanism. API documents were made accessible to AWS Athena, which allows for interactive queries with large-scale data sets. The team made simple visualizations from the API data in Athena with AWS QuickSight, the cloud-native BI tool. + +In addition to QuickSight, tools like Power BI Desktop also include support for creating reports and dashboards driven by Athena. It would be interesting to create real use-case driven analytics modules that work with a Meadowlark framework designed for community extensibility. + +## Reuse of Meadowlark Technology + +Meadowlark makes use of MetaEd to generate API document schema validations and to locate natural key and foreign key references in API documents. Some of this is done in a "pre-processing" step that mirrors the behavior of a MetaEd plugin, while the rest is done at API invocation time. This could be moved entirely into MetaEd plugins that generate standard JSON Schema and JSONPath API data from a MetaEd model. This information could be used by the ODS/API platform, for example, to support its own schema validation. + +This could also be part of a broader modularization of Meadowlark to enable extensions of Meadowlark created by the Ed-Fi community. With a clean separation of Meadowlark document validation and reference extraction from a web framework, alternatives like Azure Functions or even simple on-premise web application servers become possible. Similarly, separation of Meadowlark's back-end storage, querying and reference validation could allow for community-contributed alternatives like Azure Cosmos DB or local MongoDB instances. diff --git a/docs/PARITY-GAPS.md b/docs/PARITY-GAPS.md new file mode 100644 index 00000000..085aa8a7 --- /dev/null +++ b/docs/PARITY-GAPS.md @@ -0,0 +1,98 @@ +# Meadowlark and API Parity + +## What is "API Parity"? + +Meadowlark is designed to be implemented by the platform host and not cause breaking changes on the API client side:  to substitute the Ed-Fi API provided by Meadowlark with the API provided by the Ed-Fi ODS/API and have API clients continue to function (and not realize) that they were communicating with a different API. We refer to this as "API parity." + +API parity for the project is defined in terms of the [Meadowlark Use Cases](./use-cases.md); that is, if a feature was not critical to satisfying one of these use case, it was generally left out. For example, extensibility, eTags ,and change queries are unquestionably useful for some API clients, but the belief is that the core Meadowlark use cases do not generally depend on these features, or that those features – if used – are nice-to-haves. + +Such a calculus is imperfect:  there is always the possibility that some API client relies on a particular feature. + +Broadly speaking, the proof-of-concept achieves API parity according to the definition above, but with some gaps. This document provides a list of the known gaps to API parity. + +## Will these gaps be closed? + +Some may, but it is unlikely that all such gaps will be closed.  Ed-Fi is an both an effort to build open source data infrastructure AND an effort to provide blueprints for standardize data flows. In respect of the latter goal of standardization, it is highly useful to compare API differences across API implementations: these are opportunities to understand better what needs to be standard and what does not. + +Rather than try to close all these gaps, the goal should be to clearly define what API features are required and which should be allowed to vary. Doing so will allow for the development of alternative API implementations, whether through the open-source effort of the Ed-Fi community or through efforts independent outside of that community work. + +## List of API Parity Gaps + +### No extension support + +Meadowlark does not support API extensibility. + +Given that the Meadowlark use cases focus on LEA data sourcing where extensibility should not be needed, this features is unlikely to be prioritized. + +Note however that the Alliance has looked to extensibility as a means to evolve the API interface, as in the case of the release of an early access, revised Finance API (see [ED-FI RFC 18 - FINANCE API](https://edfi.atlassian.net/wiki/spaces/EFDSRFC/pages/25363138/ED-FI+RFC+18+-+FINANCE+API)). If this pattern becomes standard practice, there will be more of an argument for the utility of such support. + +### Support for "link" objects in JSON + +In the ODS/API, the JSON is annotated by "link" elements that show the path to the element using a GET by the resource ID. These elements appear like this: + +```json +"gradingPeriodReference": { + "gradingPeriodDescriptor": "uri://ed-fi.org/GradingPeriodDescriptor#First Six Weeks", + "periodSequence": 1, + "schoolId": 255901001, + "schoolYear": 2022, + "link": { + "rel": "GradingPeriod", + "href": "/ed-fi/gradingPeriods/0d4a8d72801240fd805ee118b2641b0f" + } +}, +``` + +These elements do not appear in the GET elements provided by Meadowlark. + +It is unlikely that these will be supported, and in general the direction is to continue to omit these from Ed-Fi API specifications. + +* The utility of these elements is doubtful: they seem to be an implementation feature/decision made by the ODS/API project and do not seem to be in wide use. The intention seems to be to deliver a HATEOS-type information to clients, but that model of interaction has generally not emerged as best practice in REST APIs. +* Since Meadowlark takes a document-centric approach to collection and data management, annotating the documents would create additional complexity for any APIs of this kind; without compelling value for this feature, it was judged to be better to simply omit the feature. + +### Support for "discriminator" fields on abstract class EducationOrganization + +The ODS API provides for discriminators that inform the API client what specific subclass of a abstract class is being referenced. This is done via a "link" object that includes a "rel" field that indicates the class of the referent object. See below for an example of this on the /course API resource. + +```json +{ + "id": "16904b88d3c144b4a43af2924f4c4590", + "educationOrganizationReference": { + "educationOrganizationId": 255901001, + "link": { + "rel": "School", + "href": "/ed-fi/schools/c81a158d7caf49f299ff3c22b503b334" + } + }, + "courseCode": "03100500", + "courseDefinedByDescriptor": "uri://ed-fi.org/CourseDefinedByDescriptor#SEA", + "courseDescription": "Algebra I", + ... +} +``` + +This feature was added to the ODS API in the interest of simplifying data usage for outbound/pulling API clients, especially for cases in which there is a high priority on API simplicity, as for the roster/enrollment API. + +However, those use cases are not the focus of the initial Meadowlark scope, so it is unclear if this should be addressed. We will likely await further feedback, and if this emerges as a need, possibly look at other implementation options for solving the same problem (e.g., might it be better to ask a client to maintain a cache of EdOrgs, and possibly add support that allows them to do that more easily?). To insert the capability to annotate JSON documents would add complexity that is not clearly justified. + +### Full authentication support + +Meadowlark's current authentication is hard coded to two key/secret pairs and hard-coded claims. + +If the project development continues, this would be a candidate for further development. However, as this authentication pattern is well-known, it is not seen as an element of the proof-of-concept that there is high value in exploring. Therefore, this is likely to be a lower priority. + +### Over-posting: posting fields not part of the JSON schema + +The Ed-Fi ODS API allows for extraneous fields to be posted without error; such fields are simply ignored. In Meadowlark, these are schema violations and a 4xx error is returned. + +Allowing over-posting is generally a bad practice, as it often indicates the API client is not following the schema and can lead to hard to detect errors. However, over-posting can be employed as a simple API client strategy to support multiple versions of an API with less complexity. + +This is likely not to be prioritized, given that this permissiveness has both pros and cons and which is more important is unclear. + +> [!WARNING] +> +> To test out Meadowlark on your own: +> 1. Make sure that you have an AWS subscription and a user account with permissions to create resources. +> 2. Must have [Node.js 1](https://nodejs.org/)6 installed locally to manage the deployment. +> 3. Clone the [source code repository](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/) using Git. +> 4. Follow the [install instructions](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/tree/main/docs). diff --git a/docs/PROVIDER-PARITY.md b/docs/PROVIDER-PARITY.md new file mode 100644 index 00000000..2e9918f9 --- /dev/null +++ b/docs/PROVIDER-PARITY.md @@ -0,0 +1,36 @@ +# Meadowlark Provider Parity Analysis + +As mentioned in the [Meadowlark Architecture](./architecture.md), Meadowlark was developed on Amazon Web Services (AWS),  but the principle was to only use AWS managed services that have an analogous option for other major providers. This would make it (relatively) easy to migrate from one platform to the other. On-premise options were also explored. + +The Alliance, following community feedback, has strongly factored open source availability of technology components into its technology roadmap choices (e.g., the work to port the Ed-Fi ODS platform storage to PostgreSQL). This choice has played an important role in expanding availability of the platform and lowering costs. This principle is likely to play an important role if the Meadowlark project is expanded (e.g., move to MongoDB from provider-specific options). + +This document reviews the services used and identifies the equivalent tools (or gaps) in Azure, Google Cloud, and on-premise. + +| Purpose | AWS Service | Azure | Google | On-Premises | Additional Notes | +| --- | --- | --- | --- | --- | --- | +| Load balancing and reverse proxy | [​API Gateway](https://aws.amazon.com/api-gateway/) | [Azure Application Gateway](https://azure.microsoft.com/en-us/services/application-gateway/#overview) | [Cloud Endpoints](https://cloud.google.com/endpoints) | [NGiNX](https://www.nginx.com/), among others |   | +| Serverless Application | [AWS Lambda](https://aws.amazon.com/lambda/) | [Azure Functions](https://azure.microsoft.com/en-us/services/functions/#overview) | [Google Cloud Functions](https://www.dynatrace.com/monitoring/technologies/google-cloud-monitoring/google-cloud-functions/?utm_source=google&utm_medium=cpc&utm_term=google%20cloud%20functions&utm_campaign=us-cloud-monitoring&utm_content=none&gclid=Cj0KCQiAqbyNBhC2ARIsALDwAsCT7cIo5OA8gTYttkevTd2XvydoEsrmpGTwjb712qKJlVQeW_LKXcEaAiL2EALw_wcB&gclsrc=aw.ds) | [OpenFaas](https://www.openfaas.com/) or [Fn](http://fnproject.io/) | The Meadowloark application is written in Typescript using the [Serverless package](https://www.npmjs.com/package/serverless), making it theoretically easy to reuse these components with any platform's serverless functions.

Could consider refactoring to OpenFaas or Fn for one system that is cloud-agnostic (runs in Kubernetes and Docker, respectively). | +| Key-value data  store and Change Data Capture | [DynamoDB](https://aws.amazon.com/dynamodb/) with [DynamoDB Change Data Capture](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/streamsmain.html) | [CosmosDB](https://azure.microsoft.com/en-us/services/cosmos-db/#overview) in Cassandra API mode with [CosmosDB Change Feed with Azure Functions](https://docs.microsoft.com/en-us/azure/cosmos-db/change-feed-functions) | [Firestore](https://cloud.google.com/firestore) ❌ see note below about change streams | [Apache Cassandra](https://cassandra.apache.org/_/index.html) with [Cassandra Triggers](https://medium.com/rahasak/publish-events-from-cassandra-to-kafka-via-cassandra-triggers-59818dcf7eed) | See detailed info below | +| Search engine | [Amazon OpenSearch](https://aws.amazon.com/opensearch-service/) | [Elastic on Azure](https://azure.microsoft.com/en-us/overview/linux-on-azure/elastic/) | [Elastic on Google Cloud Platform](https://www.elastic.co/about/partners/google-cloud-platform) | Either [ElasticSearch](https://www.elastic.co/elastic-stack/) or [OpenSearch](https://opensearch.org) can run on-premises |   | + +## Key-Value Data Detailed Notes + +The differences may be great enough that some tweaking of the storage model may be required. + +Switching to MongoDB may be a useful alternative, as it is available on all platforms: + +* [Amazon DocumentDB](https://aws.amazon.com/documentdb/) with [Change Streams](https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html) +* Azure CosmosDB has a MongoDB mode +* [MongoDB Atlas](https://www.mongodb.com/atlas/database) running on any of the three +* MongoDB can also run on-premises, with [Change Data Capture Handlers](https://docs.mongodb.com/kafka-connector/current/sink-connector/fundamentals/change-data-capture/). + +Another option would be to switch to Cassandra for a single database platform available on all providers + +* [Amazon Keyspaces](https://aws.amazon.com/keyspaces/) +* CosmosDB +* [Astra DB](https://www.datastax.com/products/datastax-astra/) from DataStax, running on any of the three +* Cassandra can also run on-premises + +## Gogle Firestore Warning + +Google Firestore might not have a direct equivalent of Change Data Capture... at least, the searching for this does not turn up functionality that is clearly the same as with the other products. However, perhaps one of these techniques is capable of writing out to a stream: [Extend... with Cloud Functions](https://firebase.google.com/docs/firestore/extend-with-functions) or [onSnapshot](https://firebase.google.com/docs/firestore/query-data/listen). diff --git a/docs/USE-CASES.md b/docs/USE-CASES.md new file mode 100644 index 00000000..961337e0 --- /dev/null +++ b/docs/USE-CASES.md @@ -0,0 +1,39 @@ +# Meadowlark Use Cases + +## Use Cases + +The initial use cases targeted by Meadowlark are focused on K12 local education agencies (LEAs) and student information systems data.  + +The use cases targeted initially by Meadowlark are: + +“*As a LEA collaborative or service provider to an LEA, I need to aggregate the most critical student performance data in order to be able to understand the efficacy of and make changes to my school district curricular programs”* + +“*As a LEA collaborative or service provider to an LEA, I want to be able to unblock myself when I encounter problems in the platform by modifying and contributing to the Ed-Fi platform solution”* + +These represent a *subset* of the current market problems and functionality delivered by the Ed-Fi ODS/API. This narrowing of overall Ed-Fi use cases was made to limit the complexity and scope of the project, and does not reflect any intentions about future investments in other use cases not included here. + +## Functional Areas + +Meadowlark focuses on two functions of the current Ed-Fi ODS platform: + +### Ed-Fi API surface + +On the first of these, API implementation, Meadowlark provides an API that is designed to mimic the Ed-Fi ODS API and to meet the published Ed-Fi API specifications. We refer to this as "API parity", and addressing this parity was by far the largest focus of the project. + +In some cases, the project decided not to pursue full parity with the current ODS API or the specifications. The reasons for this vary, but do not individually suggest that there are any "blockers" to using cloud technology services as the basis for an Ed-Fi  data exchange architecture.  Gaps to API parity are covered in this document: [API Parity Gaps](./PARITY-GAPS.md) + +### Support for data management and analytics + +The second focus was on how that data (once loaded from a vendor system) could be used by an education agency (which was – given the use cases above – an LEA). + +In the Ed-Fi ODS platform, the native storage of data sourced from vendor systems is a relational database, and this provides infrastructure that LEAs can use to query and further transform the data. + +Meadowlark takes a document-centric approach to the data loading, saving the native JSON documents into the transactional database. This eliminates the need to pre-define database schemas or to normalize data storage, thus lowering the amount of code required and improving query performance. However, this storage format is likely not useful for agencies seeking to perform analytics. Meadowlark therefore explored how to enable downstream usage. + +> [!WARNING] +> To test out Meadowlark on your own: +> +> 1. Make sure that you have an AWS subscription and a user account with permissions to create resources. +> 2. Must have [Node.js 1](https://nodejs.org/)6 installed locally to manage the deployment. +> 3. Clone the [source code repository](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/) using Git. +> 4. Follow the [install instructions](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/tree/main/docs). diff --git a/docs/VISION.md b/docs/VISION.md new file mode 100644 index 00000000..25fe7eea --- /dev/null +++ b/docs/VISION.md @@ -0,0 +1,47 @@ +# What is Meadowlark? + +Project Meadowlark is a research and development (R&D) project. It is not a project to develop a "next gen" platform but rather to inform community conversations and  inform future development. + +The goal of the Meadowlark project is to look for **technology accelerators** for the current strategy of near-real time data collection and aggregation via standards-based API exchanges from vendor systems and into a central data platform, in order to support analytics to improve the performance of students. + +The Meadowlark code and releases provide a deployable, distributable, proof-of-concept for a cloud-native (i.e., built on cloud services) implementation of the Ed-Fi API surface. It therefore replicates the data collection capabilities of the Ed-Fi ODS/API, but does not replicate the database structure and storage of the ODS/API. + +## What is it? + +A research and development effort to explore *potential* for use of new technologies, including managed cloud services, for starting up an Ed-Fi compatible API. + +## It is not… + +A replacement for the Ed-Fi ODS/API. Portions of it could someday become a replacement, but we are not there. + +## Research Questions + +Some of the questions to be explored in this project include: + +* Can we build an API application that supports multiple data standards without requiring substantial coding work for each revision of the standard? +* Can an Ed-Fi API promote events to a first-class concept, supporting notifications, subscriptions, and real-time data transformations? +* How much might a fully cloud-native architecture cost to operate? +* Is it feasible to build a system that is both cloud-native and fully operational on-premises? +* What are the most important features and security models for unlocking widespread deployment across the education sector? + +## End State Architecture + +Taken to its logical conclusion, the end state architecture for Meadowlark would be more of a framework than a monolithic "product", with many different, competing, components that could be substituted into the system to perform designated functions. For example, the Ed-Fi API is an HTTP-based service with many possible implementations. The initial implementation uses AWS Lambda Functions. Alternate implementations could use a stand-alone NodeJs web server - such as Express or Fastify - or could implement the HTTP services in the functional framework for Azure, Google Cloud, etc. + +The following diagram deliberately mixes-and-matches generic icon and icons from Amazon Web Services, Google Cloud Platform, Microsoft Azure - thus representing that the desired architecture is meant to be dynamic, pluggable, and platform-agnostic. + +~[Meadowlark architecture diagram](../images/meadowlark-architecture.png) + +## General Principles + +1. Prefer open source components or protocols. +2. Code with strong separation of concerns in mind, enabling common business logic to be interact with multiple front-end (HTTP) and back-end (data store) components. +3. Provide enough testing to prove viability, but not so much as would be required for a production-ready product. +4. Evolution toward an (database-first) Event-Driven Architecture. + +## Articles + +* [Architecture](./ARCHITECTURE.md) +* [Parity Gaps](./PARITY-GAPS.md) +* [Meadowlark Provider Parity Analysis](./PROVIDER-PARITY.md) +* [Other Findings and Questions](./FINDINGS-AND-QUESTIONS.md) diff --git a/docs/meadowlark-api-design/attachments/RND-16.pipeline.png b/docs/meadowlark-api-design/attachments/RND-16.pipeline.png new file mode 100644 index 00000000..d0ee9743 Binary files /dev/null and b/docs/meadowlark-api-design/attachments/RND-16.pipeline.png differ diff --git a/docs/meadowlark-api-design/attachments/did.you.mean.png b/docs/meadowlark-api-design/attachments/did.you.mean.png new file mode 100644 index 00000000..3b7e0068 Binary files /dev/null and b/docs/meadowlark-api-design/attachments/did.you.mean.png differ diff --git a/docs/meadowlark-api-design/attachments/image2020-6-9_9-33-27.png b/docs/meadowlark-api-design/attachments/image2020-6-9_9-33-27.png new file mode 100644 index 00000000..01551459 Binary files /dev/null and b/docs/meadowlark-api-design/attachments/image2020-6-9_9-33-27.png differ diff --git a/docs/meadowlark-api-design/attachments/image2020-6-9_9-38-11.png b/docs/meadowlark-api-design/attachments/image2020-6-9_9-38-11.png new file mode 100644 index 00000000..dc2bbd5e Binary files /dev/null and b/docs/meadowlark-api-design/attachments/image2020-6-9_9-38-11.png differ diff --git a/docs/meadowlark-api-design/attachments/image2022-5-20_10-31-10.png b/docs/meadowlark-api-design/attachments/image2022-5-20_10-31-10.png new file mode 100644 index 00000000..ded730a6 Binary files /dev/null and b/docs/meadowlark-api-design/attachments/image2022-5-20_10-31-10.png differ diff --git a/docs/meadowlark-api-design/attachments/schema.failure.png b/docs/meadowlark-api-design/attachments/schema.failure.png new file mode 100644 index 00000000..2c85e378 Binary files /dev/null and b/docs/meadowlark-api-design/attachments/schema.failure.png differ diff --git a/docs/meadowlark-api-design/attachments/section.endpoint.png b/docs/meadowlark-api-design/attachments/section.endpoint.png new file mode 100644 index 00000000..a578a0b3 Binary files /dev/null and b/docs/meadowlark-api-design/attachments/section.endpoint.png differ diff --git a/docs/meadowlark-api-design/attachments/section.identity.png b/docs/meadowlark-api-design/attachments/section.identity.png new file mode 100644 index 00000000..1b43fa8c Binary files /dev/null and b/docs/meadowlark-api-design/attachments/section.identity.png differ diff --git a/docs/meadowlark-api-design/attachments/section.metaed.png b/docs/meadowlark-api-design/attachments/section.metaed.png new file mode 100644 index 00000000..d8598ce8 Binary files /dev/null and b/docs/meadowlark-api-design/attachments/section.metaed.png differ diff --git a/docs/meadowlark-api-design/attachments/section.reference.fields.png b/docs/meadowlark-api-design/attachments/section.reference.fields.png new file mode 100644 index 00000000..7869c6eb Binary files /dev/null and b/docs/meadowlark-api-design/attachments/section.reference.fields.png differ diff --git a/docs/meadowlark-api-design/meadowlark-api-application-logging.md b/docs/meadowlark-api-design/meadowlark-api-application-logging.md new file mode 100644 index 00000000..1744136a --- /dev/null +++ b/docs/meadowlark-api-design/meadowlark-api-application-logging.md @@ -0,0 +1,89 @@ +# Meadowlark - API Application Logging + +## Purpose + +This document describes the logging policy in the Meadowlark API source code. In general, this policy seeks to balance the goals of providing sufficient information for an administrator to understand the health of the system and understand user interaction with the system with the equally important goals of protecting sensitive data and avoiding excessive log storage size. + +## Logging Principles + +* Use structured logging for integration into log-monitoring applications (LogStash, Splunk, CloudWatch, etc.). +* Do not log sensitive data. +* Use an appropriate log level. +* Include a correlation / trace ID wherever possible, with the ID being unique to each HTTP request. +* Provide enough information to help someone understand what is going on in the system, and where, but +* Be careful not to make the log entries too large, thus becoming a storage problem. +* Logs will be written to the console, at minimum. +* If any transformation or business logic is necessary for writing an info or debug message, use the utility `isDebugEnabled`  and `isInfoEnabled`  functions first before executing that logic. + +## Log Levels + +### Summary + +Meadowlark will utilize the following levels when logging messages. These levels help the reader to understand if any remedial action is needed, and they allow the administrator to tune the amount of data being logged. + +| Level | Description | Actionable | +| --- | --- | --- | +| ​ERROR | Either:

*Something unexpected occurred in code, which interrupts service in some way, or
* An error occurred in an external service, for example, a database server was down. | Yes:

*Submit a bug report with the Ed-Fi team ([How To: Get Technical Help or Provide Feedback](https://edfi.atlassian.net/wiki/spaces/ETKB/pages/20874815/How+To%3A+Get+Technical+Help+or+Provide+Feedback))
* Investigate the external service; report error to service provider if applicable | +| WARN | Something unexpected occurred in code, but the system is able to recover and continue. | If you see this happening frequently, consider submitting a detailed report as a Tracker ticket. There may be an opportunity for improving the code and/or providing better error handling for the situation. | +| INFO | Displays information about the state of an HTTP request, for example, which function is currently processing the request. | No | +| DEBUG | Displays additional information about the state of an HTTP request and/or state of responses from external services.

Includes anonymized HTTP request payloads for debugging integration problems. | No | + +> [!TIP] +> See below for more information on how this anonymization would work in DEBUG logging. +> When vendor API clients encounter data integration failures, the support teams often want to know what payload failed, and this information is not always readily available from the maintainers of the client application. Providing anonymized payloads meets the support need "half way" in that the system administrator and/or a support team member can see the *structure* of the messages sent, without being able to see the detailed *content*. In many cases, this will be sufficient to understand why a request failed. + +### Examples + +These examples are general guidelines and not 100% exhaustive. + +#### Error + +* Unhandled null reference +* Database connection / transaction failure after exhausting retry attempts + +#### Warning + +* A database connection / transaction failure occurred, but was recovered with an automatic retry + +#### Informational + +* Received an HTTP request + * URL + * clientId + * traceId + * verb + * contentType + * *no payload* +* Responded to an HTTP request + * URL + * response code + * clientId + * duration from time of receipt of HTTP request to response (milliseconds) + * *no payload* +* Process startup and shutdown +* Database created + +#### Debug + +* Received an HTTP request → add anonymized payload + * Replace potentially sensitive string and numeric data with `null`  before logging. + * Could hard code restrictions to "known-to-be-sensitive" attributes, for example attributes on Student, Parent, and Staff. + * However, that could fall short with a change to the data model. + * Therefore, it will be safest to replace all string and numeric data. + * One potential exception: descriptors. + * descriptor values will never contain sensitive data; + * since the other string and numeric values are anonymized, the descriptor value itself does not provide a side channel to sensitive information; + * there is value to having this when debugging failed HTTP requests. +* Responded  to an HTTP request → add payload + * Will require anonymization of the natural key fields when reporting a referential integrity problem + + > [!INFO] + > Potential scenario: + > * Entity1 has natural key {personName, personId}. + > * Entity2 has a reference to Entity1 + > * Post Entity2 with a {personName, personId} that do not exist. Then the response message will have `is missing identity {\"personName\": \"the actual value\", \"personId\": ... }` + +* Entered a function +* About to connect to a service or run through an interesting algorithm +* Received information back from a service + * Metadata only diff --git a/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-14-13.png b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-14-13.png new file mode 100644 index 00000000..2a22fbd7 Binary files /dev/null and b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-14-13.png differ diff --git a/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-18-34.png b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-18-34.png new file mode 100644 index 00000000..40863144 Binary files /dev/null and b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-18-34.png differ diff --git a/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-19-38.png b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-19-38.png new file mode 100644 index 00000000..182ef657 Binary files /dev/null and b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-19-38.png differ diff --git a/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-30-34.png b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-30-34.png new file mode 100644 index 00000000..68e10d53 Binary files /dev/null and b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-30-34.png differ diff --git a/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-34-7.png b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-34-7.png new file mode 100644 index 00000000..377af1d2 Binary files /dev/null and b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-34-7.png differ diff --git a/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-51-31.png b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-51-31.png new file mode 100644 index 00000000..abdd90ff Binary files /dev/null and b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-51-31.png differ diff --git a/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-55-37.png b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-55-37.png new file mode 100644 index 00000000..f8a8219e Binary files /dev/null and b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_10-55-37.png differ diff --git a/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_11-3-9.png b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_11-3-9.png new file mode 100644 index 00000000..6b23dd92 Binary files /dev/null and b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/attachments/image2022-9-7_11-3-9.png differ diff --git a/docs/meadowlark-api-design/meadowlark-api-parity-gaps/meadowlark-get-search-pattern-gaps.md b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/meadowlark-get-search-pattern-gaps.md new file mode 100644 index 00000000..459cb7e9 --- /dev/null +++ b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/meadowlark-get-search-pattern-gaps.md @@ -0,0 +1,81 @@ +# Meadowlark - Get Search Pattern Gaps + +## Problem Statement + +The ODS/API supports a "Get search pattern" for retrieving documents for a resource using property values. However, the property values used for search often do not map neatly to properties in the resource document. This is because the search properties actually expose the column names of the underlying relational schema, and those names can follow complex rules. + +### Decision Needed + +> [!WARNING] +> **How important is it for Meadowlark to follow the "published" (de facto) Ed-Fi API specification with respect to query strings?** +> The Ed-Fi API Guidelines: [https://edfi.atlassian.net/wiki/spaces/EFAPIGUIDE/pages/24281161](https://edfi.atlassian.net/wiki/spaces/EFAPIGUIDE/pages/24281161) states "An Ed-Fi REST API *should* support querying capabilities when searching a collection of Resources." Therefore judging by the published guidelines, the query operations are not a strict requirement of an Ed-Fi compatible API. Those guidelines are neutral on the exact subject. Furthermore, the Ed-Fi API specification is not published as standard in itself; however, the API specification surfaced by the ODS/API has become a de facto standard, since it is the only widely available implementation of an API that uses the Ed-Fi Unified Data Model. +> Interestingly, those same API Guidelines suggest that an Ed-Fi API *should* implement a field selection parameter. This is not implemented by today's Ed-Fi ODS/API. + +## Examples and Analysis + +The expression of column naming in the search pattern examples below are in order of increasing complexity. + +### Example 1: Simple and Role Naming on FeederSchoolAssociation + +FeederSchoolAssociation provides two naming examples around schoolId. The document itself has two schoolIds, one as part of a standard School reference named schoolReference and one as part of a School reference with a Feeder role name, named feederSchoolReference: + +![](./attachments/image2022-9-7_10-14-13.png) + +These two schoolIds are searched on in the ODS/API as schoolId and feederSchoolId: + +![](./attachments/image2022-9-7_10-19-38.png) + +This is because in the database implementation role names are used as prefixes to column names, here to differentiate the column names that refer to the two different Schools. + +| Search Field | Document Property | +| -------------- | ------------------------------ | +| schoolId | schoolReference.schoolId | +| feederSchoolId | feederSchoolReference.schoolId | + +### Example 2: Simple Merge on Section + +Section provides an example of a merge scenario where one search field maps to multiple document properties. A Section document has three schoolIds, one each for courseOfferingReference, locationReference, and locationSchoolReference: + +![](./attachments/image2022-9-7_10-30-34.png) + +These three schoolId document properties map to only two schoolIds in the search properties, schoolId and locationSchoolId: + +![](./attachments/image2022-9-7_10-34-7.png) + +This is because in the database implementation the schoolId column is unified as part of the foreign key reference to both the CourseOffering and Location tables. locationSchoolId is due to role naming as previously described in Example 1. + +| Search Field | Document Property | +| ---------------- | --------------------------------------------------------------- | +| locationSchoolId | locationSchoolReference.schoolId | +| schoolId | courseOfferingReference.schoolId and locationReference.schoolId | + +### Example 3: Prefix Variations from Merge, Role Name and Name Collapsing on Grade + +Grade provides an example of four different column naming variations within a single reference. The GradingPeriod reference on Grade is itself role named with GradingPeriod. As seen in Example 1, this typically results in the prefixing of reference properties with the role name, but variations are possible. + +Note that in the document, gradingPeriodReference has the properties gradingPeriodDescriptor, periodSequence, and schoolYear: + +![](./attachments/image2022-9-7_10-55-37.png) + +Additionally, Grade has a schoolId on both gradingPeriodReference and studentSectionAssociationReference: + +![](./attachments/image2022-9-7_10-51-31.png) + +In the search properties, there is an inconsistency of prefixing in relation to these document properties. schoolId and gradingPeriodDescriptor are unchanged (for different reasons) while a "grading" prefix has been added to periodSequence and a "gradingPeriod" prefix has been added to schoolYear: + +![](./attachments/image2022-9-7_11-3-9.png) + +schoolId is unchanged because there is a merge between the schoolId in gradingPeriodReference and studentSectionAssociationReference, so they share the same column in the database. gradingPeriodDescriptor is unchanged because the role name of the reference (gradingPeriod) is collapsed on column names, causing it to effectively be ignored here. schoolYear becomes gradingPeriodSchoolYear for the normal role name reasons. periodSequence becomes gradingPeriodSequence because the overlap of the "period" between periodSequence and role name gradingPeriod is collapsed. + +| Search Field | Document Property | +| ----------------------- | ------------------------------------------------------------------------------- | +| schoolId | gradingPeriodReference.schoolId and studentSectionAssociationReference.schoolId | +| gradingPeriodDescriptor | gradingPeriodReference.gradingPeriodDescriptor | +| gradingPeriodSequence | gradingPeriodReference.periodSequence | +| gradingPeriodSchoolYear | gradingPeriodReference.schoolYear | + +## Options from the Meadowlark Dev Team + +Meadowlark can use the MetaEd relational plugin to get these column names for ODS/API-like search support. We would need to build a mapping from them to document properties for querying as shown in the tables above. In the cases where there is one field for two properties, we will either need to choose one to search on or possibly "OR" the search on both. + +As Meadowlark is a document-oriented implementation, it would make sense for it to continue to support searching based on the structure of the documents themselves. It would be a behavior that would be easy for clients to understand. It's worth considering whether to support both the ODS/API search style as well as a document style. diff --git a/docs/meadowlark-api-design/meadowlark-api-parity-gaps/meadowlark-metadata-endpoint-gaps.md b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/meadowlark-metadata-endpoint-gaps.md new file mode 100644 index 00000000..6abcc749 --- /dev/null +++ b/docs/meadowlark-api-design/meadowlark-api-parity-gaps/meadowlark-metadata-endpoint-gaps.md @@ -0,0 +1,128 @@ +# Meadowlark - Metadata Endpoint Gaps + +## Overview + +The ODS/API metadata endpoints that Meadowlark supports are as follows: + +* ODS/API [version](https://api.ed-fi.org/v5.3/api/) +* OpenAPI [endpoint list](https://api.ed-fi.org/v5.3/api/metadata/) +* OpenAPI for [resources](https://api.ed-fi.org/v5.3/api/metadata/data/v3/resources/swagger.json) +* OpenAPI for [descriptors](https://api.ed-fi.org/v5.3/api/metadata/data/v3/descriptors/swagger.json) +* ODS/API [dependency ordering](https://api.ed-fi.org/v5.3/api/metadata/data/v3/dependencies) + +As of Meadowlark 0.2.0, these endpoints are largely hardcoded to appear as ODS/API v5.3 with Data Standard 3.3.1-b, and need to be changed over to being self-generated. + +## ODS/API Version Endpoint + +The "version endpoint" is not defined as a formal standard, though this analysis has prompted a request to standardize the API. + +Example from the .NET Ed-Fi ODS/API, version 5.3: + +```kjson +{ + "version": "5.3", + "informationalVersion": "5.3", + "suite": "3", + "build": "5.3.1434.0", + "apiMode": "Sandbox", + "dataModels": [ + { + "name": "Ed-Fi", + "version": "3.3.1-b", + "informationalVersion": "Latest Ed-Fi Data Model v3.3b" + }, + { + "name": "TPDM", + "version": "1.1.0", + "informationalVersion": "TPDM-Core" + } + ], + "urls": { + "dependencies": "https://api.ed-fi.org/v5.3/api/metadata/data/v3/dependencies", + "openApiMetadata": "https://api.ed-fi.org/v5.3/api/metadata/", + "oauth": "https://api.ed-fi.org/v5.3/api/oauth/token", + "dataManagementApi": "https://api.ed-fi.org/v5.3/api/data/v3/", + "xsdMetadata": "https://api.ed-fi.org/v5.3/api/metadata/xsd" + } +} +``` + +### Option 1: Completely Match the .NET ODS/API + +The version endpoint describes the ODS/API version, API mode (e.g. "Sandbox"), data models available (e.g. Data Standard 3.3.1-b) and a list of URLs to the other parts of the API. ODS/API version and data models available are the only parts that need to be made dynamic. Meadowlark has the data models available. ODS/API version map to Ed-Fi data model versions, so we will need either a hardcoded mapping or an environment variable on what ODS/API version to report as. An environment variable solution would avoid releasing hardcoding updates at each new ODS/API release, but would require additional knowledge on the part of the deployment team, such as which ODS/API versions are required for a particular data standard version. + +### Option 2: Redefine the Standard + +1. Required Elements + 1. **version**: software version number + 2. **dataModels**: list of the supported data models, each taking the form + 1. **name**: Ed-Fi or extension project name + 2. **version:** the data model version number + 3. **informationalVersion:** text description of the version + 3. **urls**: + 1. **dependencies**: a document that lists the dependency order of resources, e.g. so that a client application or human reader will be able to answer questions like "what resources must I load before I can load a StudentEducationOrganizationAssociation?" + 2. **openApiMetadata:** a URL to a JSON document that contains links to Open API specification documents + 3. **oauth:** The OAuth 2 token endpoint for authentication + 4. **dataManagementApi**: base endpoint for all Ed-Fi API routes + 5. **xsdMetadata**: a document containing links to XSD files + + > [!WARNING] + > Consider omitting XSD from the standard, although it is very useful for file uploads using the API Client Bulk Loader. + > Implication: there is only a single URL for the resources, meaning that we cannot be actively running two different data standards in the same deployment. + +2. Customized-elements + 1. For convenience, a Version endpoint may contain other fields that are not listed above, but they should not be interpreted as part of the standard API shape. + +Example: + +```json +{ + "version": "0.2.0", + "dataModels": [ + { + "name": "Ed-Fi", + "version": "3.3.1-b", + "informationalVersion": "Latest Ed-Fi Data Model v3.3b" + } + ], + "urls": { + "dependencies": "https://example.com/stg/metadata/dependencies", + "openApiMetadata": "https://example.com/stg/metadata/", + "oauth": "https://example.com/stg/oauth/token", + "dataManagementApi": "https://example.com/stg/api/v3.3-1b/", + "xsdMetadata": "https://example.com/stg/metadata/xsd" + } +} +``` + +## OpenAPI Endpoint List + +This is a simple listing of the resource and descriptors OpenAPI endpoints, and does not require updating. + +## OpenAPI for Resources and Descriptors + +This is the OpenAPI specification of the ODS/API resource and descriptors endpoints, which are very large JSON documents entirely dependent on the data models in use. As of Meadowlark 0.2.0, they are just a copy of the ODS/API v5.3 endpoint JSON. It will be a fair amount of work to generate OpenAPI dynamically, as described below. + +### JSON Schema → OpenAPI + +Development versions of Meadowlark 0.3.0 currently generate JSON Schema version 2020-12 descriptions of resource documents. OpenAPI has always used a resource document description based on some version JSON Schema, but sometimes with differences. However, the most recent version of OpenAPI (v3.1 released Feb 2021) has adopted JSON Schema 2020-12 for describing resource documents, which solves an important piece of OpenAPI generation. (References: [What's New in OpenAPI 3.1](https://nordicapis.com/whats-new-in-openapi-3-1-0/), [Validating OpenAPI and JSON Schema](https://json-schema.org/blog/posts/validating-openapi-and-json-schema)) + +A missing piece of current JSON Schema generation versus the ODS/API's OpenAPI spec is that Meadowlark's JSON Schema generation does not reuse pieces of schema via "$ref" references. For example, in the ODS/API OpenAPI spec, references in a document like "schoolReference" are a $ref to an "edFi\_schoolReference" JSON schema object, while Meadowlark generates the full schoolReference definition everywhere it is used. This is fine for internal Meadowlark use but would bloat a stringified version of the schema in OpenAPI form. + +### OpenAPI Documentation + +The ODS/API provides interactive API documentation via embedded Swagger UI. However Swagger UI from SmartBear does not yet support OpenAPI 3.1, so a different API documentation provider will need to be used. Documentation providers (including the new Swagger UI in development) seem to be converging on a different look-and-feel from the original SwaggerUI. A popular one that is CDN hosted and thus trivial to use is [Stoplight Elements](https://github.com/stoplightio/elements). + +### x-Ed-Fi-isIdentity + +The ODS/API OpenAPI spec includes "x-Ed-Fi-isIdentity" as an OpenAPI extension to tag fields that are part of the document identity. This is not part of the generated JSON Schema and would need to be mixed in somehow to preserve that tagging behavior. + +### Validation of Correctness + +Once Meadowlark is generating OpenAPI we'll want to validate it for correctness in tests, much like we do with generated JSON Schema by running it through ajv. A good choice would be to use [Spectral](https://meta.stoplight.io/docs/spectral/674b27b261c3c-overview), which is a JSON linter with an OpenAPI ruleset. + +## ODS/API dependency ordering + +The ODS/API dependency ordering endpoint provides a list of all resources and descriptors in dependency order, meaning the load order required to avoid referential integrity issues. The ordering derives from both the model itself and any additional constraints from the mode of security used. Security won't be an issue for Meadowlark for the time being. + +The ODS/API provides the dependencies in groupings where any resource in the group can be loaded without issue, as opposed to an absolute ordering for each resource. This allows loaders to work in parallel for resources in the same group. It would be nice for Meadowlark to provide the same behavior, though not mandatory. diff --git a/docs/meadowlark-api-design/meadowlark-data-store-transaction-handling.md b/docs/meadowlark-api-design/meadowlark-data-store-transaction-handling.md new file mode 100644 index 00000000..ef95cab0 --- /dev/null +++ b/docs/meadowlark-api-design/meadowlark-data-store-transaction-handling.md @@ -0,0 +1,53 @@ +# Meadowlark - Data Store Transaction Handling + +## Design Question + +Meadowlark's backend design pattern relies on document-oriented datastores that provide atomic transaction support. As such, there are times when an incoming request will fail due to race conditions / locking. How should Meadowlark handle these failures? + +## Scenarios for a Single Resource + +Due to [Meadowlark's ownership authorization model](../meadowlark-security/meadowlark-data-authorization.md), these scenarios cannot occur for two different Client ID's trying to access the same resource. However, there exists a possibility that two different application threads could be operating in parallel with the same Client ID. + +### Upsert Failure Due to Delete + +* Thread A submits an update request for a particular resource via a POST request (upsert) +* Thread B submits a delete request for that same resource +* Thread B's request starts processing before Client A's. + +The scenario may sound implausible at first, but consider this possibility: instead of sending a PUT request to update a record, one client application might submit a DELETE and then a POST request. If these requests are several seconds apart, then most likely they will succeed: the record is deleted, and then a new one is inserted. + +However, there is a remote possibility that the requests are processed so close in time that the POST request fails because the DELETE is still processing. + +In this case, it would be appropriate to retry the request rather than simply sending a failure message back to the client. The number of retries could be configurable, with a default value of 1. If the retry(ies) fail, then responding with 409 "conflict" would be sensible. + +→ RETRY, 409 + +### Update Failure Due to Delete + +Similar to the the scenario above (Upsert Failure Due to Delete), except with a PUT request instead of a POST request. In this case, it would be more appropriate to respond to the client with 404 "does not exist". + +→ 404 + +## Scenarios Around Reference Checks + +These scenarios could be from parallel threads using the same Client ID, or from two different Client ID's: referential integrity checks are independent of the authorization scheme. + +### Delete Failure Due to New Reference + +* Thread A submits a delete request for a particular resource +* Thread B submits a separate request that has a reference to the to-be-deleted resource +* Thread B's request completes before Thread A + +This should result in a 409 "conflict" response with an appropriate message indicating that the resource is not deletable, unless a lock timeout occurred. If there was a lock timeout, a retry would be appropriate. + +→ RETRY, 409 + +### Update Failure Due to Deleted Reference + +The backend code logic should be locking reference records so that they cannot be deleted in the scope of the transaction, see [Meadowlark - Referential Integrity in Document Databases](../../project-meadowlark-exploring-next-generation-technologies/meadowlark-data-storage-design/meadowlark-referential-integrity-in-document-databases.md). + +### Solution + +If the transaction fails due to a lock, retry it N times. N should be configurable, but default to 1 (retry once). Zero should be a valid configuration ("do not retry transaction"). + +Then ensure that 409 is the response if retry fails. diff --git a/docs/meadowlark-api-design/meadowlark-error-messages.md b/docs/meadowlark-api-design/meadowlark-error-messages.md new file mode 100644 index 00000000..5d357f09 --- /dev/null +++ b/docs/meadowlark-api-design/meadowlark-error-messages.md @@ -0,0 +1,64 @@ +# Meadowlark - Error Messages + +As of Meadowlark 0.4.0, this is a comprehensive list of the 400 and 500 ranges of error messages Meadowlark responds with along with the reasons for those responses. This does not include 500 Internal Server Error itself as these are internal failures for a variety of reasons, usually from third-party library exceptions. This also does not include various security-related reasons behind 404 responses. + +## Authorization Failure + +| Reason | HTTP Code | Message | +| --- | --- | --- | +| Invalid Authorization header | 400 | { error: 'Invalid authorization header' } | +| Provided JWT is not well-formed | 401 | None | +| Provided JWT is inactive | 401 | None | +| Oauth server is unreachable | 502 | None | +| Not authorized for the given document | 403 | None | + +## Endpoint Validation + +| Reason | HTTP Code | Message | +| --- | --- | --- | +| Resource endpoint is invalid | 404 | { error: 'Invalid resource XZY. The most similar resource is XYZ' } | + +## Body Validation + +| Reason | HTTP Code | Message | +| --- | --- | --- | +| No body on POST or PUT | 400 | { error: 'Missing body' } | +| Malformed body | 400 | { error: 'Malformed body: <>' } | +| Malformed document UUID on PUT, GET, DELETE | 404 | None | +| JSON Schema validation failure | 400 | { error: \[<>, <>, …\] }
where a validationError looks like { message: <>, path: JSONPath, context: <> } | +| Two document values must be equal but are not | 400 | { error: \['Constraint failure: document paths <> and <> must have the same values', …\] } | + +## Query Validation + +| Reason | HTTP Code | Message | +| --- | --- | --- | +| Limit and offset must be non-negative | 400 | {error: 'Must be set to a numeric value >= 0'} | +| Limit required with offset | 400 | {error: 'Limit must be provided when using offset'} | +| Query includes property not on resource | 400 | { error: 'The request is invalid.', modelState: \['<> does not include property <>**', …**\] } | +| Query server is unreachable | 502 | None | + +## Delete Failure + +| Backend Response Error | Reason | HTTP Code | Message | +| --- | --- | --- | --- | +| DELETE\_FAILURE\_REFERENCE | Attempt to delete document referenced by other documents | 409 | { error: { message: 'The resource cannot be deleted because it is a dependency of other documents', blockingUris: \['/v5.0-pre.2/edfi/students/a-referencing-document-uuid', …\] | +| DELETE\_FAILURE\_WRITE\_CONFLICT | Attempt to modify a document in a transaction that a concurrent transaction has modified | 404 | { error: { message: 'Write conflict due to concurrent access to this or related resources' } | + +## Update Failure + +| Backend Response Error | Reason | HTTP Code | Message | +| --- | --- | --- | --- | +| \- | Id not in URL | 400 | None | +| \- | Id field in document does not match id in URL | 400 | { error: { message:  'The identity of the resource does not match the identity in the updated document.' } | +| UPDATE\_FAILURE\_REFERENCE | Submitted document references one or more non-existent documents | 409 | { error: { message: 'Reference validation failed', failures: \[ {resourceName: 'Student', identity: { studentId: '123' }, …\] } } | +| UPDATE\_FAILURE\_IMMUTABLE\_IDENTITY | Attempt to modify document identity on a resource where the identity cannot be changed | 400 | { error: { message: 'The identity fields of the document cannot be modified' } } | +| UPDATE\_FAILURE\_CONFLICT | Attempt to change the identity of a document to the identity of an existing document with the same resource superclass | 409 | { error: { message: 'Update failed: the identity is in use by 'School' which is also a(n) 'EducationOrganization', blockingUris:  \['/v5.0-pre.2/edfi/schools/a-school-document-with-same-identity', …\] } } | +| UPDATE\_FAILURE\_WRITE\_CONFLICT | Attempt to modify a document in a transaction that a concurrent transaction has modified | 409 | { error: { message: 'Write conflict due to concurrent access to this or related resources' } | + +## Upsert Failure + +| Backend Response Error | Reason | HTTP Code | Message | +| --- | --- | --- | --- | +| UPDATE\_FAILURE\_REFERENCE /

INSERT\_FAILURE\_REFERENCE | Submitted document references one or more non-existent documents | 409 | { error: { message: 'Reference validation failed', failures: \[ {resourceName: 'Student', identity: { studentId: '123' }, …\] } } | +| INSERT\_FAILURE\_CONFLICT | Attempt to insert a document with the identity of an existing document with the same resource superclass | 409 | { error: { message: 'Insert failed: the identity is in use by 'School' which is also a(n) 'EducationOrganization', blockingUris:  \['/v5.0-pre.2/edfi/schools/a-school-document-with-same-identity', …\] } } | +| UPSERT\_FAILURE\_WRITE\_CONFLICT | Attempt to modify a document in a transaction that a concurrent transaction has modified | 409 | { error: { message: 'Write conflict due to concurrent access to this or related resources' } | diff --git a/docs/meadowlark-api-design/meadowlark-leveraging-metaed-for-api-parity.md b/docs/meadowlark-api-design/meadowlark-leveraging-metaed-for-api-parity.md new file mode 100644 index 00000000..2c17ca44 --- /dev/null +++ b/docs/meadowlark-api-design/meadowlark-leveraging-metaed-for-api-parity.md @@ -0,0 +1,63 @@ +# Meadowlark - Leveraging MetaEd for API Parity + +A primary goal of Meadowlark is to determine how to achieve rough parity with the existing ODS/API using NoSQL and cloud native technologies. The current architecture does this by leveraging Ed-Fi's MetaEd technology and taking a document-centric view of the existing API. + +## MetaEd + +Meadowlark makes extensive use of the backend [MetaEd IDE](https://docs.ed-fi.org/reference/metaed) code to interpret MetaEd model files, thus allowing the application to build an API surface from a Data Model at runtime instead of at compile time. This means that one application can support multiple data standards at the same time (that is, assuming that the [data storage design](../meadowlark-data-storage-design/meadowlark-dynamodb.md) will support this). + +MetaEd is a domain-specific language created by the Ed-Fi Alliance to describe Ed-Fi core and extension data models. A single .metaed file describes a single data model entity. A MetaEd project is a set of .metaed files that together make up a complete model. MetaEd projects are versioned, and one project can extend another project. + +![](./attachments/section.metaed.png) + +## Document-centricity + +The design of the existing API surface follows the Domain-Driven Design of the Ed-Fi model. Its clean separation into domains means that it well suited to be implemented in a document-centric fashion. A client submits a document representing an entity and gets back an ID to reference it in the future. Getting the document by ID for an entity returns the same document. Queries on entities by document fields return a series of documents in much the same way a document search engine would. + +However, the API differs from a pure document store in three fundamental ways. + +First and most obviously, the API imposes a specific structure (schema) for each document representing a particular type of entity. Each resource endpoint in the API represents a single document type each with its own structure. + +![](./attachments/section.endpoint.png) + +Second, documents have a concept of uniqueness and identity separate from its generated resource ID. Each entity has some number of fields considered part of the identity of the entity. Together these fields uniquely identify a document. This maps to the "natural key" nature of the Ed-Fi model. + +![](./attachments/section.identity.png) + +Finally, a document that references another entity is only valid if the referenced document is already stored in the API. Document references specify the identity fields ("natural key") of the entity being referenced. The API enforces reference validity on creates and updates, and disallows deletes that would cause a reference on an existing document to become invalid. + +![](./attachments/section.reference.fields.png) + +## Leveraging MetaEd for Validation + +A MetaEd project contains all of the information needed to dynamically create a document-centric API that enforces relationship integrity. The architecture leverages MetaEd's plugin technology to load MetaEd projects and construct the same internal object model used by the MetaEd IDE. Meadowlark uses this model dynamically to achieve parity with the ODS/API for any MetaEd project, such as any version of the Ed-Fi core model. + +![](./attachments/image2022-5-20_10-31-10.png) + +### Resource Endpoints + +Resource endpoint validation is the first and simplest part of achieving ODS/API parity. Meadowlark API requests are proxied to a validator that reads the URL and validates that the resource endpoint matches an entity in the internal model. Endpoints that do not are rejected. When possible, a "did you mean?" suggestion is provided in the error response. + +![](./attachments/did.you.mean.png) + +### Schema Validation + +Next, Meadowlark needs to validate that the shape of a POST/PUT body matches that of the ODS/API. The Meadowlark internal object model includes pre-computed JSON schema validators for each resource, using the Joi data validation framework. The relevant validator is applied to the POST/PUT body. A validation failure includes detailed information on how the document differs from the correct shape. + +![](./attachments/schema.failure.png) + +### Identity Extraction + +Next, Meadowlark needs to extract the identity from the POST/PUT body to compare against documents already in the datastore. The Meadowlark internal object model includes a pre-computation of the location of the identity fields for each entity body. These are extracted and form the first part of a datastore transaction that checks for existence already in the datastore. The current behavior is to reject POST requests where there is already an entity with that identity in the datastore, and to reject PUT requests that attempt to change an entity to an identity that already exists. + +The pre-computation consists of two phases. In the first phase (API mapping enhancement), the entire internal object model is annotated with a mapping from the MetaEd entity and property structure and naming to that of the API body for each entity. An excerpt of the annotation code for each entity and property is shown below. + +![](./attachments/image2020-6-9_9-33-27.png) + +In the second phase (reference component enhancement), every reference property in the internal model is annotated with the location in the body of the identities making up that reference. An excerpt of that annotation code is shown below. + +![](./attachments/image2020-6-9_9-38-11.png) + +### Reference Checking + +Meadowlark allows for reference checking to be enabled or disabled via a "strict-validation" header with reference checking on by default to mirror ODS/API behavior while opening up new use case possibilities. This is done using the same  pre-computation of the location of all reference identity fields for each entity body. When enabled, the reference fields of the POST/PUT body are extracted and added as condition checks to the datastore transaction. A validation failure includes the first failed reference check from the datastore. DELETE reference checking to disallow other references to become invalid is not currently implemented in the datastore but should be straightforward to add. diff --git a/docs/meadowlark-api-design/meadowlark-response-codes.md b/docs/meadowlark-api-design/meadowlark-response-codes.md new file mode 100644 index 00000000..bd6a7818 --- /dev/null +++ b/docs/meadowlark-api-design/meadowlark-response-codes.md @@ -0,0 +1,22 @@ +# Meadowlark - Response Codes + +The Meadowlark API emit the following response codes: + +| Verb | Code | Scenario | How to Resolve | +| --- | --- | --- | --- | +| All | 500 | Internal server error - something unexpected happen inside the server, which the API client can't resolve. | System administrator should inspect logs to troubleshoot and try to correct the error. | +| GET | ​200 "OK" | Request was accepted and an appropriate response returned.

> [!TIP]
> When performing a query, as opposed to getting a specific resource, the response will be an empty array with 200 status code if there are no documents to return, rather than a 404 "not found". | n/a​ | +| GET | 400 "Bad Request" | Invalid query string parameters | Client should inspect and correct the query string parameters | +| GET | 404 "Not Found" | The requested item does not exist. | Client may have stored a resource identifier incorrectly, and may need to lookup *all records* to find the right identifier. | +| POST | 200 "OK" | A POST request on an *existing*  item ("upsert") was processed successfully. | n/a | +| POST | 201 "Created" | The new item has been created. | n/a | +| POST | 400 "Bad Request" | The submitted document is not valid. For example:

*Invalid content-type header
* Bad document format (invalid JSON)
*A missing property
* A property with invalid data type

The response body will have a detailed message describing the error. | Client needs to inspect the detailed message and likely needs to make code-level corrections. | +| POST | 404 "Not Found" | The resource (i.e. Ed-Fi entity) does not exist | Client needs to check the URL for a typo. Can check the Open API specification - available through the root endpoint - for a list of available resources. | +| POST | 409 "Conflict" | Cannot insert or update the item, because of a missing reference (e.g. posted a StudentEducationOrganizationAssociation for a Student that does not exist yet).

or, Cannot insert the item, because the natural key already exists on a different entity in the same superclass hierarchy (e.g. cannot create a Local Education Agency with a `localEducationAgencyId`  that matches an existing School's `schoolId` ).

The response body describes the missing reference(s). | Client needs to insert the "upstream" reference first before re-trying. | +| PUT | 204 "No Content" | The item has been updated, and the response body does not contain any content. | n/a | +| PUT | 400 "Bad Request" | The submitted document is not valid. For example:

*Invalid content-type header
* Bad document format (invalid JSON)
*A missing property
* A property with invalid data type

The response body will have a detailed message describing the error. | Client needs to inspect the detailed message and likely needs to make code-level corrections. | +| PUT | 404 "Not Found" | The resource (i.e. Ed-Fi entity) does not exist, *or* the specific item does not exist | Client needs to check the URL for a typo. Can check the Open API specification - available through the root endpoint - for a list of available resources. | +| PUT | 409 "Conflict" | Cannot update the item, because of a missing reference (e.g. posted a StudentEducationOrganizationAssociation for a Student that does not exist yet).

The response body describes the missing reference(s). | Client needs to insert the "upstream" reference first before re-trying. | +| DELETE | 204 "No Content" | The item has been deleted, and the response body does not contain any content. | n/a | +| DELETE | 404 "Not Found" | The resource (i.e. Ed-Fi entity) does not exist, *or* the specific item does not exist | Client needs to check the URL for a typo. Can check the Open API specification - available through the root endpoint - for a list of available resources. | +| DELETE | 409 "Conflict" | Cannot delete the item, because it is a dependency of another item (e.g. cannot delete a Student if there is a StudentEducationOrganizationAssociation that references that student).

The conflict is described in the payload. | Client can either delete the "upstream" dependency first, or abandon the effort to delete the item. | diff --git a/docs/meadowlark-data-storage-design/attachments/identifiers b/docs/meadowlark-data-storage-design/attachments/identifiers new file mode 100644 index 00000000..5abf1ba6 --- /dev/null +++ b/docs/meadowlark-data-storage-design/attachments/identifiers @@ -0,0 +1 @@ +7VrbcqM4EP0aP9rFxRh49CXOpGqyk51kd7L7JoNsqACihBjH+/XbMuIqsLHHdmZr1nkIaqSW6HM4arU90Ofh+z1FsfdIXBwMNMV9H+iLgaaphqXBP27ZCQv8ZZYN9V1hKw3P/j9YGBVhTX0XJ7WOjJCA+XHd6JAowg6r2RClZFvvtiZBfdYYbcSMSml4dlCApW7ffJd5mdUyKr0/YX/j5TOrirgToryzMCQecsm2YtLvBvqcEsKyq/B9jgMevTwu2bhlx91iYRRHrM+AV//rYruc3ZE5mT8qv23/tt/WQ+HlOwpS8cCPmC8zQPRtoE0C8DxbUbja8Kvp04N4FrbLAwSPFfPLNAw++2sc+BG0ZjGmfogZpnAnEOan0jbbej7DzzFy+NAtcAdsHgsDaKlwCWgyBENo0Q4CFCf+aj+rAhaKnZQm/nf8FScZabiVpIzPNC/IwI1yoPKnxpTh94pJBO4eE1gn3UEXcXcITM7G7HKDKWDdlqwosPcqjDByIxJM3BTeS7DgQuB1AnaahB3Q1kkDxDAAhkIe02iVxPsQNKEMC5gfXAlT7MI7IJqEMo9sSISCu9JaxQoF/ibiMOM1d5wAqH60+bxvLSCmMxy5U/4mQnMVEOeNo0fSyMWuAAhwobtX4W7f+KsTukMs5svuhaYyUpSJnQ2iGAIGPKorRwtWwt0T8WEtha9cCAQvbHCt130kJKUOFsNKyCEkaFfpFvMOSfdEVtdEJYUylx3rNGrDQadGimYXH8uuL5ohusFMWnTpNu9I1usEs0GTy0XQz6e3LtF7jZnjgWnF43aQwBV68pfchxdjKmi6IoyRkBMVnpHlxCQwTW5b+kFQkFGi7r6L2Kesy1O0IicT5RoMVfWGhw5+Sn50zR41SKQrI7v66cWhS/FjLPGj796EOCP8GO0Bg3tThxH6cZtWk6GMxKWOvvDGQp9cc3PTGxRp3dy0m+5txkl5yQIxBL2fAUb8ayYoas7+j8lP1p7i//5l96C/fJvYY2P6afb6Z4uAJyx18V5TXOKkIRbCcKpyV1SZxBhuzVyUePt0Qoi2pM8N3e0h2IKBmYgdkyEZlx8VbXWUJygC0bHdwKmvbgMVbC7TE8NQTW2s6qZ22HGHbp+asAxVzRxZautcrTnLpTaGicS7py/PLwMuYEvsDtc+/BdMTH6uzEEEK0P2+Nb3AcnEUDOtkWqVZLLrKcHQrqcEzbyyP2kVa9R03VjtlfMLU6IRShLOBAgiilxggrQP5ar2R+r/pw5VfZPTHyUPgFpPIA2jdgqxzV5sOVmJzF7T9jtIwUPodW9m3Ztp3JSmlkRTipOYRImcCV1pb21h0tFd80aMUxt73ZAfeysfVR2fp0+25PemoNvdoOvTAS9zghRI+Fcy4SVFIT6e5nYCfErpbNx4++TEtDiAVBNT61pni3wBndGLCLtgDPPnNLn2Fg+oX6wy2Qxv3/iqTc5eLsByVTmN3X1ZUoFQhT9/CefjqjXSHmnYo4lx8hZ5VAkNu56nWZObClj+LVC1eE1xlSL/06GVDlrzuH4uAcDRSNcrFNB6EeCWFWBVriAAZBBOEmVC3XqgWxZffgRyTi7u7LP1D8yPbpFpN45P2vjMVGeoj5Xakc887PdS5QPNslonulTtoLVmJe9c93edpYPrMu2ianaw/nU7NQMq1TE1zy5qNXWx6elyG1grUeT9q/bt1MFKwK+Kvqo0pOPckiboUFPdehYxj8MPzfI3GVn38qct+t2/ \ No newline at end of file diff --git a/docs/meadowlark-data-storage-design/attachments/identifiers.png b/docs/meadowlark-data-storage-design/attachments/identifiers.png new file mode 100644 index 00000000..b76a5a59 Binary files /dev/null and b/docs/meadowlark-data-storage-design/attachments/identifiers.png differ diff --git a/docs/meadowlark-data-storage-design/attachments/image2021-12-6_14-10-12.png b/docs/meadowlark-data-storage-design/attachments/image2021-12-6_14-10-12.png new file mode 100644 index 00000000..17ea681c Binary files /dev/null and b/docs/meadowlark-data-storage-design/attachments/image2021-12-6_14-10-12.png differ diff --git a/docs/meadowlark-data-storage-design/attachments/image2022-7-5_14-48-34.png b/docs/meadowlark-data-storage-design/attachments/image2022-7-5_14-48-34.png new file mode 100644 index 00000000..cc51b629 Binary files /dev/null and b/docs/meadowlark-data-storage-design/attachments/image2022-7-5_14-48-34.png differ diff --git a/docs/meadowlark-data-storage-design/attachments/image2023-2-22_11-26-30.png b/docs/meadowlark-data-storage-design/attachments/image2023-2-22_11-26-30.png new file mode 100644 index 00000000..06b89b20 Binary files /dev/null and b/docs/meadowlark-data-storage-design/attachments/image2023-2-22_11-26-30.png differ diff --git a/docs/meadowlark-data-storage-design/attachments/offline workflow b/docs/meadowlark-data-storage-design/attachments/offline workflow new file mode 100644 index 00000000..4b07bb00 --- /dev/null +++ b/docs/meadowlark-data-storage-design/attachments/offline workflow @@ -0,0 +1 @@ +3VhZU9swEP41eQzjM2kec9BjSktmgIH2TbFlR41tubJM4v76riz5UBQgMKEwfbK1Wq3k/b495IE7T3efGMrX32iIk4FjhbuBuxg4jj32J/AQkkpKPvieFMSMhEqpE1yRP1gJLSUtSYgLTZFTmnCS68KAZhkOuCZDjNGtrhbRRN81R7Ha0eoEVwFKsKF2S0K+br6ip/0Zk3jd7GxbaiZFjbISFGsU0m1P5J4P3DmjlMu3dDfHiXBe4xe57uMDs+3BGM74MQtyZ/M7GrOY3QQs+Om511+z26Gyco+SUn3wggZlKkwKc6METM9WTByfV8ono9+lOPMsohkfFjViU1CwvRxQn3Xz8BaL5wJxBPNXnDKsXNHacsAruXgNqoRkIWYuLNiuCcdXOQrExBZ4BbI1TxMY2eI4tATN8GLVClCwiZmQXpYczGAll2db2L46rGKX7dWH2/de4wrMON71RMqbnzBNMWcVqDSzrkJWUbvhxLbjyUSJ1j2KNDKkmBm3hjvw4EXh9wwsXQPLMg8Rx30YJSJhi/H+TERZillSmTObjG4zeWxjbjCewcnGCwNdHEIwqSFlfE1jmqHkvJPOmAQT5i0YdToXlOYKxl+Y80phh0pOdTYcDWxBSxbgR/znqPSCWIz50zEjvu1RmjCcIE7u9URyctR9A/UMb3vBK/wjEh8K9OidgzsIFpH9XejvB+0GV0OdPnA6aUthbrAA39eUensK4B3hd8Lcma9GP5SmeF/s1E71oOoNlpgRAAWcImUvp5J9JJW8I6mkMs7QOrNHo7FcdDS9anNTxlDVU8gpyXihF4ylkPV0aBQVmBsMbc/1ctI6Bmm/YVEdE8Q2IJ8uvxg80lnyRIl4hWw/tEd6uh/5Rrq3nQP5fvRa+d4znHguI9CSFfckxfu/r9ptf/ZWZfvDe0iZL0914xOnun9TNW2z8Q0h8wcHWiJZTh8ubhdoBXcezbsoIXEmwgPWiGoyE/QkcKmYqomUhKEEEgPd0aq2J6BUaRmM+7OBvzgKqcdYZcRFe1NSmw76l5FD8WKdOY4WME3yfl796apLG4d6GOrrX6/02GbtafocC4nl+/gzHGFwdkCyWNxjaVC8RcAC2qy66w9+dB2OGHZtTT2q+qP9xubfR//Je+YnmprDnPMsPfdP/D2GSYMn5dvYoNtlFNXlzrG+QpN905DvM8ogJ7B31/i43kiPVNc5O7b18V8rfVuGV5c31/Wt52MByY3QrBA7/Bf30smRMea8qwo7MRCaJ8T8ryRz7DTPE6iPArcHe80yTaYBdKO9clqX3iUtSL3QXawo5zQ9UG853YsTKlvOefvX8Flp8Rm3Btd78tbgnubSAMPuZ6LMXt0vWff8Lw== \ No newline at end of file diff --git a/docs/meadowlark-data-storage-design/attachments/offline workflow.png b/docs/meadowlark-data-storage-design/attachments/offline workflow.png new file mode 100644 index 00000000..50782ec2 Binary files /dev/null and b/docs/meadowlark-data-storage-design/attachments/offline workflow.png differ diff --git a/docs/meadowlark-data-storage-design/meadowlark-document-shape.md b/docs/meadowlark-data-storage-design/meadowlark-document-shape.md new file mode 100644 index 00000000..cc5227bb --- /dev/null +++ b/docs/meadowlark-data-storage-design/meadowlark-document-shape.md @@ -0,0 +1,58 @@ +# Meadowlark - Document Shape + +## Overview + +Meadowlark documents have a *baseline* standardized shape or scheme in data storage; however, some implementation details may vary by data storage provider. For example, the MongoDB implementation includes two arrays inside the document, which are stored as separate tables in PostgreSQL. + +## Standard Elements + +| Attribute Name | Description | Example | +| --- | --- | --- | +| ​**\_id** (Mongo) or

**id** (PostgreSQL) | A unique [document identifier](https://edfi.atlassian.net/wiki/spaces/EXCHANGE/pages/22498135/Meadowlark+-+Document+Shape#Meadowlark-DocumentShape-DocumentIdentifiersdocument-identifier), also known as the "Meadowlark ID". Used internally by the database engine. | ​ZAwidGBEGsnKxQ-V1ktoecnvJ8xceXjM1jMehQ | +| documentUuid | A unique and randomly-assigned [document identifier](https://edfi.atlassian.net/wiki/spaces/EXCHANGE/pages/22498135/Meadowlark+-+Document+Shape#Meadowlark-DocumentShape-DocumentIdentifiersdocument-identifier). Used externally by API clients | 3518d452-a7b7-4f1c-aa91-26ccc48cf4b8 | +| documentIdentity | Array of key-value pairs that make up the document identity, corresponding to the Natural Key in the Ed-Fi Data Standard | ```
[
  {
  "key": "localEducationAgencyId",
  "value": "2231"
}
]
``` | +| projectName | The project name from MetaEd

> [!TIP]
> Eventually this will allow Meadowlark to support Data Model extensions | Ed-Fi | +| resourceName | The name of the resource, corresponding to the domain entity name in the Ed-Fi Data Standard | LocalEducationAgency | +| resourceVersion | The Data Standard version

> [!TIP]
> Eventually this will allow Meadowlark to support multiple Data Standards in the same data store | 3.3.1-b | +| isDescriptor | Boolean | false | +| edfiDoc | The document body corresponding to the data model definition | ```
[
{
  "localEducationAgencyId": "2231",
```
```
"nameOfInstitution": "Grand Bend School District",
"localEducationAgencyCategoryDescriptor":
```
```
"uri://ed-fi.org/LocalEducationAgencyCategoryDescriptor#Independent",
"categories": [ ]
}
]
``` | + +## Document Identifiers + +Meadowlark items have two unique identifiers: + +* The `id`  (aka "Meadowlark ID") value is a computed value based on the data standard version and data model natural key. +* The `documentUuid` is a v4 UUID (unique identifier) that is assigned at the time the item is first created in the API. + +Early on in project Meadowlark, the application only used the computed `id` value above. This value makes it trivial to lookup a item in the data store, including for validating that a new item's references actually exist. It was also useful for allowing API clients to specify a particular resource to GET, PUT, or DELETE (e.g. `GET /ed-fi/localEducationAgency/ZAwidGBEGsnKxQ-V1ktoecnvJ8xceXjM1jMehQ` ). + +It suffers from one glaring problem though: the natural key can change for some resource types. Implication: a changing natural key would change the `id`  value and thus break any *external* client integrations that stored that calculated ID. Thus the application needs a fully stable identifier for *external* usage. This external ID is the randomly/uniquely assigned `documentUuid` that does not change even if the natural key changes. + +The following sequence diagram shows the differing internal and external uses of these two unique identifiers. + +![](https://edfi.atlassian.net/wiki/plugins/servlet/confluence/placeholder/unknown-macro?name=drawio&locale=en_US&version=2) + +### Meadowlark ID + +The Meadowlark ID has a one-to-one unique match with the document, based on the data model project name, the resource name, and the natural key value(s). Because this could be rather large for some resources - for example, a StudentSectionAssociation resource - Meadowlark concatenates this information into a single string, calculates a hash value on it (using SHA3-224), and then encodes the result as a Base64 string. The keys are sorted alphanumerically before that concatenation, to guarantee a consistent ordering when there are multiple key fields. + +Example: Course in the Ed-Fi Data Standard has a natural key composed of the `courseCode`  and the associated `educationOrganization`. The document identity is thus: + +```json +[ + { + "key": "courseCode", + "value": "span-101" + }, + { + "key": "educationOrganizationReference.educationOrganizationId", + "value": "123" + } +] +``` + +To create a Meadowlark ID, the system concatenates this information into the following string and then hashes it: + +```none +Ed-Fi#Course#courseCode=span-101#educationOrganizationReference.educationOrganizationId=123 +``` diff --git a/docs/meadowlark-data-storage-design/meadowlark-dynamodb.md b/docs/meadowlark-data-storage-design/meadowlark-dynamodb.md new file mode 100644 index 00000000..e2e6ab32 --- /dev/null +++ b/docs/meadowlark-data-storage-design/meadowlark-dynamodb.md @@ -0,0 +1,236 @@ +# Meadowlark - DynamoDB + +OBSOLETE + +> [!CAUTION] +> The Meadowlark development team learned much while developing the 0.1.0 milestone against DynamoDB. During the development of the next milestone, the team realized that there is a fatal flaw: there is no native support for transactional, distributed lock management. There is a [Java-based client](https://aws.amazon.com/blogs/database/building-distributed-locks-with-the-dynamodb-lock-client/) that provides a [pessimistic offline locking](https://www.martinfowler.com/eaaCatalog/pessimisticOfflineLock.html) mechanism that might work, but this is not useful for Meadowlark. Furthermore, the development team cannot afford to invest time building a custom locking mechanism. Therefore the support for DynamoDB is being removed from with the milestone 0.2.0 release. +> *Also see* *[Meadowlark - Referential Integrity in Document Databases](../meadowlark-data-storage-design/meadowlark-referential-integrity-in-document-databases.md)*. + +## Overview + +The development team chose AWS DynamoDB for the initial Meadowlark because of its lost cost storage and serverless nature, and also did some research around potential use of Cassandra (or Cosmos DB in Cassandra mode). Cassandra and DynamoDB are based on the same original architectural design, so the team felt that cross-platform lessons could be learned even while exploring only one of them in depth. + +## Eventual Consistency + +Highly scalable databases such as DynamoDB and Cassandra store multiple copies of the data for resiliency and high availability, and only one of these copies receives the initial write operation. The service guarantees that all other copies will eventually come up to date with that initial write operation: the data will *eventually be consistent*. The tradeoff is in favor of connection reliability: queries are not blocked by write operations.[](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html) + +Many people find this disturbing at first, if they are used to thinking about transaction locking in relational databases. But the reality is less scary than it sounds. + +[Amazon states](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html) that it typically takes "one second or less" to bring all copies up to date. Let's compare the outcomes of the following three scenarios: + +| Time | Scenario 1 | Scenario 2 | Scenario 3 | +| --- | --- | --- | --- | +| 10:01:01.000 AM | **Client A reads a record** | Client B writes an update to that record | Client B writes an update to that record | +| 10:01:01.500 AM (half second) | Client B writes an update to that record | **Client A reads a record** | All DynamoDB copies are up-to-date | +| 10:01:02.000 AM (full second) | All DynamoDB copies are up-to-date | All DynamoDB copies are up-to-date | **Client A reads a record** | +| *Status* | *Client A has stale data* | *Client A* might *have stale data* | *Client A has current data* | + +In Scenario 1, Client A receives stale data because they requested it *half a second* before Client B writes an update. *And this is no different than in a relational database*. + +In Scenario 2, the Client B writes an update *half a second* before Client A sends a read. Client A might coincidentally be assigned to read from the first database node that received the record, or it might read from a node that is lagging by half a second. Thus it *might* get stale data, though this is not guaranteed. + +Finally in Scenario 3, Client A asks for a record a full second after Client B had written an update, and Client A is *nearly* guaranteed to get the current (not stale) data. *Again, same as with a standard relational database*. + +The practical difference between the guaranteed consistency of a relational database and the eventual consistency of a distributed database like DynamoDB is thus more a matter of happenstance than anything else. In either case, if Client A reads from the system a millisecond before Client B writes, then Client A will have stale data. If Client A reads *after*  Client B writes, then the window of time for getting stale data goes up to perhaps a second. *But if they do get stale data, they will never know that they weren't in scenario 1.* + +Eventual consistency is likely "good enough." But it does deserve further community consideration before using it in a production system. + +## Storage Design + +Meadowlark uses the [single-table design](https://aws.amazon.com/blogs/compute/creating-a-single-table-design-with-amazon-dynamodb/) approach for storage in DynamoDB, with the following structure: + +| Column Name | Purpose | +| --- | --- | +| info | Contains the JSON document for a resource | +| pk​ | Hash key (aka partition key) - one half of the primary key​. | +| naturalKey | Plain text version of the natural key | +| sk | Range key (aka sort key) - the other half of the primary key | + +There are also a couple of experimental columns and secondary indexes for exploring relationship-based authorization. + +Meadowlark creates a unique resource ID by calculating a  [SHA-3](https://en.wikipedia.org/wiki/SHA-3) (cShake 128) hash value from the natural key. This value is stored as the sort key, `sk` . The partition key, `pk` , contains entity type information: schema, model version, and domain entity name. + +> [!TIP] +> In DynamoDB, an "item" is analogous to a "record" in a relational database. Thus a single object being stored in a DynamoDB table is stored as "an item". + +## Streaming to OpenSearch + +DynamoDB has native change data capture streaming. The change stream can trigger execution of a Lambda function. This function in turn can write data out to OpenSearch. + +## Referential Integrity + +An important feature of an Ed-Fi API is the ability to enforce referential integrity, rejecting modification requests where the modified item refers to another item that does not actually exist. An Ed-Fi API also rejects attempts to delete items that are referred to by other items. + +Most NoSQL databases do not support referential integrity, whereas the ODS/API Platform leverages referential integrity checking built into the SQL database. Therefore Meadowlark had to develop its own system for referential integrity checks, in application code. In short, Meadowlark transactionally writes extra items to the transactional database with pointers to the referenced items. These items are trivial to look up. + +> [!WARNING] +> +> Due to eventual consistency, there is a small but real possibility of a referential integrity check *miss* *.* To what extent does this matter?  Another question for the community to explore. + +To illustrate: assume that a Meadowlark instance already has descriptors loaded, and an API client wants to load a School and a Course that belongs to that school. Adding excitement to the scenario: in the Ed-Fi Data Model, a School *is an* Education Organization (extends / inherits). + +![](./attachments/image2021-12-6_14-10-12.png) + +Below is the successful POST request to create the new school: + +**Request** + +``` +POST http://aws-created-url/stage-name/v3.3b/ed-fi/schools + +{ + "schoolId": 122, + "nameOfInstitution": "A School", + "educationOrganizationCategories" : [ + { + "educationOrganizationCategoryDescriptor": "uri://ed-fi.org/EducationOrganizationCategoryDescriptor#Other" + } + ], + "schoolCategories": [ + { + "schoolCategoryDescriptor": "uri://ed-fi.org/SchoolCategoryDescriptor#All Levels" + } + ], + "gradeLevels": [] +} +``` + +**Response** + +``` +HTTP/1.1 201 Created +x-metaed-project-name: Ed-Fi +x-metaed-project-version: 3.3.1-b +x-metaed-project-package-name: ed-fi-model-3.3b +location: /v3.3b/ed-fi/schools/7a5cf3f4a68015c0922e24c73401a21e9fd1767ef60c0b3300f2301e +content-type: application/json; charset=utf-8 +vary: origin +access-control-allow-credentials: true +access-control-expose-headers: WWW-Authenticate,Server-Authorization +cache-control: no-cache +content-length: 0 +Date: Mon, 06 Dec 2021 14:47:42 GMT +Connection: close + +``` + +Since there are two descriptors, the application code must validate that those are legitimate descriptors. The following DynamoDB items exist, therefore the Post is validated: + +* SchoolCategory + * pk = ​TYPE#Ed-Fi#3.3.1-b#SchoolCategoryDescriptor + * sk = ID#0f1474d47271406f6b47eabeba2fca6dd5a8b49a3b9d4e5b8d0e87e8 + * naturalKey = ​NK#**[uri://ed-fi.org/SchoolCategoryDescriptor#All] Levels** + * info =  {"namespace":{"S":"[uri://ed-fi.org/SchoolCategoryDescriptor"},"description":{"S":"All] Levels"},"shortDescription":{"S":"All Levels"},"\_unvalidated":{"BOOL":true},"codeValue":{"S":"All Levels"}} +* EducationOrganizationCategoryDescriptor + * pk = TYPE#Ed-Fi#3.3.1-b#EducationOrganizationCategoryDescriptor + * sk = ID#04c7f019c56684b0539135ab2d955e4c03bc85b3841cdd87fb970f35 + * naturalKey = NK#**[uri://ed-fi.org/EducationOrganizationCategoryDescriptor#Other]** + * info =  {"namespace":{"S":"[uri://ed-fi.org/EducationOrganizationCategoryDescriptor"},"description":{"S":"Other"},"shortDescription":{"S":"Other"},"\_unvalidated":{"BOOL":true},"codeValue":{"S":"Other]"}} + +Now that the POST has been accepted, Meadowlark saves the following records in a transaction: + +* School + * pk = ​TYPE#Ed-Fi#3.3.1-b#School + * sk = ID#7a5cf3f4a68015c0922e24c73401a21e9fd1767ef60c0b3300f2301e​ + * naturalKey = NK#schoolId=122 + * info = {"educationOrganizationCategories":{"L":\[{"M":{"educationOrganizationCategoryDescriptor":{"S":"[uri://ed-fi.org/EducationOrganizationCategoryDescriptor#Other"}}}\]},"schoolCategories":{"L":\[{"M":{"schoolCategoryDescriptor":{"S":"uri://ed-fi.org/SchoolCategoryDescriptor#All] Levels"}}}\]},"gradeLevels":{"L":\[\]},"schoolId":{"N":"122"},"nameOfInstitution":{"S":"A School"}} +* Education Organization + * pk = TYPE#Ed-Fi#3.3.1-b#EducationOrganization + * sk = ASSIGN#ID#7a5cf3f4a68015c0922e24c73401a21e9fd1767ef60c0b3300f2301e + +The second item, of type "Assign", helps to recognize entity super types when performing referential integrity validation checks. Please note that the hash value in the Assign item's `sk`  matches the hash value for the individual school. + +Now that there is a school, the client next creates a new Course, which has a reference to Education Organization. In this scenario, that Education Organization will be the School that was just created. For referential integrity, Meadowlark must determine if the Education Organization Id actually exists. Based on the payload, Meadowlark doesn't "know" to look for a *School* with this particular Education Organization Id – could be a Local or State Education Agency, for example. Hence the creation of the Assign item with `TYPE#Ed-Fi#3.3.1-b#EducationOrganization`  and the School's natural key hash value, which Meadowlark uses for the integrity lookup. + +**Request** + +``` +POST http://aws-created-url/stage-name/v3.3b/ed-fi/courses + +{ + "educationOrganizationReference": { + "educationOrganizationId": 122 + }, + "courseCode": "1234", + "courseTitle": "A Course", + "numberOfParts": 1, + "identificationCodes": [] +} +``` + +**Response** + +``` +HTTP/1.1 201 Created +x-metaed-project-name: Ed-Fi +x-metaed-project-version: 3.3.1-b +x-metaed-project-package-name: ed-fi-model-3.3b +location: /v3.3b/ed-fi/courses/2717e6e9275502cb2da0e3bdbf5c2ba3395f9e2117bdc7e03c216138 +content-type: application/json; charset=utf-8 +vary: origin +access-control-allow-credentials: true +access-control-expose-headers: WWW-Authenticate,Server-Authorization +cache-control: no-cache +content-length: 0 +Date: Mon, 06 Dec 2021 15:32:03 GMT +Connection: close +``` + +As Course does not extend any other entity, there is no need for it to have a complementary Assign item. However, another type of referential integrity comes into play now: we must make sure that no client can delete the School without first deleting the referencing Course.  Meadowlark handles this by creating additional items along with the Course: one pointing from Course to School and one in reverse, making it easy to lookup the relationship in either direction. + +* Course + * pk = TYPE#Ed-Fi#3.3.1-b#Course + * sk = ID#2717e6e9275502cb2da0e3bdbf5c2ba3395f9e2117bdc7e03c216138 + + * naturalKey = NK#courseCode=1234#educationOrganizationReference.educationOrganizationId=12 + * info = {"courseTitle":{"S":"A Course"},"numberOfParts":{"N":"1"},"educationOrganizationReference":{"M":{"educationOrganizationId":{"N":"122"}}},"identificationCodes":{"L":\[\]},"courseCode":{"S":"1234"}} +* From Course To School + * pk = FREF#ID#2717e6e9275502cb2da0e3bdbf5c2ba3395f9e2117bdc7e03c216138 + * sk = TREF#ID#7a5cf3f4a68015c0922e24c73401a21e9fd1767ef60c0b3300f2301e +* To School From Course + * pk = TREF#ID#7a5cf3f4a68015c0922e24c73401a21e9fd1767ef60c0b3300f2301e + * sk = FREF#ID#2717e6e9275502cb2da0e3bdbf5c2ba3395f9e2117bdc7e03c216138 + * info = {"Type":{"S":"TYPE#Ed-Fi#3.3.1-b#Course"},"NaturalKey":{"S":"NK#courseCode=1234#educationOrganizationReference.educationOrganizationId=122"}} + +The `info`  column in the "to ... from" item allows Meadowlark to provide a meaningful message when it rejects a Delete request based on referential integrity: + +**Request** + +``` +DELETE http://aws-created-url/stage-name/v3.3b/ed-fi/schools/7a5cf3f4a68015c0922e24c73401a21e9fd1767ef60c0b3300f2301e +``` + +**Response** + +``` +HTTP/1.1 409 Conflict +x-metaed-project-name: Ed-Fi +x-metaed-project-version: 3.3.1-b +x-metaed-project-package-name: ed-fi-model-3.3b +content-type: application/json; charset=utf-8 +vary: origin +access-control-allow-credentials: true +access-control-expose-headers: WWW-Authenticate,Server-Authorization +cache-control: no-cache +content-length: 741 +Date: Mon, 06 Dec 2021 15:51:10 GMT +Connection: close + +{ + "error": "Unable to delete this item because there are foreign keys pointing to it", + "foreignKeys": [ + { + "NaturalKey": "NK#courseCode=1234#educationOrganizationReference.educationOrganizationId=122", + "Type": "TYPE#Ed-Fi#3.3.1-b#Course" + } + ] +} +``` + +## References + +* Alex DeBrie's [DynamodDB Guide](https://www.dynamodbguide.com/) and [The DynamoDB Book](https://www.dynamodbbook.com/) +* [Single table design with DynamoDB](https://www.youtube.com/watch?v=BnDKD_Zv0og). "Covers a fair amount of his book content". +* [re:Invent 2019 - DynamoDB Deep Dive](https://www.youtube.com/watch?v=6yqfmXiZTlM) +* [re:Invent 2020 - DynamoDB Advanced Design Patterns, part 1](https://www.youtube.com/watch?v=MF9a1UNOAQo) +* [re:Invent 2020 - DynamoDB Advanced Design Patterns, part 2](https://www.youtube.com/watch?v=_KNrRdWD25M) diff --git a/docs/meadowlark-data-storage-design/meadowlark-experiments-in-pagination.md b/docs/meadowlark-data-storage-design/meadowlark-experiments-in-pagination.md new file mode 100644 index 00000000..4f795730 --- /dev/null +++ b/docs/meadowlark-data-storage-design/meadowlark-experiments-in-pagination.md @@ -0,0 +1,60 @@ +# Meadowlark - Experiments in Pagination + +## Overview + +For large databases, pagination with LIMIT x OFFSET y queries becomes very slow as you go deeper and deeper into the search results. What options does Meadowlark's usage of OpenSearch, MongoDB, and PostgreSQL provide? What performance characteristics can we expect? + +Note: pagination in the Ed-Fi API specification always has the potential to go hand-in-hand with queries, therefore this article assumes that both must be supported at the same time. + +## Standard Techniques + +### LIMIT x OFFSET y + +Fetches X number of rows, skipping Y number of rows. The skipping part is what can take a very long time as you go deeper into the result set. There is also the danger data changes between page requests. + +### Keyset Pagination + +A better technique can be to fetch a small number of records, and use a monotonically-increasing key field in a WHERE clause to select the next group of records. This is referred to as either the *seek method* and *keyset pagination*, and seems to have been first documented in [Paging Through Results](https://use-the-index-luke.com/sql/partial-results/fetch-next-page). This approach takes full advantage of database indexing. + +### Cursors + +Cursor-based pagination is common in GraphQL queries. As describing in [Paginating Requests in APIs](https://ignaciochiazzo.medium.com/paginating-requests-in-apis-d4883d4c1c4c), the "cursor" is essentially a pointer to the next record that should be fetched. Thus it is conceptually similar to a keyset approach. However, it might be subject to problems when a new record appears in the sort order *before* the next cursor. + +## Meadowlark Database Engines + +In the current design, all documents are written to either MongoDB or PostgreSQL for basic transactional support. The data are also written to OpenSearch, and GET ALL and GET by QUERY type requests go against the OpenSearch database with its powerful indexing. However, MongoDB and PostgreSQL also have document indexing capabilities. + +### Pagination in OpenSearch + +OpenSearch supports all three patterns. The cursor-based pattern is poorly documented; in fact, the official documentation only mentions [limitations](https://opensearch.org/docs/latest/search-plugins/sql/limitation/) to cursor-based paging, without ever mentioning how to use it. The limitations mention that only [basic queries](https://opensearch.org/docs/latest/search-plugins/sql/basic/) are supported; this fits the potential Meadowlark usage pattern, which would not use sub-queries or joins ([complex queries](https://opensearch.org/docs/latest/search-plugins/sql/complex/)). However, there is a markdown document describing [OpenSearch SQL Cursor (Pagination) Support](https://github.com/opensearch-project/sql/blob/2.1/docs/dev/Pagination.md) in the source repository. + +### Pagination in MongoDB + +*An aside:* + +MongoDB [does support indexing](https://www.mongodb.com/docs/v4.0/indexes/) into a document, without which there would be no point to the pagination. In theory, we could use MongoDB alone for queries, instead of relying on OpenSearch. Trying to build the right indexes might be difficult, especially with multikey indexes to cover the possibility of querying on multiple fields. Multikey indexing might necessitate moving to a design of one collection ("table") per resource, instead of having a single collection that contains all resources. It also would likely benefit from using the MetaEd model introspection to find the queryable fields and auto-generate indexes. + +*Back to the  topic:* + +As described in [MongoDB Pagination, Fast & Consistent](https://medium.com/swlh/mongodb-pagination-fast-consistent-ece2a97070f3), both the LIMIT x OFFSET y and keyset pagination, with similar limitations on OFFSET as other systems. In MongoDB, the first pattern is uses the [skip()](https://www.mongodb.com/docs/v4.0/reference/method/cursor.skip/) function for the "offset". The keyset pattern is similar to other systems, using a limit combined with a greater than query. MongoDB calls these Range Queries. + +### Pagination in PostgreSQL + +*Aside*: + +Similar to MongoDB, in theory we could switch to using indexes directly in PostgreSQL instead of utilizing OpenSearch. These are called [GIN indexes](https://pganalyze.com/blog/gin-index). Similar design concerns may apply. + +*Back to the topic:* + +> [!INFO] +> I have not been able to find any blogs or documentation that explicitly discuss pagination with GIN indexes on JSONB structures. PostgreSQL has LIMIT x OFFSET y support, and naturally "normal" tables it can be used for keyset pagination. Update this section after doing more research. +> [https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/](https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/) shows cursors in the psql-language, but how would those be represented in API-based queries, as opposed to "shell" sessions or single synchronous scripts? +> [Paginating Large, Ordered Data Sets with Cursor-Based Pagination](https://brunoscheufler.com/blog/2022-01-01-paginating-large-ordered-datasets-with-cursor-based-pagination) describes a possible approach to creating GraphQL-style cursors that essentially employ the keyset pagination technique. +> +> Some information on GIN indexing in general, and ordering issues in particular: +> +> * Blog post overview of GIN indexing: [https://pganalyze.com/blog/gin-index](https://pganalyze.com/blog/gin-index) +> * PostgreSQL docs on only b-tree indexes supporting sort: [https://www.postgresql.org/docs/current/indexes-ordering.html](https://www.postgresql.org/docs/current/indexes-ordering.html) +> * PostgreSQL perf mailing list on GIN + ORDER BY issues: [https://www.postgresql.org/message-id/flat/56B332B6.1040109@promani.be](https://www.postgresql.org/message-id/flat/56B332B6.1040109%40promani.be) +> +> Other references I've found are on Stack Overflow with reports of performance tanking when GIN indexes are used with ORDER BY. diff --git a/docs/meadowlark-data-storage-design/meadowlark-mongodb.md b/docs/meadowlark-data-storage-design/meadowlark-mongodb.md new file mode 100644 index 00000000..f137f01c --- /dev/null +++ b/docs/meadowlark-data-storage-design/meadowlark-mongodb.md @@ -0,0 +1,217 @@ +# Meadowlark - MongoDB + +## Introduction + +In hindsight,  [DynamoDB](../meadowlark-data-storage-design/meadowlark-dynamodb.md) was a poor choice of data store for the first release of Meadowlark for two primary reasons: + +* Except for a little-known open source implementation, it is entirely restricted to Amazon Web Services. +* The design model is interesting, but idiosyncratic. + +MongoDB would have been a better starting point: + +* It is supported, directly and/or through emulation, on all major cloud platforms and on-premises. +* It is a mature product, with strong documentation and design patterns. +* The scalability features, such as replication and sharding, are very attractive for large implementation. + +There are other NoSQL databases with similar benefits and other attractive features, such as Couchbase. However, the support is less widespread, so it will not be investigated at this time. + +Although it is one of the traditional relational databases, PostgreSQL has powerful built-in support for NoSQL operations. Because of the Ed-Fi community's growing adoption of PostgreSQL, it will be explored as an alternative to MongoDB. *See [Meadowlark - PostgreSQL](../meadowlark-data-storage-design/meadowlark-postgresql.md).* + +*Also see: [Meadowlark - Durable Change Data Capture](../../project-meadowlark-exploring-next-generation-technologies/meadowlark-streaming-and-downstream-data-stores/meadowlark-durable-change-data-capture.md) for more information on streaming data out to OpenSearch.* + +## Design + +This proposal takes its cue from the team experience with DynamoDB. The basic principal continues that the API document is stored along with metadata to be used for existence/reference validation. However, instead of storing the metadata in columns it will be part of a single larger document. Fast document lookups continue to be done by id, constructed as before from API document project name, entity type, version and body. Transactions will again be used to check for existence/references before performing create/update/delete operations. The MongoDB version of reference validation for deletes is greatly simplified from the DynamoDB version by taking advantage of MongoDB's indexing features, in particular indexing of arrays. + +> [!TIP] +> To support potential deployment to Amazon DocumentDB or Azure CosmosDB, all code and design should match the **[MongoDB 4.0 API](https://www.mongodb.com/docs/v4.0/)****.** + +### Entity Collection + +The MongoDB implementation will only need one collection, to be called Entity. The shape of the Entity document (all fields required): + +* Standard attributes (also see [Meadowlark - Document Shape](../meadowlark-data-storage-design/meadowlark-document-shape.md)) + * `id` \- A string hash derived from the project name, resource name, resource version and identity of the API document. This field will be a unique index on the collection. + * `documentIdentity` - The identity elements extracted from the API document. + * `projectName` \-The MetaEd project name the API document resource is defined in e.g. "EdFi" for a data standard entity. + * `resourceName` \- The name of the resource. Typically, this is the same as the corresponding MetaEd entity name. However, there are exceptions, for example descriptors have a "Descriptor" suffix on their resource name. + * `resourceVersion` - The resource version as a string. This is the same as the MetaEd project version the entity is defined in e.g. "3.3.1-b" for a 3.3b data standard entity. + * `isDescriptor` - Boolean indicator. + * `edfiDoc` \- The Ed-Fi ODS/API document itself. + * `validated`  - Boolean indicator. + * `createdBy`  - name/ID of the client who created the record, for authorization usage. +* MongoDB-specific attributes + * `outRefs` \- An array of ids extracted from the ODS/API document for all externally referenced documents. + * `existenceIds`  - An array of class and superclass identifier applicable to this document. See [Meadowlark - Referential Integrity in Document Databases](../meadowlark-data-storage-design/meadowlark-referential-integrity-in-document-databases.md). + +#### Examples + +**Example: Descriptor** Expand source + +```json +{ + "_id" : "uPvxNzlTZfnIGtMKu9K-oxLPlippk7UNmoipow", + "documentIdentity" : [ + { + "name" : "descriptor", + "value" : "uri://ed-fi.org/EducationOrganizationCategoryDescriptor#Local Education Agency" + } + ], + "projectName" : "Ed-Fi", + "resourceName" : "EducationOrganizationCategory", + "resourceVersion" : "3.3.1-b", + "isDescriptor" : true, + "edfiDoc" : { + "codeValue" : "Local Education Agency", + "shortDescription" : "Local Education Agency", + "description" : "Local Education Agency", + "namespace" : "uri://ed-fi.org/EducationOrganizationCategoryDescriptor" + }, + "existenceIds" : [ + "uPvxNzlTZfnIGtMKu9K-oxLPlippk7UNmoipow" + ], + "outRefs" : [], + "validated" : true, + "createdBy" : "super-great-SIS" +} +``` + +In the following example, there are two `existenceId`  values. You'll recognize ZAwidGBEGsnKxQ-V1ktoecnvJ8xceXjM1jMehQ as "this" document identifier. The plain text [document identifier](../meadowlark-data-storage-design/meadowlark-document-shape.md) is "**Ed-Fi#LocalEducationAgency#localEducationAgencyId=2231**". + +The second value, 0bCeilWY\_p33iM0Z3wOqdI058gvNTmphi\_ZBJQ, is the document Id constructed as if this document where an EducationOrganization instead of a LocalEducationAgency. Thus the plain text document identifier is "**Ed-Fi#EducationOrganization#educationOrganizationId=2231**" + +**Example: Local Education Agency** Expand source + +```json +{ + "_id" : "ZAwidGBEGsnKxQ-V1ktoecnvJ8xceXjM1jMehQ", + "documentIdentity" : [ + { + "name" : "localEducationAgencyId", + "value" : 2231 + } + ], + "projectName" : "Ed-Fi", + "resourceName" : "LocalEducationAgency", + "resourceVersion" : "3.3.1-b", + "isDescriptor" : false, + "edfiDoc" : { + "localEducationAgencyId" : 2231, + "nameOfInstitution" : "Grand Bend School District", + "localEducationAgencyCategoryDescriptor" : "uri://ed-fi.org/LocalEducationAgencyCategoryDescriptor#Independent", + "categories" : [] + }, + "existenceIds" : [ + "ZAwidGBEGsnKxQ-V1ktoecnvJ8xceXjM1jMehQ", + "0bCeilWY_p33iM0Z3wOqdI058gvNTmphi_ZBJQ" + ], + "outRefs" : [], + "validated" : true, + "createdBy" : "super-great-SIS" +} +``` + +In the following example, note that the `outRefs`  array has the identifiers for a school and a student. + +**Example: StudentInterventionAssociation, with References** Expand source + +```json +{ + "_id" : "LCEK0AxHRDUHK-5LVBlQKIarJHE83o1dVNgKWA", + "documentIdentity" : [ + { + "name" : "interventionReference.educationOrganizationId", + "value" : 123 + }, + { + "name" : "interventionReference.interventionIdentificationCode", + "value" : "111" + }, + { + "name" : "studentReference.studentUniqueId", + "value" : "s0zf6d1123d3e" + } + ], + "projectName" : "Ed-Fi", + "resourceName" : "StudentInterventionAssociation", + "resourceVersion" : "3.3.1-b", + "isDescriptor" : false, + "edfiDoc" : { + "studentReference" : { + "studentUniqueId" : "s0zf6d1123d3e" + }, + "interventionReference" : { + "interventionIdentificationCode" : "111", + "educationOrganizationId" : 123 + } + }, + "existenceIds" : [ + "LCEK0AxHRDUHK-5LVBlQKIarJHE83o1dVNgKWA" + ], + "outRefs" : [ + "M42GTJNsVAGX5EOOoa7U_EwZdbOhmSiAF9wehw", + "kKOLuEZJWjsDhpDiJOQlryLw_JBvzQ5KXTF2xg" + ], + "validated" : false, + "createdBy" : "super-great-SIS" +} +``` + +> [!TIP] +> If trying to query inside of an entity, or if trying to GET ALL by type in MongoDB, then separate collections would be better than a single collection. However, when using MongoDB we would still plan to have OpenSearch or ElasticSearch in the picture for those functions. Therefore a single "table" (collection) design is appropriate, and makes sharding easy. + +### Insert Transaction Steps + +Inserting a new Entity document into the collection will follow these steps: + +* Check that id does not exist (indexed query) +* Check that external reference ids for the document all exist (index query per reference) +* Perform **up**sert + +### Update Transaction Steps + +Updating an existing Entity document into the collection will follow these steps: + +* Check that id exists (indexed query) +* Check that external reference ids for the document all exist (index query per reference) +* Perform overwrite + +### Delete Transaction Steps + +Deleting an existing Entity document from the collection will follow these steps: + +* Check that id exists (indexed query) +* Check that there are no out\_refs for this id (indexed query) +* Perform delete + +### Queries + +Get all and get-by-key queries will continue to be serviced by OpenSearch. See [Meadowlark - Durable Change Data Capture](../../project-meadowlark-exploring-next-generation-technologies/meadowlark-streaming-and-downstream-data-stores/meadowlark-durable-change-data-capture.md) for more information on how data will flow out to OpenSearch. + +## Future Considerations + +### Security + +* Investigate adding security annotations based on indexable API document attributes + * Examples: ownership field, extracted education organization field +*  Investigate using with CASL.js for attribute-based authorization + +* [https://casl.js.org/v5/en](https://casl.js.org/v5/en) +* Slide deck intro: [CASL presentation by author](https://www.slideshare.net/SergiyStotskiy/casl-isomorphic-permission-managementpptx-207064469) + +### Improve version migration support + +Consider ways we might want to change the id design to make migrating to newer DS versions easier. For current design, id includes project name, entity type, version, and natural key. + +Let's say a new DS version comes out and a Meadowlark implementation wants to upgrade documents to the newer DS version. Assume School is unchanged between two DS versions. From the API client perspective, it would be very nice if the School resource ids didn't change. However, in the current design it would have to because version is part of the id hash. + +This may get into changes in how DS versions are incorporated into resource URLs, and/or doing versions per MongoDB collection so that id is unique within a collection? + +## Alternative Design + +An alternative design would be to create separate collections for each resource, with [indexes](https://www.mongodb.com/docs/v4.0/indexes/) on each queryable field. This could mean that MongoDB could serve as a single engine for all API CRUD requests, without the need for OpenSearch. + +The development team has not explored this in detail at this time. + +> [!WARNING] +> This document is for discussion and general guidance. The implementation may vary as needed. The development team will endeavor to keep this document up-to-date, though working software remains a higher priority than comprehensive documentation. diff --git a/docs/meadowlark-data-storage-design/meadowlark-postgresql.md b/docs/meadowlark-data-storage-design/meadowlark-postgresql.md new file mode 100644 index 00000000..6f29ba60 --- /dev/null +++ b/docs/meadowlark-data-storage-design/meadowlark-postgresql.md @@ -0,0 +1,132 @@ +# Meadowlark - PostgreSQL + +## Introduction + +PostgreSQL has [extensive support](https://www.postgresql.org/docs/current/datatype-json.html) for storing and querying JSON documents. In fact, it seems to fare very well compared to MongoDB for [features](https://community.sisense.com/t5/knowledge/postgres-vs-mongodb-for-storing-json-data-which-should-you/ta-p/111) and especially for [performance](https://www.enterprisedb.com/news/new-benchmarks-show-postgres-dominating-mongodb-varied-workloads). Each platform has its own advantages. For Meadowlark 0.2.0, the development team will implement CRUD operations using PostgreSQL in addition to [MongoDB](../meadowlark-data-storage-design/meadowlark-mongodb.md), thus enabling direct comparison of features and benefits, and demonstrating the flexibility inherent in the design of the Meadowlark code. + +*Also see: [Meadowlark - Durable Change Data Capture](../../project-meadowlark-exploring-next-generation-technologies/meadowlark-streaming-and-downstream-data-stores/meadowlark-durable-change-data-capture.md) for more information on streaming data out to OpenSearch.* + +## Design + +### Overview + +The PostgreSQL schema would be set up in a similar non-relational design to the other NoSQL designs, but take advantage of PostgreSQL's document store features. The basic principal continues that the API document is stored along with metadata to be used for existence/reference validation. Metadata would continue to be stored alongside the API document in columns. Fast document lookups continue to be done by id, constructed as before from API document project name, entity type, version and natural key. Transactions will again be used to check for existence/references before performing create/update/delete operations. + +In order to simplify a PostgreSQL deployment, this design is flexible on the requirement of OpenSearch for queries. This also means change data capture streaming becomes optional. + +Instead of using OpenSearch, a "standalone deployment" will take advantage of PostgreSQLs JSON inverted-index support. Rather than split the entities into separate tables, an additional index on `project_name/entity_type/entity_version`  will be required for query support. Once a deployment reaches the performance constraints of this design, all these indexes can be dropped and an Elastic/OpenSearch configuration introduced that will continue to use a single table design. + +### **Documents** Table + +This implementation will use a single table named Entity. + +#### Columns + +| Column Name | Data Type | Description | +| --- | --- | --- | +| `id` | bigserial | Synthetic primary key, analogous to MongoDB's `_id` | +| ​`document_id` | ​VARCHAR | A string hash derived from the project name, resource name, resource version and identity of the API document. This field will be a unique index on the collection.​ | +| `document_identity` | JSONB | The identity elements extracted from the API document. | +| `project_name` | VARCHAR | The MetaEd project name the API document resource is defined in e.g. "EdFi" for a data standard entity. | +| `resource_name` | VARCHAR | The name of the resource. Typically, this is the same as the corresponding MetaEd entity name. However, there are exceptions, for example descriptors have a "Descriptor" suffix on their resource name. | +| `resource_version` | VARCHAR | The resource version as a string. This is the same as the MetaEd project version the entity is defined in e.g. "3.3.1-b" for a 3.3b data standard entity. | +| `is_descriptor` | Boolean | Indicator | +| `validated` | Boolean | Indicator | +| `edfi_doc` | JSONB | The Ed-Fi ODS/API document itself. | +| `createdBy` | VARCHAR(100) | name/ID of the client who created the record, for authorization usage. | + +#### Indexes + +* On `edfi_doc`  as a GIN jsonb\_path\_ops index - for query support in standalone deployment +* On project\_name & entity\_type & entity\_version - for query support in standalone deployment + * Maybe separate b-tree index, maybe multi-column GIN with api\_doc. See [https://pganalyze.com/blog/gin-index](https://pganalyze.com/blog/gin-index) + +### **References** Table + +This implementation will also use a reference table for reference validation. + +#### Columns + +| Column Name | Data Type | Description | +| --- | --- | --- | +| `id` | bigserial | Synthetic primary key | +| parent\_document\_id | varying | The parent document's `id` (~ *foreign key*) | +| reference\_document\_id | varying | The child document's `id` (~ *document' natural key*) | + +## **Existence** Table + +This implementation will also use an existence table for validation. The existence table provides a way for documents that might have a super/sub class relationship (i.e. education organizations like school) to have multiple document id's that relate. + +#### Columns + +| Column Name | Data Type | Description | +| --- | --- | --- | +| `id` | bigserial | Synthetic primary key | +| document\_id | varying | The child document's `id` (~ *document' natural key*) | +| existence\_id | varying | The id that the document can also be identified as | + +> [!TIP] +> +> Potential addition: +> +> | | | | +> | --- | --- | --- | +> | document\_location | varying | JSONPath expression to the external reference in the parent document | +> Might be useful in API response metadata? + +#### Indexes + +Need to be able to look up references in both directions: + +* `reference_to` - e.g. when trying to delete a resource, determine if there are any external references to it +* `reference_from` - e.g. when trying to delete a resource, also deletes its own references to a "parent" resource. + +### Data Processing + +#### Insert Transaction Steps + +Inserting a new Entity document into the table will follow these steps: + +* Check that id does not exist in Entity (indexed query) +* Check that external reference ids for the document all exist in Entity (index query per reference) +* Perform insert of document into Entity +* Perform insert of external references into Reference +* Perform insert of external references into Existence +* Perform insert of superclass references into Existence +* Note: PostgreSQL has upsert support, but we may need to know if the outcome was insert or update to return the correct API response. + +#### Update Transaction Steps + +Updating an existing Entity document into the table will follow these steps: + +* Check that id exists in Entity (indexed query) +* Check that external reference ids for the document all exist in Entity (index query per reference) +* Perform update into Entity +* Perform replacement of prior external references in Reference (delete all old + insert) +* Perform replacement of prior external references in Existence (delete all old + insert) +* Note: PostgreSQL has upsert support, but we may need to know if the outcome was insert or update to return the correct API response. + +#### Delete Transaction Steps + +Deleting an existing Entity document from the table will follow these steps: + +* Check that id exists in Entity (indexed query) +* Check that there are no external references in Existence for this id (indexed query) +* Perform delete + +#### Queries + +A PostgreSQL installation will operate in two modes. In standalone mode, get all and get-by-key queries will be done directly on PostgreSQL by project\_name/entity\_type/entity\_version plus the GIN-indexed api\_doc. In "normal" mode, get all and get-by-key queries will be serviced by OpenSearch/Elasticsearch via CDC streaming. + +## Open Issues + +Need a partioning / sharding paradigm for large databases. See [https://www.percona.com/blog/2019/05/24/an-overview-of-sharding-in-postgresql-and-how-it-relates-to-mongodbs/](https://www.percona.com/blog/2019/05/24/an-overview-of-sharding-in-postgresql-and-how-it-relates-to-mongodbs/) + +## Alternative Design + +An alternative design would be to create separate collections for each resource, with [indexes](https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING) on each queryable field. This could mean that PostgreSQL could serve as a single engine for all API CRUD requests, without the need for OpenSearch. + +The development team has not explored this in detail at this time. + +> [!WARNING] +> This document is for discussion and general guidance. The implementation may vary as needed. The development team will endeavor to keep this document up-to-date, though working software remains a higher priority than comprehensive documentation. diff --git a/docs/meadowlark-data-storage-design/meadowlark-referential-integrity-in-document-databases.md b/docs/meadowlark-data-storage-design/meadowlark-referential-integrity-in-document-databases.md new file mode 100644 index 00000000..51af1d33 --- /dev/null +++ b/docs/meadowlark-data-storage-design/meadowlark-referential-integrity-in-document-databases.md @@ -0,0 +1,437 @@ +# Meadowlark - Referential Integrity in Document Databases + +## Introduction + +Relational databases have robust support for relationship management. Naturally 😉. And document databases do not. This article reviews the steps taken in project Meadowlark to ensure a high level of referential integrity for data submitted to the API. + +## Scenarios + +The following diagram is from the Ed-Fi Data Model 3.3 documentation on the [Survey Domain](https://edfi.atlassian.net/wiki/spaces/EFDS33/pages/26968421/Survey+Domain+-+Model+Diagrams), simplified to show a single relationship: a Survey has an associated Session. This will be our exemplar that stands in for many different situations in the Ed-Fi Data Model. + +![](./attachments/image2022-7-5_14-48-34.png) + +One of the key concepts behind the Ed-Fi system is to ensure a high degree of validity to data. Aside from enforcing some basic type constraints (e.g. not submitting a SchoolYear "202eeee"), the primary validation rule is to ensure that related entities exist and to prevent removing one entity when another entity references it. Within Meadowlark, because of the NoSQL database design, these entities take the form of JSON *documents,* which are analogous to the *records* found in a traditional relational database. + +Relational databases enforce this reference validation through foreign keys. Document databases do not – at least, not traditionally, or as a general pattern. Thus Meadowlark must have custom code to account for the following situations: + +1. **Create a Survey**: does the session exist? No: respond with status 400 bad request and do not save the document. Yes: save the document. +2. **Update a Survey**: does the session exist? No: respond with status 400 bad request and do not save the document. Yes: replace the document. +3. **Delete a Session**: are there any Surveys that reference this session? Yes: respond with status 409 conflict. No: delete the document. + +Imagine the following sequence of actions being taken virtually simultaneously by two different API clients: + +| one | two | +| --- | --- | +| ​begin transaction | ​ | +| delete document A | begin transaction | +| | if document A exists:

    save document B

else:

    error | +| commit | commit | + +How do we accomplish this safely? + +## Referential Integrity Pattern + +The basic pattern that Meadowlark employs for reference validation is to convert the identity and reference portions of a document into document ids. Reference validation then consists of simple document id lookups, which are possible even in key/value-like datastores. However, in order to maintain consistency during validation a datastore must support ACID transactions. It must also provide a way to lock documents on reads in a transaction such that Meadowlark upserts, updates and deletes can stay consistent with those referenced documents. + +## Solutions + +### DynamoDB + +DynamoDB provides ACID transactions. It also provides transaction "condition checks" that are like read locks, but have limitations on what can be checked. It turns out that condition checks are too limited to support consistent delete operations with the Meadowlark pattern. Instead, it would require a more generalized read locking behavior. + +There is no built-in solution for this in DynamoDB. There is a [java client](https://aws.amazon.com/blogs/database/building-distributed-locks-with-the-dynamodb-lock-client/) that introduces pessimistic offline locking support - in other words, a client locks a record by updating a `lock`  column with a unique value. Other clients can't access that record until the `lock`  column is cleared. This *might* be sufficient to support the Meadowlark pattern. However, the development team is not going to develop the JavaScript code to investigate or support this.  This may need to be a full replacement for native DynamoDB transactions, and the performance implications are unclear. + +> [!CAUTION] +> Therefore DynamoDB will be removed from Meadowlark release 0.2.0. + +### PostgreSQL (using non-relational pattern) + +PostgreSQL has built-in mechanisms for [explicitly locking a record](https://www.postgresql.org/docs/current/explicit-locking.html). These can be used either to prevent a competing DELETE from occurring in the middle of an INSERT or UPDATE transaction, or to prevent an INSERT or UPDATE from occurring during a previously-started DELETE transaction. + +The article [Selecting for Share and Update in PostgreSQL](https://shiroyasha.io/selecting-for-share-and-update-in-postgresql.html) does a nice job of explaining two of the lock modes: `select for share`  and `select for update`. + +With a plain select statement by client two, the save will proceed... which is NOT what we want to happen. To resolve that, we can append `for share nowait` at the end of the select statement. This will have the effect of locking document A against updates momentarily, without locking out *reads* by any other client. If client one has already issued a delete statement, however, then the lock will fail. The `nowait`  keyword tells the database engine to fail immediately, rather than wait for client one's lock to be released. + +The following code demonstrates the desired behavior: + +```javascript +const pg = require('pg'); +const { exit } = require('process'); + +const dbConfiguration = { + host: process.env.POSTGRES_HOST ?? 'localhost', + port: Number(process.env.POSTGRES_PORT ?? 5432), + user: process.env.POSTGRES_USER, + password: process.env.POSTGRES_PASSWORD, + database: process.env.MEADOWLARK_DATABASE_NAME ?? 'meadowlark', +}; + +const parentId = `parent${Math.random() * 100}`; +const referencingDocumentId = `reference${Math.random() * 1000}`; + +async function RunTest() { + const clientOne = new pg.Client(dbConfiguration); + await clientOne.connect(); + const clientTwo = new pg.Client(dbConfiguration); + await clientTwo.connect(); + + // Create sample records + const insertParent = ` + insert into meadowlark.documents (document_id, document_identity, project_name, resource_name, resource_version, is_descriptor, validated, edfi_doc) + values ('${parentId}', '{}', 'edfi', 'test', '3.3b', False, True, '{}'); + `; + await clientOne.query('begin'); + await clientOne.query(insertParent); + await clientOne.query('commit'); + + // Issue a delete statement _without_ committing the transaction + const deleteParent = `delete from meadowlark.documents where document_id = '${parentId}';`; + await clientOne.query('begin'); + await clientOne.query(deleteParent); + + // And now in a separate client, try to insert a doc that references the parent + const referenceCheck = ` + select id from meadowlark.documents where document_id = '${parentId}'; + `; + const insertReference = ` + insert into meadowlark.documents (document_id, document_identity, project_name, resource_name, resource_version, is_descriptor, validated, edfi_doc) + values ('${referencingDocumentId}', '{}', 'edfi', 'test', '3.3b', False, True, '{}'); + insert into meadowlark.references (parent_document_id, referenced_document_id) + values ('${parentId}','${referencingDocumentId}');`; + try { + await clientTwo.query('begin'); + const res = await clientTwo.query(referenceCheck); + if (res.rows.length === 0) { + console.info('no record found! an API would return 400 due to missing parent.'); + exit(); + } + + await clientTwo.query(insertReference); + await clientTwo.query('commit'); + } catch (error) { + console.info('unexpected failure on initial insert of reference'); + console.error(error); + } + + // That didn't fail! Try now using the SELECT ... FOR UPDATE approach. "nowait" is essential here; if you remove it, then + // clientTwo will wait for clientOne to finish. + const selectForUpdate = ` + select id from meadowlark.documents where document_id = '${parentId}' for share nowait; + insert into meadowlark.documents (document_id, document_identity, project_name, resource_name, resource_version, is_descriptor, validated, edfi_doc) + values ('${referencingDocumentId}', '{}', 'edfi', 'test', '3.3b', False, True, '{}'); + insert into meadowlark.references (parent_document_id, referenced_document_id) + values ('${parentId}','${referencingDocumentId}');`; + + try { + await clientTwo.query('begin'); + await clientTwo.query(selectForUpdate); + } catch (error) { + console.info('EXPECTED failure on second insert reference'); + console.info(error); + } + + // Close these clients in reverse order + clientOne.query('commit'); + clientTwo.query('commit'); + + // Cleanup + clientOne.end(); + clientTwo.end(); +} + +RunTest().finally(() => console.log('Done')); +``` + +## MongoDB + +MongoDB supports ACID transactions but does not have native lock-on-read support. However, there is a common pattern used to simulate this by updating documents using a randomly-generated lock field. See the MongoDB blog post [How to SELECT...FOR UPDATE Inside MongoDB Transactions](https://www.mongodb.com/blog/post/how-to-select--for-update-inside-mongodb-transactions) for details. + +By implementing the lock field pattern, MongoDB transactions will fail with a `WriteConflict` if, for example, a document read in one transaction is deleted in another. These transactions can then be retried or reported back to the client as a conflict, as appropriate. + +The following unit test code demonstrates the desired behavior: + + Expand source + +```javascript +// SPDX-License-Identifier: Apache-2.0 +// Licensed to the Ed-Fi Alliance under one or more agreements. +// The Ed-Fi Alliance licenses this file to you under the Apache License, Version 2.0. +// See the LICENSE and NOTICES files in the project root for more information. + +import { + DocumentInfo, + NoDocumentInfo, + newDocumentInfo, + newSecurity, + documentIdForDocumentInfo, + DocumentReference, + UpsertRequest, + NoResourceInfo, + ResourceInfo, + newResourceInfo, +} from '@edfi/meadowlark-core'; +import { ClientSession, Collection, MongoClient, ObjectId } from 'mongodb'; +import { MeadowlarkDocument, meadowlarkDocumentFrom } from '../../../src/model/MeadowlarkDocument'; +import { getCollection, getNewClient } from '../../../src/repository/Db'; +import { + validateReferences, + asUpsert, + onlyReturnExistenceIds, + onlyDocumentsReferencing, + onlyReturnId, +} from '../../../src/repository/ReferenceValidation'; +import { upsertDocument } from '../../../src/repository/Upsert'; + +jest.setTimeout(10000); + +// A bunch of setup stuff +const newUpsertRequest = (): UpsertRequest => ({ + id: '', + resourceInfo: NoResourceInfo, + documentInfo: NoDocumentInfo, + edfiDoc: {}, + validate: false, + security: { ...newSecurity() }, + traceId: 'traceId', +}); + +const schoolResourceInfo: ResourceInfo = { + ...newResourceInfo(), + resourceName: 'School', +}; + +const schoolDocumentInfo: DocumentInfo = { + ...newDocumentInfo(), + documentIdentity: [{ name: 'schoolId', value: '123' }], +}; +const schoolDocumentId = documentIdForDocumentInfo(schoolResourceInfo, schoolDocumentInfo); + +const referenceToSchool: DocumentReference = { + projectName: schoolResourceInfo.projectName, + resourceName: schoolResourceInfo.resourceName, + documentIdentity: schoolDocumentInfo.documentIdentity, + isDescriptor: false, +}; + +const academicWeekResourceInfo: ResourceInfo = { + ...newResourceInfo(), + resourceName: 'AcademicWeek', +}; +const academicWeekDocumentInfo: DocumentInfo = { + ...newDocumentInfo(), + documentIdentity: [ + { name: 'schoolId', value: '123' }, + { name: 'weekIdentifier', value: '1' }, + ], + documentReferences: [referenceToSchool], +}; +const academicWeekDocumentId = documentIdForDocumentInfo(academicWeekResourceInfo, academicWeekDocumentInfo); + +const academicWeekDocument: MeadowlarkDocument = meadowlarkDocumentFrom( + academicWeekResourceInfo, + academicWeekDocumentInfo, + academicWeekDocumentId, + {}, + true, + '', +); + +describe('given a delete document transaction concurrent with an insert document referencing the delete - without a read for write lock ', () => { + let client: MongoClient; + + beforeAll(async () => { + client = (await getNewClient()) as MongoClient; + const mongoCollection: Collection = getCollection(client); + + // Insert a School document - it will be referenced by an AcademicWeek document while being deleted + await upsertDocument({ ...newUpsertRequest(), id: schoolDocumentId, documentInfo: schoolDocumentInfo }, client); + + // ---- + // Start transaction to insert an AcademicWeek - it references the School which will interfere with the School delete + // ---- + const upsertSession: ClientSession = client.startSession(); + upsertSession.startTransaction(); + + // Check for reference validation failures on AcademicWeek document - School is still there + const upsertFailures = await validateReferences( + academicWeekDocumentInfo.documentReferences, + [], + academicWeekDocument.outRefs, + mongoCollection, + upsertSession, + '', + ); + + // Should be no reference validation failures for AcademicWeek document + expect(upsertFailures).toHaveLength(0); + + // ---- + // Start transaction to delete the School document - it interferes with the AcademicWeek insert referencing the School + // ---- + const deleteSession: ClientSession = client.startSession(); + deleteSession.startTransaction(); + + // Get the existenceIds for the School document, used to check for references to it as School or as EducationOrganization + const deleteCandidate: any = await mongoCollection.findOne( + { _id: schoolDocumentId }, + onlyReturnExistenceIds(deleteSession), + ); + + // Check for any references to the School document + const anyReferences = await mongoCollection.findOne( + onlyDocumentsReferencing(deleteCandidate.existenceIds), + onlyReturnId(deleteSession), + ); + + expect(anyReferences).toBeNull(); + + // Delete the School document + const { deletedCount } = await mongoCollection.deleteOne({ _id: schoolDocumentId }, { session: deleteSession }); + + expect(deletedCount).toBe(1); + + // ---- + // End transaction to delete the School document + // ---- + deleteSession.commitTransaction(); + + // Perform the insert of AcademicWeek document + const { upsertedCount } = await mongoCollection.replaceOne( + { _id: academicWeekDocumentId }, + academicWeekDocument, + asUpsert(upsertSession), + ); + + // **** The insert of AcademicWeek document should NOT have be successful - but was + expect(upsertedCount).toBe(1); + + // ---- + // End transaction to insert the AcademicWeek document + // ---- + upsertSession.commitTransaction(); + }); + + afterAll(async () => { + await getCollection(client).deleteMany({}); + await client.close(); + }); + + it('deleted the School document in the db anyway, this is a failed reference validation implementation!', async () => { + const collection: Collection = getCollection(client); + const result: any = await collection.findOne({ _id: schoolDocumentId }); + expect(result).toBeNull(); + }); +}); + +describe('given a delete concurrent with an insert referencing the to-be-deleted document - using read lock scheme', () => { + let client: MongoClient; + + beforeAll(async () => { + client = (await getNewClient()) as MongoClient; + const mongoDocuments: Collection = getCollection(client); + + // Insert a School document - it will be referenced by an AcademicWeek document while being deleted + await upsertDocument({ ...newUpsertRequest(), id: schoolDocumentId, documentInfo: schoolDocumentInfo }, client); + + // ---- + // Start transaction to insert an AcademicWeek - it references the School which will interfere with the School delete + // ---- + const upsertSession: ClientSession = client.startSession(); + upsertSession.startTransaction(); + + // Check for reference validation failures on AcademicWeek document - School is still there + const upsertFailures = await validateReferences( + academicWeekDocumentInfo.documentReferences, + [], + academicWeekDocument.outRefs, + mongoDocuments, + upsertSession, + '', + ); + + // Should be no reference validation failures for AcademicWeek document + expect(upsertFailures).toHaveLength(0); + + // ***** Read-for-write lock the validated referenced documents in the insert + // see https://www.mongodb.com/blog/post/how-to-select--for-update-inside-mongodb-transactions + mongoDocuments.updateMany( + { existenceIds: { $in: academicWeekDocument.outRefs } }, + { $set: { lock: new ObjectId() } }, + { session: upsertSession }, + ); + + // ---- + // Start transaction to delete the School document - interferes with the AcademicWeek insert referencing the School + // ---- + const deleteSession: ClientSession = client.startSession(); + deleteSession.startTransaction(); + + // Get the existenceIds for the School document, used to check for references to it as School or as EducationOrganization + const deleteCandidate: any = await mongoDocuments.findOne( + { _id: schoolDocumentId }, + onlyReturnExistenceIds(deleteSession), + ); + + // Check for any references to the School document + const anyReferences = await mongoDocuments.findOne( + onlyDocumentsReferencing(deleteCandidate.existenceIds), + onlyReturnId(deleteSession), + ); + + // Delete transaction sees no references yet, though we are about to add one + expect(anyReferences).toBeNull(); + + // Perform the insert of AcademicWeek document, adding a reference to to to-be-deleted document + const { upsertedCount } = await mongoDocuments.replaceOne( + { _id: academicWeekDocumentId }, + academicWeekDocument, + asUpsert(upsertSession), + ); + + // **** The insert of AcademicWeek document should have been successful + expect(upsertedCount).toBe(1); + + // ---- + // End transaction to insert the AcademicWeek document + // ---- + upsertSession.commitTransaction(); + + // Try deleting the School document - should fail thanks to AcademicWeek's read-for-write lock + try { + await mongoDocuments.deleteOne({ _id: schoolDocumentId }, { session: deleteSession }); + } catch (e) { + expect(e).toMatchInlineSnapshot(`[MongoServerError: WriteConflict]`); + } + + // ---- + // End transaction to delete the School document + // ---- + deleteSession.abortTransaction(); + }); + + afterAll(async () => { + await getCollection(client).deleteMany({}); + await client.close(); + }); + + it('should have still have the School document in the db - a success', async () => { + const collection: Collection = getCollection(client); + const result: any = await collection.findOne({ _id: schoolDocumentId }); + expect(result.documentIdentity).toHaveLength(1); + expect(result.documentIdentity[0].name).toBe('schoolId'); + expect(result.documentIdentity[0].value).toBe('123'); + }); +}); +``` + +## Downstream Data Storage + +Downstream data stores - including OpenSearch and filesystem ("data lake" - could theoretically become out of sink in this event-driven architecture: + +1. Network error +2. Bug in the event handler code +3. Faulty downstream service. + +Further research is needed on patterns for detecting and correcting these situations. + +This is technically an eventual consistency problem, but it could have the same effect as a referential integrity error. diff --git a/docs/meadowlark-operations/meadowlark-api-docker-support.md b/docs/meadowlark-operations/meadowlark-api-docker-support.md new file mode 100644 index 00000000..74e2b3d1 --- /dev/null +++ b/docs/meadowlark-operations/meadowlark-api-docker-support.md @@ -0,0 +1,19 @@ +# Meadowlark API Docker Support + +## Docker Image + +With help from the [snyk blog](https://snyk.io/blog/choosing-the-best-node-js-docker-image/),  the team has selected the Debian bullseye "slim" base image to optimize size and security of the image. From there, the Dockerfile install minimal build tools, copies the source code into the the image, and runs a build directly inside the image. Then it creates a final layer that is based on the original layer, and thus does not have the build tools. + +## Docker Compose for Local Testing + +For local testing, the source code repository contains a docker-compose file that will start up and connect all images required to run a complete MongoDB-based "deploy" of Meadowlark. This is not recommended for production use, as it is not properly secured with HTTPS and does not have a reverse-proxy to protect the Node.js-based API. + +For more information on this solution, see the DOCKER.md file in the repository. + +## Cloud-Based Hosting + +As of 02 Feb 2023, the development team has not yet done any work to try to run the Docker image on a cloud provider. Rough notes of the way forward: + +* Not at present building the image on Docker Hub. We might add this in the future. Anyone wanting to use it is advised to create the Meadowlark image directly in the cloud provider's Container Registry. +* The Meadowlark API uses Node.js for serving HTTP content. In a production scenario - which means, ultimately, in *all* environments, since they should have the same topology - this Node.js service should be sitting behind a reverse proxy / gateway. Details depend on the installation. Many installations will choose to use the cloud provider's load balancing solution for external access to the API, and for HTTPS termination. In some cases it may also be appropriate to have a reverse proxy *inside the container network* so that the Node.js port is not directly exposed to the outside world. These are implementation and security details that the Alliance development team can discuss, but they do not have the expertise to provide the best advice. +* Any implementation will need to decide whether or not to use managed services for the database backends, or whether to host them directly in the container ecosystem. There will be cost and management implications either way, and again this is beyond the development team's expertise. diff --git a/docs/meadowlark-operations/meadowlark-loading-descriptors.md b/docs/meadowlark-operations/meadowlark-loading-descriptors.md new file mode 100644 index 00000000..9b536295 --- /dev/null +++ b/docs/meadowlark-operations/meadowlark-loading-descriptors.md @@ -0,0 +1,8 @@ +# Meadowlark - Loading Descriptors + +As of Meadowlark releases 0.2.0 and 0.3.0, there are two mechanisms for quickly loading the default descriptor sets into a running API: + +1. Make an HTTP call to `http://localhost:3000/local/loadDescriptors` ; this will load all descriptors through an internal operation, without any additional HTTP calls. +2. Open one of the "Invoke-Load?.ps1" PowerShell scripts in the `eng`  directory; comment out the last line of the script so that only descriptors run. This uses the ODS/API's dotnet-based client side bulk loader utility to open the descriptor XML files and load the resources one-by-one through the API. This is essentially how the ODS/API's minimal template is populated. + +Option 1 is not in the long-term plans. It was a short-term solution that saw us bundle all of the Data Standard 3.3.1-a descriptor XML files directly into the repository, as part of the `meadowlark-core`  library. That is not a scalable solution. Option 2 became available when a member of the Ed-Fi tech team manually created a NuGet package of those same XML files and published it in the Alliance's NuGet repository on MyGet. The long-term plan is to automate the process of bundling the descriptor XML and the rest of the Grand Bend sample XML files into NuGet packages. diff --git a/docs/meadowlark-operations/meadowlark-running-performance-tests.md b/docs/meadowlark-operations/meadowlark-running-performance-tests.md new file mode 100644 index 00000000..6892ec8f --- /dev/null +++ b/docs/meadowlark-operations/meadowlark-running-performance-tests.md @@ -0,0 +1,89 @@ +# Meadowlark - Running Performance Tests + +## Tips + +* Run on an isolated environment that does not have anti-virus or other processes running. +* Start with a fresh slate - no records in the databases (both MongoDB and OpenSearch). +* To isolate the performance with MongoDB, disable the OpenSearch listener. This will only be useful with the Bulk Upload tests, because the other tests will try to run "get all" queries that will fail without OpenSearch data. + +## Bulk Upload Data + +These utilities provide a repeatable, static test of data upload performance, which can be compared against the ODS/API. The tests can be executed using scripts in the `eng directory` in the Meadowlark repository. + +* Invoke-LoadGrandBend.ps1 - load the entire Grand Bend dataset, aka "populated template" + + > [!TIP] + > To load only the descriptors, open that script file and comment out the last line, which writes all of the Grand Bend Data + +* `Invoke-LoadPartialGrandBend.ps1` - load a small portion of Grand Bend, including all descriptors and education organizations. + +In PowerShell, you can measure the time by wrapping the invocation with `Measure-Command` , as shown below. At the end of the execution, the total time taken will be displayed at the console. + +```shell +cd eng/ +Measure-Command { ./Invoke-LoadGrandBend.ps1 } +``` + +## Suite 3 Performance Tests + +The [Suite 3 Performance Test](https://github.com/Ed-Fi-Exchange-OSS/Suite-3-Performance-Testing) kit includes several test suites that can be useful with Meadowlark. + +* Paging Tests run through download of all resources using different page sizes. To be meaningful, you should load the Grand Bend data set. Make sure that the OpenSearch listener is on when running the bulk upload, otherwise OpenSearch will not have any data to return. +* Pipeclean tests run POST, GET, and PUT operations on the API, across all resources. + * As of 02 Mar 2023 , use branch [PERF-286](https://github.com/Ed-Fi-Exchange-OSS/Suite-3-Performance-Testing/tree/PERF-286) because it has many corrections to allow the suite to work with Meadowlark. ![(warning)](https://edfi.atlassian.net/wiki/s/695013191/6452/be943731e17d7f4a2b01aa3e67b9f29c0529a211/_/images/icons/emoticons/warning.png) + + There are a handful of known errors at this time. +* Volume tests run a heavy load of POST, PUT, and DELETE operations (with a few GET operations), and they log timing information. ![(warning)](https://edfi.atlassian.net/wiki/s/695013191/6452/be943731e17d7f4a2b01aa3e67b9f29c0529a211/_/images/icons/emoticons/warning.png) + + Not yet functional for Meadowlark, likely needs some of the same corrections made in the pipeclean tests. + +### Preparing for Suite 3 Tests + +* Set environment begin / end years as 1991 to 2050 +* Create a Host type key and secret for easy access to all resources, and put those into the test project's `.env`  file. + +### Known Problems + +As of 02 Mar 2023. This is not a complete list. + +```none +PUT     /v3.3b/ed-fi/classPeriods/{id}                          48      HTTPError('400 Client Error: Bad Request for url: /v3.3b/ed-fi/classPeriods/{id}') +PUT http://localhost:3000/local/v3.3b/ed-fi/classPeriods/byQwXeKVqZJuzZSfR8SkjeBN-LLUZekzLIEJQQ - RESPONSE CODE: 400 : {"error":"The identity of the +resource does not match the identity in the updated document."} +``` + + The PUT request tried to change the natural key and it was denied. However, \`classPeriod\` is supposed to allow natural key +changes. Should retest this after merging RND-442, which makes some changes in the way that resources are matched. + +Similar: gradeBookEntry + +* * * + +```none +POST    /v3.3b/ed-fi/educationContents                          58      HTTPError('400 Client Error: Bad Request for url: /v3.3b/ed-fi/educationContents') +POST http://localhost:3000/local/v3.3b/ed-fi/educationContents - RESPONSE CODE: 400 : {"error":[{"message":"{requestBody} must have required property 'learningResourceMetadataURI'","path":"{requestBody}","context":{"errorType":"required"}},{"message":"{requestBody} must have required property 'shortDescription'","path":"{requestBody}","context":{"errorType":"required"}},{"message":"{requestBody} must have required property 'contentClassDescriptor'","path":"{requestBody}","context":{"errorType":"required"}}]} +``` + +\`learningResourceMetadataURI\` *should* be required, according to the model. The ODS/API is not requiring it, and the MetaEd language definition makes it questionable what we should do with it. I have proposed changing MetaEd to reflect the "not required" state. Also requires \`contentClassDescriptor\` and  \`shortDescription\`. + +* * * + +Descriptor \`codeValue\` can be updated in ODS/API and it cascades. We need to standardize that. + +* * * + +```none +POST    /v3.3b/ed-fi/reportCards                                34      HTTPError('400 Client Error: Bad Request for url: /v3.3b/ed-fi/reportCards') +POST http://localhost:3000/local/v3.3b/ed-fi/reportCards - RESPONSE CODE: 400 : {"error":[{"message":"'gpaGivenGradingPeriod' property is not expected to be here","suggestion":"Did you mean property 'gPAGivenGradingPeriod'?","path":"{requestBody}","context":{"errorType":"additionalProperties"}}]} +``` + +Same as the known iEP problem + +* * * + +```none +GET     /v3.3b/ed-fi/schoolYearTypes                            10      HTTPError('404 Client Error: Not Found for url: /v3.3b/ed-fi/schoolYearTypes') +GET http://localhost:3000/local/v3.3b/ed-fi/schoolYearTypes - RESPONSE CODE: 404 : {"error":"Invalid resource 'schoolYearTypes'. The most similar resource is 'schoolTypeDescriptors'."} +``` + +We didn't implement \`schoolYearTypes\` as an endpoint because it isn't in the Data Standard... but it \_is\_ in the API specification. So we \_should\_ implement it, reading from the environment variables diff --git a/docs/meadowlark-releases/attachments/image2022-5-9_12-36-34.png b/docs/meadowlark-releases/attachments/image2022-5-9_12-36-34.png new file mode 100644 index 00000000..25e0118e Binary files /dev/null and b/docs/meadowlark-releases/attachments/image2022-5-9_12-36-34.png differ diff --git a/docs/meadowlark-releases/attachments/image2022-7-1_16-28-0.png b/docs/meadowlark-releases/attachments/image2022-7-1_16-28-0.png new file mode 100644 index 00000000..13e2155a Binary files /dev/null and b/docs/meadowlark-releases/attachments/image2022-7-1_16-28-0.png differ diff --git a/docs/meadowlark-releases/attachments/image2022-7-21_16-23-32.png b/docs/meadowlark-releases/attachments/image2022-7-21_16-23-32.png new file mode 100644 index 00000000..3ddb7d7e Binary files /dev/null and b/docs/meadowlark-releases/attachments/image2022-7-21_16-23-32.png differ diff --git a/docs/meadowlark-releases/attachments/image2022-7-21_16-54-41.png b/docs/meadowlark-releases/attachments/image2022-7-21_16-54-41.png new file mode 100644 index 00000000..289f2ec5 Binary files /dev/null and b/docs/meadowlark-releases/attachments/image2022-7-21_16-54-41.png differ diff --git a/docs/meadowlark-releases/attachments/image2022-7-21_16-59-4.png b/docs/meadowlark-releases/attachments/image2022-7-21_16-59-4.png new file mode 100644 index 00000000..bae22f0c Binary files /dev/null and b/docs/meadowlark-releases/attachments/image2022-7-21_16-59-4.png differ diff --git a/docs/meadowlark-releases/attachments/image2022-7-6_13-24-27.png b/docs/meadowlark-releases/attachments/image2022-7-6_13-24-27.png new file mode 100644 index 00000000..00614dc7 Binary files /dev/null and b/docs/meadowlark-releases/attachments/image2022-7-6_13-24-27.png differ diff --git a/docs/meadowlark-releases/attachments/ml-arch-1.png b/docs/meadowlark-releases/attachments/ml-arch-1.png new file mode 100644 index 00000000..fd6c8798 Binary files /dev/null and b/docs/meadowlark-releases/attachments/ml-arch-1.png differ diff --git a/docs/meadowlark-releases/meadowlark-010.md b/docs/meadowlark-releases/meadowlark-010.md new file mode 100644 index 00000000..67fe68c0 --- /dev/null +++ b/docs/meadowlark-releases/meadowlark-010.md @@ -0,0 +1,135 @@ +# Meadowlark 0.1.0 + +## Goals and Design + +Overarching goals: basic feature parity with the ODS/API + +**Source code: [https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/tree/0.1.0](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/tree/0.1.0)** + +| ![(tick)](https://edfi.atlassian.net/wiki/s/695013191/6452/be943731e17d7f4a2b01aa3e67b9f29c0529a211/_/images/icons/emoticons/check.png)

Implemented | ![(error)](https://edfi.atlassian.net/wiki/s/695013191/6452/be943731e17d7f4a2b01aa3e67b9f29c0529a211/_/images/icons/emoticons/error.png)

Not Implemented | +| --- | --- | +| *​Data Model 3.1 and 3.3b
* ... and easy to support others
*Ownership-based authorization
* Validates authentication token
*Authorizes client to access data that were written by the client ("ownership")
* Optional foreign key enforcement
* With HTTP header, can disable foreign key validation | *​Authentication
* oauth endpoint is fake, returning hard-coded tokens for demo purposes
*Cascading deletes
* Extensions
*Composites
* Profiles | + +## Architecture + +![](./attachments/ml-arch-1.png) + +Some of the interesting challenges / learning opportunities with this architecture: + +* Building a NoSQL structure and code– specifically, with DynamoDB – that supports the foreign key concept +* Rethinking authorization: simplifying by only adopting ownership-based authorization +* Right database for the job, using both DynamoDB and OpenSearch together + +> [!INFO] +> The development team could just have easily picked Azure or Google Cloud. There was no reason to favor one platform over the other - the team just picked a platform and ran with it. + +## Introduction + +The initial goal of Meadowlark 0.1.0 was to achieve "API Parity" with the ODS/API: expose an API whose functionality matches that of the ODS/API to the extent that a client system would not know the difference. + +That goal was not met. As the development team progressed through the project, it found that: + +1. some of those existing API features may not have a place in a hypothetical future revision of the API, and +2. there were more interesting questions and topics to explore than strict API parity. + +Consequently, the gaps in parity can be looked at from two perspective: purposeful divergence, and leftover (scope cut). + +## Purposeful Divergence + +### Authorization + +When an API client connects to the ODS/API, which documents (records) are they allowed to access? The ODS/API has several authorization models, and they can get quite complex. Those models might truly be necessary. With Meadowlark, the team asked itself: what happens if we simplify - if you create it, you "own" it, and therefore you can access it? We call this "ownership authorization". + +Does this mean that there can be two different documents for the same student? For better or worse, yes. Think of a SIS and an Assessment both creating Student documents, each with a distinct StudentUniqueId. Advantage: Assessment provider does not need to look up the SIS's StudentUniqueId. Disadvantage: what if they're already doing that? + +This question of what authorization model(s) to support is one that the Alliance will need to continue refining with stakeholders. In the meantime, the Ownership Authorization sufficiently proves the point that Meadowlark *can* provide at least basic record-level authorization security. + +### Composites + +Aside from the Enrollments Composite, there has been little uptake of this ODS/API feature. The coding to reproduce this would be complex (as is the existing code), and there are likely other ways to achieve a similar goal for Enrollments. + +### Overposting + +The ODS/API allows submitting JSON payloads that have extra fields, not defined in the Data Standard. This was probably an undetected accident for a long period of time. Some API clients may in fact "break" if this feature is absent, because they have been sending unnecessary fields for a long time. Overposting [poses a small security risk](https://www.hanselman.com/blog/aspnet-overpostingmass-assignment-model-binding-security) and it is best to avoid supporting it. + +### Links + +The [API Specification](https://edfi.atlassian.net/wiki/display/EFDSRFC/Ed-Fi+RFC+16+-+Open+API+Spec) includes the concept of a "link", as shown in the following partial API response: + +```json + { + "id": "229855ca5b28450592fffd886e232479", + "schoolReference": { + "schoolId": 255901107, + "link": { + "rel": "School", + "href": "/ed-fi/schools/aa600640d35a418ea5bb51d32b148013" + } + }, +... +} +``` + +The development team suspects that few, if any, clients are actually using these links to navigate between documents, and therefore deliberately chose to exclude them. + +> [!WARNING] +> If Meadowlark were ever to evolve into a production-ready system, either it would need to add support for links, *or* the core API specification would have to evolve to drop the item. + +## Leftovers + +These *might* be addressed in a future where Meadowlark evolves toward a production-ready product. Or, they might move over to the Purposeful Divergence category. + +### Profiles + +Profiles help the API host to limit the data exposure to clients by restricting which *fields* are returned with a document. This feature is heavily used in some settings, and deserves another look - at the right time. The programming model would need to change substantially in order to replicate this functionality. + +### Cascading Deletes + +Delete an object, and it automatically deletes all references to as well. This is actually an "add-on" in the ODS/API that must be enabled manually. In Meadowlark, the coding would not be terribly difficult - but the effort does not support any specific research objective, and therefore it was left in the backlog as a low priority. + +### Authentication + +Ensuring that a client is whom they claim to be is absolutely necessary. Doing so in a proof-of-concept... not so critical. OAuth2 is a widely supported industry standard at this point. For Meadowlark 0.1.0, the ownership-based authorization relies on a JSON Web Token (JWT), which is "created" from the authentication process. That process, in 0.1.0, accepts either of two hard-coded sets of client credentials and creates a hard-coded JWT. In other words, it is completely fake and unreliable authentication, but it does allow the software to have real authorization. + +If/when real authentication is brought in, it will likely use a third-party OAuth2 provider, unlike the ODS/API Suite 3. + +### Open API Metadata + +Generating Open API Metadata is an interesting topic in itself. With the Suite 3 technology, it comes out of the .NET-based code generation process. The development team is confident that it can create a plugin to generate this instead. For the 0.1.0 release, the metadata document from Suite 3 was saved as a JSON file, uploaded to cloud storage, and then imported into the Open API metadata HTTP handler at runtime as needed. This has the effect of making it *slightly* harder to add another data model to the Meadowlark code base. Well worth the sacrifice of scope. + +### SchoolYear / SchoolYearTypes + +A strange vestige of ODS/API Suite 2 crept into Suite 3: while most "Type" concepts were discarded, `SchoolYearType`  was kept in the API instead of `SchoolYear`. The development team had a good reason for this. Your humble article author has heard it but fails to remember the details. Thus it is that the [Core Student Data API specification](https://edfi.atlassian.net/wiki/display/EFDSRFC/Ed-Fi+RFC+16+-+Open+API+Spec) which defines a SchoolYearType, even though all entities that refer to it have a `SchoolYear`. Interestingly, neither is part of the Data Standard as a stand-alone entity. + +### ETAGS + +An ETAG is useful "extension" to the data model: a unique hash value that changes when a document changes. And it is likely desirable in a production-ready system, and would be easy to implement in Meadowlark. Example scenario timeline: + +1. Client A gets a record from the API, with etag ABC +2. Client B also gets that record +3. Client A prepares an update to the record +4. Client B prepares an update to the record *and* issues a PUT request to the API, updating the record +5. Client A issues their PUT request, with the old etag. The API rejects it, as the etag changed at step 4. + +## Demonstration + +```shell +# Get code +git clone https://github.com/Ed-Fi-Exchange-OSS/Meadowlark +git checkout 0.1.0 +cd Meadowlark +npm run install +cd Meadowlark-js + +# Run locally +npm run init:local-dynamodb +pushd docker-Meadowlark +docker compose up -d +popd +npm run start:local + +# In another terminal +npm run load-descriptors:local + +# See test/http/local* for various requests to run in Visual Studio Code +``` diff --git a/docs/meadowlark-releases/meadowlark-010/meadowlark-010-security.md b/docs/meadowlark-releases/meadowlark-010/meadowlark-010-security.md new file mode 100644 index 00000000..62a96c1f --- /dev/null +++ b/docs/meadowlark-releases/meadowlark-010/meadowlark-010-security.md @@ -0,0 +1,49 @@ +# Meadowlark 0.1.0 - Security + +# Authentication + +Because OAuth2 is the desired authentication process, and it is a well-known and well-supported protocol, completing a full-blown authentication integration is a low-value task for this project. + +To support authorization, there *is* a token endpoint with two hard-coded sets of client credentials. These will return a JSON Web Token (JWT) that should be used on all HTTP requests when authorization is enabled. The JWT is signed with a signing key, which must be provided via the `SIGNING_KEY` environment variable. + +| Key | Secret | +| --- | --- | +| meadowlark\_key\_2 | meadowlark\_secret\_2 | +| ​meadowlark\_key\_1 | meadowlark\_secret\_1 | + +# Authorization + +## Implemented + +The authorization requirement can be turned off with an environment variable, `ACCESS_TOKEN_REQUIRED`  (true or false). When true, "ownership" based authorization is enforced. This authorization enforces that the client who creates a resource is the only one who can access that resource. For the purpose of this project, this is seen as a minimum viable product showing that a resource-level restriction can be applied on GET, UPDATE, and DELETE requests. + +## Commentary + +The ODS API's main authorization pattern is based on establishing relationships from resources to education organizations – subclasses of EducationOrganization, or EdOrg for short. API clients are assigned one or more EdOrgs and a strategy that specifies CRUD permissions over API classes for which specific resources can be traced to one of these EdOrgs. + +This strategy is powerful and logical but also complex to implement. On the implementation side, each new authorization scheme needs to be driven by relational database views that materialize how each API resource can be traced to an EdOrg. Such views are custom code. + +This strategy has also created complexity for API clients. As noted above, the relationships that drive authorizations are opaque and not easily presented to an API client. This strategy also results in strange interaction scenarios, such as the fact that a client cannot read a Student or Parent resource the client just wrote (because it has no relation to an EdOrg yet). + +As noted above, this is not to say that the ODS API approach is wrong, but only that for some cases the complexity may not be justified. For example, in the case of a SIS client providing data to an API where the scope is a single LEA, these permissions probably suffice: + +* *For this particular API instance, your client has the ability to Create API resources for any of the following API classes:* *(list classes here)* +* *For any resource you write, your client can also Read, Update or Delete that same resource.* + +Implementing these rules is considerably simpler and demands no customized SQL or other materialized means to connect each resource to an EdOrg. + +Clearly, in the context in which data is being read out of the API the ODS EdOrg authorization pattern becomes potentially much more useful.  But in many cases of data out – particularly early one – the scope of that authorization in field work still tends to be "all district data across these API resources for school year X" + +In summary, the ODS API pattern of using EdOrg relationships to drive authorization is powerful and worth preserving, but the Meadowlark project suggests that a set of simpler patterns might eliminate complexity from many early field projects. As a implementation advances in complexity, an API host may choose to enable more powerful and complex designs. + +# Infrastructure + +Because this is a research and development project, only minimal effort was put into securing the infrastructure. When deployed to Amazon, the team did try to keep access permissions to the minimum necessary to achieve the purpose. There was no attempt to minimize the database access permissions; for example, the Lambda functions can create tables and access all records. This may or may not be a desirable pattern in a real production system. + +**Table of Contents** + +* [Authentication](#authentication) +* [Authorization](#authorization) + * [Implemented](#implemented) + * [Commentary](#commentary) +* [Infrastructure](#infrastructure) \ No newline at end of file diff --git a/docs/meadowlark-releases/meadowlark-020.md b/docs/meadowlark-releases/meadowlark-020.md new file mode 100644 index 00000000..4d5cafd4 --- /dev/null +++ b/docs/meadowlark-releases/meadowlark-020.md @@ -0,0 +1,48 @@ +# Meadowlark 0.2.0 + +## Goals and Design + +Replace DynamoDB + +* [Meadowlark - MongoDB](../../project-meadowlark-exploring-next-generation-technologies/meadowlark-data-storage-design/meadowlark-mongodb.md) +* [Meadowlark - PostgreSQL](../../project-meadowlark-exploring-next-generation-technologies/meadowlark-data-storage-design/meadowlark-postgresql.md) +* [Meadowlark - Durable Change Data Capture](../../project-meadowlark-exploring-next-generation-technologies/meadowlark-streaming-and-downstream-data-stores/meadowlark-durable-change-data-capture.md) + +Provide a full-access mode for authorization + +* [Meadowlark 0.2.0 - Security](./meadowlark-020/meadowlark-020-security.md) + +## Architecture + +![](./attachments/image2022-7-21_16-23-32.png) + +1. Runs on localhost! +2. New features / functionality: + 1. MongoDB transactional storage + 2. PostgreSQL transactional storage + 3. Node.js front end + 4. Data out from MongoDB to Kafka +3. Removing DynamoDB - see [Meadowlark - Referential Integrity in Document Databases](../../project-meadowlark-exploring-next-generation-technologies/meadowlark-data-storage-design/meadowlark-referential-integrity-in-document-databases.md) for more information. +4. Temporarily, AWS support is broken: we have not orchestrated a replacement for DynamoDB. + +## Demonstration + +Be sure to edit the `.env`  file as mentioned near the end of the following script: + +```shell +# Get code +git clone https://github.com/Ed-Fi-Exchange-OSS/Meadowlark +git checkout 0.2.0 +cd Meadowlark +npm install & npm run build + +# Startup backend services +.eng/docker.ps1 + +# Run locally +pushd Meadowlark-js/services/meadowlark-fastify +cp .env.example .env # And edit it appropriately +npm run start:local + +# See test/http/local* for various requests to run in Visual Studio Code   +``` diff --git a/docs/meadowlark-releases/meadowlark-020/meadowlark-020-cost-and-performance-analysis.md b/docs/meadowlark-releases/meadowlark-020/meadowlark-020-cost-and-performance-analysis.md new file mode 100644 index 00000000..d4e6ea7c --- /dev/null +++ b/docs/meadowlark-releases/meadowlark-020/meadowlark-020-cost-and-performance-analysis.md @@ -0,0 +1,14 @@ +# Meadowlark 0.2.0 - Cost and Performance Analysis + +> [!WARNING] +> Broad goals, with details to be filled in as the development team gets closer to being able to tackle this problem: +> * Upload to AWS +> * Consider any or all of the following tools: +> * Smoke test utility +> * Bulk upload grand bend +> * API-to-API synchronization tool to copy data from an ODS/API 5.3 installation to Meadowlark +> * Must be able to time the process +> * Start with a small data set (Grand Bend) and look at the cost before moving forward +> * If cost is sufficiently low, run 5 times to get real statistics +> * Increase the data set size, using output from Sample Data Generator to tune the data set to the desired "size" +> * Go carefully so as not to rack up a huge bill! \ No newline at end of file diff --git a/docs/meadowlark-releases/meadowlark-020/meadowlark-020-security.md b/docs/meadowlark-releases/meadowlark-020/meadowlark-020-security.md new file mode 100644 index 00000000..29735282 --- /dev/null +++ b/docs/meadowlark-releases/meadowlark-020/meadowlark-020-security.md @@ -0,0 +1,65 @@ +# Meadowlark 0.2.0 - Security + +# Authentication + +Meadowlark 0.2.0 will not (yet) integrate real authentication: it will continue to have the hard-coded authentication token mechanism provided in [Meadowlark 0.1.0 - Security](../../meadowlark-releases/meadowlark-010/meadowlark-010-security.md). + +# Authorization + +On the authorization side, milestone 0.2.0 will: + +* remove some original prototypical education organization-based authorization,  +* continue to support ownership-based authorization, and +* introduce a full-access claim and a third hard-coded JSON Web Token (JWT). + +The full-access claim would be used by the API provider for full API synchronization. It will also be establish an initial pattern for other authorization schemes in the future, based on claims encoded in the JWT. + +## Current Token + +A signed JSON web token contains three different components. For detailed information, see for example [auth0: JSON Web Token Structure](https://auth0.com/docs/secure/tokens/json-web-tokens/json-web-token-structure). The long second portion, for one of the hard-coded Meadowlark tokens, decodes to this: + +``` +{ + "iss": "ed-fi-meadowlark", + "aud": "meadowlark", + "sub": "super-great-SIS", + "jti": "3d59b75f-a762-4baa-9116-19c82fdf8de3", + "iat": 1636562060, + "exp": 3845548881 +} +``` + +These are all standard "reserved" claims defined in the JWT specification. + +* iss = issuer +* aud = audience +* sub = subject (in this case, a vendor name) +* jti = unique identifier +* iat = issued at (datetime in Unix epoch format) +* exp = expires at (again in epoch format) + +## Future Token + +The token will include additional claims: + +* client\_id - the client\_id used to authenticate +* roles - with two values at this time: vendor and host.  + +The "host" role will grant full access to all resources. + +This design is subject to revision in a future milestone, once we get further into the work of integrating with OAuth2 providers. + +# Infrastructure + +No changes compared to [Meadowlark 0.1.0 - Security](../../meadowlark-releases/meadowlark-010/meadowlark-010-security.md) - still not trying to create production-ready product. + +> [!WARNING] +> This document is for discussion and general guidance. The implementation may vary as needed. The development team will endeavor to keep this document up-to-date, though working software remains a higher priority than comprehensive documentation. + +**Table of Contents** + +* [Authentication](#authentication) +* [Authorization](#authorization) + * [Current Token](#current-token) + * [Future Token](#future-token) +* [Infrastructure](#infrastructure) \ No newline at end of file diff --git a/docs/meadowlark-releases/meadowlark-030.md b/docs/meadowlark-releases/meadowlark-030.md new file mode 100644 index 00000000..79191412 --- /dev/null +++ b/docs/meadowlark-releases/meadowlark-030.md @@ -0,0 +1,37 @@ +# Meadowlark 0.3.0 + +## Goals and Design + +Low-frills installation for API support without breaking vendor integrations, suitable for pilot testing on Azure. + +* [Multiple Data Standards](../meadowlark-api-design/meadowlark-multiple-data-standards.md) - 3.2.0-c for ODS/API 5.0 compatibility +* Flesh out the [Internal OAuth 2 Client Credential Provider](../meadowlark-security/meadowlark-authentication/meadowlark-internal-oauth-2-client-credential-provider.md) +* Script out Azure deployment - approach TBD: PowerShell / ARM, Kubernetes, Serverless? +* Standardize [Error Messages](../meadowlark-api-design/meadowlark-response-codes.md) +* [Log Monitoring](../meadowlark-operations/meadowlark-log-monitoring.md) with Azure Monitor? +* Performance testing + +## Architecture + +![](./attachments/image2022-7-21_16-59-4.png) + +## Release + +[https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/releases/tag/v0.3.0](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/releases/tag/v0.3.0) + +## Project Status + +While not yet production-ready, we think this release is strong enough for some pilot testing with SIS vendors in the field, with the goal of finding out + +* What is broken compared to the ODS/API release? There may be edge cases that we don't know about, or that we thought were unimportant. +* What do deployments look like? And what more should be built into the platform to support those deployments? + +### What's Changed + +* Real [OAuth2 authentication](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/blob/v0.3.0/docs/OAUTH2.md) (client credentials flow), using a built-in OAuth2 provider +* [Docker support](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/blob/v0.3.0/docs/DOCKER.md): `docker image pull edfialliance/meadowlark-ed-fi-api:v0.3.0` +* MongoDB and OpenSearch optimizations with respect to transactions and error handling +* OpenSearch: some of the querystring functionality is broken, and will be restored in an upcoming release +* Numerous bug fixes on API routes and payload validation +* Fastify clustering for multi-threaded operation +* Thorough documentation of [configuration options](https://github.com/Ed-Fi-Exchange-OSS/Meadowlark/blob/v0.3.0/docs/CONFIGURATION.md). diff --git a/docs/meadowlark-security/meadowlark-authentication/meadowlark-internal-oauth-2-client-credential-provider.md b/docs/meadowlark-security/meadowlark-authentication/meadowlark-internal-oauth-2-client-credential-provider.md new file mode 100644 index 00000000..aad31791 --- /dev/null +++ b/docs/meadowlark-security/meadowlark-authentication/meadowlark-internal-oauth-2-client-credential-provider.md @@ -0,0 +1,226 @@ +# Meadowlark - Internal OAuth 2 Client Credential Provider + +## Overview + +The Ed-Fi API needs to be secured with the OAuth 2.0 Client Credentials flow. In many cases a third-party authentication provider will be most appropriate for managing authentication. However, some organizations may wish to continue using a built-in authentication provider, as with the ODS/API Suite 3. In addition to managing the authentication process itself, the built-in provider should handle provisioning of keys and secrets. In this way, we will be able to have a micro data store that is accessed by only a single application. + +## Requirements + +### Authentication / Token Generation + +Support [Client Credentials flow](https://www.oauth.com/oauth2-servers/access-tokens/client-credentials/) + +1. Example route signature: `POST /oauth/token` +2. with support for the following message body formats: + 1. `grant_type` , `client_id` , and `client_secret`  in a JSON payload + 2. `grant_type` , `client_id` , and `client_secret`  in a form-urlencoded payload + 3. `grant_type` in a json payload, with `client_id` , and `client_secret`  encoded into a basic authentication header + 4. `grant_type` in a form-urlencoded payload, with `client_id` , and `client_secret`  encoded into a basic authentication header. +3. with responses: + 1. 200 when the request is valid, with a signed JSON Web Token (JWT) as an access code response (more detail on JWT below). Example: + + ```json + { + "access_token": "eyJ0eXAiOiJKV1QiLCJibGciOiJIUzI1NiJ9.eyJpc3MiOiJlZC1maS1tZWfkb3dsYXJrIiwiYXVkIjoibWVhZG93bGFyayIsInJvbGVzIjpbInZlbmRvciJdLCJzdWIiOiJzdXBlci1ncmVhdC1TSVMiLCJqdGkiOiIyODQxNTY3Yi0wNzRiLTRiMDktYmQwMS1jZGYyODVlY2NjMDEiLCJpYXQiOjE2NTkzNzA2MjgsImV4cCI6MTY1OTM3NDIyOH0.GKwl3Uactabl6emQy9Ta2R5emGL6IF_v8w85LoR2wAs", + "token_type": "bearer", + "expires_in": 1659374228, + "refresh_token": "not available" + } + ``` + + 2. 400 when the payload *structure* is invalid or the `grant_type`  is invalid. Example: + + ```json + { + "message": "The request is invalid.", + "modelState": { + "grant_type": [ + "The grant_type '???' is not supported." + ] + } + } + ``` + + > [!WARNING] + > This is an "ideal" example that provides some consistency with existing messaging. The actual solution can be different based on the package components used in the solution. + + 3. 401 with no message body when the `client_id`  or `client_secret`  is invalid. ![(warning)](https://edfi.atlassian.net/wiki/s/695013191/6452/be943731e17d7f4a2b01aa3e67b9f29c0529a211/_/images/icons/emoticons/warning.png) + + Deliberately not revealing *why* the authentication attempt failed. + +#### JSON Web Token Response + +The access token provided by the /oauth/token endpoint should be in the format of a [signed JSON Web Token](https://jwt.io/introduction/) (JWT). The expected format of the JWT is described in some detail in [Meadowlark - Data Authorization](../../meadowlark-security/meadowlark-data-authorization.md). In summary, the token's payload is expected to match this structure: + +```json +{ + "iss": "ed-fi-meadowlark", + "aud": "ed-fi-meadowlark", + "sub": "client name", + "jti": "3d59b75f-a762-4baa-9116-19c82fdf8de3", + "iat": 1636562060, + "exp": 3845548881, + "client_id": "fbf739c4-fb86-4f03-a477-91af51cc46f2", + "roles": [ "vendor" ] +} +``` + +### Token Introspection + +An endpoint for [verifying a token](https://datatracker.ietf.org/doc/html/rfc7662) and accessing information about the token. + +1. Example route signature: POST /oauth/verify with the token to verify in a form-urlencoded body, as well as a valid token authorizing the request itself. Example: + + ```none + POST /oauth/verify + Authorization: bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI.... + Content-Type: application/x-www-form-urlencoded + + token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI.... + ``` + + 1. ⚠️ The specification notes an optional `token_type`  hint. Meadowlark will only support bearer tokens, so this parameter is not necessary. However, the service should not reject a payload that includes this parameter. It can simply ignore the parameter, always validating as a bearer token. +2. responses: + 1. 200 if the token is still valid with a message body containing the token information in in a JSON payload. + + 1. Active token example: + + **Active Token** + + ```json + { + "active": true, + "client_id": "fbf739c4-fb86-4f03-a477-91af51cc46f2", + "sub": "client name", + "aud": "ed-fi-meadowlark", + "iss": "ed-fi-meadowlark", + "exp": 1659374285, + "iat": 1659374285, + "roles": [ + "vendor" + ] + } + ``` + + 2. Inactive token example: + + **Inactive Token** + + ```json + { + "active": false + } + ``` + + 3. `active`  will mean that the token is valid: + 1. issued by this application + 2. not revoked + 3. not expired + 4. client has not been deactivated + 2. 401 if the Authorization header is missing or "corrupt". +3. Authorization + 1. A token with the "vendor" or "host" role can only verify their own token, no others. A token with the "admin" role can verify any token. + 2. A token with "admin. + +> [!TIP] +> One implication of this design is that the meadowlark API application needs to have client credentials with the "admin" role in order to verify incoming tokens. + +## Client Credential Management + +At this time there is no concept of vendors and applications - just keys and secrets. Therefore most of the Admin API Design is not relevant to this project. We simply need to have a route that supports creating keys and secrets. + +* Support standard HTTP verbs and status codes +* Requires a token with role claim of "admin", any other valid token gets a 403 "forbidden" response, and invalid or no token gets a 401 response. +* URL endpoint: "oauth/client" to start with, adjust as needed. +* GET + * GET by id + * GET all +* POST + * body + + ```json + { + "clientName": "Hometown SIS", + "roles": [ + "vendor" + ] + } + ``` + + * 400 response if clientName is missing and/or there is not at least one role in the request. + * Three roles and one variant will be available. + * vendor + * READ access on all descriptors + * Full CRUD access on those resources created by this client credential + * OAuth Token introspection for own token + * host + * Full CRUD access on all Ed-Fi API resources + * OAuth Token introspection for own token + * admin + * Full CRUD on OAuth Client endpoint + * OAuth Token introspection for any token + * assessment + * Disables the reference checks on POST statements + * Generally speaking, one client would have *either* vendor *or* host role, not both. However, it is probably not worthwhile to force them to be mutually exclusive through validation. + * response + + ```none + location: /oauth/client/a-uuid-v4-value + + { + "active": true + "client_id": "a-uuid-v4-value", + "client_secret": "a really good random secret", + "clientName": "Hometown SIS", + "roles": [ + "vendor" + ] + } + ``` + + * Note the presence of `active`  in the output response. That will be an optional value that defaults to `true`  on POST. +* PUT + * body + + ```json + { + "active": false, + "client_id": "a-uuid-v4-value", + "clientName": "Hometown SIS", + "roles": [ + "vendor" + ] + } + ``` + + * does not update client\_id or client\_secret +* Generate a new secret for an existing client id: + * Request: `POST /oauth/client/a-uuid-v4-value/reset` + * Response + + ```json + { + "client_id": "a-uuid-v4-value", + "client_secret": "a new really good random secret" + } + ``` + +## Implementation Notes + +### Microservice + +> [!TIP] +> As of Meadowlark 0.3.0, the code is easily separable, but for ease of use it is still integrated into the same Fastify application. + +Application could be separate from the Meadowlark Ed-Fi API, and can be written in either TypeScript or Python. + +As a microservice, it will have its own datastore. + +* Should support both PostgreSQL and MongoDB +* By default, should use the same database as the Meadowlark API code, but with complete independence from the tables / collections used by the API code. + +Can utilize open source\* third-party identity provide packages (so as not GPL, LGPL, Affero GPL, or other restrictive / "viral" license) + +### Bootstrapping Initial Admin Credentials + +If there are no admin accounts, relax security to allow new "admin" type client creation without a token. As soon as the first one is created, fully enforce the token authentication, and do not allow a new token to be created. diff --git a/docs/meadowlark-security/meadowlark-data-authorization.md b/docs/meadowlark-security/meadowlark-data-authorization.md new file mode 100644 index 00000000..2ea3eeb0 --- /dev/null +++ b/docs/meadowlark-security/meadowlark-data-authorization.md @@ -0,0 +1,115 @@ +# Meadowlark - Data Authorization + +## Overview + +The development team is exploring alternatives to the complex authorization schemes in the ODS/API. + +* Milestone 0.1.0: only "ownership-based" authorization was supported. +* Milestone 0.2.0: added "full access" for hosting providers, and full access to descriptors. + +## Meadowlark Authorization Modes + +As described below, we will use the `roles`  claim in a JSON Web Token (JWT) to tell Meadowlark which authorization mode to use. There are four authorization modes, described in the following sections. + +### Ownership + +Assign "vendor" to the roles for this default authorization model [when creating](./meadowlark-authentication/meadowlark-internal-oauth-2-client-credential-provider.md) the API client. + +"Whoever creates a record has access to it". ClientA submits a POST request with a Student. ClientB guesses the document ID for that student and issues a GET request for it: access is denied. ClientA issues the same GET request: access granted, 200 response. + +This does not affect validation: if ClientB submits a StudentEducationOrganizationAssociation with "ClientA's student", then the validation passes despite the fact that ClientB "cannot directly see" the student. + +Also see: [Meadowlark - Referential Integrity in Document Databases](../meadowlark-data-storage-design/meadowlark-referential-integrity-in-document-databases.md) + +### Host Full Access + +Assign "host" to the roles for this default authorization model [when creating](./meadowlark-authentication/meadowlark-internal-oauth-2-client-credential-provider.md) the API client. + +This mode is intended for API hosting providers so that they can run synchronization processes, with access to read all documents unfiltered. + +### Admin + +Assign "admin" to the roles for this default authorization model [when creating](./meadowlark-authentication/meadowlark-internal-oauth-2-client-credential-provider.md) the API client. + +This mode is specific to the internal OAuth2 provider and client management API, allowing the API client to create and manage other API clients. + +### Assessment + +Assign "assessment" to the roles for this default authorization model [when creating](./meadowlark-authentication/meadowlark-internal-oauth-2-client-credential-provider.md) the API client. + +>[!WARNING] +> +> This is not a unique authorization model, and it should be used in addition to "vendor". This role allows the API client to bypass the usual referential integrity checks when issuing a POST or PUT request. + +## Descriptors + +Regardless of the mode, all API clients need to know about available descriptors. At this time, all authenticated clients will be able to query for all descriptors. + +> [!WARNING] +> Descriptors have a concept of namespace for identifying which descriptors are used by which vendor. In Meadowlark this is, for now, on the honors system: there is no restriction on which namespaced descriptors any given client can use. This may change in the future. + +## How It Will Work: JSON Web Token + +Although we do not know the details of the Authentication software integration yet, we have already chosen to use [OAuth2](https://oauth.net/) as the protocol and [JSON Web Tokens](https://www.rfc-editor.org/rfc/rfc9068.html#name-roles) (JWT) as the format for access tokens. The type of authorization will be configured through the [`Roles`](https://www.rfc-editor.org/rfc/rfc9068.html#name-roles)  claim on the JWT; thus any third-party or integrated OAuth2 provider will need to support configuration of a "roles" claim. Initially, Meadowlark will support two mutually exclusive roles: Vendor and Host, subject to ownership-based authorization and full access, respectively. If other authorization modes are added in the future - for example, based on Person relationship or Local Education Agency ID - then additional claims may be needed to support those use cases. + +As of Meadowlark 0.2.0, where authentication is hard-coded to a couple of tokens, the JWT is signed using the HMAC with SHA256 symmetric key algorithm. This will likely change to the RSA with SHA256 algorithm once third party OAuth2 providers are supported, so the authentication provider and the Ed-Fi API provider do not need to have access to a shared key. + +Below is an example of a decoded (plain JSON) JWT from Meadowlark 0.2.0: + +**Header block** + +```json +{ + "typ": "JWT", + "alg": "HS256" +} +``` + +**Payload** + +```json +{ + "iss": "ed-fi-meadowlark", + "aud": "meadowlark", + "sub": "", + "jti": "3d59b75f-a762-4baa-9116-19c82fdf8de3", + "iat": 1636562060, + "exp": 3845548881, + "client_id": "fbf739c4-fb86-4f03-a477-91af51cc46f2", + "roles": [ "vendor" ] +} +``` + +Explanation of each claim... + +| Claim | Full description | Meaning | +| --- | --- | --- | +| ​iss | Issuer​ | The OAuth2 provider​ | +| aud | Audience | The application for which the token was issued | +| sub | Subject | The client for which the token was issued | +| jti | JWT Id | A unique identifier for the JWT | +| iat | Issued At | The Unix-style timestamp when the JWT was created | +| exp | Expiration Time | The Unix-style timestamp when the JWT should not longer be accepted ("expired") | +| client\_id | Client ID | Unique identifier for the client application | +| roles | Roles | An array of roles assigned to the client credentials that were used to generate the JWT. | + +[https://datatracker.ietf.org/ipr/search/?rfc=9068&submit=rfc](https://datatracker.ietf.org/ipr/search/?rfc=9068&submit=rfc) + +## Background + +The ODS API's main authorization pattern is based on establishing relationships from resources to education organizations – subclasses of EducationOrganization, or EdOrg for short. API clients are assigned one or more EdOrgs and a strategy that specifies CRUD permissions over API classes for which specific resources can be traced to one of these EdOrgs. + +This strategy is powerful and logical but also complex to implement. On the implementation side, each new authorization scheme needs to be driven by relational database views that materialize how each API resource can be traced to an EdOrg. Such views are custom code. + +This strategy has also created complexity for API clients. As noted above, the relationships that drive authorizations are opaque and not easily presented to an API client. This strategy also results in strange interaction scenarios, such as the fact that a client cannot read a Student or Parent resource the client just wrote (because it has no relation to an EdOrg yet). + +As noted above, this is not to say that the ODS API approach is wrong, but only that for some cases the complexity may not be justified. For example, in the case of a SIS client providing data to an API where the scope is a single LEA, these permissions probably suffice: + +* *For this particular API instance, your client has the ability to Create API resources for any of the following API classes:* *(list classes here)* +* *For any resource you write, your client can also Read, Update or Delete that same resource.* + +Implementing these rules is considerably simpler and demands no customized SQL or other materialized means to connect each resource to an EdOrg. + +Clearly, in the context in which data is being read out of the API the ODS EdOrg authorization pattern becomes potentially much more useful.  But in many cases of data out – particularly early one – the scope of that authorization in field work still tends to be "all district data across these API resources for school year X" + +In summary, the ODS API pattern of using EdOrg relationships to drive authorization is powerful and worth preserving, but the Meadowlark project suggests that a set of simpler patterns might eliminate complexity from many early field projects. As a implementation advances in complexity, an API host may choose to enable more powerful and complex designs. diff --git a/docs/meadowlark-streaming-and-downstream-data-stores/meadowlark-durable-change-data-capture.md b/docs/meadowlark-streaming-and-downstream-data-stores/meadowlark-durable-change-data-capture.md new file mode 100644 index 00000000..8f770c57 --- /dev/null +++ b/docs/meadowlark-streaming-and-downstream-data-stores/meadowlark-durable-change-data-capture.md @@ -0,0 +1,96 @@ +# Meadowlark - Durable Change Data Capture + +> [!WARNING] +> Not completed in Milestone 0.2.0 as previously desired. Some early work was done, but it was cut from scope in order to prioritize moving towards a pilot-testable 0.3.0 release. + +## Overview + +One outcome of Meadowlark 0.1.0 was the demonstration of value in using DynamoDB Streams to insert documents into OpenSearch and separately deliver to S3 for analytics. In Meadowlark 0.2.0, we expand this to publish Meadowlark documents as events to a durable [Kafka](https://kafka.apache.org/) event store to support a broader range of integration use cases. We plan for Meadowlark to emit change data capture messages for both the PostgreSQL and MongoDB datastores. This will also set the stage for [Meadowlark - Streaming to Filesystem](../meadowlark-streaming-and-downstream-data-stores/meadowlark-streaming-to-filesystem.md) and [Meadowlark - Materialized Views](../meadowlark-streaming-and-downstream-data-stores/meadowlark-materialized-views.md). + +> [!TIP] +> Why Kafka? +> +> 1. It is "durable" - that is, the messages stay around, like a transactional database log. This is useful for materialized views. If a Student document is posted to the API one day, and a StudentEducationOrganizationAssociation documented is posted two weeks later, then we want the stream processor that creates the materialized view to be able to read both objects from the streams, without having to requery some other data store. +> 2. It is widely used by companies large and small. +> 3. It is available on all cloud providers and on-premises. + +## Streaming + +### Kafka Topic + +By default, Meadowlark will publish messages to the edfi.meadowlark.documents topic. + +### Message Definition + +There are two types of Meadowlark messages: upsert and delete. A Meadowlark upsert message is sent whenever a document is created or updated, and is defined as follows: + +#### Message Primary Key + +```json +{ "id": "" } +``` + +### Message Body + +```json +{ + "id": "", + "documentIdentity": [ + { + "name": "", + "value": "" + }, + ... + ], + "projectName": "", + "resourceName": "", + "resourceVersion": "", + "edfiDoc": "" +} +``` + +An example Meadowlark message body would look like: + +```json +{ + "id": "t4JWTsagjhY4Ea-oIcXCeS7oqbNX9iWfPx6e-g", + "documentIdentity": [ + { + "name": "schoolReference.schoolId", + "value": 123 + }, + { + "name": "weekIdentifier", + "value": "1st" + } + ], + "projectName": "Ed-Fi", + "resourceName": "AcademicWeek", + "resourceVersion": "3.3.1-b", + "edfiDoc": { + "schoolReference": { + "schoolId": 123 + }, + "weekIdentifier": "1st", + "beginDate": "2022-12-01", + "endDate": "2022-12-31", + "totalInstructionalDays": 30 + } +} +``` + +Meadowlark delete events are Kafka "tombstone" events, which have a message primary key but no body. This design allows for Kafka to be used in a finer-grained per-record retention mode, rather than the coarser-grained time-based retention mode. This enables Kafka to act as a durable message store that contains the entire Meadowlark state. + +## Change Data Capture + +### Debezium + +Meadowlark 0.2.0 takes advantage of the popular [Debezium](https://debezium.io/) Kafka connector to enable Meadowlark message publication to Kafka. Debezium provides connectors to both MongoDB and PostgreSQL. + +### Debezium MongoDB Implementation + +Meadowlark 0.2.0 uses the Debezium [MongoDB connector](https://debezium.io/documentation/reference/stable/connectors/mongodb.html) to listen to the Meadowlark MongoDB change streams and emit Meadowlark messages.  Debezium connectors are very robust, and can use snapshotting to generate messages for datastore changes that happened even while the connector was inactive. + +The challenge of using the Debezium connector is that by default it works with MongoDB by converting the full document into stringified JSON. Debezium does supply an optional transformer that will parse the JSON into a Kafka message body, but it is not compatible with Meadowlark's schema-less design. Because Kafka messages often have a schema, the transform embeds a full JSON schema in each message that it derives from the JSON message. Unfortunately, the transformer will crash if there is variance in the schema between documents. + +As a result, we cannot get a perfectly shaped Meadowlark message with the built-in Debezium and Kafka Connect transforms. A custom Java transform will need to be created and deployed into the Kafka Connector container. This adds complexity that we will defer until later. For now, the messages emitted by Debezium have two differences from a well-formed Meadowlark message. The message body is stringified JSON rather than regular JSON, and because renaming stringified fields is not possible with built-in transforms,  the document id embedded in the document has the property “\_id” rather than “id”. diff --git a/images/course-dependencies.png b/images/course-dependencies.png new file mode 100644 index 00000000..17ea681c Binary files /dev/null and b/images/course-dependencies.png differ diff --git a/images/infrastructure.png b/images/infrastructure.png new file mode 100644 index 00000000..cc271bd7 Binary files /dev/null and b/images/infrastructure.png differ diff --git a/images/meadowlark-architecture.png b/images/meadowlark-architecture.png new file mode 100644 index 00000000..cc300d57 Binary files /dev/null and b/images/meadowlark-architecture.png differ