Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework messaging handling to directly use protocol headers, support multiple content-types #8

Open
2 of 4 tasks
Lance-Drane opened this issue Jul 25, 2024 · 1 comment

Comments

@Lance-Drane
Copy link
Collaborator

Lance-Drane commented Jul 25, 2024

Currently, internal INTERSECT logic does not utilize protocol headers at all, but serializes metadata and the message "payload" (the actual scientific data) into a single JSON blob. So an actual "message payload" is just JSON of one of the following three types:

  • UserspaceMessage
  • EventMessage
  • LifecycleMessage

One immediate problem this presents is if we want to send binary data over the wire, as not everything serializes into JSON (quick example: raw PNG images). The common solution for this is to encode the binary data as Base64, but this is not a viable approach for larger data given that Base64 inflates the size of the data considerably.

This is where allowing users to specify the Content-Type of their events comes in handy. We can directly pass this as a protocol header in actual messages. We can also use it to auto-reject messages when their Content-Type header does not match the expected Content-Type of the request operation.

Note that if the user is sending back a complex object (lists, BaseModels, dataclasses...), but the complex object uses binary data, it will probably be necessary to encode the binary value as a base64 string, and use the contentMediaType property in the schema to represent the true MIME type. Official JSON Schema docs here. An alternative users could implement for events would be to first emit a custom "metadata" typed INTERSECT event, then emit the associated "data" event. The "metadata" event would need to be able to reference the "data" event. (I think that the base64-encoding approach will probably still be necessary for response messages, though.) There are some Pydantic classes which can help with this.

Protocols

We have a full list of protocols we want to support here. Here is a list of protocols supported by AsyncAPI officially,, note that the protocols specification in AsyncAPI is extensible and not limited to their definitions.

Protocols which support protocol-level headers

Not a complete list, may be inaccurate.

With all of these, it makes sense to first try to use established headers - Content-Type is a common header. If we can't find a common header, we can use an X-Intersect-SDK- prefix value in the header.

Protocols which do NOT support protocol-level headers

Not a complete list, may be inaccurate.

Note that while it's not impossible to communicate across these protocols with non-JSON data, we would need to create our own header encoding logic. Since all metadata in headers can already be expressed as printable UTF-8 strings, the only binary data should be in the message payload itself. We can prohibit non-printable characters in header keys and values, and use specific control characters as separators for the various parts of the message.

  • MQTT 3.x - user defined properties were introduced in MQTT v5
  • Redis

Protocols with limited support for protocol-level headers

Not a complete list, may be inaccurate

  • HTTP Server-Sent Events - the server would first need to send back a custom event type (i.e. event: metadata). The associated data with this can look like whatever we want. If we want to send the metadata WITH the data, we must use custom encoding logic.
  • WebSockets - the server will send headers back with the initial handshake, but will not send headers per message.

Proposed action items

  • Only use Pydantic for serializing and deserializing application/json Content-Types in our own library. Otherwise, we just verify that the output value is in bytes/bytearray format.
  • Rework how we use Pydantic message classes. These are still okay for validating and serializing protocol-level messages, but there needs to be custom logic for each protocol we support.
  • Either drop support for protocols which don't support protocol-level headers, or write our own encoder/decoder (do NOT use JSON to do this).
  • Add some custom validation logic for Content-Types - this is currently the only message header field where we need to allow complete flexibility. I would generally suggest that for any Content-Type other than application/json, we require the input/output fields to be either byte or bytearray (note that str assumes a UTF-8 encoding, and valid UTF-8 objects should always be serializable as JSON already). Users will need to perform the appropriate conversions with their preferred library - I do not think it would be a good idea to include tons of different libraries in the INTERSECT-SDK for binary formats. This still allows us to have a valid JSON schema which is generated. Do not allow users to specify non-printable characters in any Content-Type definition. (This is an interesting discussion regarding a media type regex, if we want to further restrict Content Types.)

Note that these changes should be considered breaking.

@Lance-Drane
Copy link
Collaborator Author

Note that most of this discussion applies only to the workflows where we send the data directly through the message instead of MINIO or another data mechanism. However, we still need to address the Only use Pydantic for serializing and deserializing application/json Content-Types in our own library bulletpoint for MINIO or other data mechanisms to fully work.

Lance-Drane added a commit that referenced this issue Jul 30, 2024
…types

still need to add protocol-level and message-level handling

Signed-off-by: Lance Drane <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant