feat: crypt4gh #223

mmalenic · 2024-01-16T02:13:53Z

I'm creating this as a draft pull request because I don't think it should be merged into the main branch as it is. A large part of the code is inside the async-crypt4gh crate which doesn't belong in this repository. I wanted to show this as a demonstration of Crypt4GH and htsget-rs using a UrlStorage backend. The main explanation of the logic is inside the docs/crypt4gh folder which contains a diagram.

Recent additions include creating/reading edit lists in async-crypt4gh, as the edit list implementation in the crypt4gh-rust crate was incomplete.

I also want to mention that none of this is set in stone. See the alternative design considerations in the ARCHITECTURE.md docs.

Co-authored-by: Marko Malenic <[email protected]>

…pted blocks. Co-authored-by: Marko Malenic <[email protected]>

…oll_next methods for the Async(Buf)Read/Stream. Co-authored-by: Marko Malenic <[email protected]>

…oding

…rypter

…ting keys

… feature powerset

…t-public-key, update diagram

…client key

# Conflicts: # deploy/Dockerfile # htsget-config/src/resolver/mod.rs

… hyper client with reqwest to support redirects and simpler certificate settings

…rposes

pontus · 2024-05-20T11:09:50Z

Hi, I'm involved in a project where we're to a large part using the same software stack as in GDI and Federated EGA (and we're to use the same stack for more), and recently I've done some work adjacent to what's being worked on for htsget support.

I haven't looked at this PR but I believe I have a quite good idea of what's being implemented to support in the archive software stack, and there are parts I don't know how they are intended to work. That may be a failure of imagination on my part or use cases that you don't care about or current limits in the archive implementation that hasn't been addressed yet, but there are at least some things I feel I need to ask about.

What's a good way to do so? Should I just raise them here or over e.g. e-mail or some chat? If needed I can certainly do meetings but suspect time difference might make that tricky to set up.

mmalenic · 2024-05-20T23:23:46Z

Hi @pontus, more than happy to answer any questions. Feel free to post them here, or open a separate issue for them.

pontus · 2024-05-21T12:02:09Z

So, to start with - with the design as I understand it (stitching together crypt4gh blocks into a single file), it's not possible to utilise different resources (files on the backend) as there's only one symmetric key per crypt4gh file/stream and picking blocks from different backend files would cause mismatch for the symmetric key.

Similarly, since crypt4gh doesn't have length information for data blocks, if the last block of the file isn't full sized (65536 data bytes), it can't be used with anything following it since the MAC and following data blocks would be misaligned (and it's difficult to fix by padding since one would also need to calculate the block MAC).

So those are some use-case restrictions that I wanted to check if there is anything I'm missing technically and also if those use case restrictions are considered unimportant.

mmalenic · 2024-05-22T01:39:34Z

Yes, that's right. If you are stitching together different files/resources, then care would need to be taken to edit/remove data for the header and EOF blocks to make sure that the files remain readable. This is somewhat a limitation of the specs, i.e. the htsget spec requires that non-header byte ranges returned compose an entire file, headers and EOF blocks included. This problem would arise when using non-Crypt4GH files too. For example, the BAM spec does allow "insignificant EOF marker" blocks that can be ignored if they are contained in the middle of a file, but I don't think CRAM provides the same option.

However, I'm not quite clear on the context behind this, do you have an example of where it would be necessary to stitch together files, vs just reading files individually?

I'd say that this kind of operation would require the client to properly edit the htsget responses to stitch together files, however it should be possible to do so by individually reading responses from htsget. It's also possible that the htsget server could help make this easier, for example, by labelling which URL tickets correspond to the header bytes/EOF block, or what data to pad/remove. This information is already partially supplied with the class field in the response, which determines header vs body bytes. A different field could be used to extend this and support labelling EOF data.

At this point though, wouldn't it be simpler to just leave the files unmerged? Or is there some issue with performance or storage that I'm not aware of?

pontus · 2024-05-22T13:31:37Z

Thanks for the response, for starters, while I'm not sure I understand completely what you mean, I wanted to start by checking that my understanding was correct and those use cases get broken by design - compared to (my understanding of) what the htsget protocol supports. (For clarity; this would be on the crypt4gh level, so in addition to whatever issues there might be with constructing the underlying data stream.)

It may be that the protocol design was overly ambitious but then again, I've been in on talks in my current area of interest (imaging) about using htsget for that, which would likely run into the "several resources" issue.

As to actual practical issues; while I don't think the multiple resources would be that common with current usage patterns, it feels lik a range from the final block not at the very end is something that might not be that uncommon.

mmalenic · 2024-05-23T11:02:00Z

I wanted to start by checking that my understanding was correct and those use cases get broken by design

Yes, that's correct. I don't think this is a problem that could be solved by the htsget server alone, and it would require cooperation with the client. Although, I'd say it's probably doable. It seems possible that the client could retrieve requests from the htsget server one at a time and make the edits/merges that it needs. For example, it could request one resource, decrypt it, then request another resource, decrypt it (potentially with another key), and then combine those resources.

it's not possible to utilise different resources (files on the backend) as there's only one symmetric key per crypt4gh file/stream and picking blocks from different backend files would cause mismatch for the symmetric key.

I just want to clarify with this. It's possible to encrypt a single resource with multiple keys, and this is specifically supported by the Crypt4GH protocol. The htsget server can then return a Crypt4GH header with support for multiple keys. However, the URL tickets may include the last Crypt4GH block, which could be less than 64KiB. It's also possible to encrypt/decrypt different resources with different keys.

pontus · 2024-05-23T11:31:29Z

I just want to clarify with this. It's possible to encrypt a single resource with multiple keys, and this is specifically supported by the Crypt4GH protocol. The htsget server can then return a Crypt4GH header with support for multiple keys. However, the URL tickets may include the last Crypt4GH block, which could be less than 64KiB. It's also possible to encrypt/decrypt different resources with different keys.

True, thanks, I'd actually forgotten (and while we have this implemented, I'm not sure if it's ever been tested with our implementation, I'll try to look over the automatic tests for that at least).

But this still leaves the problem with having a final block (not 65535 bytes long) in any place other than the last, right?

mmalenic · 2024-05-23T22:32:54Z

But this still leaves the problem with having a final block (not 65535 bytes long) in any place other than the last, right?

Yes. The Crypt4GH spec seems to strongly imply that only the last block can be less than 64KiB. Although, since the last block also contains the nonce and MAC components, I don't think there is anything preventing decryption of a smaller block that is present in the middle of a stream if it's position is known (although current libraries may error).

pontus · 2024-05-24T08:49:02Z

Unfortunately not - with crypt4gh not having a package length, the way to get a package is either to read 65536 bytes+extras for the used encryption or until EOF.

I assume it was done this way to make it possible to make guarantee you can read data from anywhere in fairly cheaply (there are of course other ways it could have been done, but this is where we're at).

So, assuming the current crypt4gh standard, a packet in stream can't be shorter, meaning if one wants to use a final packet in stream, one would need to extend it somehow, but as that would calculating a new MAC involve I don't see how that could happen without being able to decode the packet.

It could possibly be worked around by having the htsget being able to access it (so having a private key, which seems non-optimal) or having the archive do special magic to extend short blocks, but that of course adds a layer of complexity (and it's not obvious to me it's better than e.g. a design composed of streams that are decrypted separately).

mmalenic · 2024-05-31T00:09:46Z

Yes, it would be possible for htsget/the archive to make these edits, but it would require decrypting the last block and having the associated key to do that. I think this would probably add more complexity than necessary for something like this, especially if the client was able to do it on their end.

If considering htsget with access to local data files (e.g. using LocalStorage), then this could be more suitable because the server would probably be able to decrypt those files anyway. However, Crypt4GH for LocalStorage or S3Storage is not implemented in this PR (although there are plans for it eventually).

pontus · 2024-06-14T08:05:03Z

Sorry for the delayed response here.

Do I understand correctly that you share my concern with the current design?

I think getting data via htsget is an important feature but am not terribly interested in exactly how it's done, but as for serving bits of crypt4gh encrypted data, it makes a lot more sense to me to use coordinates in the resulting decrypted stream than any encrypted stream because it frees the "client side" from dealing with the in archive file's data-edit-list (or assuming there is none). Similarly, if the header is included in the coordinate system, there's the problem that the size of the header may well depend on what bits of the file in archive is requested (because of e.g. different needs for the data-edit-list, different symmetric keys being included and so on). And I do have an interest in the code for the sensitive-data-archive not having to jump through too many hoops to support special cases.

mmalenic · 2024-06-18T23:26:55Z

Do I understand correctly that you share my concern with the current design?

I don't think I fully understand what the concern is, but I think you would like the client and backend to avoid doing extra work? I agree, and htsget-rs already attempts to do that as much as possible. At some point though, the system needs to interact with Crypt4GH if that's the goal. E.g. the client needs to understand how to decrypt data if it's receiving encrypted bytes.

it makes a lot more sense to me to use coordinates in the resulting decrypted stream than any encrypted stream

I'm not sure I understand what you mean here. Are you saying that the client should receive URL tickets from htsget-rs which represent unencrypted data from the archive backend? If so, this supported with the send_unencrypted_to_client config option.

if the header is included in the coordinate system, there's the problem that the size of the header may well depend on what bits of the file in archive is requested

This can definitely be addressed by htsget-rs by requesting more data if it is not enough. Note, that the archive backend doesn't need to be super smart here, it just needs to respond to HTTP range requests from htsget-rs, no need to convert between encrypted/unencrypted positions.

Is there a particular problem that you have encountered trying to set up htsget-rs with Crypt4GH? I think it would help me understand what kind of issues you are encountering with a concrete example. I'd be happy to help, and I'm pretty flexible with including features if it's possible.

brainstorm and others added 30 commits February 28, 2023 15:53

Add crypt4gh example keys and encrypted BAM file for test purposes only

d41d0e4

Add a bit more CLI tests on crypt4gh /cc @mmalenic

b3c22d4

Merge branch 'main' into crypt4gh

358a54d

Sketching Crypt4GH traits in the context of htsget-rs storage layer.

f20fc97

Co-authored-by: Marko Malenic <[email protected]>

Implementing AsyncRead Cryptor future to iterate over encrypted/decry…

ed9643a

…pted blocks. Co-authored-by: Marko Malenic <[email protected]>

Introduced (encrypted) block reader and implement the poll_fill_buf/p…

a39612f

…oll_next methods for the Async(Buf)Read/Stream. Co-authored-by: Marko Malenic <[email protected]>

fix: build errors and code tidy

2335ecc

refactor: add state for block decoding, and implement header info dec…

9d6a3a3

…oding

refactor: pass header packets count to the next state

d579e78

feat: add dedicated error type

cb58fa9

feat: implement header packet splitting in decoder state

9fe410f

refactor: move decode code into separate functions

c7618cc

test: add decode header info test

9b6c171

refactor: rename BlockType to DecodedBlock

d4b5a05

test: add test for header packets

38eac99

test: add test for data block

85c878f

refactor: create decrypt module

762be3e

refactor: move stream decryptor structs

58f32fd

feat: implement decryptor for header packets

ffd5bce

fix: use session keys and body decrypt in data block decryptor

45beda7

test: rearrange test functions and add data block decryptor test

b71af33

test: add header packet decryptor test

10b2bbc

refactor: move around some files and introduce polling functions

cb17e0d

refactor: decrypt all header packets at once in the header packet dec…

ff26397

…rypter

wip: header info future

4cc14ab

refactor: decode all header packets in one go

696b0ef

wip: last block of decoder

ef072aa

fix: decode end of file properly and add test for the last data block

540c01c

fix: working Crypt4GH Stream and tests

ff60068

feat(crypt4gh): add builder for reader

7905fb9

mmalenic added 3 commits March 12, 2024 08:07

merge from main

4f30ef9

feat: add user agent to url storage

77a011d

fix(search): tests and byte ranges, use main branch of crypt4gh

a5f891a

brainstorm approved these changes Mar 12, 2024

View reviewed changes

mmalenic added 13 commits March 12, 2024 15:51

fix(search): tests working, remove temp directory when reading/genera…

b734e93

…ting keys

feat(search): send_encrypted_to_client option, fix tests, and compile…

10f7624

… feature powerset

fix(search): fix full byte range search

bd456e7

search(url): base64 encode key and add header lines

40ecafd

test(search): fix test assumptions

194457b

fix(search): remove server-public-key, overwrite user-agent and clien…

c7ecefb

…t-public-key, update diagram

fix(search): add correct headers to head requests, including updated …

de7e9f8

…client key

Merge branch 'main' of https://github.com/umccr/htsget-rs into crypt4gh

31c72e2

# Conflicts: # deploy/Dockerfile # htsget-config/src/resolver/mod.rs

test(search): integration tests for crypt4gh and fix CRAM eof range

7e5ee23

fix(deploy): exclude async-crypt4gh from docker ignore

1fab8e7

fix(search): overwrite user-agent when requesting index file, replace…

df675cd

… hyper client with reqwest to support redirects and simpler certificate settings

feat(search): add override for validation certificates for testing pu…

a0aba09

…rposes

Merge branch 'main' of https://github.com/umccr/htsget-rs into crypt4gh

c06b550

brainstorm mentioned this pull request Sep 11, 2024

feat: Crypt4GH support using LocalStorage #262

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: crypt4gh #223

feat: crypt4gh #223

mmalenic commented Jan 16, 2024 •

edited

Loading

pontus commented May 20, 2024

mmalenic commented May 20, 2024

pontus commented May 21, 2024

mmalenic commented May 22, 2024

pontus commented May 22, 2024

mmalenic commented May 23, 2024 •

edited

Loading

pontus commented May 23, 2024

mmalenic commented May 23, 2024 •

edited

Loading

pontus commented May 24, 2024

mmalenic commented May 31, 2024 •

edited

Loading

pontus commented Jun 14, 2024

mmalenic commented Jun 18, 2024 •

edited

Loading

feat: crypt4gh #223

Are you sure you want to change the base?

feat: crypt4gh #223

Conversation

mmalenic commented Jan 16, 2024 • edited Loading

pontus commented May 20, 2024

mmalenic commented May 20, 2024

pontus commented May 21, 2024

mmalenic commented May 22, 2024

pontus commented May 22, 2024

mmalenic commented May 23, 2024 • edited Loading

pontus commented May 23, 2024

mmalenic commented May 23, 2024 • edited Loading

pontus commented May 24, 2024

mmalenic commented May 31, 2024 • edited Loading

pontus commented Jun 14, 2024

mmalenic commented Jun 18, 2024 • edited Loading

mmalenic commented Jan 16, 2024 •

edited

Loading

mmalenic commented May 23, 2024 •

edited

Loading

mmalenic commented May 23, 2024 •

edited

Loading

mmalenic commented May 31, 2024 •

edited

Loading

mmalenic commented Jun 18, 2024 •

edited

Loading