Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI: Support collections #819

Closed
josephjclark opened this issue Nov 10, 2024 · 3 comments · Fixed by #839
Closed

CLI: Support collections #819

josephjclark opened this issue Nov 10, 2024 · 3 comments · Fixed by #839

Comments

@josephjclark
Copy link
Collaborator

It would surely be useful for the CLI to support the new Collections feature directly

A particularly strong use-case would be to upload mapping data into a collection. I have 1000 mappings in my GH repo and want to upload them to lightning for use in my workflow. I don't really want to have to create a workflow step and copy and paste the values in for this. I want to call the collections API directly.

openfn collections get/count/keys/set/delete <name> <pattern> --input  --output --token --limit --url
  • get would get values from a collection and return a JSON object of key/value pairs. Uses --output to write the result to disk (if unset, returns to stdout). Pass --json to parse values as JSON.
  • count would return the number of values which match a particular pattern. Returns to stdout only.
  • keys would return a list of matching keys without downloading any data (we don't even support this!)
  • set would accept data through a file. CSV or JSON. If JSON, it must expect and object of key value pairs, and it should probably stringify values
  • delete would delete keys according to the pattern and return the count of deleted keys.

Auth

We need to think carefully about auth.

The request needs to include a personal access token.

But we probably don't want to take this as a flat env var because access tokens are scoped per project, so users will have several. Which makes it a bit fiddly to submit to the CLI.

Options:

  • accept an env var name in the command, ie, --token MY_OPENFN_TOKEN (but surely --token $MY_OPENFN_TOKEN also works?)
  • If a token isn't provided, prompt the user to paste it in (annoying)
  • Allow tokens to be saved by the CLI (which must encrypt and write them to the repo somewhere). Now we only need to take a token once and the CLI can remember for that collection name

Implementation details

Should this use the collections adaptor under the hood? Probably not, because it behaves very differently. But we might copy some code across to handle streaming efficiently.

@github-project-automation github-project-automation bot moved this to New Issues in v2 Nov 10, 2024
@josephjclark
Copy link
Collaborator Author

josephjclark commented Dec 5, 2024

A data structure discussion

tl;dr:

  • Should the get and set structures be identical in the CLI? Probably yes, to support mappings
  • Should the structure be { [key]: value } or [{ key, value }] ?
  • Is collections.set confusing if you can set a single value or upload an object?

In the CLI I want to use a simplified data structure. So collections get my-collection * will return an object of key value pairs, like this:

{
  items: {
    'item-1': { id: 'a' },
    'item-2': { id: 'b' },
  } ,
 count: 2,
}

I find this quite a nice and intuitive data structure. Very Javascripty. I'll maybe add a meta key for created/update a metadata.

So when setting, it makes sense that I should do collections set my-collection path/to/values.json, where values.json looks like this:

{
  'item-1': { id: 'a' },
  'item-2': { id: 'b' },
}

Again it's very intuitive. But note there's no items and no count.

So, first problem: I can't get a bunch of items, save them to disk, manually update the values in-situ, and then set it back. because the shape of the data is wrong. Well, that's unintuitive, but I don't think I ever want to do this anyway?

Not for multiple items no. But for mappings, yes. I absolutely want to fetch, edit and upsert a mappings file. Probably usually the mappings file is checked in to source control but still, this seems like a thing I would want to do.

I could to strip the wrapper from the get call, which aligns the data structures - but if I do this, I have no way to handle metadata. Does this matter? I don't know. I feel like closing off the option of metadata is a mistake.

We could change the format and download results to an array of { key, value, ...meta } objects. That aligns get and set and lets us save metadata later, and mirrors the server side data structure. But:

  • It's a very verbose serialisation, ugly and expensive in memory
  • If I download a single item, a mapping, does it still come down as a [{ key, value}] array? That's not nice at all for mappings

One benefit of that structure is that we can detect it, so that if you try and set, say, an array of keys onto a single key, we can raise a warning.

That leads into the second problem. I'm really worried that a CLI like this is ambiguous:

openfn collections set my-collection some-key path/to/values.json

If values.json contains a single value, no problem. We upload that value verbatim to the collection.

If values.json contains many key/value pairs, we'll be uploading a nested structure into the key. Ok, maybe that's stupid, but isn't it super easy to accidentally have a key name in the command, or mis ype something? Miss a dash out and an argument like --limit will be interpreted as a key.

So I think this gets flaky and unintuitive.

Solution?

I really want to optimise this for uploading and downloading single items.

That means get and set should upload and download to file verbatim. No metadata (we could log it to stdout I suppose)

If getting with a pattern, we can write a simplified structure with no wrapper. I think this is fairly intuitive.

That leaves bulk upsert - which is currently impossible in this API. We could add a command like:

openfn collections set my-collection --items path/to/values.json

Where values.json is a verbatim object of key/value pairs (the same object that would be downloaded)

I like this approach because --items is pretty declarative, explicitly plural, and non-default.

@josephjclark
Copy link
Collaborator Author

Should we use a standard env var for the PAT after all?

Although a user might have access to multiple projects, they'll likely only have one PAT right? And usually only use one lightning instance?

So if we defaulted OPENFN_PAT and loaded that by default, that would actually be quite reasonable. We can log the last 8 characters or whatever (like the Lightning UI does) do give a hint as to which token is being used, just for the avoidance of doubt. Maybe debug level.

@josephjclark
Copy link
Collaborator Author

What about remove --dry-run, which will basically just GET a list of keys which match the pattern and remove what will be removed.

So you can do remove --dry-run to test a key pattern

@github-project-automation github-project-automation bot moved this from New Issues to Done in v2 Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant