Streaming #354

hellovai · 2024-01-21T21:31:05Z

hellovai
Jan 21, 2024
Maintainer

Original Proposal: https://gloochat.notion.site/Streaming-d4a538be3cd6494d8c7b1710e9b63252

Problem

With LLMs, due to model wait times, there can be a large delay when getting the first token vs getting the final output. With 0 signal in the middle, users can be stuck with no indication that the model is working.

Current Situation

Currently, to use streaming you cannot use BAML or any BAML generated code. You must use OpenAI, Anthropic, or any client directly.

Prior Art

All the major LLM providers have a streaming flag built in, traditionally with an extra parameter to the api called: stream = true
Some people have build JSON object streaming

https://twitter.com/i/status/1747833924275859917

Proposal

Constraints

See question to answer section below to understand why.

Streaming is not defined in BAML files by authors of BAML
P0: Only BAML functions which use an LLM and have the final output type as a string will support streaming
P0: All BAML defined providers will support streaming
- The fallback provider will just pass through streaming call
What events are streamed:
- On start:
  - An empty object
- On new token:
  - The total response to stage (not jsut the newest token)
- On Finish
  - The final result
- Then only is the final result returned out of the function
BREAKING CHANGE: BAML will not support function parameters with leading and trailing double underscores (__)
P1: Support streaming for objects and adapters in BAML

Sample Code

Option 1

Add a new top level method called, .stream with a required __onstream__ parameter. To support this, we would likely need to ban all parameters starting and ending with __ from being defined in BAML.

from baml_client import baml as b

async def foo(arg: str):
	# res is of type str
  response = await b.GetMessage.stream(arg, __onstream__=lambda res: print(res))
  response = await b.GetMessage.get_impl('v1').stream(
		arg, __onstream__=lambda res: print(res)
	)

def OnStreamHandler(res: str) -> None:
  print(res)

async def bar(arg: str, handler: Callable[[str], None]):
	# res is of type Partial<ReturnType<GetMessage>>
  response = await b.GetMessage.stream(arg, __onstream__=handler)
  response = await b.GetMessage.get_impl('v1').stream(arg, __onstream__=handler)

bar("Foo", OnStreamHandler)

Questions to consider

Is streaming done in BAML or in runtime code?

Streaming should only be done in runtime code. The purpose of BAML is to provide clean interfaces for inputs and outputs. Streaming is only done to produce partial outputs, hence there is no point in streaming within BAML code. In the future, we may offer early outs directly in BAML (i.e. cancel the LLM call if the partial result is bad).

How does a provider indicate if they support streaming or not?

There is no flag in BAML yet. All LLM providers will offer streaming via the interface, by default the stream interface will patch directly to the run interface and do nothing on stream. For non-LLM providers, stream will also default to the run interface.

Streaming objects

For now, streaming for objects will be unsupported. Only strings will be streamed. We thing we can support this trivially however. And only functions will the output type of string will even create a .stream method. If we did want to implement streaming of objects we would need the following:

Create partials for every type (where every field is json)
Create types for partial function handlers so its easy to pass around handlers

from baml_client.baml_types import IMyFunctionStreamOutput
from baml_client.baml_types import IMyFunctionStreamHandler

# type IMyFunctionStreamOutput = Partial[output type of MyFunction]
# type IMyFunctionStreamHandler = Callable[[IMyFunctionStreamOutput], None]

# If the type of the function is a class
class MyResponse:
   a: int
   b: List[MyInnerResponse]

class MyInnerResponse:
   foo: str
   bar: List[int]

# Then the Streamed output typed would be:
# Partial will actually be implemented as a generic class: 
# https://pypi.org/project/pydantic-partial/
class PartialMyResponse:
   a: int | None
   b: List[PartialMyInnerResponse] | None

class PartialMyInnerResponse:
   foo: str | None
   bar: List[int] | None

We can infer the partial json by completing any json that isn’t completed. (i.e. close opening { [ "), then run the appropriate deserializer. On failed deserialization, we just skip that token, and wait for more of the stream to complete.

Streaming + Adapters

Since an adapter may chance the output type of the LLM, all functions with adapters won’t end up streaming anything for now.

A user can define a function in BAML file that takes the partial adapter output and converts it to the partial function output. Then streaming can work as expected.

How will streaming be used by code function

If there are impls which are heuristics, we may still want to pass in the __onstream__ handler and other state downstream. For now, we just need to keep this in mind when we implement them.

Streaming only deltas

Some users may want to handle the delta themselves, for them we can offer a second parameter, _onstreamdelta__ in addition to __onstream__ which only has the following signature: Callable[[str], None].

Why use a callback instead of a stream / iterator?

There is no additional elegance given a stream. No one would practically write something like:

for item in b.MyFunction.stream(''):
   ...

As most streaming based code on the web is callback driven (even in FAST API and express)

What additional data can we collect?

We could have more telemetry, such as time to first token. Not P0.

Alternative Designs

Expose the prompt and generating the prompt for an impl

We could alternatively offer a method that exposes the generated prompt via the client.

Cons:

No tracing
User has to call untyped interface for open ai

Stream as a paramter on run

Make this an option directly in run, rather than a different method. This is not as clean as how a provider would implement stream may differ dramatically from the run method.

from baml_client import baml as b

async def foo(arg: str):
	# res is of type str
  response = await b.GetMessage(arg, __onstream__=lambda res: print(res))
  response = await b.GetMesssage.get_impl('v1').run(
		arg, __onstream__=lambda res: print(res)
	)

def OnStreamHandler(res: str) -> None:
  print(res)

async def bar(arg: str):
  response = await b.GetMessage(arg, __onstream__=OnStreamHandler)
  response = await b.GetMessage.get_impl('v1').run(arg, __onstream__=OnStreamHandler)

Offer a stream type in BAML

Main cons are:

How do we collect data on streamed types? It really makes no sense as streaming is only done for showing users partials state. there is no inherent benefit in knowing its a streamed method.

villagab4 · 2024-01-21T21:38:56Z

villagab4
Jan 21, 2024

I think I would personally prefer the .stream() over an optional parameter in the .run() method. A couple of thoughts below:

Expose the prompt and generating the prompt for an impl

Personally I think this could be helpful beyond streaming. This affords me the ability to package both the prompt and the LLM response in other downstream tasks. Then, I have awareness of the impact of prompt changes when the boundary dashboard doesn't suffice.

On Finish
The final result

What does this mean? Will BAML somehow signal to me that the stream has completed with an extra token? In our setup, I await for the OpenAI stream to complete, then send a special token to the client to signal that the stream has completed. I would need to know if BAML intends to throw their own stop sequence token.

0 replies

hellovai · 2024-01-21T21:42:35Z

hellovai
Jan 21, 2024
Maintainer Author

Personally I think this could be helpful beyond streaming. This affords me the ability to package both the prompt and the LLM response in other downstream tasks. Then, I have awareness of the impact of prompt changes when the boundary dashboard doesn't suffice.

Interesting point. Will add this in regardless during this work in that case. Should be very easy.

On Finish
The final result

What does this mean? Will BAML somehow signal to me that the stream has completed with an extra token? In our setup, I await for the OpenAI stream to complete, then send a special token to the client to signal that the stream has completed. I would need to know if BAML intends to throw their own stop sequence token.

Effectively in the following case:

def OnStreamHandler(res: str) -> None:
  print(res)

async def bar(arg: str, handler: Callable[[str], None]):
	# res is of type Partial<ReturnType<GetMessage>>
  response = await b.GetMessage.stream(arg, __onstream__=handler)
  response = await b.GetMessage.get_impl('v1').stream(arg, __onstream__=handler)

bar("Foo", OnStreamHandler)

OnStreamHandler would be triggered once with "" then as more data came in, with the respective response, but once the stream completes, before response is set, the handler will be triggered one last time with the final value.

0 replies

villagab4 · 2024-01-21T21:47:20Z

villagab4
Jan 21, 2024

Can you confirm my understanding?

If I were to run the code you shared as is (ignoring the .get_impl version), I would see the following in my stdout logs:

"
F
o
o
"
"Foo"

Is that correct? If so, I think I'd prefer if Baml provided a way to discern the delta from the complete response.
Also, so now I need to worry about the quotation marks?

0 replies

hellovai · 2024-01-21T21:48:15Z

hellovai
Jan 21, 2024
Maintainer Author

It would actually do:

 (Emtpy)
F
Fo
Foo

The quotes were there only for the purpose of showing it a string type

0 replies

villagab4 · 2024-01-21T21:51:39Z

villagab4
Jan 21, 2024

I see, ok I understand. It would be cool if I could just pass some sort of delta=True flag so that my handler would only get the actual delta as opposed to having to manage the state of the full completion in the handler itself.

I think in my ideal world, the stream handler would receive the delta, and the final response would receive the full completion.

0 replies

hellovai · 2024-01-21T21:54:28Z

hellovai
Jan 21, 2024
Maintainer Author

May I ask the use case for getting just the delta? The issue with providing just the delta is that it prevents us from implementing the object version of steam. As then, the caller would be responsible for implementing that. This also prevents adapters from implementing streams

0 replies

villagab4 · 2024-01-21T22:01:54Z

villagab4
Jan 21, 2024

Our client side is configured to accept 1 new character a time until the final special stop sequence is retrieved. Without a delta option, I'd need to refactor our application code to now manage the state of the string.

If it's going to be a big lift for the BAML DSL, then I can add the logic to the handler

0 replies

hellovai · 2024-01-21T22:05:33Z

hellovai
Jan 21, 2024
Maintainer Author

I see, ok! we can support this via an additional param to stream in addition to __onstream__:

__onstreamdelta__

Which only will return the delta only.

0 replies

aaronvg · 2024-06-19T23:04:01Z

aaronvg
Jun 19, 2024
Maintainer

Just for posterity, this spec is outdated, the real streaming API we implemented is now in our docs for both TS and Python!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boundary

Streaming #354

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Boundary

Streaming #354

hellovai Jan 21, 2024 Maintainer

Problem

Current Situation

Prior Art

Proposal

Constraints

Sample Code

Questions to consider

Alternative Designs

Replies: 9 comments

villagab4 Jan 21, 2024

hellovai Jan 21, 2024 Maintainer Author

villagab4 Jan 21, 2024

hellovai Jan 21, 2024 Maintainer Author

villagab4 Jan 21, 2024

hellovai Jan 21, 2024 Maintainer Author

villagab4 Jan 21, 2024

hellovai Jan 21, 2024 Maintainer Author

aaronvg Jun 19, 2024 Maintainer

hellovai
Jan 21, 2024
Maintainer

villagab4
Jan 21, 2024

hellovai
Jan 21, 2024
Maintainer Author

villagab4
Jan 21, 2024

hellovai
Jan 21, 2024
Maintainer Author

villagab4
Jan 21, 2024

hellovai
Jan 21, 2024
Maintainer Author

villagab4
Jan 21, 2024

hellovai
Jan 21, 2024
Maintainer Author

aaronvg
Jun 19, 2024
Maintainer