Replies: 9 comments
-
I think I would personally prefer the
Personally I think this could be helpful beyond streaming. This affords me the ability to package both the prompt and the LLM response in other downstream tasks. Then, I have awareness of the impact of prompt changes when the boundary dashboard doesn't suffice.
What does this mean? Will BAML somehow signal to me that the stream has completed with an extra token? In our setup, I await for the OpenAI stream to complete, then send a special token to the client to signal that the stream has completed. I would need to know if BAML intends to throw their own stop sequence token. |
Beta Was this translation helpful? Give feedback.
-
Interesting point. Will add this in regardless during this work in that case. Should be very easy.
Effectively in the following case: def OnStreamHandler(res: str) -> None:
print(res)
async def bar(arg: str, handler: Callable[[str], None]):
# res is of type Partial<ReturnType<GetMessage>>
response = await b.GetMessage.stream(arg, __onstream__=handler)
response = await b.GetMessage.get_impl('v1').stream(arg, __onstream__=handler)
bar("Foo", OnStreamHandler)
|
Beta Was this translation helpful? Give feedback.
-
Can you confirm my understanding? If I were to run the code you shared as is (ignoring the
Is that correct? If so, I think I'd prefer if Baml provided a way to discern the delta from the complete response. |
Beta Was this translation helpful? Give feedback.
-
It would actually do:
The quotes were there only for the purpose of showing it a string type |
Beta Was this translation helpful? Give feedback.
-
I see, ok I understand. It would be cool if I could just pass some sort of I think in my ideal world, the stream handler would receive the delta, and the final response would receive the full completion. |
Beta Was this translation helpful? Give feedback.
-
May I ask the use case for getting just the delta? The issue with providing just the delta is that it prevents us from implementing the object version of steam. As then, the caller would be responsible for implementing that. This also prevents adapters from implementing streams |
Beta Was this translation helpful? Give feedback.
-
Our client side is configured to accept 1 new character a time until the final special stop sequence is retrieved. Without a delta option, I'd need to refactor our application code to now manage the state of the string. If it's going to be a big lift for the BAML DSL, then I can add the logic to the handler |
Beta Was this translation helpful? Give feedback.
-
I see, ok! we can support this via an additional param to
Which only will return the delta only. |
Beta Was this translation helpful? Give feedback.
-
Just for posterity, this spec is outdated, the real streaming API we implemented is now in our docs for both TS and Python! |
Beta Was this translation helpful? Give feedback.
-
Original Proposal: https://gloochat.notion.site/Streaming-d4a538be3cd6494d8c7b1710e9b63252
Problem
With LLMs, due to model wait times, there can be a large delay when getting the first token vs getting the final output. With 0 signal in the middle, users can be stuck with no indication that the model is working.
Current Situation
Currently, to use streaming you cannot use BAML or any BAML generated code. You must use OpenAI, Anthropic, or any client directly.
Prior Art
stream = true
https://twitter.com/i/status/1747833924275859917
Proposal
Constraints
See question to answer section below to understand why.
__
)Sample Code
Option 1
Add a new top level method called,
.stream
with a required__onstream__
parameter. To support this, we would likely need to ban all parameters starting and ending with__
from being defined in BAML.Questions to consider
Streaming should only be done in runtime code. The purpose of BAML is to provide clean interfaces for inputs and outputs. Streaming is only done to produce partial outputs, hence there is no point in streaming within BAML code. In the future, we may offer early outs directly in BAML (i.e. cancel the LLM call if the partial result is bad).
There is no flag in BAML yet. All LLM providers will offer streaming via the interface, by default the stream interface will patch directly to the run interface and do nothing on stream. For non-LLM providers, stream will also default to the run interface.
For now, streaming for objects will be unsupported. Only strings will be streamed. We thing we can support this trivially however. And only functions will the output type of string will even create a
.stream
method. If we did want to implement streaming of objects we would need the following:{
[
"
), then run the appropriate deserializer. On failed deserialization, we just skip that token, and wait for more of the stream to complete.Since an adapter may chance the output type of the LLM, all functions with adapters won’t end up streaming anything for now.
A user can define a function in BAML file that takes the partial adapter output and converts it to the partial function output. Then streaming can work as expected.
If there are impls which are heuristics, we may still want to pass in the
__onstream__
handler and other state downstream. For now, we just need to keep this in mind when we implement them.Some users may want to handle the delta themselves, for them we can offer a second parameter,
_onstreamdelta__
in addition to__onstream__
which only has the following signature:Callable[[str], None]
.There is no additional elegance given a stream. No one would practically write something like:
As most streaming based code on the web is callback driven (even in FAST API and express)
We could have more telemetry, such as time to first token. Not P0.
Alternative Designs
Expose the prompt and generating the prompt for an impl
We could alternatively offer a method that exposes the generated prompt via the client.
Cons:
Stream as a paramter on run
Make this an option directly in run, rather than a different method. This is not as clean as how a provider would implement stream may differ dramatically from the run method.
Offer a stream type in BAML
Main cons are:
Beta Was this translation helpful? Give feedback.
All reactions