-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SS]《3.1 Structured Streaming 之状态存储解析》讨论区 #33
Comments
@lw-lin |
这个跟数据集大小有关。如果数据集非常小,如 user id 的空间很小,那么 statestore 是没有问题的。如果 user id 的空间很大,但每天的 distinct user id 很小,那么 statestore 也是没有问题的。但如果 user id 空间很大,每天的 distinct user id 又很多,那 statestore 就有问题了。可以考虑其它方法如 hyperloglog 等。 |
谢谢 |
您好,我想请教一下stateStore里具体存储的是什么内容?我看到在statefulOperators里的一些对state的put操作如下:
|
@KevinZwx 是 UnsafeRow;key 和 value 都是 UnsafeRow。UnsafeRow 在 SparkSQL 模块里相当于 Object 在 Java 里的作用。UnsafeRow 里包含各种类型(数值、字符串等)的具体数据。 |
好的谢谢 |
您好,我想请教下,是不是每次批次的数据在做状态更新的时候都要去hdfs拉一遍对应的stateStore,然后更新完之后再放回hdfs。 |
请问一个可能不算是state的问题。在structured streaming中,两个流之间Join, |
如需要贴代码,请复制以下内容并修改:
谢谢!
The text was updated successfully, but these errors were encountered: