-
-
Notifications
You must be signed in to change notification settings - Fork 406
Flesh out Apache Spark Examples documentation #5160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
At a first glance the examples are reasonable. Next step would be fleshing out the english explanations to highlight the relevant parts of each example and explaining why they are important and necessary for each particular example. |
Perhaps one more thing is required at a PR-level: a convincing explanation in the PR description why these are the examples that are most important for spark developers, and not the dozens of other possible examples we could come up with. I do not have a spark background, so you'll need to tell me why this choice of examples is the most useful for spark users in a way I can understand and be convinced by |
@lihaoyi Ready for review |
Sorry for the delay, just got back home after a month of traveling. The explanations in the example files are generally lacking; reading through them doesn't tell me anything about the code or example: why the example is relevant, why the given Mill code is necessary, what parts of the given Mill code are interesting. This needs to be fleshed out to be made useful |
This PR adds example documentation on key components in Apache spark ecosystem:
3-spark-streaming (with kafka & docker-compose)
Spark Streaming allows Spark developers to leverage their existing Spark skills to process real-time data streams, enabling the creation of powerful and scalable streaming applications. Kafka is widely considered the de-facto data source for spark streaming. In this example we demonstrate a simple spark streaming service with kafka & using custom task we define functions to run administrative kafka commands in mill.
4-hello-delta
5-hello-iceberg
Delta Lake and Apache Iceberg are open-source storage layers that bring enhanced features like ACID transactions and schema evolution to data lakes used with Spark, making them more reliable and manageable for large-scale analytics. While Delta lake has very high adoption & is considered default within the data bricks community, Apache iceberg has a rapidly growing adoption. These two storage layers are the most commonly used data storage layers in real world practice.
6-hello-mllib
7-hello-pyspark-mllib
Spark MLlib (Spark's scalable machine learning library) designed to run efficiently on Spark's distributed computing framework, provides tools for common machine learning tasks like regression, classification, & etc. Example are provided in both Python & Scala.
Resolves issue #4592