Skip to content

Flesh out Apache Spark Examples documentation #5160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

monyedavid
Copy link
Contributor

@monyedavid monyedavid commented May 20, 2025

This PR adds example documentation on key components in Apache spark ecosystem:

  • 3-spark-streaming (with kafka & docker-compose)
    Spark Streaming allows Spark developers to leverage their existing Spark skills to process real-time data streams, enabling the creation of powerful and scalable streaming applications. Kafka is widely considered the de-facto data source for spark streaming. In this example we demonstrate a simple spark streaming service with kafka & using custom task we define functions to run administrative kafka commands in mill.

  • 4-hello-delta

  • 5-hello-iceberg
    Delta Lake and Apache Iceberg are open-source storage layers that bring enhanced features like ACID transactions and schema evolution to data lakes used with Spark, making them more reliable and manageable for large-scale analytics. While Delta lake has very high adoption & is considered default within the data bricks community, Apache iceberg has a rapidly growing adoption. These two storage layers are the most commonly used data storage layers in real world practice.

  • 6-hello-mllib

  • 7-hello-pyspark-mllib
    Spark MLlib (Spark's scalable machine learning library) designed to run efficiently on Spark's distributed computing framework, provides tools for common machine learning tasks like regression, classification, & etc. Example are provided in both Python & Scala.

Resolves issue #4592

@lihaoyi
Copy link
Member

lihaoyi commented May 28, 2025

At a first glance the examples are reasonable. Next step would be fleshing out the english explanations to highlight the relevant parts of each example and explaining why they are important and necessary for each particular example.

@lihaoyi
Copy link
Member

lihaoyi commented May 28, 2025

Perhaps one more thing is required at a PR-level: a convincing explanation in the PR description why these are the examples that are most important for spark developers, and not the dozens of other possible examples we could come up with. I do not have a spark background, so you'll need to tell me why this choice of examples is the most useful for spark users in a way I can understand and be convinced by

@monyedavid
Copy link
Contributor Author

@lihaoyi Ready for review

@lihaoyi
Copy link
Member

lihaoyi commented Jun 8, 2025

Sorry for the delay, just got back home after a month of traveling.

The explanations in the example files are generally lacking; reading through them doesn't tell me anything about the code or example: why the example is relevant, why the given Mill code is necessary, what parts of the given Mill code are interesting. This needs to be fleshed out to be made useful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants