Flesh out Apache Spark Examples documentation #5160

monyedavid · 2025-05-20T03:28:35Z

This PR adds example documentation on key components in Apache spark ecosystem:

3-spark-streaming (with kafka & docker-compose)
Spark Streaming allows Spark developers to leverage their existing Spark skills to process real-time data streams, enabling the creation of powerful and scalable streaming applications. Kafka is widely considered the de-facto data source for spark streaming. In this example we demonstrate a simple spark streaming service with kafka & using custom task we define functions to run administrative kafka commands in mill.
4-hello-delta
5-hello-iceberg
Delta Lake and Apache Iceberg are open-source storage layers that bring enhanced features like ACID transactions and schema evolution to data lakes used with Spark, making them more reliable and manageable for large-scale analytics. While Delta lake has very high adoption & is considered default within the data bricks community, Apache iceberg has a rapidly growing adoption. These two storage layers are the most commonly used data storage layers in real world practice.
6-hello-mllib
7-hello-pyspark-mllib
Spark MLlib (Spark's scalable machine learning library) designed to run efficiently on Spark's distributed computing framework, provides tools for common machine learning tasks like regression, classification, & etc. Example are provided in both Python & Scala.

Resolves issue #4592

lihaoyi · 2025-05-28T11:10:11Z

At a first glance the examples are reasonable. Next step would be fleshing out the english explanations to highlight the relevant parts of each example and explaining why they are important and necessary for each particular example.

lihaoyi · 2025-05-28T18:05:30Z

Perhaps one more thing is required at a PR-level: a convincing explanation in the PR description why these are the examples that are most important for spark developers, and not the dozens of other possible examples we could come up with. I do not have a spark background, so you'll need to tell me why this choice of examples is the most useful for spark users in a way I can understand and be convinced by

fix pyspark

monyedavid · 2025-06-02T10:39:32Z

@lihaoyi Ready for review

lihaoyi · 2025-06-08T08:18:34Z

Sorry for the delay, just got back home after a month of traveling.

The explanations in the example files are generally lacking; reading through them doesn't tell me anything about the code or example: why the example is relevant, why the given Mill code is necessary, what parts of the given Mill code are interesting. This needs to be fleshed out to be made useful

monyedavid and others added 4 commits May 20, 2025 04:25

add extra spark examples

f40ed6b

fix lint

d559503

Merge branch 'refs/heads/main' into issue-4592

0e884f4

[autofix.ci] apply automated fixes

2135d70

monyedavid added 5 commits May 30, 2025 04:58

updated doc

d976347

Merge remote-tracking branch 'origin/issue-4592' into issue-4592

1ba2931

updated doc

97721b1

better docs

a8c59b6

fix pyspark

fix pyspark

4ebebc5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Flesh out Apache Spark Examples documentation #5160

Flesh out Apache Spark Examples documentation #5160

Uh oh!

monyedavid commented May 20, 2025 •

edited

Loading

Uh oh!

lihaoyi commented May 28, 2025

Uh oh!

lihaoyi commented May 28, 2025

Uh oh!

monyedavid commented Jun 2, 2025

Uh oh!

lihaoyi commented Jun 8, 2025

Uh oh!

Uh oh!

Uh oh!

Flesh out Apache Spark Examples documentation #5160

Are you sure you want to change the base?

Flesh out Apache Spark Examples documentation #5160

Uh oh!

Conversation

monyedavid commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lihaoyi commented May 28, 2025

Uh oh!

lihaoyi commented May 28, 2025

Uh oh!

monyedavid commented Jun 2, 2025

Uh oh!

lihaoyi commented Jun 8, 2025

Uh oh!

Uh oh!

monyedavid commented May 20, 2025 •

edited

Loading