Skip to content

Latest commit

 

History

History
523 lines (392 loc) · 24.9 KB

File metadata and controls

523 lines (392 loc) · 24.9 KB

Changelog

All notable changes to this project will be documented in this file.

[Unreleased]

[26.3.0] - 2026-03-16

[26.3.0-rc1] - 2026-03-16

Added

  • Add conversion webhook (#656).
  • Support objectOverrides using .spec.objectOverrides on the SparkConnectServer and SparkHistoryServer. See objectOverrides concepts page for details (#640).
  • Support for Spark 4.1.1 (#642).
  • Add SparkApplication.spec.job.retryOnFailureCount field with a default of 0. This has the effect that applications where the spark-submit Pod fails are not resubmitted. Previously, Jobs were retried at most 6 times by default (#647).
  • Support for Spark 3.5.8 (#650).
  • First class support for S3 on Spark connect clusters (#652).
  • Spark applications can now have templates that are merged into the application manifest before reconciliation. This allows users with many applications to source out common configuration in a central place and reduce duplication (#660).

Fixed

  • Spark applications now correctly handle the case where both the History Server and the S3 connection use the same TLS secret class (#655). Previously, the Spark application pods contained the same TLS volume twice, which could not be applied to the API server.
  • The spark-submit job now sets the correct -Djavax.net.ssl.trustStore properties (#655).
  • Spark application jobs can now have pod/node affinities. This was an omission as the application driver and executors already had this field for a long time. (#664).
  • Fix "404 page not found" error for the initial object list (#666).

Changed

  • Bump stackable-operator to 0.108.0, snafu to 0.9, strum to 0.28 (#663, #666).
  • Gracefully shutdown all concurrent tasks by forwarding the SIGTERM signal (#651).
  • Remove the Spark application owner reference from the executor pods. This allows Kubernetes to garbage collect them early when the driver or the submit job fail (#648).
  • Clean up driver pods when the spark application is finished. Previously, driver pods created by the submit job would be left hanging even after the job has been deleted (#649).

Removed

  • Support for Spark 3.5.6 (#642).
  • Deprecated support for Spark 3.5.7 (#650).

[25.11.0] - 2025-11-07

[25.11.0-rc1] - 2025-11-06

Added

  • Add end-of-support checker which can be controlled with environment variables and CLI arguments (#615).
    • EOS_CHECK_MODE (--eos-check-mode) to set the EoS check mode. Currently, only "offline" is supported.
    • EOS_INTERVAL (--eos-interval) to set the interval in which the operator checks if it is EoS.
    • EOS_DISABLED (--eos-disabled) to disable the EoS checker completely.
  • Add experimental support for Spark 4 (#589)
  • Helm: Allow Pod priorityClassName to be configured (#608).
  • Support for Spark 3.5.7 (#610).
  • Add metrics service with prometheus.io/path|port|scheme annotations for spark history server (#619).
  • Add metrics service with prometheus.io/path|port|scheme annotations for spark connect (#619).

Fixed

  • SparkConnectServer: The imagePullSecret is now correctly passed to Spark executor pods (#603).

  • Previously we had a bug that could lead to missing certificates (#611).

    This could be the case when you specified multiple CAs in your SecretClass. We now correctly handle multiple certificates in this cases. See this GitHub issue for details

  • The service account of spark applications can now be overridden with pod overrides (#617).

    Previously the application service account was passed as command line argument to spark-submit and was thus not possible to overwrite with pod overrides for the driver and executors. This CLI argument has now been moved to the pod templates of the individual roles.

Removed

  • Support for Spark versions 3.5.5 has been dropped (#610).

Changed

  • Bump stackable-operator to 0.100.1 and product-config to 0.8.0 (#622).
  • Bump testing-tools to 0.3.0-stackable0.0.0-dev (#638).

[25.7.0] - 2025-07-23

[25.7.0-rc1] - 2025-07-18

Added

  • Experimental support for Spark Connect (#539).
  • Adds new telemetry CLI arguments and environment variables (#560).
    • Use --file-log-max-files (or FILE_LOG_MAX_FILES) to limit the number of log files kept.
    • Use --file-log-rotation-period (or FILE_LOG_ROTATION_PERIOD) to configure the frequency of rotation.
    • Use --console-log-format (or CONSOLE_LOG_FORMAT) to set the format to plain (default) or json.
  • Expose history and connect services via listener classes (#562).
  • Support for Spark 3.5.6 (#580).
  • Add RBAC rule to helm template for automatic cluster domain detection (#592).
  • Add sparkhistory and shs shortnames for SparkHistoryServer (#592).

Changed

  • BREAKING: Replace stackable-operator initialize_logging with stackable-telemetry Tracing (#547, #554, #560).
    • The console log level was set by SPARK_K8S_OPERATOR_LOG, and is now set by CONSOLE_LOG_LEVEL.
    • The file log level was set by SPARK_K8S_OPERATOR_LOG, and is now set by FILE_LOG_LEVEL.
    • The file log directory was set by SPARK_K8S_OPERATOR_LOG_DIRECTORY, and is now set by FILE_LOG_DIRECTORY (or via --file-log-directory <DIRECTORY>).
    • Replace stackable-operator print_startup_string with tracing::info! with fields.
  • BREAKING: Inject the vector aggregator address into the vector config using the env var VECTOR_AGGREGATOR_ADDRESS instead of having the operator write it to the vector config (#551).
  • Document that Spark Connect doesn't integrate with the history server (#559)
  • test: Bump to Vector 0.46.1 (#565).
  • Use versioned common structs (#572).
  • BREAKING: Change the label app.kubernetes.io/name for Spark history and connect objects to use spark-history and spark-connect instead of spark-k8s (#573).
  • BREAKING: The history Pods now have their own ClusterRole named spark-history-clusterrole (#573).
  • BREAKING: Previously this operator would hardcode the UID and GID of the Pods being created to 1000/0, this has changed now (#575)
    • The runAsUser and runAsGroup fields will not be set anymore by the operator
    • The defaults from the docker images itself will now apply, which will be different from 1000/0 going forward
    • This is marked as breaking because tools and policies might exist, which require these fields to be set
  • Enable the built-in Prometheus servlet. The jmx exporter was removed in (#584) but added back in (#585).
  • BREAKING: Bump stackable-operator to 0.94.0 and update other dependencies (#592).
    • The default Kubernetes cluster domain name is now fetched from the kubelet API unless explicitly configured.
    • This requires operators to have the RBAC permission to get nodes/proxy in the apiGroup "". The helm-chart takes care of this.
    • The CLI argument --kubernetes-node-name or env variable KUBERNETES_NODE_NAME needs to be set. The helm-chart takes care of this.

Fixed

  • Use json file extension for log files (#553).
  • The Spark connect controller now watches StatefulSets instead of Deployments (again) (#573).
  • BREAKING: Move listenerClass to roleConfig for Spark History Server and Spark Connect. Service names changed. (#588).
  • Allow uppercase characters in domain names (#592).

Removed

  • Support for Spark versions 3.5.2 has been dropped (#570).
  • Integration test spark-pi-public-s3 because the AWS SDK >2.24 doesn't suuport anonymous S3 access anymore (#574).
  • Remove the lastUpdateTime field from the stacklet status (#592).
  • Remove role binding to legacy service accounts (#592).

[25.3.0] - 2025-03-21

Added

  • The lifetime of auto generated TLS certificates is now configurable with the role and roleGroup config property requestedSecretLifetime. This helps reducing frequent Pod restarts (#501).
  • Run a containerdebug process in the background of each Spark container to collect debugging information (#508).
  • Aggregate emitted Kubernetes events on the CustomResources (#515).
  • Support configuring JVM arguments (#532).
  • Support for S3 region (#528).

Changed

  • Default to OCI for image metadata and product image selection (#514).
  • Update tests and docs to Spark version 3.5.5 (#534)

[24.11.1] - 2025-01-10

[24.11.0] - 2024-11-18

Added

  • Make spark-env.sh configurable via configOverrides (#473).
  • The Spark history server can now service logs from HDFS compatible systems (#479).
  • The operator can now run on Kubernetes clusters using a non-default cluster domain. Use the env var KUBERNETES_CLUSTER_DOMAIN or the operator Helm chart property kubernetesClusterDomain to set a non-default cluster domain (#480).

Changed

  • Reduce CRD size from 1.2MB to 103KB by accepting arbitrary YAML input instead of the underlying schema for the following fields (#450):
    • podOverrides
    • affinity
    • volumes
    • volumeMounts
  • Update tests and docs to Spark version 3.5.2 (#459)

Fixed

  • BREAKING: The fields connection and host on S3Connection as well as bucketName on S3Bucketare now mandatory (#472).
  • Fix envOverrides for SparkApplication and SparkHistoryServer (#451).
  • Ensure SparkApplications can only create a single submit Job. Fix for #457 (#460).
  • Invalid SparkApplication/SparkHistoryServer objects don't cause the operator to stop functioning (#[482]).

Removed

  • Support for Spark versions 3.4.2 and 3.4.3 has been dropped (#459).

[24.7.0] - 2024-07-24

Changed

  • Bump stackable-operator to 0.70.0, product-config to 0.7.0, and other dependencies (#401, #425).

Fixed

  • BREAKING (behaviour): Specified CPU resources are now applied correctly (instead of rounding it to the next whole number). This might affect your jobs, as they now e.g. only have 200m CPU resources requested instead of the 1000m it had so far, meaning they might slow down significantly (#408).
  • Processing of corrupted log events fixed; If errors occur, the error messages are added to the log event (#412).

[24.3.0] - 2024-03-20

Added

  • Helm: support labels in values.yaml (#344).
  • Support version 3.5.1 (#373).
  • Support version 3.4.2 (#357).
  • spec.job.config.volumeMounts property to easily mount volumes on the job pod (#359)

Changed

  • Various documentation of the CRD (#319).
  • [BREAKING] Removed version field. Several attributes have been changed to mandatory. While this change is technically breaking, existing Spark jobs would not have worked before as these attributes were necessary (#319).
  • [BREAKING] Remove userClassPathFirst properties from spark-submit. This is an experimental feature that was introduced to support logging in XML format. The side effect of this removal is that the vector agent cannot aggregate output from the spark-submit containers. On the other side, it enables dynamic provisionining of java packages (such as Delta Lake) with Stackable stock images which is much more important. (#355)

Fixed

  • Add missing deletecollection RBAC permission for Spark drivers. Previously this caused confusing error messages in the spark driver log (User "system:serviceaccount:default:my-spark-app" cannot deletecollection resource "configmaps" in API group "" in the namespace "default".) (#313).

[23.11.0] - 2023-11-24

Added

  • Default stackableVersion to operator version. It is recommended to remove spec.image.stackableVersion from your custom resources (#267, #268).
  • Configuration overrides for the JVM security properties, such as DNS caching (#272).
  • Support PodDisruptionBudgets for HistoryServer (#288).
  • Support for versions 3.4.1, 3.5.0 (#291).
  • History server now exports metrics via jmx exporter (port 18081) (#291).
  • Document graceful shutdown (#306).

Changed

  • vector 0.26.0 -> 0.33.0 (#269, #291).
  • operator-rs 0.44.0 -> 0.55.0 (#267, #275, #288, #291).
  • Removed usages of SPARK_DAEMON_JAVA_OPTS since it's not a reliable way to pass extra JVM options (#272).
  • [BREAKING] use product image selection instead of version (#275).
  • [BREAKING] refactored application roles to use CommonConfiguration structures from the operator framework (#277).
  • Let secret-operator handle certificate conversion (#286).
  • Extended resource-usage documentation (#297).

Fixed

  • Dynamic loading of Maven packages (#281).
  • Re-instated driver/executor cores setting (#302).

Removed

  • Removed support for versions 3.2.1, 3.3.0 (#291).

[23.7.0] - 2023-07-14

Added

  • Generate OLM bundle for Release 23.4.0 (#238).
  • Add support for Spark 3.4.0 (#243).
  • Add support for using custom certificates when accessing S3 with TLS (#247).
  • Use bitnami charts for testing S3 access with TLS (#247).
  • Set explicit resources on all containers (#249).
  • Support pod overrides (#256).

Changed

  • operator-rs 0.38.0 -> 0.44.0 (#235, #259).
  • Use 0.0.0-dev product images for testing (#236).
  • Use testing-tools 0.2.0 (#236).
  • Run as root group (#241).
  • Added kuttl test suites (#252).

Fixed

  • Fix quoting issues when spark config values contain spaces (#243).
  • Increase the size limit of log volumes (#259).
  • Typo in executor cpu limit property (#263).

[23.4.0] - 2023-04-17

Added

  • Deploy default and support custom affinities (#217)
  • Log aggregation added (#226).

Changed

  • [BREAKING] Support specifying Service type for HistoryServer. This enables us to later switch non-breaking to using ListenerClasses for the exposure of Services. This change is breaking, because - for security reasons - we default to the cluster-internal ListenerClass. If you need your cluster to be accessible from outside of Kubernetes you need to set clusterConfig.listenerClass to external-unstable or external-stable (#228).
  • [BREAKING]: Dropped support for old spec.{driver,executor}.nodeSelector field. Use spec.{driver,executor}.affinity.nodeSelector instead (#217)
  • Revert openshift settings (#207)
  • BUGFIX: assign service account to history pods (#207)
  • Merging and validation of the configuration refactored (#223)
  • operator-rs 0.36.00.38.0 (#223)

[23.1.0] - 2023-01-23

Added

  • Create and manage history servers (#187)

Changed

  • Updated stackable image versions (#176)
  • operator-rs 0.22.00.27.1 (#178)
  • operator-rs 0.27.1 -> 0.30.2 (#187)
  • Don't run init container as root and avoid chmod and chowning (#183)
  • [BREAKING] Implement fix for S3 reference inconsistency as described in the issue #162 (#187)

[0.6.0] - 2022-11-07

Changed

  • Bumped image to 3.3.0-stackable0.2.0 in tests and docs (#145)
  • BREAKING: use resource limit struct instead of passing spark configuration arguments (#147)
  • Fixed resources test (#151)
  • Fixed inconsistencies with resources usage (#166)

[0.5.0] - 2022-09-06

Added

  • Add Getting Started documentation (#114).

Fixed

  • Add missing role to read S3Connection and S3Bucket objects (#112).
  • Update annotation due to update to rust version (#114).
  • Update RBAC properties for OpenShift compatibility (#126).

[0.4.0] - 2022-08-03

Changed

  • Include chart name when installing with a custom release name (#97)
  • Pinned MinIO version for tests (#100)
  • operator-rs 0.21.00.22.0 (#102).
  • Added owner-reference to pod templates (#104)
  • Added kuttl test for the case when pyspark jobs are provisioned using the image property of the SparkApplication definition (#107)

[0.3.0] - 2022-06-30

Added

Changed

  • BREAKING: Use current S3 connection/bucket structs (#86)
  • Add node selector to top-level job and specify node selection in PVC-relevant tests (#90)
  • Update kuttl tests to use Spark 3.3.0 (#91)
  • Bugfix for duplicate volume mounts in PySpark jobs (#92)

[0.2.0] - 2022-06-21

Added

  • Added new fields to govern image pull policy (#75)
  • New nodeSelector fields for both the driver and the executors (#76)
  • Mirror driver pod status to the corresponding spark application (#77)

Changed

  • Updated examples (#71)

[0.1.0] - 2022-05-05

Added

  • Initial commit
  • ServiceAccount, ClusterRole and RoleBinding for Spark driver (#39)
  • S3 credentials can be provided via a Secret (#42)
  • Job information can be passed via a configuration map (#50)
  • Update S3 bucket specification to be conform with the corresponding ADR (#55)