All notable changes to this project will be documented in this file.
- Add conversion webhook (#656).
- Support objectOverrides using
.spec.objectOverrideson theSparkConnectServerandSparkHistoryServer. See objectOverrides concepts page for details (#640). - Support for Spark
4.1.1(#642). - Add
SparkApplication.spec.job.retryOnFailureCountfield with a default of0. This has the effect that applications where thespark-submitPod fails are not resubmitted. Previously, Jobs were retried at most 6 times by default (#647). - Support for Spark
3.5.8(#650). - First class support for S3 on Spark connect clusters (#652).
- Spark applications can now have templates that are merged into the application manifest before reconciliation. This allows users with many applications to source out common configuration in a central place and reduce duplication (#660).
- Spark applications now correctly handle the case where both the History Server and the S3 connection use the same TLS secret class (#655). Previously, the Spark application pods contained the same TLS volume twice, which could not be applied to the API server.
- The spark-submit job now sets the correct
-Djavax.net.ssl.trustStoreproperties (#655). - Spark application jobs can now have pod/node affinities. This was an omission as the application driver and executors already had this field for a long time. (#664).
- Fix "404 page not found" error for the initial object list (#666).
- Bump stackable-operator to 0.108.0, snafu to 0.9, strum to 0.28 (#663, #666).
- Gracefully shutdown all concurrent tasks by forwarding the SIGTERM signal (#651).
- Remove the Spark application owner reference from the executor pods. This allows Kubernetes to garbage collect them early when the driver or the submit job fail (#648).
- Clean up driver pods when the spark application is finished. Previously, driver pods created by the submit job would be left hanging even after the job has been deleted (#649).
- Add end-of-support checker which can be controlled with environment variables and CLI arguments (#615).
EOS_CHECK_MODE(--eos-check-mode) to set the EoS check mode. Currently, only "offline" is supported.EOS_INTERVAL(--eos-interval) to set the interval in which the operator checks if it is EoS.EOS_DISABLED(--eos-disabled) to disable the EoS checker completely.
- Add experimental support for Spark 4 (#589)
- Helm: Allow Pod
priorityClassNameto be configured (#608). - Support for Spark 3.5.7 (#610).
- Add metrics service with
prometheus.io/path|port|schemeannotations for spark history server (#619). - Add metrics service with
prometheus.io/path|port|schemeannotations for spark connect (#619).
-
SparkConnectServer: The
imagePullSecretis now correctly passed to Spark executor pods (#603). -
Previously we had a bug that could lead to missing certificates (#611).
This could be the case when you specified multiple CAs in your SecretClass. We now correctly handle multiple certificates in this cases. See this GitHub issue for details
-
The service account of spark applications can now be overridden with pod overrides (#617).
Previously the application service account was passed as command line argument to spark-submit and was thus not possible to overwrite with pod overrides for the driver and executors. This CLI argument has now been moved to the pod templates of the individual roles.
- Support for Spark versions 3.5.5 has been dropped (#610).
- Bump stackable-operator to
0.100.1and product-config to0.8.0(#622). - Bump testing-tools to
0.3.0-stackable0.0.0-dev(#638).
- Experimental support for Spark Connect (#539).
- Adds new telemetry CLI arguments and environment variables (#560).
- Use
--file-log-max-files(orFILE_LOG_MAX_FILES) to limit the number of log files kept. - Use
--file-log-rotation-period(orFILE_LOG_ROTATION_PERIOD) to configure the frequency of rotation. - Use
--console-log-format(orCONSOLE_LOG_FORMAT) to set the format toplain(default) orjson.
- Use
- Expose history and connect services via listener classes (#562).
- Support for Spark 3.5.6 (#580).
- Add RBAC rule to helm template for automatic cluster domain detection (#592).
- Add
sparkhistoryandshsshortnames for SparkHistoryServer (#592).
- BREAKING: Replace stackable-operator
initialize_loggingwith stackable-telemetryTracing(#547, #554, #560).- The console log level was set by
SPARK_K8S_OPERATOR_LOG, and is now set byCONSOLE_LOG_LEVEL. - The file log level was set by
SPARK_K8S_OPERATOR_LOG, and is now set byFILE_LOG_LEVEL. - The file log directory was set by
SPARK_K8S_OPERATOR_LOG_DIRECTORY, and is now set byFILE_LOG_DIRECTORY(or via--file-log-directory <DIRECTORY>). - Replace stackable-operator
print_startup_stringwithtracing::info!with fields.
- The console log level was set by
- BREAKING: Inject the vector aggregator address into the vector config using the env var
VECTOR_AGGREGATOR_ADDRESSinstead of having the operator write it to the vector config (#551). - Document that Spark Connect doesn't integrate with the history server (#559)
- test: Bump to Vector
0.46.1(#565). - Use versioned common structs (#572).
- BREAKING: Change the label
app.kubernetes.io/namefor Spark history and connect objects to usespark-historyandspark-connectinstead ofspark-k8s(#573). - BREAKING: The history Pods now have their own ClusterRole named
spark-history-clusterrole(#573). - BREAKING: Previously this operator would hardcode the UID and GID of the Pods being created to 1000/0, this has changed now (#575)
- The
runAsUserandrunAsGroupfields will not be set anymore by the operator - The defaults from the docker images itself will now apply, which will be different from 1000/0 going forward
- This is marked as breaking because tools and policies might exist, which require these fields to be set
- The
- Enable the built-in Prometheus servlet. The jmx exporter was removed in (#584) but added back in (#585).
- BREAKING: Bump stackable-operator to 0.94.0 and update other dependencies (#592).
- The default Kubernetes cluster domain name is now fetched from the kubelet API unless explicitly configured.
- This requires operators to have the RBAC permission to get nodes/proxy in the apiGroup "". The helm-chart takes care of this.
- The CLI argument
--kubernetes-node-nameor env variableKUBERNETES_NODE_NAMEneeds to be set. The helm-chart takes care of this.
- Use
jsonfile extension for log files (#553). - The Spark connect controller now watches StatefulSets instead of Deployments (again) (#573).
- BREAKING: Move
listenerClasstoroleConfigfor Spark History Server and Spark Connect. Service names changed. (#588). - Allow uppercase characters in domain names (#592).
- Support for Spark versions 3.5.2 has been dropped (#570).
- Integration test spark-pi-public-s3 because the AWS SDK >2.24 doesn't suuport anonymous S3 access anymore (#574).
- Remove the
lastUpdateTimefield from the stacklet status (#592). - Remove role binding to legacy service accounts (#592).
- The lifetime of auto generated TLS certificates is now configurable with the role and roleGroup
config property
requestedSecretLifetime. This helps reducing frequent Pod restarts (#501). - Run a
containerdebugprocess in the background of each Spark container to collect debugging information (#508). - Aggregate emitted Kubernetes events on the CustomResources (#515).
- Support configuring JVM arguments (#532).
- Support for S3 region (#528).
- Default to OCI for image metadata and product image selection (#514).
- Update tests and docs to Spark version 3.5.5 (#534)
- Make spark-env.sh configurable via
configOverrides(#473). - The Spark history server can now service logs from HDFS compatible systems (#479).
- The operator can now run on Kubernetes clusters using a non-default cluster domain.
Use the env var
KUBERNETES_CLUSTER_DOMAINor the operator Helm chart propertykubernetesClusterDomainto set a non-default cluster domain (#480).
- Reduce CRD size from
1.2MBto103KBby accepting arbitrary YAML input instead of the underlying schema for the following fields (#450):podOverridesaffinityvolumesvolumeMounts
- Update tests and docs to Spark version 3.5.2 (#459)
- BREAKING: The fields
connectionandhostonS3Connectionas well asbucketNameonS3Bucketare now mandatory (#472). - Fix
envOverridesfor SparkApplication and SparkHistoryServer (#451). - Ensure SparkApplications can only create a single submit Job. Fix for #457 (#460).
- Invalid
SparkApplication/SparkHistoryServerobjects don't cause the operator to stop functioning (#[482]).
- Support for Spark versions 3.4.2 and 3.4.3 has been dropped (#459).
- BREAKING (behaviour): Specified CPU resources are now applied correctly (instead of rounding it to the next whole number). This might affect your jobs, as they now e.g. only have 200m CPU resources requested instead of the 1000m it had so far, meaning they might slow down significantly (#408).
- Processing of corrupted log events fixed; If errors occur, the error messages are added to the log event (#412).
- Helm: support labels in values.yaml (#344).
- Support version
3.5.1(#373). - Support version
3.4.2(#357). spec.job.config.volumeMountsproperty to easily mount volumes on the job pod (#359)
- Various documentation of the CRD (#319).
- [BREAKING] Removed version field. Several attributes have been changed to mandatory. While this change is technically breaking, existing Spark jobs would not have worked before as these attributes were necessary (#319).
- [BREAKING] Remove
userClassPathFirstproperties fromspark-submit. This is an experimental feature that was introduced to support logging in XML format. The side effect of this removal is that the vector agent cannot aggregate output from thespark-submitcontainers. On the other side, it enables dynamic provisionining of java packages (such as Delta Lake) with Stackable stock images which is much more important. (#355)
- Add missing
deletecollectionRBAC permission for Spark drivers. Previously this caused confusing error messages in the spark driver log (User "system:serviceaccount:default:my-spark-app" cannot deletecollection resource "configmaps" in API group "" in the namespace "default".) (#313).
- Default stackableVersion to operator version. It is recommended to remove
spec.image.stackableVersionfrom your custom resources (#267, #268). - Configuration overrides for the JVM security properties, such as DNS caching (#272).
- Support PodDisruptionBudgets for HistoryServer (#288).
- Support for versions 3.4.1, 3.5.0 (#291).
- History server now exports metrics via jmx exporter (port 18081) (#291).
- Document graceful shutdown (#306).
vector0.26.0->0.33.0(#269, #291).operator-rs0.44.0->0.55.0(#267, #275, #288, #291).- Removed usages of SPARK_DAEMON_JAVA_OPTS since it's not a reliable way to pass extra JVM options (#272).
- [BREAKING] use product image selection instead of version (#275).
- [BREAKING] refactored application roles to use
CommonConfigurationstructures from the operator framework (#277). - Let secret-operator handle certificate conversion (#286).
- Extended resource-usage documentation (#297).
- Removed support for versions 3.2.1, 3.3.0 (#291).
- Generate OLM bundle for Release 23.4.0 (#238).
- Add support for Spark 3.4.0 (#243).
- Add support for using custom certificates when accessing S3 with TLS (#247).
- Use bitnami charts for testing S3 access with TLS (#247).
- Set explicit resources on all containers (#249).
- Support pod overrides (#256).
operator-rs0.38.0->0.44.0(#235, #259).- Use 0.0.0-dev product images for testing (#236).
- Use testing-tools 0.2.0 (#236).
- Run as root group (#241).
- Added kuttl test suites (#252).
- Fix quoting issues when spark config values contain spaces (#243).
- Increase the size limit of log volumes (#259).
- Typo in executor cpu limit property (#263).
- [BREAKING] Support specifying Service type for HistoryServer.
This enables us to later switch non-breaking to using
ListenerClassesfor the exposure of Services. This change is breaking, because - for security reasons - we default to thecluster-internalListenerClass. If you need your cluster to be accessible from outside of Kubernetes you need to setclusterConfig.listenerClasstoexternal-unstableorexternal-stable(#228). - [BREAKING]: Dropped support for old
spec.{driver,executor}.nodeSelectorfield. Usespec.{driver,executor}.affinity.nodeSelectorinstead (#217) - Revert openshift settings (#207)
- BUGFIX: assign service account to history pods (#207)
- Merging and validation of the configuration refactored (#223)
operator-rs0.36.0→0.38.0(#223)
- Create and manage history servers (#187)
- Updated stackable image versions (#176)
operator-rs0.22.0→0.27.1(#178)operator-rs0.27.1->0.30.2(#187)- Don't run init container as root and avoid chmod and chowning (#183)
- [BREAKING] Implement fix for S3 reference inconsistency as described in the issue #162 (#187)
- Bumped image to
3.3.0-stackable0.2.0in tests and docs (#145) - BREAKING: use resource limit struct instead of passing spark configuration arguments (#147)
- Fixed resources test (#151)
- Fixed inconsistencies with resources usage (#166)
- Add Getting Started documentation (#114).
- Add missing role to read S3Connection and S3Bucket objects (#112).
- Update annotation due to update to rust version (#114).
- Update RBAC properties for OpenShift compatibility (#126).
- Include chart name when installing with a custom release name (#97)
- Pinned MinIO version for tests (#100)
operator-rs0.21.0→0.22.0(#102).- Added owner-reference to pod templates (#104)
- Added kuttl test for the case when pyspark jobs are provisioned using the
imageproperty of theSparkApplicationdefinition (#107)
- BREAKING: Use current S3 connection/bucket structs (#86)
- Add node selector to top-level job and specify node selection in PVC-relevant tests (#90)
- Update kuttl tests to use Spark 3.3.0 (#91)
- Bugfix for duplicate volume mounts in PySpark jobs (#92)
- Added new fields to govern image pull policy (#75)
- New
nodeSelectorfields for both the driver and the executors (#76) - Mirror driver pod status to the corresponding spark application (#77)
- Updated examples (#71)