The Spark overlay is a Gentoo ebuild repository that offers Gentoo packages supporting big data infrastructures based on the Java platform. It was originally created for distributing Apache Spark to Gentoo users (hence the name "Spark overlay"), but it has been expanded later to include packages for H2O and Kotlin core libraries that can be built from source.
dev-java/spark-core
: The core package for Sparkdev-java/spark-demo
: A demo program for the Spark packages in this overlay, which can be run with commandspark-demo-2.12
Note: Some dependencies of Spark require Kotlin compiler 1.4. When being
pulled as a dependency, Kotlin compiler 1.4 might not be fully-automatically
installable by emerge
. If emerge
comes up with any error pertaining to
Kotlin packages, please install Kotlin compiler 1.4 separately before retrying:
# emerge --ask --oneshot dev-lang/kotlin-bin:1.4
dev-java/h2o
: The meta package for H2O- Most H2O sub-packages are slotted based on the major version of H2O (e.g. 3.32, 3.34), so it is possible to install many major versions of H2O in parallel
- The name of H2O's executable for a slot is
h2o-${SLOT}
; for example, the executable for H2O 3.32.x.y ish2o-3.32
dev-python/h2o-py
: The Python module for H2Odev-java/h2o-flow
: The web-based interactive computational environment for H2O- To use H2O Flow, please enable the
flow
USE flag ofdev-java/h2o
- To use H2O Flow, please enable the
For information regarding the Kotlin packages, please consult the relevant page on Gentoo Wiki.
The Spark overlay needs to be added to the system before the packages in it can be installed. This can be done with the following steps:
-
Because the Spark overlay is offered as a Git repository, Git must be installed before the repository's contents can be downloaded to the system. The following command ensures that Git is installed on the system:
# emerge --ask --noreplace dev-vcs/git
-
Add the Spark overlay to the system with any of the following method:
-
Enable the Spark overlay from
eselect-repository
:# emerge --ask --noreplace app-eselect/eselect-repository # eselect repository enable spark-overlay
-
Manually add the repository definition to
/etc/portage/repos.conf
:[spark-overlay] location = /var/db/repos/spark-overlay sync-type = git sync-uri = https://github.com/6-6-6/spark-overlay.git
-
-
Download the contents of the Spark overlay to the system:
# emerge --sync spark-overlay
The tests
directory under the Spark overlay's Git tree contains scripts and
test cases for installation tests of the ebuilds in the Spark overlay. These
tests are automatically run by a GitHub Actions
workflow once a day to capture issues preventing
ebuilds in the Spark overlay from being installed caused by both problems in
those ebuilds themselves and changes in the official Gentoo ebuild repository
(like removal of packages needed by ebuilds in the Spark overlay).
The tests can be run in a local environment to replicate the jobs executed by GitHub Actions:
-
Install ebuild-commander -- the tool used to run the installation test cases -- in the local environment. Please make sure every dependency required by ebuild-commander is installed too, because some of the dependencies are required during the runtime only, and missing runtime dependencies will not cause any errors in the installation process.
-
Change working directory to the top directory of the Spark overlay, then:
- Run
tests/run.sh
to run all test cases stored undertests/test-cases
. - Run
tests/run.sh TESTCASE...
to run one or more specificTESTCASE
s. For example:tests/run.sh tests/test-cases/h2o.sh
runs the test case for H2O packages.tests/run.sh tests/test-cases/kotlin-latest.sh tests/test-cases/h2o.sh
runs the test case for Kotlin packages for the latest feature release and then the test case for H2O packages.
- Run
-
Some test cases can take hours to run.
-
The
tests/run.sh
script may indirectly invoke Docker via ebuild-commander, which might requiretests/run.sh
to be run asroot
and ebuild-commander to be installed globally for all users.
Test cases are stored in the tests/test-cases
directory under the Spark
overlay's Git tree. A test case's format is similar to an ebuild:
-
A test case is written in Bash syntax, which will allow
tests/run.sh
tosource
it. -
A test case needs to have a
run_test
Bash function whose body contains the list of commands to run in the Docker container created by ebuild-commander for the test case.-
Please note that ebuild-commander executes each command in the
run_test
function in a separate shell, even if the commands are separated by a semicolon instead of a newline. This means that inrun_test
, variable declarations,if
statements,for
loops and so on do not work as expected. There exist some workarounds:-
Put any parts of the test case involving variable declarations and control flows into standalone scripts, and call the scripts in
run_test
. The Spark overlay's Git tree will be available at/var/db/repos/spark-overlay
in the Docker container created for the test. -
Write variable values to files, and read those files to retrieve them later.
-
-
-
A test case may define additional Bash variables recognized by
tests/run.sh
to control command-line options of ebuild-commander.- To find out which variables are supported, please refer to the content of
tests/run.sh
.
- To find out which variables are supported, please refer to the content of
Besides run.sh
and test-cases
, the tests
directory contains some other
sub-directories for files supporting test cases:
portage-configs
: Portage configurations for test casesdefault
: The default configuration, used by all test casesunstable
: A configuration that accepts the~arch
keyword for all ebuilds, including those in::gentoo
- This configuration is enabled automatically by
tests/run.sh
when theUNSTABLE
environment variable's value is not empty.
- This configuration is enabled automatically by
features-test
: A configuration that setsFEATURES="test"
for ebuilds in the Spark overlaybinary
: A configuration that setsUSE="binary"
for ebuilds in the Spark overlay
resources
: Scripts, programs and other miscellaneous files used by test cases- This is the best location to place any scripts that need to be called in
the
run_test
function for a test case. The path to this directory in the Docker container is/var/db/repos/spark-overlay/tests/resources
.
- This is the best location to place any scripts that need to be called in
the
The Spark overlay could still use some improvements. Any contributions that help resolve the following issues are welcome!
- Update Apache Spark packages to the latest upstream releases
- Update
dev-java/hadoop-*
anddev-java/netty-*
packages to the latest upstream releases that do not have known security vulnerabilities - Expand the family of H2O packages by adding packages for the H2O XGBoost and AutoML extension
- Improve Kotlin packages and eclasses
- Fix other documented issues in the issue tracker