Cloud Dataproc Initialization Actions

When creating a Google Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up.

How initialization actions are used

Initialization actions are stored in a Google Cloud Storage bucket and can be passed as a parameter to the gcloud command or the clusters.create API when creating a Cloud Dataproc cluster. For example, to specify an initialization action when creating a cluster with the gcloud command, you can run:

gcloud dataproc clusters create CLUSTER-NAME
[--initialization-actions [GCS_URI,...]]
[--initialization-action-timeout TIMEOUT]

For convenience, copies of initialization actions in this repository are stored in the following Cloud Storage bucket, which is publicly accessible:

gs://dataproc-initialization-actions

The folder structure of this Cloud Storage bucket mirrors this repository. You should be able to use this Cloud Storage bucket (and the initialization scripts within it) for your clusters.

Why these samples are provided

These samples are provided to show how various packages and components can be installed on Cloud Dataproc clusters. You should understand how these samples work before running them on your clusters. The initialization actions provided in this repository are provided without support and you use them at your own risk.

Actions provided

This repository presently offers the following actions for use with Cloud Dataproc clusters.

Install packages/software on the cluster:
Configure the cluster:
- Configure a nice shell environment
- Share a NFS consistency cache
- Share a Google Cloud SQL Hive Metastore
- Setup Google Stackdriver monitoring for a cluster

Initialization actions on single node clusters

Single Node clusters have dataproc-role set to Master and dataproc-worker-count set to 0. Most of the initialization actions in this repository should work out of the box, as they run only on the master. Examples include notebooks (such as Apache Zeppelin) and libraries (such as Apache Tez). Actions that run on all nodes of the cluster (such as cloud-sql-proxy) similarly work out of the box.

Some initialization actions are known not to work on Single Node clusters. All of these expect to have daemons on multiple nodes.

Apache Drill
Apache Flink
Apache Kafka
Presto
Apache Zookeeper

Feel free to send pull requests or file issues if you have a good use case for running one of these actions on a Single Node cluster.

For more information

For more information, review the Cloud Dataproc documentation. You can also pose questions to the Stack Overflow community with the tag google-cloud-dataproc. See our other Google Cloud Platform github repos for sample applications and scaffolding for other frameworks and use cases.

Contributing changes

See CONTRIBUTING.md

Licensing

See LICENSE

Name	Name	Last commit message	Last commit date
Latest commit Adam Stasiak Flink changes after review Feb 23, 2018 0da8830 · Feb 23, 2018 History 160 Commits
cloud-sql-proxy	cloud-sql-proxy	Add note to enable Google Cloud SQL API (GoogleCloudDataproc#175 )	Feb 5, 2018
conda	conda	Fix conda install steps w/custom yaml (fixes GoogleCloudDataproc#176 ) (…	Feb 8, 2018
datalab	datalab	Retry apt-get update on failure (GoogleCloudDataproc#169 )	Jan 29, 2018
drill	drill	enhancements from GoogleCloudDataproc#98 (GoogleCloudDataproc#99 )	Apr 6, 2017
flink	flink	Flink changes after review	Feb 23, 2018
ganglia	ganglia	Retry apt-get update on failure (GoogleCloudDataproc#169 )	Jan 29, 2018
hive-hcatalog	hive-hcatalog	Update README.md	Mar 27, 2017
hue	hue	Add note about Hive Oozie actions (GoogleCloudDataproc#179 )	Feb 12, 2018
ipython-notebook	ipython-notebook	Deprecate the ipython-notebook init action in favor of Jupyter Notebook	Sep 28, 2017
jupyter	jupyter	Retry apt-get update on failure (GoogleCloudDataproc#169 )	Jan 29, 2018
kafka	kafka	Retry apt-get update on failure (GoogleCloudDataproc#169 )	Jan 29, 2018
oozie	oozie	Add note about Hive Oozie actions (GoogleCloudDataproc#179 )	Feb 12, 2018
post-init	post-init	Add a new post-init helper script. (GoogleCloudDataproc#74 )	Sep 12, 2016
presto	presto	Use sed instead of cut to extract Spark settings (GoogleCloudDataproc…	Oct 13, 2017
stackdriver	stackdriver	Add Google Cloud Datalab init action. (GoogleCloudDataproc#73 )	Sep 27, 2016
tez	tez	Retry apt-get update on failure (GoogleCloudDataproc#169 )	Jan 29, 2018
user-environment	user-environment	Retry apt-get update on failure (GoogleCloudDataproc#169 )	Jan 29, 2018
util	util	REF (DS-253): Clean up Jupyter notebook init action	Jun 27, 2016
zeppelin	zeppelin	Retry apt-get update on failure (GoogleCloudDataproc#169 )	Jan 29, 2018
zookeeper	zookeeper	Retry apt-get update on failure (GoogleCloudDataproc#169 )	Jan 29, 2018
CONTRIBUTING.md	CONTRIBUTING.md	Clarify instructions for using Dataproc distro	Jan 29, 2018
LICENSE	LICENSE	Initial Commit	Oct 12, 2015
README.md	README.md	Update README with single node information (GoogleCloudDataproc#148 )	Sep 29, 2017
favicon.ico	favicon.ico	Initial Commit	Oct 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud Dataproc Initialization Actions

How initialization actions are used

Why these samples are provided

Actions provided

Initialization actions on single node clusters

For more information

Contributing changes

Licensing

About

Releases

Packages

Languages

License

adam-stasiak/dataproc-initialization-actions

Folders and files

Latest commit

History

Repository files navigation

Cloud Dataproc Initialization Actions

How initialization actions are used

Why these samples are provided

Actions provided

Initialization actions on single node clusters

For more information

Contributing changes

Licensing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages