Motivation

With so many Hadoop and Spark builds out there on Docker, why would we create another one? This was for two reasons:

We had a lot of trouble getting many of these builds to run and work and
Many of the builds were using older versions of Hadoop and Spark.

Therefore, we decided to comb through many of the builds looking for pieces we could stitch together that would work with the latest versions and actually deploy seamlessly on a distributed cluster. We ended up borrowing ideas and parts of builds from a handful of developers as well as adding our own work arounds.

I tried to remember and list everyone I used as a reference in resources; however, if you notice developer we forgot to attribute, send us the link to their (or your) GitHub and Docker repos. We will be able to recall if we went through code by seeing the name.

Resources

I had to adapt my set up from many different configurations. Here is the short list of what I can remember.

Before you start using these images or Dockerfiles, make sure to go through the scripts, configuration files, and Dockerfiles looking for network specific or user specific settings to change. Most will be marked <name>.

Required Files Hadoop/Spark Python Docker Swarm

DNSmasq directory
- dnsmasq.conf
- dnsmasq-run.sh
Hadoop directory
- Dockerfile
- entrypoint.sh
- datanode directory
  - Dockerfile
  - run.sh
- namenode directory
  - Dockerfile
  - run.sh
Spark directory
- Dockerfile
Anaconda Python directory
- Dockerfile
Data Science directory
- Dockerfile
- requirements.txt
- conda_requirement_install.sh
Grafana directory
- docker-compose-monitor.yml
- Docker Swarm Dashboard-1537760402324.json
- swarm-dashboard_rev1.json
build.sh
docker-compose.yml
hadoop.env
pyspark_example.py

Build

Run the build script in the current directory containing cluster directories. The build scripts has six arguments. In our case, we would name them:

hadoop
namenode
datanode
spark
anaconda
datascience

The usages would be

sh ./build.sh hadoop namenode datanode spark anaconda datascience

where the argument names are optional. The default values are shown above. This will build, tag, and push our images to our docker registry at <dns_name>:5000.

Setup

DNSmasq

Move the .conf and .sh files to desired machine to act as the server.
Setup the desired configuration in .conf.
Run sh ./dnsmasq-run.sh.

Launching Hadoop Spark Cluster

Update the docker-compose.yml to the desired settings, specify the image names, and set the replica number for the datanodes. Replica must be equal to or less than the number of available datanodes.
Update the hadoop.env file to configure Hadoop.

On the machine that will be used as the manager node, launch the docker compose file with

docker stack deploy -c <file.yml> <name of network>
docker stack deploy -c docker-compose.yml cluster

For the first run on the system, the join-token will be needed for the workers. On the manager node, run docker swarm join-token worker to see the token.
ssh into all the worker nodes and paste the token into the command line.
Enter the container with docker exec -it cluster_master.... bash note that you can auto-complete the container name with tab.
Test Hadoop is running correctly with $HADOOP_HOME/bin/hdfs dfs -cat /user/test.md which will print out Success to the command line.
To drop a worker, ssh into the desired machine and enter docker swarm leave.

Hadoop and Spark on Local Cluster

Will add later

Shutting Down the Cluster

Get a list of the running services with docker service ls.
Then run docker service rm <name, name, ...> to rm the service.
Then run docker stack rm <stack-name>(docker stack rm cluster) to remove the stack.

Enabling Portainer

Once the cluster is running, we access the manager node and enter the following commands:

   docker service create \
   --name portainer \
   --network hadoop-spark-swarm-network \
   --publish 9000:9000 \
   --mount src=portainer_data,dst=/data \
   --replicas=1 \
   --constraint 'node.role == manager' \
   portainer/portainer -H "tcp://tasks.portainer_agent:9001"
   --tlsskipverify

   docker service create \
   --name portainer_agent \
   --network hadoop-spark-swarm-network \
   -e AGENT_CLUSTER_ADDR=tasks.portainer_agent \
   --mode global \
   --constraint 'node.platform.os == linux' \
   --mount type=bind,src=//var/run/docker.sock,dst=/var/run/docker.sock \
   --mount type=bind,src=//var/lib/docker/volumes,dst=/var/lib/docker/volumes \
   portainer/agent

Check portainer.io <dns_name>:9000 for the join status of the workers.

Enabling Grafana

Once the cluster is running, we access the manager node and enter the following commands:

Deploy grafana, influxDB, Cadvisor with docker stack deploy -c docker-compose-monitor.yml monitor
Create cadvisor database in influxDB container(docker exec -it <influxDB container ID> bash) with influx -execute 'CREATE DATABASE cadvisor'
In your browser, access grafana dashboard with ip:80
Login with user: admin and password: admin and change the password
Add source: change Name to influx, Type to influxDB, URL to http://influx:8086, and Database to cadvisor. Then click the save and test button
Import dashboard with the Manage button (in left panel) with Docker Swarm Dashboard-<gen_id>.json

Chonograf(influxDB dashboard)

<ip>:9001

Reference

https://botleg.com/stories/monitoring-docker-swarm-with-cadvisor-influxdb-and-grafana/

Enable NFS

In the docker-compose-spark.yml, we needed to add

volumes:
  nfsshare:
    driver: local
    driver_opts:
      type: "nfs"
      o: "addr=<ip_add>,nfsvers=4"
      device: ":/"

Next, log into the manager node, <dns_name> and run

docker run -e NFS_EXPORT_0='/nfsshare
*(rw,fsid=0,async,no_subtree_check,no_auth_nlm,insecure,no_root_squash)'
--privileged -p 2049:2049 -d --name glcalcs erichough/nfs-server

Scaling

Once the cluster is running, if a new worker needs to be added, we can run

docker service scale <container_name>=<num>

where <container_name> is correct container you would like to spin up another worker on (Hadoop, Spark) and <num> is an integer greater than what was set in the docker-compose.yml as replica but less than or equal to the total number of workers available. For example, if we have a cluster of 5 workers and would like to scale Spark to 7 workers, we would run

docker service scale spark_swarm_worker=7

Executing Python

Replace the Notebook token with a Password

By replacing the token with a password, we can store http://dns_name:8888 into our browser and simple navigate to this address as opposed to copying the address and token when Jupyter lab or notebook launches.

This is optional. Simply comment out the lines at the bottom of the Dockerfile if you don't want to set this up (anaconda folder).

Run jupyter notebook --generate-config to generate the config file.
Run jupyter notebook password and enter your password twice.
Copy jupyter_notebook_config.py to the anaconda folder containing the Dockerfile.
Build the docker.

Jupyter Lab

We can run a jupyter notebook or lab from the bash of the anaconda or datascience images. From the bash, run the command

jupyter-lab --allow-root --ip=0.0.0.0

Additionally, the environment variable JUPYTER_LAB has been created to take care of the options needing to be passed. Then copy the URL into your local web browser and change the ip address to host machines ip or DNS name.

Python Script

Will add later

Collaborators

Much of this couldn't have been done without the help of Yok.

Docker Hub

dwsmith1983

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Motivation

Resources

Required Files Hadoop/Spark Python Docker Swarm

Build

Setup

DNSmasq

Launching Hadoop Spark Cluster

Hadoop and Spark on Local Cluster

Shutting Down the Cluster

Enabling Portainer

Enabling Grafana

Chonograf(influxDB dashboard)

Reference

Enable NFS

Scaling

Executing Python

Replace the Notebook token with a Password

Jupyter Lab

Python Script

Collaborators

Docker Hub

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
anaconda		anaconda
datascience		datascience
dnsmasq		dnsmasq
grafana		grafana
hadoop		hadoop
spark		spark
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
build.sh		build.sh
docker_compose.yml		docker_compose.yml
hadoop.env		hadoop.env
pyspark_example.py		pyspark_example.py

License

dwsmith1983/hadoop_spark_docker

Folders and files

Latest commit

History

Repository files navigation

Motivation

Resources

Required Files Hadoop/Spark Python Docker Swarm

Build

Setup

DNSmasq

Launching Hadoop Spark Cluster

Hadoop and Spark on Local Cluster

Shutting Down the Cluster

Enabling Portainer

Enabling Grafana

Chonograf(influxDB dashboard)

Reference

Enable NFS

Scaling

Executing Python

Replace the Notebook token with a Password

Jupyter Lab

Python Script

Collaborators

Docker Hub

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages