Running a Spark Job on AWS Clustered Environment

AWS or Amazon Web Services is a cloud environment, which provides variety of different services at a cost.

I am going to run a SPARK job using python on a dataset with 1 million records. This is obviously a huge computation and may take a little while, depending upon the number of clusters I would rent from AWS.

AWS has a per hour charge for each server on the cluster. Once the job is over, those servers must be manually terminated else the cost will increase by each passing hour, hense resulting in a huge $$. The per server cost is around 0.22$/hour and I am going to rent about 5 servers. Which will be divided as one master node and 4 executor nodes.

It is obvious, that if this this script is run on a personal computer without any distribution applied to it, the Job will certainly fail.

So, for this specific piece of code, a clustured environment is necessary. As a matter of fact, in Data Science, we tend to encounter Datasets which are often huge, if you don't have a clustered environment setup then AWS is a good option.

Below are details about how I did it.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
data		data
images		images
source		source
README.md		README.md
aws-setup.md		aws-setup.md
secure-copy.md		secure-copy.md
spark-job.md		spark-job.md
summary.md		summary.md
terminal-app.md		terminal-app.md
the-dataset.md		the-dataset.md
the-prespective.md		the-prespective.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Running a Spark Job on AWS Clustered Environment

About

Releases

Packages

Languages

rfhussain/Running-a-Spark-Job-on-AWS-Cluster

Folders and files

Latest commit

History

Repository files navigation

Running a Spark Job on AWS Clustered Environment

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages