This is a project about using Spark with Scala and IntelliJ IDE. I participated in the Kaggle Knowledge Competition called House Prices: Advanced Regression Techniques and used their datasets (put them inside a data/
folder). The goal of the competition is to predict the sales price for each house. The models are evaluated using Root-Mean-Squared-Error (RMSE), which means that the lower the value, the better -- the predictions are close to the actual house prices.
- IntelliJ IDEA 2020.1.3
- scala = 2.11.12
- spark = 2.4.5
This assumes that you are using IntelliJ IDE. This part is similar to using sbt assembly
command if you are using the sbt plugin.
-
File -> Project Structure -> Artifacts -> + -> JAR -> "From modules with dependencies...".
-
Fill in the information and make sure to tick the box: "Include in project build". Same as shown in the picture below:
-
Click OK. Then "Build Project". It will generate the jar file inside the
out/artifacts/
.
Again, this assumes that you are using IntelliJ IDE. This part is similar to using spark-submit
command if you have spark in your machine.
-
Run -> Edit Configurations... -> + -> Application
-
Name it "spark-submit", just for ease of understanding.
- Main class:
org.apache.spark.deploy.SparkSubmit
- VM options:
-Dspark.master=local[2]
- Program arguments:
--class <name_of_class> <location_of_jar_file> args
, example:
--class kaggle.houseprice.HPRegression /<location_path>/out/artifacts/kaggle_house_price_jar/kaggle_house_price.jar data/train.csv data/test.csv data/sample_submission.csv tmp/submission
It looks like this:
- Main class:
-
Click OK. Run "spark-submit". It will generate the CSV file inside the
tmp/submission/
, which you can submit to Kaggle.
- Check the RMSE of the
sample_submission.csv
- Trial submission using Linear Regression with one feature.
- Linear Regression with more features: combination of numerics and categories.
- Linear Regression with more features: combination of numerics and categories + hyperparameter tuning via cross-validation.