Running mini ETL in PySpark
For some reason I couldn't export data to csv, so used persist() method while creating DataFrame, there are similar issues in the stackoverflow
-
Install the docker
-
Run below commands:
docker build . -t sparkhome
- If you want only run test
docker run --name spark_container sparkhome /bin/bash -c "cd /opt/spark/ && pytest etl_test.py"
- If you want run main script
docker run --name spark_container sparkhome /bin/bash -c "cd /opt/spark/ && python main.py"
- This command will transfer exported file from container to the local directory
docker cp spark_container:opt/spark/output/test_transformed test_transformed