Skip to content

musakarimli/dwh_eng_test_task

Repository files navigation

dwh_eng_test_task

Running mini ETL in PySpark

For some reason I couldn't export data to csv, so used persist() method while creating DataFrame, there are similar issues in the stackoverflow

https://stackoverflow.com/questions/45963507/spark-dataframes-are-getting-created-successfully-but-not-able-to-write-into-the

Running

  1. Install the docker

  2. Run below commands:

    docker build . -t sparkhome
  1. If you want only run test
    docker run --name spark_container sparkhome /bin/bash -c "cd /opt/spark/ && pytest etl_test.py" 
  1. If you want run main script
    docker run --name spark_container sparkhome /bin/bash -c "cd /opt/spark/ && python main.py" 
  1. This command will transfer exported file from container to the local directory
    docker cp spark_container:opt/spark/output/test_transformed test_transformed

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published