Skip to content

ETL & Data Enrichment with Spark.NET and ML.NET Automated (Auto) ML

License

Notifications You must be signed in to change notification settings

lqdev/RestaurantInspectionsSparkMLNET

Repository files navigation

Restaurant Inspections ETL & Data Enrichment in Spark.NET and ML.NET Automated (Auto) ML

This sample takes a restaurant violation dataset from the NYC Open Data portal and process it using Spark.NET. Then, the processed data will be used to train a machine learning model that attempts to predict the grade an establishment will receive after an inspection. The model will be trained using ML.NET, an open-source, cross-platform machine learning framework. Finally, data for which no grade currently exists will be enriched using the trained model to assign an expected grade.

For a detailed write-up, check out the Restaurant Inspections ETL & Data Enrichment with Spark.NET and ML.NET Automated (Auto) ML blog post.

Pre-requisites

This project was built using Ubuntu 18.04 but should work on Windows and Mac devices.

  • .NET Core 2.1
  • Java 8
  • Apache Spark 2.4.1 with Hadoop 2.7
  • .NET Spark Worker 0.4.0

Solution description

Understand the data

The dataset used in this solution is the DOHMH New York City Restaurant Inspection Results and comes from the NYC Open Data portal. It is updated daily and contains assigned and pending inspection results and violation citations for restaurants and college cafeterias. The dataset excludes establishments that have gone out of business. Although the dataset contains several columns, only a subset of them are used in this solution. For a detailed description of the dataset, visit the site.

Understand the solution

This solution is made up of different .NET Core applications:

  • RestaurantInspectionsETL: .NET Core Console application that takes raw data and uses Spark.NET to clean and transform the data into a format that is easier to use as input for training and making predictions with a machine learning model built with ML.NET.
  • RestaurantInspectionsML: .NET Core Class Library that defines the input and output schema of the ML.NET machine learning model. Additionally, this is where the trained model is saved to.
  • RestaurantInspectionsTraining: .NET Core Console application that uses the graded data generated by the RestaurantInspectionsETL application to train a multiclass classification machine learning model using ML.NET's AutoML.
  • RestaurantInspectionsEnrichment: .NET Core Console application that uses the ungraded data generated by the RestaurantInspectionsETL application as input for the trained ML.NET machine learning model to predict what grade an establishment is most likely to receive based on the violations found during inspection.

Get the code

git clone https://github.com/lqdev/RestaurantInspectionsSparkMLNET.git

Update solution locations

Before building the code, update the location of the solution in the RestaurantInspectionsTraining and RestaurantInspectionsEnrichment.

Replace the value of solutionDirectory with the path of where your solution is saved.

Original:

string solutionDirectory = "/home/lqdev/Development/RestaurantInspectionsSparkMLNET";

New:

string solutionDirectory = "<YOUR-SOLUTION-PATH>/RestaurantInspectionsSparkMLNET";

Build

RestaurantInspectionsETL

dotnet publish -f netcoreapp2.1 -r ubuntu.18.04-x64

RestaurantInspectionsTraining

dotnet build

RestaurantInspectionsEnrichment

dotnet publish -f netcoreapp2.1 -r ubuntu.18.04-x64

Run

RestaurantInspectionsETL

From the project directory run the application with spark-submit.

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin/Debug/netcoreapp2.1/ubuntu.18.04-x64/publish/microsoft-spark-2.4.x-0.4.0.jar dotnet bin/Debug/netcoreapp2.1/ubuntu.18.04-x64/publish/RestaurantInspectionsETL.dll

RestaurantInspectionsTraining

dotnet run

RestaurantInspectionsEnrichment

Navigate to the publish directory. In this case, it's bin/Debug/netcoreapp2.1/ubuntu.18.04-x64/publish.

From the publish directory, run the application with spark-submit.

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-2.4.x-0.4.0.jar dotnet RestaurantInspectionsEnrichment.dll

About

ETL & Data Enrichment with Spark.NET and ML.NET Automated (Auto) ML

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published