Azure Cosmos DB Connector for Apache Spark

Guides: User Guide | Configuration Reference

azure-cosmosdb-spark is the official connector for Azure CosmosDB and Apache Spark. The connector allows you to easily read to and write from Azure Cosmos DB via Apache Spark DataFrames in python and scala. It also allows you to easily create a lambda architecture for batch-processing, stream-processing, and a serving layer while being globally replicated and minimizing the latency involved in working with big data.

Table of Contents

Latest
Guides
Jump Start
- Reading from Cosmos DB
- Writing to Cosmos DB
Requirements
Working with the connector
Working with our samples
More Information
Contributing & Feedback

Latest

Lambda Architecture Re-architected - Speed Layer (Databricks notebook HTML view)
Lambda Architecture Re-architectured (Documentation and Samples)
Using the Bulk API with the connector (Guidance)

Guides

Guides	Description
User Guide	An end-to-end `azure-cosmosdb-spark` user guide
Configuration Reference Guide	Reference guide of the various read, change feed, write, and bulk API write parameters

Jump Start

Reading from Cosmos DB

Below are excerpts in Python and Scala on how to create a Spark DataFrame to read from Cosmos DB

# Read Configuration
readConfig = {
  "Endpoint" : "https://doctorwho.documents.azure.com:443/",
  "Masterkey" : "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
  "Database" : "DepartureDelays",
  "preferredRegions" : "Central US;East US2",
  "Collection" : "flights_pcoll", 
  "SamplingRatio" : "1.0",
  "schema_samplesize" : "1000",
  "query_pagesize" : "2147483647",
  "query_custom" : "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'"
}

# Connect via azure-cosmosdb-spark to create Spark DataFrame
flights = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**readConfig).load()
flights.count()

Click for Scala Excerpt

// Import Necessary Libraries
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark._
import com.microsoft.azure.cosmosdb.spark.config.Config

// Configure connection to your collection
val readConfig = Config(Map(
  "Endpoint" -> "https://doctorwho.documents.azure.com:443/",
  "Masterkey" -> "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
  "Database" -> "DepartureDelays",
  "PreferredRegions" -> "Central US;East US2;",
  "Collection" -> "flights_pcoll", 
  "SamplingRatio" -> "1.0",
  "query_custom" -> "SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'"
))

// Connect via azure-cosmosdb-spark to create Spark DataFrame
val flights = spark.read.cosmosDB(readConfig)
flights.count()

Writing to Cosmos DB

Below are excerpts in Python and Scala on how to write a Spark DataFrame to Cosmos DB

# Write configuration
writeConfig = {
 "Endpoint" : "https://doctorwho.documents.azure.com:443/",
 "Masterkey" : "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
 "Database" : "DepartureDelays",
 "Collection" : "flights_fromsea",
 "Upsert" : "true"
}

# Write to Cosmos DB from the flights DataFrame
flights.write.format("com.microsoft.azure.cosmosdb.spark").options(**writeConfig).save()

Click for Scala Excerpt

// Configure connection to the sink collection
val writeConfig = Config(Map(
  "Endpoint" -> "https://doctorwho.documents.azure.com:443/",
  "Masterkey" -> "SPSVkSfA7f6vMgMvnYdzc1MaWb65v4VQNcI2Tp1WfSP2vtgmAwGXEPcxoYra5QBHHyjDGYuHKSkguHIz1vvmWQ==",
  "Database" -> "DepartureDelays",
  "PreferredRegions" -> "Central US;East US2;",
  "Collection" -> "flights_fromsea",
  "WritingBatchSize" -> "100"
))

// Upsert the dataframe to Cosmos DB
import org.apache.spark.sql.SaveMode
flights.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig)

See other sample Jupyter and Databricks notebooks as well as PySpark and Spark scripts.

Requirements

azure-cosmosdb-spark has been regularly tested using HDInsight 3.6 (Spark 2.1), 3.7 (Spark 2.2) and Azure Databricks Runtime 3.5 (Spark 2.2.1), 4.0 (Spark 2.3.0).

Review supported component versions

Component	Versions Supported
Apache Spark	2.2.1, 2.3
Scala	2.10, 2.11
Python	2.7, 3.6
Azure Cosmos DB Java SDK	1.16.1, 1.16.2

Working with the connector

You can build and/or use the maven coordinates to work with azure-cosmosdb-spark.

Review the connector's maven versions

Spark	Scala	Latest version
2.2.0	2.11	azure-cosmosdb-spark_1.0.0-2.2.0_2.11
2.2.0	2.10	azure-cosmosdb-spark_1.0.0-2.2.0_2.10
2.1.0	2.11	azure-cosmosdb-spark_1.0.0-2.1.0_2.11
2.1.0	2.10	azure-cosmosdb-spark_1.0.0-2.1.0_2.10
2.0.2	2.11	azure-cosmosdb-spark_0.0.3-2.0.2_2.11
2.0.2	2.10	azure-cosmosdb-spark_0.0.3-2.0.2_2.10

Using spark-cli

To work with the connector using the spark-cli (i.e. spark-shell, pyspark, spark-submit), you can use the --packages parameter with the connector's maven coordinates.

spark-shell --master YARN --packages "com.microsoft.azure:azure-cosmosdb-spark_2.2.0_2.11:1.0.0"

Using Jupyter notebooks

If you're using Jupyter notebooks within HDInsight, you can use spark-magic %%configure cell to specify the connector's maven coordinates.

{ "name":"Spark-to-Cosmos_DB_Connector",
  "conf": {
    "spark.jars.packages": "com.microsoft.azure:azure-cosmosdb-spark_2.2.0_2.11:1.0.0",
    "spark.jars.excludes": "org.scala-lang:scala-reflect"
   }
   ...
}

Note, the inclusion of the spark.jars.excludes is specific to remove potential conflicts between the connector, Apache Spark, and Livy.

Using Databricks notebooks

Please create a library using within your Databricks workspace by following the guidance within the Azure Databricks Guide > Use the Azure Cosmos DB Spark connector

Build the connector

Currently, this connector project uses maven so to build without dependencies, you can run:

mvn clean package

Working with our samples

Included in this GitHub repository are a number of sample notebooks and scripts that you can utilize:

On-Time Flight Performance with Spark and Cosmos DB (Seattle) ipynb | html: This notebook utilizing azure-cosmosdb-spark to connect Spark to Cosmos DB using HDInsight Jupyter notebook service to showcase Spark SQL, GraphFrames, and predicting flight delays using ML pipelines.
Connecting Spark with Cosmos DB Change feed: A quick showcase on how to connect Spark to Cosmos DB Change Feed.
Twitter Source with Apache Spark and Azure Cosmos DB Change Feed: ipynb | html
Using Apache Spark to query Cosmos DB Graphs: ipynb | html
Connecting Azure Databricks to Azure Cosmos DB using azure-cosmosdb-spark. Linked here is also an Azure Databricks version of the On-Time Flight Performance notebook.
Lambda Architecture with Azure Cosmos DB and HDInsight (Apache Spark): Combining the Azure Cosmos DB, , and HDInsight not only allows you to accelerate real-time big data analytics, but also allows you to benefit from a Lambda Architecture while simplifying its operations.

More Information

We have more information in the azure-cosmosdb-spark wiki including:

Azure Cosmos DB Spark Connector User Guide
Aggregations Examples

Configuration and Setup

Spark Connector Configuration
Spark to Cosmos DB Connector Setup (In progress)
Configuring Power BI Direct Query to Azure Cosmos DB via Apache Spark (HDI)

Troubleshooting

Using Cosmos DB Aggregates
Known Issues

Performance

Performance Tips
Query Test Runs
Writing Test Runs

Change Feed

Stream Processing Changes using Azure Cosmos DB Change Feed and Apache Spark
Change Feed Demos
Structured Stream Demos

Contributing & Feedback

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

See CONTRIBUTING.md for contribution guidelines.

To give feedback and/or report an issue, open a GitHub Issue.

Apache®, Apache Spark, and Spark® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Azure Cosmos DB Connector for Apache Spark

Latest

Guides

Jump Start

Reading from Cosmos DB

Writing to Cosmos DB

Requirements

Working with the connector

Using spark-cli

Using Jupyter notebooks

Using Databricks notebooks

Build the connector

Working with our samples

More Information

Contributing & Feedback

Files

README.md

Latest commit

History

README.md

File metadata and controls

Azure Cosmos DB Connector for Apache Spark

Latest

Guides

Jump Start

Reading from Cosmos DB

Writing to Cosmos DB

Requirements

Working with the connector

Using spark-cli

Using Jupyter notebooks

Using Databricks notebooks

Build the connector

Working with our samples

More Information

Contributing & Feedback