Skip to content

A playground for experimenting ideas that may apply to Spark SQL/Catalyst

Notifications You must be signed in to change notification settings

liancheng/spear

Folders and files

NameName
Last commit message
Last commit date

Latest commit

66216aa · Jul 5, 2018
Nov 8, 2015
May 23, 2017
Aug 31, 2017
Jul 5, 2018
Dec 11, 2017
May 23, 2017
Sep 17, 2017
Jul 5, 2018
Sep 6, 2017
Jun 25, 2017
Mar 13, 2016
Sep 13, 2016
Jun 1, 2018
Sep 8, 2017
Sep 13, 2016
Apr 5, 2017
Jun 19, 2017

Repository files navigation

Overview

Build Status codecov.io

Codecov.io

This project is a sandbox and playground of mine for experimenting ideas and potential improvements to Spark SQL. It consists of:

  • A parser that parses a small SQL dialect into unresolved logical plans
  • A semantic analyzer that resolves unresolved logical plans into resolved ones
  • A query optimizer that optimizes resolved query plans into equivalent but more performant ones
  • A query planner that turns (optimized) logical plans into executable physical plans

Currently Spear only works with local Scala collections.

Build

Building Spear is as easy as:

$ ./build/sbt package

Run the REPL

Spear has an Ammonite-based REPL for interactive experiments. To start it:

$ ./build/sbt spear-repl/run

Let's create a simple DataFrame of numbers:

@ context range 10 show ()
╒══╕
│id│
├──┤
│ 0│
│ 1│
│ 2│
│ 3│
│ 4│
│ 5│
│ 6│
│ 7│
│ 8│
│ 9│
╘══╛

A sample query using the DataFrame API:

@ context.
    range(10).
    select('id as 'key, (rand(42) * 100) cast IntType as 'value).
    where('value % 2 === 0).
    orderBy('value.desc).
    show()
╒═══╤═════╕
│key│value│
├───┼─────┤
│  5│   90│
│  9│   78│
│  0│   72│
│  1│   68│
│  4│   66│
│  8│   46│
│  6│   36│
│  2│   30│
╘═══╧═════╛

Equivalent sample query using SQL:

@ context range 10 asTable 't // Registers a temporary table first

@ context.sql(
    """SELECT * FROM (
      |  SELECT id AS key, CAST(RAND(42) * 100 AS INT) AS value FROM t
      |) s
      |WHERE value % 2 = 0
      |ORDER BY value DESC
      |""".stripMargin
  ).show()
╒═══╤═════╕
│key│value│
├───┼─────┤
│  5│   90│
│  9│   78│
│  0│   72│
│  1│   68│
│  4│   66│
│  8│   46│
│  6│   36│
│  2│   30│
╘═══╧═════╛

We can also check the query plan using explain():

@ context.
    range(10).
    select('id as 'key, (rand(42) * 100) cast IntType as 'value).
    where('value % 2 === 0).
    orderBy('value.desc).
    explain(true)
# Logical plan
Sort: order=[$0] ⇒ [?output?]
│ ╰╴$0: `value` DESC NULLS FIRST
╰╴Filter: condition=$0 ⇒ [?output?]
  │ ╰╴$0: ((`value` % 2:INT) = 0:INT)
  ╰╴Project: projectList=[$0, $1] ⇒ [?output?]
    │ ├╴$0: (`id` AS `key`#11)
    │ ╰╴$1: (CAST((RAND(42:INT) * 100:INT) AS INT) AS `value`#12)
    ╰╴LocalRelation: data=<local-data> ⇒ [`id`#10:BIGINT!]

# Analyzed plan
Sort: order=[$0] ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
│ ╰╴$0: `value`#12:INT! DESC NULLS FIRST
╰╴Filter: condition=$0 ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
  │ ╰╴$0: ((`value`#12:INT! % 2:INT) = 0:INT)
  ╰╴Project: projectList=[$0, $1] ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
    │ ├╴$0: (`id`#10:BIGINT! AS `key`#11)
    │ ╰╴$1: (CAST((RAND(CAST(42:INT AS BIGINT)) * CAST(100:INT AS DOUBLE)) AS INT) AS `value`#12)
    ╰╴LocalRelation: data=<local-data> ⇒ [`id`#10:BIGINT!]

# Optimized plan
Sort: order=[$0] ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
│ ╰╴$0: `value`#12:INT! DESC NULLS FIRST
╰╴Filter: condition=$0 ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
  │ ╰╴$0: ((`value`#12:INT! % 2:INT) = 0:INT)
  ╰╴Project: projectList=[$0, $1] ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
    │ ├╴$0: (`id`#10:BIGINT! AS `key`#11)
    │ ╰╴$1: (CAST((RAND(42:BIGINT) * 100.0:DOUBLE) AS INT) AS `value`#12)
    ╰╴LocalRelation: data=<local-data> ⇒ [`id`#10:BIGINT!]

# Physical plan
Sort: order=[$0] ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
│ ╰╴$0: `value`#12:INT! DESC NULLS FIRST
╰╴Filter: condition=$0 ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
  │ ╰╴$0: ((`value`#12:INT! % 2:INT) = 0:INT)
  ╰╴Project: projectList=[$0, $1] ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
    │ ├╴$0: (`id`#10:BIGINT! AS `key`#11)
    │ ╰╴$1: (CAST((RAND(42:BIGINT) * 100.0:DOUBLE) AS INT) AS `value`#12)
    ╰╴LocalRelation: data=<local-data> ⇒ [`id`#10:BIGINT!]