play.tex

\clearpage

This is a tutorial document that may one day grow into a paper.
The project behind this tutorial is to use relational algebra techniques to optimize linear algebra programs, such as those in machine learning.
The goal is to achieve performance speedups by finding better plans through relational techniques.

\section{Motivation}


\begin{figure}[tbh]
\includegraphics[trim={3cm 2cm 2cm 0},clip,width=1\linewidth]{la_to_ra_and_back}
\caption{For a given query MPlan, it is difficult to find the lowest-cost semantically equivalent MPlan due an incomplete basis: MPlan operators are compound and can be decomposed into more primitive parts. Rewriting to a canonical RPlan grants the ability to find otherwise unreachable equivalent MPlans.}
\label{fOverallPlan}
\end{figure}
\begin{figure}[tbh]
\includegraphics[trim={0 0 0 0},clip,width=1\linewidth]{la_to_ra_and_back_flowchart}
\caption{Approach to finding the lowest-cost semantically equivalent MPlan by searching through a space generated by canonical RPlans.}
\label{fOverallPlanFlowchart}
\end{figure}


\section{$ABC+BCD+ABD$}

\begin{figure*}[tbh]
\subcaptionbox{Cost in flops for a few plans computing $ABC+BCD+ABD$\label{tExCost}}{
\begin{tabular}{c|ccc}
plan & cost (dense) 
& \begin{tabular}{@{}c@{}}$n_1=2$\\$n_2=10$\end{tabular} 
& \begin{tabular}{@{}c@{}}$n_1=10$\\$n_2=2$\end{tabular} \\
\hline
$AB(C+D) + (BC)D$       & $2n_1^2 n_2 + 6n_1 n_2^2 + n_2^2 + n_1 n_2$ & 1400 & \textbf{664} \\
$(BC+AB)D + (AB)C$      & $2n_1^2 n_2 + 6n_1 n_2^2 + 2n_1 n_2$ & \textbf{1320} & 680 \\
$(AB)C + (BC)D + (AB)D$ & $2n_1^2 n_2 + 8n_1 n_2^2 + 3n_1 n_2$ & 1740 & 780 \\
$A(BC) + (BC)D + (AB)D$ & $4n_1^2 n_2 + 6n_1 n_2^2 + 3n_1 n_2$ & 1420 & 1100 \\
\end{tabular}
}
\subcaptionbox{Plot of $n_2:n_1$ ratio vs. plan cost\label{fExPlot}}{
\includegraphics[width=0.32\linewidth]{ex_compare}
}
\end{figure*}

Consider the following DML program, in which $n_1$ and $n_2$ are parameters that determine the dimensions of the matrices $A$, $B$, $C$, $D$.
The \texttt{matrix} function creates dense matrices with the given dimensions, filled with the value 1.

\begin{lstlisting}[language=R,escapeinside={(*}{*)},frame=single,caption={DML Program for $ABC+BCD+ABD$},label={lExDML}]
A = matrix(1, rows=(*$n_1$*), cols=(*$n_1$*))
B = matrix(1, rows=(*$n_1$*), cols=(*$n_2$*))
C = matrix(1, rows=(*$n_2$*), cols=(*$n_2$*))
D = matrix(1, rows=(*$n_2$*), cols=(*$n_2$*))

R = A%*%B%*%C + B%*%C%*%D + A%*%B%*%D
write(R, "result.csv")
\end{lstlisting}

\subsection{Manual analysis of plans}

What is the optimal plan to compute $R$?
The answer depends on $n_1$ and $n_2$.
Table~\ref{tExCost} presents a few alternative plans that compute $R$.
The cost, in number of additions and multiplications, is given symbolically in terms of $n_1$ and $n_2$, as well as numerically with different ratios of $n_1$ and $n_2$.

Let's manually analyze the cost of a few alternative plans that compute $R$ for practice.
Recall that a dense matrix multiplication of an $a \times b$ matrix with an $b \times c$ matrix requires $ac$ inner products, each requiring $b$ multiplications and $b-1$ additions (really, $b$ additions when added to a 0 in the result matrix), for a total cost of $2abc$.

Table~\ref{tExCost} hints that the optimal plan for $R$ depends on the ratio of $n_2$ to $n_1$.
When $n_1 \ll n_2$, then the first plan appears to be cheapest. 
When $n_1 \gg n_2$, then the second plan appears to be cheapest. 
The third and fourth plans, while correct, do not appear to be cheapest for any ratio, as they do not take advantage of + factorizations.
Figure~\ref{fExPlot} on the ratio $r$ confirms this trend.

Note that the author's analysis missed a lower cost plan for inputs where $n_1 \gg n_2$.  This plan is $AB(C+D) + B(CD)$. It is easy to miss such plans during manual analysis.

A real ML program contains more complicated expressions. Manual analysis is not feasible; an automated method for finding a (near-)optimal plan is called for. Let's walk through how we could optimize this expression in a more general fashion.


\subsection{MPlans}
Different ways of representing plans can make it easier to optimize them.
One of the simplest representations is the Directed Acyclic Graph (DAG).
A DAG representation presents inputs to an expression as leaves at the bottom, outputs as roots at the top, and operators as interior nodes.

Parsers parse DML expressions into DAGs.
In our example, a parser would parse the program from Listing~\ref{lExDML} into a DAG consisting of matrix operators, shown in Figure~\ref{fExMatrixDAG}.
This parsing assumes parentheses on the left of every multiplication and addition, as in $((AB)C + (BC)D) + (AB)D$.
The order of children is significant in this representation, which we call an MPlan.

The DAG shows the common subexpression $AB$ with two parents: the node computing $ABC$ and the node computing $ABD$.
Common subexpressions are important to recognize, because they can be computed once and reused for all parents.
The process of identifying common subexpressions inside a DAG is called \textit{common subexpression elimination} (i.e., eliminating duplicate subplans by reconnecting the parents of the duplicates to a single copy of the subplan).

It is possible to optimize MPlans directly, by writing rules over the operators MxM (matrix multiplication), $*$ (element-wise multiplication), $+$ (element-wise addition), $^\tr$ (transpose), $\sum_{row}$ (row-wise aggregation), $\sum_{col}$ (column-wise aggregation), and $\sum$ (full aggregation).
However, reasoning at the matrix DAG level is prone to missing possible rewrites,
such as interchanging the $\sum$ over an MxM or $*$.
We therefore consider a finer-grained representation.

\begin{figure}[tbh]
    \centering
\begin{tikzpicture}[level distance=3em, node distance=3em,
    level 1/.style={sibling distance=8em},
    level 2/.style={sibling distance=8em},
    level 3/.style={sibling distance=4em},
]
\node {$+$}
    child {node {$+$}
        child {node (ABC) {MxM}
            child {node (AB) {MxM}
                % child {node (Aab) {A}} %{$B_{ae}$}
                % child {node (Bbc) {B}} %{$C_{ef}$}
            }
            child[missing]% {node (Ccd) {C}} %{$D_{fd}$}
        }
        child {node (BCD) {MxM}
            child {node (BC) {MxM}
                % child {node (Bae) {B}} %{$B_{ae}$}
                % child {node (Cef) {C}} %{$C_{ef}$}
            }
            child[missing]% {node (Dfd) {D}} %{$D_{fd}$}
        }
    }
    child {node (ABD) {MxM}
        child[missing]
        % {node {MxM}
        %     child {node (Aag) {A}} %{$A_{ag}$}
        %     child {node (Bgh) {B}} %{$B_{gh}$}
        % }
        child[missing] %{node (Dhd) {D}} %{$D_{hd}$}
    }
;
\node (A) [below=of AB,xshift=2em] {$A$};
\node (B) [right=of A] {$B$};
\node (C) [right=of B] {$C$};
\node (D) [right=of C] {$D$};

\draw (A.north) -- (AB);
% \draw (A.north) -- (Aag.south);
\draw (B.north) -- (AB);
\draw (B.north) -- (BC);
% \draw (B.north) -- (Bgh.south);
\draw (AB) -- (ABD);
\draw (C.north) -- (ABC);
\draw (C.north) -- (BC);
\draw (D.north) -- (BCD);
\draw (D.north) -- (ABD);
\end{tikzpicture}
    \caption{MPlan, parsed from Listing~\ref{lExDML}}
    \label{fExMatrixDAG}
\end{figure}


As a brief note, we would like to mention that DML programs contain compound operators in addition to the ones listed above.
For example, $-$ (element-wise subtraction) and \^{} (element-wise power) are DML operators that can be written in terms of more basic ones, by applying the rules
\begin{align*}
A - B &\to A + (-1)*B \\
\forall k \in \textsc{Z}^+,\, A^k &\to A * A^{k-1} \\
A^0 &\to 1
\end{align*}

Other operators in DML, such as min and max, cbind (horizontal concatenation), reshape (permute the entries of a matrix into a different shape), and table (construct a matrix from a vector of row indices, a vector of column indices, and a vector of values) are deferred to future work. Including these operators is important because it may be possible to optimize through them by, for example, pushing down a matrix multiply through a cbind($A$,$B$) operator in order to optimize it with the plans that compute $A$ and $B$. In this work, we simply model these operators as uninterpreted black boxes.


\subsection{RPlan DAG}
The elementary operators of linear algebra are patterns of $+$ and $*$.
In this section we introduce two patterns that have sufficient expressiveness to represent all the matrix operators in the MPlans.  We call the new representation RPlans, short for relation plans.

The first pattern is the join.  
Element-wise addition or multiplication join their arguments on matching row and column indices, adding or multiplying matching values.
The first stage of matrix multiply joins the column indices of the first matrix to equal row indices of the second matrix, multiplying matching values.
The first stage of the inner product of two vectors joins on their matching indices.
The outer product of two vectors joins without any join conditions, i.e., taking the Cartesian product of indices.
Thus, joins in linear algebra may involve 0, 1, or 2 join conditions of indices.

In the spirit of relational algebra, from which we borrow the join operation, we will refer to indices as \textit{attributes}.
The objects in the RPlan algebra representation can be considered \textit{relations},
whose attribute names are known as its \textit{schema}.
The schema of a join has attributes equal to the union of its inputs' schema.

We use the symbols $*$ and $+$ to denote the $n$-ary join that multiplies or adds matching values.
The two operators are similar, but there is an important difference between them.  
If we imagine the value 0 as a ``null value'' in the relational sense,
then $*$ acts like an inner join, because multiplication has the annihilator property that for any number $x$, $0 * x = 0$,
whereas $+$ acts like an outer join, because addition does not have the annihilator property.
Nuances like these are important when writing a cost model.
%k-relation

In order to represent join conditions, we adopt the convention of natural join semantics, i.e., that attributes whose names match are meant to be joined together.
We use two operators to manage attributes:
the \textit{bind} operator to assign attribute names, and the \textit{unbind} operator to un-assign attributes.
We write $\bind{ab}$ to bind the indices $a$ and $b$ to the rows and columns of a matrix.
We write $\unbind{ab}$ to un-assign the attribute $a$ as the row and $b$ as the column in a matrix.
Vectors require only one attribute name, because they have a single dimension.

The semantics of bind are to open a stream of scalars from the input that are labelled with the attribute names they are bound to.  Bound names may be thought of as iterating over the input values.
The semantics of unbind are to collect a stream of scalars into a matrix object, using the positions of the unbound attributes to determine which index is the row and which is the column.

The transpose operator is therefore represented in RPlans as an unbind that reverses the order of unbound names.
For example, the plan $\unbind{ba}( \bind{ab}A )$ transposes the input $A$ by unbinding index names opposite the order in which the indices were bound.

Applying this to matrix multiply, the first stage of the plan $A \text{ MxM } B$ can be represented as $\bind{ab} A * \bind{bc} B$.
The name $b$ joins the rows of $A$ to the columns of $B$.  
The result is a 3-tensor, i.e. a tensor of 3 dimensions, because it has three attributes in its schema: $a, b, c$.

The final part of a matrix multiply can be represented with aggregation over $b$.
Aggregation is an operator that is annotated with the attributes to be aggregated, in this case $\sum_b$.
The schema of an aggregation is the schema of its input minus the aggregated attributes.
Thus, a complete expression for the plan $A \text{ MxM } B$ is $\unbind{ac} \sum_b( \bind{ab} A * \bind{bc} B )$.


\subsection{RPlan DAG in Canonical Form}
A common way to write the example in math, assuming parentheses on the left of every multiplication, is
\begin{align*}
R_{ad} &= 
\sum_c \lrparen{\sum_b A_{ab} B_{bc}} C_{cd} + 
\sum_f \lrparen{\sum_e B_{ae} C_{ef}} D_{fd} \\&\qquad\qquad +
\sum_h \lrparen{\sum_g A_{ag} B_{gh}} D_{hd} \\
&= \sum_{bc} A_{ab} B_{bc} C_{cd} + \sum_{ef} B_{ae} C_{ef} D_{fd} + \sum_{gh} A_{ag} B_{gh} D_{hd}
\end{align*}
The above math includes indices (i.e., the letters $a, b, c$, etc.) that bind to the rows and columns of the input matrices, in a way that matches the semantics of matrix multiplication:
the column of the first matrix joins to (has the same attribute name as) the row of the second matrix, which is aggregated away after multiplication.
The rewriting uses the distributive law of $*$ over $\sum$ and the commutativity of $\sum$ to pull the aggregation operators to the top of the expression and push the multiplication operators to the bottom.  

% Talk about relational interpretation over k-relations? 

``Top'' and ``bottom'' refer to the presentation of the expression in the form of a DAG.  
The DAG has one root corresponding to the output of the expression, and one leaf for each input. 
Internal nodes take the form of $+$, $*$, and $\sum$.
Figure \ref{fExDag} presents a DAG of the expressions both before and after rewriting into canonical form.

In general, the canonical form of an RPlan is one that, from bottom to top, includes the inputs, a bind layer, a multiplication layer, an aggregation layer, an addition layer, and an unbind layer.

\begin{figure}[tbh]
    \centering
\subcaptionbox{Plan to compute $((AB)C+(BC)D)+(AB)D$\label{fExDagOrig}}{
\begin{tikzpicture}[level distance=3em, node distance=3em,
    level 2/.style={sibling distance=12em},
    level 3/.style={sibling distance=5em},
    level 4/.style={sibling distance=3em},
]
\node {\unbind{ae}}
child {node {$+$}
    child {node {$+$}
        child {node {$\sum_{c}$} child {node {$*$}
            child {node {$\sum_b$} child {node {$*$}
                child {node (Aab) {\bind{ab}}} %{$B_{ae}$}
                child {node (Bbc) {\bind{bc}}} %{$C_{ef}$}
            }}
            child {node (Ccd) {\bind{cd}}} %{$D_{fd}$}
        }}
        child {node {$\sum_{f}$} child {node {$*$}
            child {node {$\sum_e$} child {node {$*$}
                child {node (Bae) {\bind{ae}}} %{$B_{ae}$}
                child {node (Cef) {\bind{ef}}} %{$C_{ef}$}
            }}
            child {node (Dfd) {\bind{fd}}} %{$D_{fd}$}
        }}
    }
    child {node {$\sum_{h}$} child {node {$*$}
        child {node {$\sum_g$} child {node {$*$}
            child {node (Aag) {\bind{ag}}} %{$A_{ag}$}
            child {node (Bgh) {\bind{gh}}} %{$B_{gh}$}
        }}
        child {node (Dhd) {\bind{hd}}} %{$D_{hd}$}
    }}
};
\node (A) [below=of Bbc,xshift=2em] {$A$};
\node (B) [right=of A] {$B$};
\node (C) [right=of B] {$C$};
\node (D) [right=of C] {$D$};

\draw (A.north) -- (Aab.south);
\draw (A.north) -- (Aag.south);
\draw (B.north) -- (Bbc.south);
\draw (B.north) -- (Bae.south);
\draw (B.north) -- (Bgh.south);
\draw (C.north) -- (Ccd.south);
\draw (C.north) -- (Cef.south);
\draw (D.north) -- (Dfd.south);
\draw (D.north) -- (Dhd.south);
\end{tikzpicture}
}

\subcaptionbox{Plan rewritten into canonical form\label{fExDagCanonical}}{
\begin{tikzpicture}[level distance=3em, node distance=3em,
    level 2/.style={sibling distance=7em},
    level 3/.style={sibling distance=3.5em},
    level 4/.style={sibling distance=2.2em},
]
\centering
\node {\unbind{ae}}
child {node {$+$}
    child {node {$\sum_{bc}$}
        child {node {$*$}
            child {node (Aab) {$\bind{ab}$}} %{$A_{ab}$}
            child {node (Bbc) {\bind{bc}}} %{$B_{bc}$}
            child {node (Ccd) {\bind{cd}}} %{$C_{cd}$}
        }
    }
    child {node {$\sum_{ef}$}
        child {node {$*$}
            child {node (Bae) {\bind{ae}}} %{$B_{ae}$}
            child {node (Cef) {\bind{ef}}} %{$C_{ef}$}
            child {node (Dfd) {\bind{fd}}} %{$D_{fd}$}
        }
    }
    child {node {$\sum_{gh}$}
        child {node {$*$}
            child {node (Aag) {\bind{ag}}} %{$A_{ag}$}
            child {node (Bgh) {\bind{gh}}} %{$B_{gh}$}
            child {node (Dhd) {\bind{hd}}} %{$D_{hd}$}
        }
    }
};
\node (A) [below=of Bbc] {$A$};
\node (B) [below=of Bae,xshift=-.33em] {$B$};
\node (C) [below=of Dfd,xshift=.33em] {$C$};
\node (D) [below=of Bgh] {$D$};

\draw (A.north) -- (Aab.south);
\draw (A.north) -- (Aag.south);
\draw (B.north) -- (Bbc.south);
\draw (B.north) -- (Bae.south);
\draw (B.north) -- (Bgh.south);
\draw (C.north) -- (Ccd.south);
\draw (C.north) -- (Cef.south);
\draw (D.north) -- (Dfd.south);
\draw (D.north) -- (Dhd.south);
\end{tikzpicture}
}
    \caption{DAG representation of plans to compute $ABC+BCD+ABD$}
    \label{fExDag}
\end{figure}


\subsection{Mockup bottom-up plan search}


\section{Formal Definitions}

\subsection{MPlans, RPlans, Mixed Plans}

\begin{table}[tbh]
\begin{tabular}{llll}
&name & type & syntax \\
\hline
\multirow{7}{*}{\rotatebox[origin=c]{90}{MPlan}} 
&mmult & $M_{M,L} \times M_{L,N} \rightarrow M_{M,N}$ & $AB$ or MxM \\
&elemmult & $M_{M,N} \times M_{M,N} \rightarrow M_{M,N}$ & $A*B$ \\
&elemplus & $M_{M,N} \times M_{M,N} \rightarrow M_{M,N}$ & $A+B$ \\
&rowagg & $M_{M,N} \rightarrow M_{M,1}$ & $\sum_{row} A$ \\
&colagg & $M_{M,N} \rightarrow M_{1,N}$  & $\sum_{col} A$ \\
&agg & $M_{M,N} \rightarrow M_{1,1}$ & $\sum A$ \\
%&power & $M  \times Int \rightarrow M$ & $A^k$ \\
&transpose & $M_{M,N} \rightarrow M_{N,M}$ & $A^\tr$ \\
\hline
\multirow{2}{*}{\rotatebox[origin=c]{90}{conv.}}
&bind & $M_{M,N} \times [i,j] \rightarrow R_{i:M,j:N}$ & $\bind{i,j} A$ \\
&unbind & $R_{i:M,j:N} \times [i,j] \rightarrow M_{M,N}$ & $\unbind{i,j} A$ \\
\hline
\multirow{2}{*}{\rotatebox[origin=c]{90}{RPlan}}
&join & $R_{S_1} \times R_{S_2} \rightarrow R_{S_1 \cup S_2}$ & $A*B$ or $A+B$ \\
&agg & $R_S \times U \rightarrow R_{S \setminus U}$ & $\sum_{U} A$ \\
\end{tabular}
\caption{Operators of MPlans and RPlans. Uninterpreted black-box functions are also allowed.
The type $M_{M,N}$ is a matrix of size $M \times N$; $[i,j]$ is a list of attribute names; $R_{i:M,j:N}$ is a relation with attributes $i$ of size $M$ and $j$ of size $N$; $S_1, S_2, S,$ and $U$ are sets of attributes.}
\label{tPlanOps}
\end{table}

The Matrix Plan Algebra is an algebra over real number matrices with the operators listed in the top third of Table~\ref{tPlanOps}. 
The type of an matrix with dimensions $M \times N$ is written $M_{M,N}$.
Vectors have orientation and are written as either $M_{1,N}$ or $M_{M,1}$; scalars have type $M_{1,1}$.
An \textit{MPlan}---short for Matrix Plan---is an expression in the Matrix Plan Algebra. 

The Relation Plan Algebra is an algebra over relations (whose attributes' domain is a prefix of the positive integers) with the operators listed in the bottom third of Table~\ref{tPlanOps}.
The domain of each attribute is a prefix of the positive integers; we say an attribute has size $s$ if its domain is the integers $1, 2, \dots, s$.
The notation $R_{i:M,j:N,k:L}$ indicates a relation $R$ with attributes $i$ of size $M$, $j$ of size $N$, and $k$ of size $L$.
Each tuple of a relation is annotated with a real number.
The set of a relation's attributes is called its schema.
We write $i \in A$ if $i$ is an attribute name present in $A$'s schema.
An \textit{RPlan}---short for Relation Plan---is an expression in the Relation Plan Algebra.

The middle third of Table~\ref{tPlanOps} presents the bind and unbind conversion operators between matrices and relations. Both operators are annotated with a pair of attribute names, either of which may be the `\_' symbol to match the dimension of the input type. % whose length matches the input's dimension 

\begin{definition}
An \textbf{\emph{MPlan}} is an expression in the Matrix Plan Algebra, except that relation objects (but not relation operators) and conversion operators are also allowed. An MPlan is \emph{pure} if it does not contain conversion operators.

An \textbf{\emph{RPlan}} is an expression in the Relation Plan Algebra, except that matrix objects (but not matrix operators) and conversion operators are also allowed. An RPlan is \emph{pure} if it does not contain conversion operators.

A \textbf{\emph{Mixed Plan}} is an expression in the product of the Matrix Plan Algebra and the Relation Plan Algebra, extended with the conversion operators.
\end{definition}

MPlans, RPlans, and Mixed Plans are depicted as DAGs, to emphasize that they may contain common sub-expressions (CSEs) or multiple outputs (multiple roots).
They may also contain uninterpreted (black-box) functions.


\subsection{Converting MPlans to RPlans}
An MPlan may be converted into a semantically equivalent RPlan via the rewrite rules $R_{MR}$ in Figure~\ref{RMR}.
This list includes additional rules for a couple common operators in the DML language: element-wise subtraction and positive integer exponentiation.

\begin{figure}[htb]
\begin{enumerate}
  \item $A - B \rightarrow A + (-1) * B$
  \item For small integers $k > 1$, $A^k \rightarrow A * A^{k-1}$ and $A^1 \to A$.
  \item $A*B \rightarrow \unbind{i,j}(\bind{i,j}A * \bind{i,j}B)$.  Same pattern for $+$.
  \item $AB \rightarrow \unbind{i,k} \sum_j (\bind{i,j}A * \bind{j,k}B)$. 
  \item $A^\tr \rightarrow \unbind{j,i} \bind{i,j}A$.
  \item $\sum_{row} A \rightarrow \unbind{i,\_} \sum_j \bind{i,j} A$.  Similar pattern for $\sum_{col}$, $\sum$.
\end{enumerate}
\caption{MPlan-to-RPlan Ruleset $R_{MR}$}
\label{RMR}
\end{figure}

Applying the $R_{MR}$ rules replaces matrix operators with relation operators, with conversion operators in between to convert matrices to relations and back. The result is an RPlan.

It follows from the $R_{MR}$ rules that the space of MPlans is a subset of the space of RPlans. We prove that it is a strict subset in the following:
\begin{lemma}
Let \MPLAN{} and \RPLAN{} be the set of all MPlans and RPlans.
Then $R_{MR}(\MPLAN{}) \subsetneq \RPLAN{}$.
\end{lemma}

% No confluence needed (confluence means that no matter what order we apply RMR rules, the result is the same).

\begin{proof}
First we need to show that the ruleset $R_{MR}$ is terminating on any MPlan, or else the lemma is improper. It is terminating because each rule (except (1)) removes a matrix operator, and an MPlan has finitely many matrix operators. For rule (2), we use the fact that $k$ decreases and is bounded below to prove termination.

Suppose $p \in \MPLAN{}$. 
We prove that $R_{MR}(p) \in \RPLAN{}$ by structural induction on $p$. 
The base case, that $p$ is a lone matrix, is trivially an RPlan.
The rules $R_{MR}$ compose the inductive case; for each matrix operator, there is a rule in $R_{MR}$ that converts it to a composition of relation operators and conversion operators.  Rule (2) requires a second induction, but otherwise the proof is straightforward. 

To show that $R_{MR}(\MPLAN{}) \neq \RPLAN{}$, consider the RPlan $\bind{i,j}A * \bind{k,l}B$.
This RPlan computes a tensor product and has schema $\{i,j,k,l\}$. As its dimension is 4, no representation as an MPlan is possible because the maximum dimension of an MPlan is 2.
\end{proof}


The unbind and bind operators can be eliminated or moved around to create ``pipelines'' of RPlan operators.
For example, the rule $\bind{i,j} \unbind{i,j} A \to A$ removes a bind-unbind sequence, assuming that care is taken to properly handle CSEs.
More generally the rule $\bind{x,y} \unbind{i,j} A \to A[i \to x, j \to y]$ indicates that the attributes $i$ and $j$ in $A$'s schema should be renamed to $x$ and $y$, by propagating the rename downward into $A$.
%This pattern will be discussed in more detail.
The result of eliminating/shifting unbind and bind operators is an RPlan with a bind operator above every (nonscalar) input and an unbind operator below every (nonscalar) output.


\subsection{RPlan Canonical Form}
We now introduce a canonical form for RPlans.
The purpose of the form is to (1) identify common sub-expressions (i.e., expressions that may differ in syntax but compute the same function) and (2) act as a neutral starting point from which we may systematically enumerate alternative plans.  Fulfilling these roles requires the canonical property, that two RPlans have the same canonical form if and only if they are semantically equivalent (i.e., they compute the same result on every input).%\footnote{Our canonical form technically is only canonical up to uninterpreted constants. Constant folding, such as replacing $2 + 3$ with 5, occurs before and after our optimization process and affects the estimated cost of plans little.}

Our form is similar in spirit to the disjunctive normal form (DNF) in propositional logic, also known as a truth table.  DNF formulae are rarely the most efficient form to execute, but they do act as a canonical form and also serve as a starting point for finding more efficient forms.  %((Reference K-V maps?))
Our form is also similar to the canonical form of polynomials as a sum of monomials, except that the monomial terms also include aggregations and constants are treated as black-box symbols, rather than real numbers.

The analogy to polynomials extends further: the canonical form of polynomials combines ``like terms,'' i.e., two terms that have the same variables.  Instead of writing $3x^2y + 5x^2y$, for example, we write its canonical form as $8x^2y$.
For RPlans, we determine whether two terms are like by whether there exists an isomorphism mapping one to the other by renaming attributes (up to a constant scalar).

The canonical form for MPlans is the sum-of-aggregations-of-products with like terms combined, as follows:

\begin{definition}
An RPlan is \textbf{canonical} if it can be written as the sum of $n$ terms, each of which is an aggregation over a (possibly empty) attribute set $A_i$ (for $1 \leq i \leq n$), each of whose inputs is the multiplication of $m_i$ base input matrices $b_{ij}$ (for $1 \leq i \leq n$, $1 \leq j \leq m_i$):

\[\sum_{A_1} \lrparen{b_{11} * \dots * b_{1m_1}} %+ \sum_{A_2} (b_{21} * \dots * b_{2m_2})
+ \dots + \sum_{A_n} \lrparen{b_{n1} * \dots * b_{nm_n}}\]

Further, there must not exist, for any $1 \leq i < j \leq n$, an isomorphism $\phi$ renaming attributes in the schema of $b_{i1} * \dots * b_{im_i}$ to attributes in the schema of $b_{j1} * \dots * b_{jm_j}$ such that $m_i=m_j$, $\phi(A_i) = A_j$, and $\phi\{b_{i1} * \dots * b_{im_i}\} = \{b_{j1} * \dots * b_{jm_j}\}$.

The order of summands and multiplicands is insignificant, due to the commutative and associative properties of $*$ and $+$. A canonical order within these can be established when necessary by, e.g., hashing.

% A canonical form may be abbreviated as \\
% $[(A_1; b_{11}, \dots, b_{1m_1}), \dots, (A_n; b_{n1}, \dots, b_{nm_n})]$.
\end{definition}

Figure~\ref{RRC} shows the rewrite rules $R_{RC}$ that canonize RPlans by pulling additions to the top of an expression, followed by aggregations and last multiplications. 
Let $\mathcal{C}$ be the transitive application of ruleset $R_{RC}$.

\begin{figure}[htb]
\begin{enumerate}
  \item\label{RRC_mp} $A * (B + C) \rightarrow A*B + A*C$
  \item\label{RRC_ap} $\sum_i (A + B) \rightarrow \sum_i A + \sum_i B$
  \item\label{RRC_ma} If $i \not\in A$,  $A * \sum_i B \rightarrow \sum_i (A * B) \quad$ (else rename indices)
  \item\label{RRC_aa} $\sum_i \sum_j A \rightarrow \sum_{i,j} A$
%   \item $\sum_i A \rightarrow \sum_{i,j} A$
  \item\label{RRC_ac} If $i \not\in A$, then $\sum_i A \rightarrow A * |i|$
  \item\label{RRC_pp} $A + (B + C) \rightarrow +(A, B, C) \quad$ (binary to n-ary addition)
  \item\label{RRC_mm} $A * (B * C) \rightarrow *(A, B, C) \quad$ (binary to n-ary multiply)
  \item\label{RRC_c} Constant fold constant terms $c_1 + c_2$, $c_1*c_2$
  \item\label{RRC_cc} 
  If $c_1, c_2$ are constant and $T_1, T_2$ are isomorphic up to renaming,
  then $c_1 * T_1 + c_2 * T_2 \to (c_1 + c_2) * T_1$ and apply rule (\ref{RRC_c})
\end{enumerate}
\caption{RPlan-to-Canonical-RPlan Ruleset $R_{RC}$}
\label{RRC}
\end{figure}

We now prove a few properties of $\mathcal{C}$ in order to show that it truly is canonizing.
% Each rewrite preserves the expression's semantics, as they stem from the mathematics of semirings and aggregates.

\begin{lemma}
$\forall p \in \RPLAN{},\, \mathcal{C}(p)$ is in canonical form. 
\end{lemma}
\begin{proof}
$R_{RC}$ rules 8 and 9 combine isomorphic-up-to-constant terms.

Proceed by structural induction over an RPlan $p$.
The base case, that $p$ is a lone relation, is canonical.
The inductive cases are as follows.

1) Suppose $p = p^1 + p^2$. 
By the inductive hypothesis, $p^1$ and $p^2$ are canonical with forms, for $i \in \{1,2\}$, \\ 
$\sum_{A^i_{1}} \lrparen{b^i_{11} * \dots * b^i_{1m^i_{1}}} + \dots + \sum_{A^i_{n^i}} \lrparen{b^i_{n^i1} * \dots * b^i_{n^im^i_{n^i}}}$.

By rule~\ref{RRC_pp}, $p$ takes the canonical form \\
$\sum_{A^1_{1}} \lrparen{b^1_{11} * \dots * b^1_{1m^1_{1}}} + \dots + \sum{A^1_{n^1}} \lrparen{b^1_{n^11} * \dots * b^1_{n^1m^1_{n^1}}} +
\sum_{A^2_{1}} \lrparen{b^2_{11} * \dots * b^2_{1m^2_{1}}} + \dots + \sum{A^2_{n^2}} \lrparen{b^2_{n^21} * \dots * b^2_{n^2m^2_{n^2}}}$.

2) Suppose $p = \sum_A p'$.
By the inductive hypothesis, $p'$ is canonical with form \\
$\sum_{A_1} \lrparen{b_{11} * \dots * b_{1m_1}} + \dots + \sum_{A_n} \lrparen{b_{n1} * \dots * b_{nm_n}}$.

Rule \ref{RRC_ap} distributes the aggregate into plus;
rule \ref{RRC_ac} handles constant aggregations, as after the expression $\sum_{ij} (X + 7)$;
rule \ref{RRC_aa} combines aggregates.
The result has canonical form \\
$\sum_{A_1 \cup A} \lrparen{b_{11} * \dots * b_{1m_1}} + \dots + \sum_{A_n \cup A} \lrparen{b_{n1} * \dots * b_{nm_n}}$.

3) Suppose $p = p^1 * p^2$.
By the inductive hypothesis, $p^1$ and $p^2$ are canonical with forms, for $i \in \{1,2\}$, \\ 
$\sum_{A^i_{1}} \lrparen{b^i_{11} * \dots * b^i_{1m^i_{1}}} + \dots + \sum_{A^i_{n^i}} \lrparen{b^i_{n^i1} * \dots * b^i_{n^im^i_{n^i}}}$.
% $\sum_{A_{i1}} \lrparen{b_{i11} * \dots * b_{i1m_1}} + \dots + \sum{A_{in}} \lrparen{b_{in1} * \dots * b_{inm_n}}$.

Rule \ref{RRC_mp} multiplies across the Cartesian product of summands of $p_1$ and $p_2$.
Rules \ref{RRC_ma}, \ref{RRC_mm}, and \ref{RRC_aa} push the multiplications further down into aggregations and other multiplications.
In particular, rule \ref{RRC_ma} may rename attributes; we use the `$\sqcup$' symbol to denote disjoint union after renaming.
The result is the canonical form \\
$\sum_{A^1_{1} \sqcup A^2_{1}} \lrparen{b^1_{11} * \dots * b^1_{1m^1_1} * b^2_{11} * \dots * b^2_{1m^2_1}} + \dots + \\
\sum_{A^1_{n^1} \sqcup A^2_{n^2}} \lrparen{b^1_{n^11} * \dots * b^1_{n^1m^1_{n^1}} * b^2_{n^21} * \dots * b^2_{n^2m^2_{n^2}}}$.
\end{proof}


\begin{lemma}\label{lCanonPreservesSemantics}
Canonization preserves semantics: \\
$\forall p \in \RPLAN{}, \forall \text{inputs } d,\, p(d) = \mathcal{C}(p)(d)$.
\end{lemma}
\begin{proof}
Every rule composing $\mathcal{C}$ in $R_{RC}$ stems from a mathematical property of $+$, $*$, and $\sum$.
\end{proof}

In the following proof, we write syntactic equality with the symbol $\equiv$, in order to distinguish it from value equality.

\begin{theorem}
Two RPlans are semantically equivalent iff they have identical canonical forms:
$\forall p_1, p_2 \in \RPLAN{},\, \mathcal{C}(p_1) \equiv \mathcal{C}(p_2) \iff \forall \text{inputs }d,\, p_1(d) = p_2(d)$.
\end{theorem}
\begin{proof}
For the $\Rightarrow$ direction, two plans that have syntactically identical canonical forms must compute the same result on every input because, per Lemma \ref{lCanonPreservesSemantics}, canonization preserves semantics.

The $\Leftarrow$ direction is more interesting. Proceed by proving its contrapositive:
$\mathcal{C}(p_1) \not\equiv \mathcal{C}(p_2) \Rightarrow \exists d,\, p_1(d) \neq p_2(d)$.

It is sufficient to prove 
$\mathcal{C}(p_1) \not\equiv \mathcal{C}(p_2) \Rightarrow \exists d,\, \mathcal{C}(p_1)(d) \neq \mathcal{C}(p_2)(d)$
because, by Lemma \ref{lCanonPreservesSemantics}, $\mathcal{C}(p)(d) = p(d)$.

To find a $d$ that the canonical forms disagree on, given that the canonical forms syntactically differ,
we supply a partial input (i.e., input relations with holes) that reduces each $\mathcal{C}(p)$ to syntactically different polynomials in canonical form.
Then, by Lemma XX, we use the fact that syntactically different polynomials in canonical form differ on a concrete input, 
and we construct a full input database $d$ using this concrete input to finish the proof.

First we construct a partial input.
Pick a term $t_0 \in p_1$ such that $t_0 \not\in p_2$. Such a term must exist because $p_1 \not\equiv p_2$. [Todo: introduce graphs]
Next pick the term $t \in p_1 \text{ or } p_2$ such that $t$ is the maximal term among all terms in $p_1$ and $p_2$ that have a surjective homomorphism to $t$.

We shall design a partial input database using $t$.
Suppose there are $n$ attributes (before aggregating) in $t$.
Assign each attribute a unique positive integer between $1$ and $n$ and let $g$ be the bijection from $t$'s attributes to $[n]$.

For each multiplicand $A_{ij}$, introduce a fresh scalar variable $a_{ij}$ at entry $(g(i), g(j))$ of the input matrix $A$.
Let other entries in each input matrix be 0.

Expand the aggregations in $p_1$ and $p_2$ with respect to the new symbolic matrices.
The result is a polynomial in canonical form.
It is in canonical form because the sums-of-aggregations-of-products in the original RPlans expands to a polynomial in sum-of-products form,
and as no two terms in the original RPlan were isomorphic, neither are any two terms in the polynomial. Todo: say more about the surjective homomorphism.

Call the above reduction $d(t)$, where $t$ is the term chosen by the above process. We write $t_1[d(t)]$ for the result of applying the partial input generated by term $t$ to $t_1$. Two properties of the transformation are relevant:
\begin{enumerate}
    \item For any two terms $t_1$ and $t_2$ that are isomorphic, $t_1[d(t)] \equiv t_2[d(t)]$.
    To see this, observe that each multiplicand in $t_1[d(t)]$ arose from a multiplicand in $t_1$, and that there is an equivalent multiplicand in $t_2[d(t)]$ multiplied in the same way due to the isomorphism between $t_1$ and $t_2$.
    \item For any terms $t_1$ and $t_2$ that are not isomorphic, if there does not exist a surjective homomorphism from $t_2$ to $t_1$, then $t_1[d(t)] \not\equiv t_2[d(t)]$. Proof todo.
\end{enumerate}

Because $\mathcal{C}(p_1)$ syntactically differs from $\mathcal{C}(p_2)$, then by the second property above it follows that $\mathcal{C}(p_1)[d(t)]$ is a polynomial that syntactically differs from $\mathcal{C}(p_2)[d(t)]$.
Applying Lemma XX on polynomial canonical forms proves that a particular input on the polynomial variables $d$ exists on which the two polynomials disagree.  Plugging $d$ into the symbolic matrices designed by $d(t)$ results in a database input to the original RPlans disagree, showing that $\exists d,\, \mathcal{C}(p_1)(d) \neq \mathcal{C}(p_2)(d)$.
\end{proof}


\subsection{RPlan Graphbag}
% An RPlan \textit{without uninterpreted functions}\footnote{Uninterpreted functions split an RPlan in two: the RPlan below and the RPlan above. Each may be represented as a separate RPlan Graphbag.} in canonical form may be represented as a bag of hypergraphs, defined as follows.
% Each child of $+$ is represented as a separate graph, the collection of which forms a bag of graphs.
% A child of $+$ consists of a set of aggregated attributes, followed by the product of inputs.
% The graph is a directed and labeled hypergraph.
% The vertices $V$ are the union of the schemas of the inputs.
% Each vertex is either \textit{aggregated}, if it is in the set of aggregated attributes, or \textit{output}, if it is not.
% There is one hyperedge for each input, labeled with the input itself, on the vertices in its schema *** todo need to include the input bind and output bind *** redefine to include conv. operators -- RPlan+


\nocite{CohenDDHW09}


% OLD VERSION
% For the $\Rightarrow$ direction, two plans that have syntactically identical canonical forms must compute the same result on every input because, per Lemma \ref{lCanonPreservesSemantics}, canonization preserves semantics.

% The $\Leftarrow$ direction is more interesting. Proceed by proving its contrapositive:
% $\mathcal{C}(p_1) \neq_{syn} \mathcal{C}(p_2) \Rightarrow \exists d,\, p_1(d) \neq p_2(d)$.

% It is sufficient to prove 
% $\mathcal{C}(p_1) \neq_{syn} \mathcal{C}(p_2) \Rightarrow \exists d,\, \mathcal{C}(p_1)(d) \neq \mathcal{C}(p_2)(d)$
% because, by Lemma \ref{lCanonPreservesSemantics}, $\mathcal{C}(p)(d) = p(d)$.

% To find a $d$ that the canonical forms disagree on, 
% given that the canonical forms syntactically differ, 
% we use derivatives.  
% It is well known that for any differentiable functions $f$ and $g$, $\forall d,\, f(d) = g(d) \Rightarrow \forall d,\, f'(d) = g'(d)$; the contrapositive of this fact gives us the proof technique we need.

% In order to take derivatives, we intentionally blur the line between RPlans and (multi)linear algebra, because an RPlan relation with $n$ attributes is equivalent to a tensor with $n$ named dimensions.

% Let $t$ be one of the summed terms in $\mathcal{C}(p_1)$ that does not appear in $\mathcal{C}(p_2)$, which must exist since they syntactically differ.
% By definition of the canonical form, no other term in $\mathcal{C}(p_1)$ is isomorphic-up-to-constants to $t$.

% Derivate $\mathcal{C}(p_1)$ and $\mathcal{C}(p_2)$ for each input in $t$ to obtain $\mathcal{C}(p_1)'$ and $\mathcal{C}(p_2)'$.
% Consider $d=0$ (zero for all inputs).
% We have that $t'(0) \neq 0$ because $t$ is a constant term (since we derivated $t$ by each input in $t$).
% Other terms in $\mathcal{C}(p_1)'$ evaluate to 0 because they either (a) began with a subset of the inputs in $t$, which results in their derivative being 0, or (b) have inputs remaining after taking the derivatives, which evaluate to 0 on input 0. 
% Therefore $\mathcal{C}(p_1)'(0) = t'(0) \neq 0$.

% For every term in $\mathcal{C}(p_2)$ that is not isomorphic-up-to-constants to $t$, its derived counterpart in $\mathcal{C}(p_2)'$ evaluates to 0. There is at most one term in $\mathcal{C}(p_2)$ that is isomorphic-up-to-constants. If there is such a term, then it must have a different constant and so $t'(0) \neq \mathcal{C}(p_2)'(0)$, or else it would be identical to $t$ and $t$ would appear in $\mathcal{C}(p_2)$, contradicting our choice of $t$.
% Therefore $\mathcal{C}(p_1)'(0) \neq \mathcal{C}(p_2)'(0)$.

% As the derivatives differ on input 0, it must hold that $\exists d,\, \mathcal{C}(p_1)(d) \neq \mathcal{C}(p_2)(d)$.


\section{Related Work}
Relational join optimization: \cite{moerkotte2006analysis}.
Looks at the join graph, whose vertices are relations and edges are between two relations that have a join condition (overlap in their attributes).  No aggregation or projection. Enumerates join tree possibilities without duplicates. Enumerates connected-subgraph complement pairs \emph{csg-cmp-pairs}, but this is automatically given by the query graph where attributes are vertices and relations are hyperedges, isn't it?

The closest sum-product optimization frameworks in the literature (FAQ \cite{KhamisNR16}, AJAR \cite{Joglekar2016AJARAA}, PANDA) cover LA expressions involving $*$ and $\sum$.
These frameworks have an excellent coverage of theory and implementation,
but they fall short of LA workloads that occur in practice.
We step past these frameworks by incorporating common sub-expressions and incorporating addition ($+$).


The lineage of optimizer work from Starburst <- System R -> Exodus -> Volcano -> Cascades ->  Orca; Columbia -> Agrios/Bonneyville, as well as some UDF-like ones stemming from Nephele/PACTs