This project dives into Run-Length Encoding, writing simple optimization rules and column-at-a-time execution.
In this task, you will implement Run-Length Encoding (RLE).
RLE is used for lossless compression, while allowing relational operators to operate on the compressed data. RLE replaces runs of the same element with a tuple of the element and the length of the run. For example, a sequence AAAAABBBAABAC would be replaced by (A, 5), (B, 3), (A, 2), (B, 1), (A, 1), (C, 1). RLE can be used for compressing both columnar and row data, but the diversity across full rows usually reduces the compression ratio. When compressing columns, the value-pairs are often accompanied by the virtual id, to allow tuple reconstruction after applying columnar filters.
In the first subtask you are called to create a scan (in [ch.epfl.dias.cs422.rel.early.volcano.Scan]) that reads columns compressed with RLE and reconstructs the (full, uncompressed) original tuples. For the rest of the volcano operators you may build on top of your submission for project 0.
In the second subtask you should implement the Decode and Reconstruct operators:
-
Decode ([ch.epfl.dias.cs422.rel.early.volcano.rle.Decode]) translates
RLETuple
s toTuple
s by repeating values as necessary. -
Reconstruct ([ch.epfl.dias.cs422.rel.early.volcano.rle.Reconstruct]) is a binary operator (has two input operators) and from the stream of RLETuples they provide, it synchronizes the two streams to produce a new stream of RLETuples. More specifically, the two inputs produce RLETuples for the same table but different groups of columns. These streams may or may not miss tuples, due to some pre-filtering. Reconstruct should find the virtual IDs that exist on both input streams and generate the output as RLE compressed tuples.
Example: if the left input produces:
RLEentry(2, 2, Tuples("a")), RLEentry(20, 1, Tuples("c", 7))
,and the right input produces:
RLEentry(0, 11, Tuples(3.0)), RLEentry(18, 1000, Tuples(6.0))
,Reconstruct should produce:
RLEentry(2, 2, Tuples("a", 3.0)), RLEentry(20, 1, Tuples("c", 7, 6.0))
, since the only virtual IDs that both input streams share are 2, 3, and 20.More details & example at: [ch.epfl.dias.cs422.helpers.builder.skeleton.logical.LogicalReconstruct]
In the third subtask you should implement a filter, join, project and an aggregate ([ch.epfl.dias.cs422.rel.early.volcano.rle.RLEFilter] , [ch.epfl.dias.cs422.rel.early.volcano.rle.RLEJoin], [ch.epfl.dias.cs422.rel.early.volcano.rle.RLEProject] , [ch.epfl.dias.cs422.rel.early.volcano.rle.RLEAggregate]) that operate on RLE data without decompressing them.
In this task you will implement new optimization rules to ''teach'' the query optimizer possible plan transformations. Then you are going to use these optimization rules to push decodes and reconstructions as high as possible into the query plans.
We use Apache Calcite as the query parser and optimizer. Apache Calcite is an open-source easy-to-extend Apache project, used by many commercial and non-commercial systems. In this task you are going to implement a few new optimization rules that allow the parsed query to be transformed into a new query plan.
All optimization rules in Calcite inherit from RelOptRule and in [ch.epfl.dias.cs422.rel.early.volcano.rle.qo] you can find the boilerplate for the rules you are asked to implement. Note that this rules operate on top of Logical operators and not the operators you implemented. The Logical operators are translated into your operators in a further step.
In this subtask you are called to implement three rules:
- [ch.epfl.dias.cs422.rel.early.volcano.rle.qo.FilterDecodeTransposeRule] to pull a Decode above a Filter,
- [ch.epfl.dias.cs422.rel.early.volcano.rle.qo.AggregateDecodeTransposeRule] to pull a Decode above an Aggregation
- [ch.epfl.dias.cs422.rel.early.volcano.rle.qo.JoinDecodeTransposeRule] to pull a Decode above a Join.
In this task you should implement [ch.epfl.dias.cs422.rel.early.volcano.rle.qo.FilterReconstructTransposeRule] which, given a filter on top of a Reconstruct, it tries to push the filter into the inputs of the Reconstruct.
Hint: we suggest you to start by looking the documentation and classes that appear in the boilerplate code.
This tasks focuses on the column-at-a-time execution model, building gradually from an operator-at-a-time execution over columnar data.
A fundamental block in implementing vector-at-a-time execution is selection-vectors. In this task you should implement
the Filter ([ch.epfl.dias.cs422.rel.early.operatoratatime.Filter]),
Project ([ch.epfl.dias.cs422.rel.early.operatoratatime.Project]),
Join ([ch.epfl.dias.cs422.rel.early.operatoratatime.Join]), Scan ([ch.epfl.dias.cs422.rel.early.operatoratatime.Scan]),
Sort ([ch.epfl.dias.cs422.rel.early.operatoratatime.Sort]) and
Aggregate ([ch.epfl.dias.cs422.rel.early.operatoratatime.Aggregate]) for operator-at-a-time execution over columnar
inputs. Your implementation should be based on selection vectors and (Tuple=>Tuple
) evaluators (similarly to the
evaluators of project 0). That is, all operators receive one extra column of Boolean
s (the last column) that signifies
which of the inputs tuples are active. The Filter, Scan, Project should not prune tuples, but only set the selection
vector. For the Join and Aggregate you are free to select whether they only generate active tuples or they also produce
inactive tuples, as long as you conform with the operator interface (extra Boolean column).
In this task you should implement Filter ([ch.epfl.dias.cs422.rel.early.columnatatime.Filter]),
Project ([ch.epfl.dias.cs422.rel.early.columnatatime.Project]), Join ([ch.epfl.dias.cs422.rel.early.columnatatime.Join])
, Scan ([ch.epfl.dias.cs422.rel.early.columnatatime.Scan]), Sort ([ch.epfl.dias.cs422.rel.early.columnatatime.Sort]) and
Aggregate ([ch.epfl.dias.cs422.rel.early.columnatatime.Aggregate]) for columnar-at-a-time execution over columnar inputs
with selection vectors, but this time instead of using the evaluators that work on tuples (Tuple => Tuple
), you should
use the map
-based provided functions that evaluate one expression for the full
input (Indexed[HomogeneousColumn] => HomogeneousColumn
).
The skeleton codebase is pre-configured for development in the latest IntelliJ version (2020.3+) and this is the only supported IDE. You are free to use any other IDE and/or IntelliJ version, but it will be your sole responsibility to fix any configuration issues you encounter, including that through other IDEs may not display the provided documentation.
After you install IntelliJ in your machine, from the File menu select
New->Project from Version Control
. Then on the left-hand side panel pick Repository URL
. On the right-hand side
pick:
- Version control: Git
- URL: https://gitlab.epfl.ch/DIAS/COURSES/CS-422/2021/students/Project-1-<username> or [email protected]:DIAS/COURSES/CS-422/2021/students/Project-1-<username> , depending on whether you set up SSH keys (where is your GitLab username).
- Directory: anything you prefer, but in the past we have seen issues with non-ascii code paths (such as french punctuations), spaces and symlinks
IntelliJ will clone your repository and setup the project environment. If you are prompt to import or auto-import the
project, please accept. If the JDK is not found, please use IntelliJ's option to Download JDK
, so that IntelliJ
install the JDK in a location that will not change your system settings and the IDE will automatically configure the
project paths to point to this JDK.
The provided repository (https://gitlab.epfl.ch/DIAS/COURSES/CS-422/2021/students/Project-1-<username>) is personal and you are free to push code/branches as you wish. The grader will run on all the branches, but for the final submission only the master branch will be taken into consideration.
The skeleton code depends on a library we provide to integrate the project with a state-of-the-art query optimizer, Apache Calcite. Additional information for Calcite can be found in it's official site https://calcite.apache.org and it's documentation site https://calcite.apache.org/javadocAggregate/ .
Documentation for the integration functions and helpers we provide as part of the project-to-Calcite integration code can be found either be browsing the javadoc of the dependency jar (External Libraries/ch.epfl.dias.cs422:base), or by browsing to http://diascld24.iccluster.epfl.ch:8080/ch/epfl/dias/cs422/helpers/index.html WHILE ON VPN.
If while browsing the code IntelliJ shows a block:
/**
* @inheritdoc
*/
Next to it, near the column with the file numbers, the latest versions of IntelliJ have a paragraph symbol
to Toggle Render View
(to Reader Mode) and get IntelliJ to display the properly rendered inherited prettified
documentation.
In addition to the documentation in inheritdoc, you may want to browse the documentation of parent classes (
including the skeleton operators and the parent Operator and [ch.epfl.dias.cs422.helpers.rel.RelOperator] classes)
Documentation of constructor's input arguments and examples are not copied by the IntelliJ's inheritdoc command, so please visit the parent classes for such details
Submit your code and short report, by pushing it to your personal gitlab project before the deadline. The repositories will be frozen after the deadline and we are going to run the automated grader on the final tests.
We will grade the last commit on the master
branch of your GitLab repository. In the context of this project you only
need to modify the ``ch.epfl.dias.cs422.rel'' package. Except from the credential required to get access to the
repository, there is nothing else to submit on moodle for this project. Your repository must contain a Report.pdf
or
`report.md` which is a short report that gives a small overview of the peculiarities of your implementation and any
additional details/notes you want to submit. If you submit the report in markdown format, you are responsible for making
sure it renderes properly on gitlab.epfl.ch's preview.
To evaluate your solution, run your code with the provided tests ([ch.epfl.dias.cs422.QueryTest] class). The grader will run similar checks.
Keep in mind that we will test your code automatically. We will harshly penalize implementations that change the original skeleton code and implementations which are specific to the given datasets. More specifically, you should not change the function and constructor signatures provided in the skeleton code, or make any other change that will break interoperability with the base library.
You are allowed to add new classes, files and packages, but only under the current package. Any code outside the current
package will be ignored and not graded. You are free to edit the Main.scala
file and/or create new tests
, but we are
going to ignore such changes during grading.
For the actual grading we will slightly modify the tests/datasets in the grader. Hence, your reported failures/successes in the existing tests only approximately determine your final grade. However, the new tests will be very similar.
Tests that timeout will lose all the points for the timed-out test cases, as if they returned wrong results.
Additionally, a small portion of the grade will be reserved for matching or exceeding the performance of the "baseline" solution. We will maintain a score-board that will report the percentage of passed tests and execution time for each submitted commit (no names will be displayed). To access the score-board, while on VPN, visit: http://diascld24.iccluster.epfl.ch:28423/public/dashboard/c7200a68-3b8b-4abe-b07e-78e3b6a2d3e7 Note that there may be some delay between updates.