Cantor is a persistent data abstraction layer; it provides functionalities to query and retrieve data stored as key/value pairs, sorted sets, map of key/values, or multi-dimensional time-series data points.
Cantor can help in simplifying and reducing the size of the data access layer implementation in an application.
Majority of applications require some form of persistence. The data access object layer implementation usually accounts for a considerable portion of the code in these applications. This layer usually contains code to initalize and connect to some storage system; a mapping to/from the layout of the data in storage and the corresponding representation in the application; and also, composing and executing queries against the storage, handling edge cases, handling exceptions, etc. This is where Cantor can help to reduce the code and its complexity.
Some of the commonly used patterns to access data are:
-
Store and retrieve single objects; for example, storing user informations. The data stored is usually relatively small, it can be a JSON object, or a serialized protobuf object, or any other form of data that can be converted to a byte array. Key/value storages such as BerkeleyDB or Redis are usually used for this purpose.
-
Store and retrieve collections of data; for example, list of users in a certain group. Relational databases or indexing engines are usually used for this purpose.
-
Store and retireve temporal data points; for example, metric values, or IoT events. Time-series databases such as Elastic Search, InfluxDB, OpenTSDB, or Prometheus are used for this purpose.
Cantor tries to provide a set of simple yet powerful abstractions that can be used to address essential needs for the above mentioned use cases. The implementation focues more on simplicity and usability than completeness, performance, or scale. This is not a suitable solution for large scale (data sets larger than few tera-bytes) applications. It is also not recommended for high throughput (more than a hundred thousand operations per second) applications.
The library allows users to persist and query data stored in one of the following forms:
-
Key/value pairs - the key is a unique string and the value is an arbitrary byte array. This is consistent with other key/value storage solutions: there are methods to create and drop namespaces, as well as methods to persist and retrieve objects.
-
Sorted sets - each set is identified with a unique string as the set name, and a number of entries, each associated with a numerical value as the weight of the entry. Functions on sets allow users to create and drop namespaces, as well as slice and paginate sets based on the weight of the entries.
-
Fat Events - multi-dimensional time-series data points; where each data point has a timestamp along with an arbitrary list of metadata (key/value strings), a number of dimensions (double values), and a payload (arbitrary byte array).
These data structures can be used to solve variety of use-cases for applications; and they are straight forward to implement simply and efficiently on top of relational databases. Cantor provides this implementaion. It also tries to eliminate some of the complexities around relational databases, such as joins, constraints, and stored procedures. The aim of the library is to provide a simple and powerful set of abstractions for users to be able to spend more time on the application's business logic rather than the data storage and retrieval.
There are four main interfaces exposed for users to interact with: the Objects
interface for key/value pairs;
the Sets
interface for persisted sorted sets; and the Events
interface for timeseries data.
All methods expect a namespace
parameter which can be used to slice data into multiple physically separate databases.
A namespace
must be first created by making a call to the create(namespace)
method. It is also possible to drop the
whole namespace by calling drop(namespace)
after which point any call to that namespace
will result in IOException
.
Internally, all objects are stored in a table with two columns; a string column for the key, and a blob column for the value; similiar to the table below:
Operations on key/value pairs are defined in the Objects
interface:
create/drop
to create a new namespace (i.e., database) or to delete a namespace.store/get/delete
to store, retrieve, and delete a single or a batch of key/value pairs.keys
to get the list of all keys in a namespace; for example, callingkeys
against the above table would return a list of strings containing{'object-key-1', 'object-key-2', 'object-key-3'}
.size
to get the count of all objects stored in the namespace.
Sorted sets are stored in a table with three columns; a string column for the set name, a string column for the entry, and a long column for the weight associated to the entry; simliar to the table below:
Operations on sorted sets are defined in the Sets
interface:
create/drop
to create a new namespace (i.e., database) or to delete a namespace.add/delete
to add an entry with a weight to a set, or to remove an entry from a set.entries/get
to retrieve entries or pairs of entry/weight from a set, where their weight is between two given values, and are ordered either ascending or descending. It is also possible to paginate through the results. For example, callingentries
for thesample-set-1
against the above table returns a list of strings containing{'entry-1', 'entry-2', 'entry-3'}
; and callingget
on the same, returns{'entry-1' => 0, 'entry-2' => 1, 'entry-3' => 2}
.union/intersect
to get the union or intersection of two sets.pop
to atomically retrieve and remove entries from a set; this is particularly useful for using sorted sets as a producer/consumer buffer or similar usecases.
Most operations support ranges. A range is described as count
entries with weight between a min
and a max
value,
starting from the start
index; for example making a get
call similar to this:
get(namespace, 'sample-set-1', 1, Long.MAX_VALUE, 0, 3, true)
against the above dataset, returns maximum of 3 entries
from the sample-set-1
where weight is between 1 and Long.MAX_VALUE
starting from index 0, ordered ascending.
Events represent multi-dimensional timeseries data points, with arbitrary metadata key/value pairs, and optionally a
byte[]
payload attached to an event.
Operations on events are defined in the Events
interface:
create/drop
to create a new namespace (i.e., database) or to delete a namespace.store
to store a new event in the namespace.get
to fetch events in the namespace, matching a query; the queries are defined as a map of key/value pairs, where the key is either a dimension key name or a metadata key name (e.g.,Host
orCPU
in the above example), and values are either literal values to match exactly with (e.g.,Tenant => 'tenant-1'
) or an operator along with a value (e.g.,Tenant => '~tenant-*'
orCPU => '>0.30'
). Please see the javadoc for more detail on queries.expire
to expire all events in the namespace with timestamp before some value.count
to count number of events in the namespace, matching a query.aggregate
to retrieve aggregated values of a dimension matching a query. Please see the javadoc for more detail on aggregation functions.metadata
to retrieve metadata values for a given metadata key for events matching a query.payloads
to retrievebyte[]
payloads for all events in the namespace matching a query.
Events are internally bucketed into 1-hour intervals and stored in separate tables based on the different dimensions and
metadata keys associated to an event. For example, an event with dimension d1
and metadata m1
is stored in a
separate table than one with dimension d2
and metadata m2
.
Note that an event cannot contain more than 100 metadata and 400 dimension keys.
The projects is divided into a number of sub-modules to allow users to only pull in dependencies that are necessary.
To use the embedded Cantor library which is implemented on top of H2, include the following dependency:
<dependency>
<groupId>com.salesforce.cantor</groupId>
<artifactId>cantor-h2</artifactId>
<version>${cantor.version}</version>
</dependency>
To use Cantor on top of MySQL, include the following dependency:
<dependency>
<groupId>com.salesforce.cantor</groupId>
<artifactId>cantor-mysql</artifactId>
<version>${cantor.version}</version>
</dependency>
MySQL (or MariaDB) is a stable, performant, and scalable open source relational database, with very active community and variety of tools and services around it.
To use Cantor in a client/server mode, execute the cantor-server.jar
similar to this:
$ java -jar cantor-server.jar <path-to-cantor-server.conf>
And include the following dependency on the client:
<dependency>
<groupId>com.salesforce.cantor</groupId>
<artifactId>cantor-grpc-client</artifactId>
<version>${cantor.version}</version>
</dependency>
More details on how to instantiate instances can be found in javadocs.
Clone the repository:
$ git clone https://github.com/salesforce/cantor.git
Starting with docker is the preferred approach:
$ cd /path/to/cantor/
$ ./bang.sh install_skip_tests build_docker run_docker
This will create a Cantor gRPC Server running in a docker container using the configuration file here. Use this configuration file to alter your setup as appropriate (See Cantor Server Configutation Properties)
Running a local instance is the simplest and quickest way to start a Cantor gRPC Server:
$ cd /path/to/cantor/
$ ./bang.sh
or
$ ./bang.sh install_skip_tests run_jar
This will start a local Cantor gRPC Server running using the configuration file here.
Running from the IDE is similar, though you'll need to specify the argument cantor-server/src/main/resources/cantor-server.conf
, so
the server knows how to startup. If you want to make changes to the server configuration then you need to edit the aforementioned
configuration file (See Cantor Server Configutation Properties)
You can kill the application with Ctrl-C
or if you are running in a docker container: ./bang.sh kill_docker
This is a Typesafe property file which all properties are under the prefix cantor.
.
Property | Value | Default | Description |
---|---|---|---|
grpc.port | Port Number | 7443 | Server Port |
storage.type | { h2, mysql, s3 } | h2 | Backing storage type |
H2 takes in a list of instances that will be sharded across. Each instance has these properties:
Property | Value | Description |
---|---|---|
path | File Path | The location of the H2 database file |
in-memory | Boolean | Set whether to embed H2 in memory |
compressed | Boolean | Set whether to use compression on the stored data |
username | String | Root Username for the database |
password | String | Root Password for the database |
Mysql takes in a list of instances that will be sharded across. Each instance has these properties:
Property | Value | Description |
---|---|---|
hostname | String | The domain name or IP address where the mysql host is contactable |
port | Port Number | The port where mysql is contactable |
username | String | Username for the database |
password | String | Password for the database |
Unlikely H2 and MySQL only a single instance of S3 can be passed in. No sharding logic is used. The properties are:
Property | Value | Description |
---|---|---|
bucket | String | S3 Bucket Name |
region | String | S3 Region |
sets.type | { h2, mysql } | S3 does not support Sets, so you must specify what will be used when Sets is referenced. |
proxy.host | String | If you need a proxy to connect to S3 specify this property |
proxy.port | Port Number | Set the port number for the proxy if proxy.host is set |
Note: If you enable s3 you must have the credentials for the specified bucket in the proper directory: ~/.aws/credentials
.
You can also point to their location using the AWS env variable AWS_PROFILE=/path/to/credentials
.
In both the case of local and docker connecting to the server is the same. The port 7443
will be exposed by default.
Simply use the java client CantorOnGrpc
to connect:
final CantorOnGrpc cantor = new CantorOnGrpc("localhost:7443");