This document describes how to set up and use ModelarDB. Installation instructions are provided for Linux, macOS, FreeBSD, and Windows. To support running ModelarDB in a containerized environment, instructions for setting up a Docker environment are also provided. Once installed, using ModelarDB is consistent across all platforms.
Builds for aarch64 macOS, x86_64 Windows, and x86_64 Linux are created for each commit to the main branch using
GitHub Actions. As these builds are created for each commit, they are not considered stable release builds. Also, since
they are built using GitHub Actions, they are only available for 90 days, as this is GitHub's maximum artifact retention
period for public repositories. The latest builds can be found in the Artifacts window for the latest Workflow that
completed
successfully.
The following commands are for Ubuntu Server. However, equivalent commands should work for other Linux distributions.
- Install build-essential:
sudo apt install build-essential
- Install the Xcode Command Line Developer Tools:
xcode-select --install
- Install cURL as the root user:
pkg install curl
- Install the latest versions of the Microsoft Visual C++ Prerequisites for Rust:
- Microsoft Visual C++ Prerequisites for Rust: see The rustup book.
- Install the latest stable Rust Toolchain.
- Install the Protocol Buffer Compiler.
- Clone the repository:
git clone https://github.com/ModelarData/ModelarDB-RS - Build, test, and run the system using Cargo:
- Debug Build:
cargo build - Release Build:
cargo build --release - Run Tests:
cargo test - Run DBMS Server:
cargo run --bin modelardbd path_to_local_data_folder - Run Client:
cargo run --bin modelardb [server_address] [query_file]
- Debug Build:
- Move
modelardbd,modelardb, andmodelardbbfrom thetargetdirectory to any directory. - Install and test the Python bindings for
modelardb_embeddedusing Python:- Install:
python3 -m pip install . - Run Tests:
python3 -m unittest
- Install:
ModelarDB consists of three binaries and a library with bindings: modelardbd is a DBMS server that manages data and
executes SQL queries, modelardb is a command-line client for connecting to a modelardbd instance and executing
commands and SQL queries, modelardbb is a command-line bulk loader that operates without modelardbd as it reads
from and writes to modelardbd's data folder directly, and modelardb_embedded is an embeddable library for executing
queries against and writing to modelardbd or its data folder directly. modelardbd uses local storage on the edge
and an Amazon S3-compatible or Azure Blob Storage object store in the cloud. modelardbd can be deployed alone or
together with other instances of modelardbd in a cluster architecture, depending on the use case.
modelardbd can be deployed on a single node to manage data in a local folder. modelardbd can also be deployed in a
distributed configuration across edge and cloud. In this configuration, the shared remote object store is responsible
for keeping the database schema consistent across all modelardbd instances in the cluster. While the modelardbd
instances on the edge and in cloud provide the same functionality, the primary purpose of the instances on the edge is
to collect data and transfer it to an object store in the cloud while the primary purpose of the instances in the cloud
is to execute queries on the data in the object store.
Thus, when modelardbd is deployed in edge mode, it executes queries against local storage for low latency queries on
the latest data and when it is deployed in cloud mode, it executes queries against the object store for efficient
analytics on all the data transferred to the cloud. modelardbd implements the Apache Arrow Flight
protocol for looking up the modelardbd instance in
the cloud to use for executing each query, thus providing a workload-balanced interface for querying the data
in the object store using the modelardbd instances in the cloud.
There are three options available when starting modelardbd depending on the desired deployment use case. Each option
has different requirements and supports different features.
- Start
modelardbdin edge mode using only local storage - This is a simple and easy-to-use single-node deployment for use cases where a redundant object store is not required, e.g., when testing or experimenting. In edge mode without an object store, data is ingested to and queried from local storage. Thus,modelardbdinstances in this configuration do not support transferring ingested data to an object store and do not support querying data from the object store. To runmodelardbdin edge mode using only local storage, simply pass the path to the local foldermodelardbdshould use as its data folder:
modelardbd path_to_local_data_folder
# or
modelardbd edge path_to_local_data_folder- Start
modelardbdin edge mode in a cluster - An Amazon S3-compatible or Azure Blob Storage object store is required to startmodelardbdin a cluster. This configuration supports ingesting data to a local folder on the edge, querying the data in the local folder on the edge, and transferring data in the local data folder to the object store. - Start
modelardbdin cloud mode in a cluster - As above, an Amazon S3-compatible or Azure Blob Storage object store is required. This configuration supports ingesting data to a local folder in the cloud, transferring data in the local data folder to the object store, and querying the data in the object store in the cloud.
The following environment variables must be set to appropriate values if an Amazon S3-compatible object store is used so
modelardbd know how to connect to it:
AWS_ENDPOINT
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEYFor example, to use a local instance of MinIO, assuming the access key id KURo1eQeMhDeVsrz and the
secret access key sp7TDyD2ZruJ0VdFHmkacT1Y90PVPF7p have been created through MinIO's command line tool or web
interface, set the environment variables as follows:
AWS_ENDPOINT="http://127.0.0.1:9000"
AWS_ACCESS_KEY_ID="KURo1eQeMhDeVsrz"
AWS_SECRET_ACCESS_KEY="sp7TDyD2ZruJ0VdFHmkacT1Y90PVPF7p"Then, assuming a bucket named wind-turbine has been created through MinIO's command line tool or web interface,
modelardbd can be started in edge mode with transfer of the ingested data to the object store using the
following command:
modelardbd edge path_to_local_data_folder s3://wind-turbinemodelardbd also supports using Azure Blob Storage
for the remote object store. To use Azure Blob Storage, set the following environment variables:
AZURE_STORAGE_ACCOUNT_NAME
AZURE_STORAGE_ACCESS_KEYThen, assuming a container named wind-turbine has been created, modelardbd can be started with the following
command:
modelardbd edge path_to_local_data_folder azureblobstorage://wind-turbineTo run modelardbd in cloud mode simply replace edge with cloud as shown below. Be aware that when running in cloud
mode the modelardbd instance will execute queries against the object store and not against local storage.
modelardbd cloud path_to_local_data_folder s3://wind-turbineNote that modelardbd uses 9999 as the default port. The port can be changed by specifying a different port with
the following environment variable:
MODELARDBD_PORT=8889 # By default modelardbd uses port 9999.Before data can be ingested into modelardbd, tables must be created. modelardbd supports two types of tables,
standard relational tables created with CREATE TABLE statements and time series tables created with CREATE TIME SERIES TABLE
statements. From a user's perspective, a time series table functions like a standard relational table and can be queried
using SQL. However, the implementation of a time series table is highly optimized for time series and a time series table
must contain a single column with timestamps, one or more columns with fields (measurements as floating-point values),
and zero or more columns with tags (metadata as strings). As stated, time series tables can be created using
CREATE TIME SERIES TABLE statements with the column types TIMESTAMP, FIELD, and TAG. For FIELD, an error bound
can optionally be specified in parentheses to enable lossy compression with a per-value error bound. The error bound can
be absolute or relative, e.g.,FIELD(1.0) creates a column with an absolute per-value error bound that allows each value
to deviate by at most 1.0 while FIELD(1.0%) creates a column with a relative per-value error bound that allows each
value to deviate by at most 1.0%. FIELD columns default to an error bound of zero when none is specified. Thus, by
default lossless compression is used and lossy compression is only used if explicitly requested. If the values in a
FIELD column can be computed from other columns they need not be stored. Instead, if a FIELD column is defined using
the syntax FIELD AS (expression), e.g., FIELD AS (column_one + column_two), the values of the FIELD column will be
the result of the expression. As these generated FIELD columns do not store any data, an error bound cannot be defined.
As both CREATE TIME SERIES TABLE and CREATE TABLE are just SQL statements, both types of tables can be created using
modelardb or programmatically using Apache Arrow Flight. For example, a time series table storing a simple multivariate
time series with weather data collected at different wind turbines can be created as follows:
CREATE TIME SERIES TABLE wind_turbine(timestamp TIMESTAMP, wind_turbine TAG, wind_direction FIELD, wind_speed FIELD(1.0%))The following example shows how to create the same time series table in Python using Apache Arrow Flight:
from pyarrow import flight
flight_client = flight.FlightClient("grpc://127.0.0.1:9999")
sql = "CREATE TIME SERIES TABLE wind_turbine(timestamp TIMESTAMP, wind_turbine TAG, wind_direction FIELD, wind_speed FIELD(1.0%))"
ticket = flight.Ticket(sql)
result = flight_client.do_get(ticket)
print(list(result))When running a cluster of modelardbd instances, any instance in the cluster can be used to create tables. The
process for creating a table in a cluster is the same as when creating the table directly on a single modelardbd
instance, as shown above. When the table is created through modelardbd in a cluster, the table is automatically
created in all peer modelardbd instances.
After creating a table, data can be ingested into modelardbd with INSERT in modelardb. Be aware that INSERT
statements currently must contain values for all columns but that the values for generated columns will be dropped by
modelardbd. As parsing INSERT statements add significant overhead, binary data can also be ingested programmatically
using Apache Arrow Flight. For example, this Python example ingests three data points into the time series table
wind_turbine:
import pyarrow
from pyarrow import flight
# Read the data into a PyArrow table.
timestamp = pyarrow.array([100, 200, 300])
wind_turbine = pyarrow.array(["1026", "1026", "1026"])
wind_direction = pyarrow.array([300.0, 300.0, 300.0])
wind_speed = pyarrow.array([4.0, 4.0, 4.0])
names = ["timestamp", "wind_turbine", "wind_direction", "wind_speed"]
table = pyarrow.Table.from_arrays([timestamp, wind_turbine, wind_direction, wind_speed], names=names)
# Push the table to modelardbd's Apache Arrow Flight do_put() endpoint.
flight_client = flight.FlightClient("grpc://127.0.0.1:9999")
upload_descriptor = flight.FlightDescriptor.for_path("wind_turbine")
writer, _ = flight_client.do_put(upload_descriptor, table.schema)
writer.write(table)
writer.close()While this example simply ingests three data points from memory, it is simple to extend such that it reads from other
data sources. For example, this Python
script makes it simple to bulk load
time series from Apache Parquet files with the same schema by reading the Apache Parquet files, creating a time series
table that matches their schema if it does not exist, and transferring the data in the Apache Parquet files to modelardbd
using Apache Arrow Flight.
Time series can also be ingested into modelardbd using
Telegraf with the Apache Arrow Flight output
plugin. By using Telegraf, data points can be
efficiently streamed into modelardbd from a large collection of data
sources such as
MQTT and OPC-UA.
It should be noted that the ingested data is only transferred to the remote object store when modelardbd is deployed
in edge mode in a cluster or in cloud mode in a cluster. When modelardbd is deployed in edge mode without a cluster,
the ingested data is only stored in local storage.
ModelarDB includes a command-line client in the form of modelardb. To interactively execute SQL statements against a
local instance of modelardbd through a REPL, simply run modelardb:
modelardbBe aware that the REPL currently does not support splitting SQL statements over multiple lines, thus SQL statements do
not need to end with a ;. In addition to CREATE TIME SERIES TABLE, ModelarDB also extends SQL with an INCLUDE clause
with the format INCLUDE address[, address]*. When this clause is prepended to a SELECT statement, a modelardbd
instance executes the SELECT statement on the data it manages and forwards the statement to modelardbd instances at
the provided addresses. Afterwards, the modelardbd instance that initially received the query, unions the result from
all modelardbd instances and returns it to the client. As an example, this can be used to execute a SELECT statement
on both the data in the cloud and a specific edge node. Although, be aware that this does not make the edge node
transfer the data it manages to the cloud, only the result of the query. In addition to SQL statements, the REPL also
supports listing all tables using \dt, printing the schema of a table using \d table_name, flushing data in memory
to disk using \f, flushing data in memory and on disk to an object store using \F, and printing operations supported
by the client using \h.
ModelarDB> \dt
wind_turbine
ModelarDB> \d wind_turbine
timestamp: Timestamp(Microsecond, None)
wind_turbine: Utf8
wind_direction: Float32
wind_speed: Float32, Error Bound 1%
ModelarDB> SELECT * FROM wind_turbine LIMIT 5
+-------------------------+--------------+------------+----------------+
| timestamp | wind_turbine | wind_speed | wind_direction |
+-------------------------+--------------+------------+----------------+
| 2020-01-01T00:00:00.000 | KL3773 | 4.7987617 | 12.195049 |
| 2020-01-01T00:00:00.100 | KL3773 | 4.7562086 | 10.183235 |
| 2020-01-01T00:00:00.200 | KL3773 | 4.7384476 | 9.4519857 |
| 2020-01-01T00:00:00.300 | KL3773 | 4.7108905 | 9.8607325 |
| 2020-01-01T00:00:00.400 | KL3773 | 4.6879335 | 10.083183 |
+-------------------------+--------------+------------+----------------+
ModelarDB>If modelardbd is not running on the same host, the host modelardb should connect to must be specified.
modelardbd's Apache Arrow Flight interface accepts requests on port 9999 by default, so it is generally not
necessary to specify a port:
modelardb 10.0.0.37Also, if modelardbd has been configured to use another port using the environment variable MODELARDBD_PORT, the same
port must also be passed to the client:
modelardb 10.0.0.37:9998modelardb can also execute SQL statements from a file passed as a command-line argument:
modelardb 10.0.0.37 path_to_file_with_sql_statements.sqlmodelardbd can also be queried programmatically from many different programming
languages using Apache Arrow Flight. The following Python example shows how
to execute a simple SQL query against modelardbd and process the resulting stream of data points using
pyarrow and pandas. A PEP 249 compatible
connector is also available for Python in the form of PyModelarDB.
from pyarrow import flight
flight_client = flight.FlightClient("grpc://127.0.0.1:9999")
ticket = flight.Ticket("SELECT * FROM wind_turbine LIMIT 10")
flight_stream_reader = flight_client.do_get(ticket)
for flight_stream_chunk in flight_stream_reader:
record_batch = flight_stream_chunk.data
pandas_data_frame = record_batch.to_pandas()
print(pandas_data_frame)When running a cluster of modelardbd instances, it is recommended to use the
Apache Arrow Flight protocol to query the data in
the remote object store. This means that a request is made to the get_flight_info() endpoint first to determine which
modelardbd cloud instance in the cluster should be queried. The workload of the queries sent to this endpoint is
balanced across all modelardbd cloud instances and at least one modelardbd cloud instance must be running for the
endpoint to accept queries. The following Python example shows how to execute a simple workload-balanced SQL query
using a modelardbd instance and how to process the resulting stream of data points using
pyarrow and pandas. It should be noted that
the endpoint is only responsible for workload balancing and that the query is sent directly to the modelardbd cloud
instance chosen by the endpoint which then sends the result set directly back to the client.
from pyarrow import flight
modelardbd_client = flight.FlightClient("grpc://127.0.0.1:9999")
query_descriptor = flight.FlightDescriptor.for_command("SELECT * FROM wind_turbine LIMIT 10")
flight_info = modelardbd_client.get_flight_info(query_descriptor)
endpoint = flight_info.endpoints[0]
cloud_node_url = endpoint.locations[0]
cloud_client = flight.FlightClient(cloud_node_url)
flight_stream_reader = cloud_client.do_get(endpoint.ticket)
for flight_stream_chunk in flight_stream_reader:
record_batch = flight_stream_chunk.data
pandas_data_frame = record_batch.to_pandas()
print(pandas_data_frame)ModelarDB includes an embeddable library in the form of modelardb_embedded. It allows programming languages to execute
queries against or write to modelardbd or a data folder directly. A C-API allows other programming
languages than Rust to also use modelardb_embedded. The location where queries and writes are executed is specified
using the set of open_*() methods and the connect() method. The other methods in modelardb_embedded work for both
local disk, object store, and modelardbd with a few exceptions. Thus, a program can be developed against
a small data set in a local data folder and then scaled by switching to modelardbd.
import modelardb
# Execute queries and write to a data folder on a local disk.
local_data_folder = modelardb.open_local(data_folder_path)
# Execute queries and write to a data folder in a S3-compatible object store.
remote_data_folder = modelardb.open_s3(endpoint, bucket_name, access_key_id, secret_access_key)
# Execute queries and write to a data folder in Microsoft Azure Blob Storage.
remote_data_folder = modelardb.open_azure(account_name, access_key, container_name)
# Execute queries and write to a `modelardbd` instance.
modelardb_node = modelardb.connect(url)When the server is started for the first time, a configuration file is created in the root of the data folder named
modelardbd.toml. This file is used to persist updates to the configuration made using the UpdateConfiguration
action. If the file is changed manually, the changes are only applied when the server is restarted.
ModelarDB can be configured before the server is started using environment variables. A full list of the environment
variables is provided here. If an environment variable is not set, the configuration file will be used. If neither the
environment variable nor the configuration file contains the variable, the specified default value will be used.
| Variable | Default | Description |
|---|---|---|
| MODELARDBD_PORT | 9999 | The port of the server Apache Arrow Flight Server. |
| MODELARDBD_IP_ADDRESS | 127.0.0.1 | The IP address of the Apache Arrow Flight Server. |
| MODELARDBD_MULTIVARIATE_RESERVED_MEMORY_IN_BYTES | 512 MB | The amount of memory to reserve for storing multivariate time series. |
| MODELARDBD_UNCOMPRESSED_RESERVED_MEMORY_IN_BYTES | 512 MB | The amount of memory to reserve for storing uncompressed data buffers. |
| MODELARDBD_COMPRESSED_RESERVED_MEMORY_IN_BYTES | 512 MB | The amount of memory to reserve for storing compressed data buffers. |
| MODELARDBD_TRANSFER_BATCH_SIZE_IN_BYTES | 64 MB | The amount of data that must be collected before transferring a batch of data to the remote object store. |
| MODELARDBD_TRANSFER_TIME_IN_SECONDS | Disabled | The number of seconds between each transfer of data to the remote object store. |
| MODELARDBD_UNCOMPRESSED_DATA_BUFFER_CAPACITY | 65536 | The capacity of each uncompressed data buffer as the number of elements in the buffer where each element is a timestamp and a value. Note that the resulting size of the buffer has to be a multiple of 64 bytes to avoid the actual capacity being larger than the requested due to internal alignment. |
Two different Docker environments are included to make it easy to experiment with the
different use cases of ModelarDB. The first environment sets up a single instance of modelardbd that only uses local
storage. Data can be ingested into this instance, compressed, and saved to local storage. The compressed data in local
storage can then be queried. The second environment covers the more complex use case of ModelarDB where multiple
instances of modelardbd are deployed in a cluster. Two edge nodes and two cloud nodes are set up in the cluster. Data
can be ingested into the edge or cloud nodes, compressed, and transferred to a remote object store. The compressed data
in the remote object store can then be queried through the cloud nodes.
Note that since Rust is a compiled language and a more dynamic ModelarDB configuration might be needed, it is not recommended to use the Docker environments during active development of ModelarDB. They are, however, ideal to use for experimenting with ModelarDB or when developing software that utilizes ModelarDB. Downloading Docker Desktop is recommended to make maintenance of the created containers easier.
Once Docker is set up, the single edge deployment can be started by running the following command from the root of the ModelarDB repository:
docker-compose -p modelardb-single -f docker-compose-single.yml upNote that -p modelardb-single is only used to name the project to make it easier to manage in Docker
Desktop. Once created, the container can be started and stopped using Docker Desktop
or by using the corresponding commands:
docker-compose -p modelardb-single start
docker-compose -p modelardb-single stopThe single edge is running locally on port 9999 and can be accessed using the modelardb client or through Apache
Arrow Flight as described above. Tables can be created and data can be ingested, compressed, and saved to local disk.
The compressed data on local disk can then be queried.
Once Docker is set up, the cluster deployment can be started by running the following command from the root of the ModelarDB repository:
docker-compose -p modelardb-cluster -f docker-compose-cluster.yml upNote that -p modelardb-cluster is only used to name the project to make it easier to manage in Docker Desktop. Once
created, the containers can be started and stopped using Docker Desktop or by using the corresponding commands:
docker-compose -p modelardb-cluster start
docker-compose -p modelardb-cluster stopThe cluster deployment sets up a MinIO object store and a MinIO client is used to initialize the bucket modelardb in
the object store. MinIO can be administered through its web interface. The default username and
password, minioadmin, can be used to log in.
The cluster itself consists of two edge nodes and two cloud nodes. The edge nodes can be accessed using
the URLs grpc://127.0.0.1:9999 and grpc://127.0.0.1:9998, and the cloud nodes using the URLs grpc://127.0.0.1:9997
and grpc://127.0.0.1:9996. Tables can be created and data can be ingested, compressed, and transferred to the object
store through all four nodes. The compressed data in the MinIO object store can then be queried through the cloud nodes.