ArcticDB_demo_querybuilder
View in Github | Open in Google ColabArcticDB Query Builder demo
In this demo, we will explore the different functionalities of the QueryBuilder for ArcticDB. We will cover various possibilities of this API, including:¶
- Filtering
- Projections
- Groupbys and Aggregations
- Combinations of the above features
Why use QueryBuilder?¶
- Performance boost via efficient C++ implementation that uses multi-threading
- Efficient data access - only reads the data needed
- For very large data sets some queries are possible that would not fit into memory
Demo setup¶
Necessary packages installation
!pip install arcticdb
Necessary libraries imports
import os
import numpy as np
import pandas as pd
import random
from arcticdb import Arctic, QueryBuilder
from arcticdb.util.test import random_strings_of_length
For this demo we will configure the LMDB file based backend. ArcticDB achieves its high performance and scale when configured with an object store backend (e.g. S3).
arctic = Arctic("lmdb://arcticdb_demo")
You can have an unlimited number of libraries, but we will just create one to start with.
if 'sample' not in arctic.list_libraries():
# library does not already exist
arctic.create_library('sample')
lib = arctic.get_library('sample')
Run the cell to set up preliminary variables. 100,000 unique strings is a pathological case for us, as with the default row-slicing policy there are 100,000 rows per data segment, and so each unique strings will appear around once per data segment in this column.
ten_grouping_values = random_strings_of_length(10, 10, True)
one_hundred_thousand_grouping_values = random_strings_of_length(100_000, 10, True)
rng = np.random.RandomState()
sym_10M = "demo_10M"
sym_100M = "demo_100M"
sym_1B = "demo_1B"
Choose which symbol you want to work with¶
- sym_10M: symbol with 10 million rows
- sym_100M: symbol with 100 million rows
- sym_1B: symbol with 1 billion rows
assign the symbol you want to work with to the sym variable
- example: sym = sym_10M
sym = sym_10M
Run this cell to set up the dataframe according to the symbol name¶
if sym==sym_10M:
num_rows = 10_000_000
elif sym==sym_100M:
num_rows = 100_000_000
elif sym==sym_1B:
num_rows = 1_000_000_000
df = pd.DataFrame(
{
"grouping_column_10": list(random.choices(ten_grouping_values, k=num_rows)),
"grouping_column_100_000": list(random.choices(one_hundred_thousand_grouping_values, k=num_rows)),
"numeric_column": rng.rand((num_rows))
}
)
DEMO STARTS¶
lib.write(sym, df)
Show how the data has been sliced and written to disk.
lib._nvs.read_index(sym)
Show the first 100 rows of data as a sample.
lib.head(sym, n=100).data
Reading¶
Read the symbol without any filtering.
%%time
lib.read(sym)
Most of the time is spent allocating Python strings in the column with 100,000 unique strings, so omitting this column is much faster.
%%time
lib.read(sym, columns=["grouping_column_10", "numeric_column"])
Filtering¶
Note that all of the values in the numeric column are between 0 and 1. Thisquery therefore does not filter out any data. This demonstrates that doing a full table scan does not significantly impact the performance.
q = QueryBuilder()
q = q[q["numeric_column"] < 2.0]
%%time
lib.read(sym, query_builder=q)
Now we are filtering down to approximately 10% of the rows in the symbol. This is faster than reading, as there are now fewer Python strings to allocate.
q = QueryBuilder()
q = q[q["numeric_column"] < 0.1]
%%time
lib.read(sym, query_builder=q).data
lib.read(sym, query_builder=q).data
Projections¶
Creating a new column as a funtion of existing columns and constants is approximately the same speed as a filter that doesn't reduce the amount of data displayed.
q = QueryBuilder()
q = q.apply("new_column", q["numeric_column"] * 2.0)
%%time
lib.read(sym, query_builder=q)
lib.read(sym, query_builder=q).data
Groupbys and Aggregations¶
Grouping is again faster than just reading due to the reduced number of Python string allocations, even with the extra computation performed.
q = QueryBuilder()
q = q.groupby("grouping_column_10").agg({"numeric_column": "mean"})
%%time
lib.read(sym, query_builder=q)
lib.read(sym, query_builder=q).data
Even grouping on a pathologically large number of unique values does not significantly reduce the performance.
q = QueryBuilder()
q = q.groupby("grouping_column_100_000").agg({"numeric_column": "mean"})
%%time
lib.read(sym, query_builder=q)
lib.read(sym, query_builder=q).data
Combinations¶
These operations can be arbitrarily combined in a seqential pipeline.
q = QueryBuilder()
q = q[q["numeric_column"] < 0.1].apply("new_column", q["numeric_column"] * 2.0).groupby("grouping_column_10").agg({"numeric_column": "mean", "new_column": "max"})
%%time
lib.read(sym, query_builder=q)
lib.read(sym, query_builder=q).data