ArcticDB_demo_lazydataframe

ArcticDB LazyDataFrame demo

In this demo, we will explore the DataFrame processing options available in ArcticDB using the LazyDataFrame class. We will cover various possibilities of this API, including:

Filtering
Projections
Groupbys and Aggregations
Combinations of the above features

Why perform the processing in ArcticDB?

Performance boost via efficient C++ implementation that uses multi-threading
Efficient data access - only reads the data needed
For very large data sets some queries are possible that would not fit into memory

Note that all of the operations described here are also available using the legacy QueryBuilder class, but we think this API is more intuitive!

Demo setup¶

Necessary packages installation

In [ ]:

Copied!

!pip install arcticdb
!pip install arcticdb

Necessary libraries imports

In [ ]:

Copied!





import os
import numpy as np
import pandas as pd
import random
import arcticdb as adb
from arcticdb.util.test import random_strings_of_length
import os
import numpy as np
import pandas as pd
import random
import arcticdb as adb
from arcticdb.util.test import random_strings_of_length

For this demo we will configure the LMDB file based backend. ArcticDB achieves its high performance and scale when configured with an object store backend (e.g. S3).

In [ ]:

Copied!

arctic = adb.Arctic("lmdb://arcticdb_demo")
arctic = adb.Arctic("lmdb://arcticdb_demo")

You can have an unlimited number of libraries, but we will just create one to start with.

In [ ]:

Copied!





if 'sample' not in arctic.list_libraries():
    # library does not already exist
    arctic.create_library('sample')
lib = arctic.get_library('sample')
if 'sample' not in arctic.list_libraries():
    # library does not already exist
    arctic.create_library('sample')
lib = arctic.get_library('sample')

Run the cell to set up preliminary variables. 100,000 unique strings is a pathological case for us, as with the default row-slicing policy there are 100,000 rows per data segment, and so each unique strings will appear around once per data segment in this column.

In [ ]:

Copied!





ten_grouping_values = random_strings_of_length(10, 10, True)
one_hundred_thousand_grouping_values = random_strings_of_length(100_000, 10, True)
rng = np.random.RandomState()

sym_10M = "demo_10M"
sym_100M = "demo_100M"
sym_1B = "demo_1B"
ten_grouping_values = random_strings_of_length(10, 10, True)
one_hundred_thousand_grouping_values = random_strings_of_length(100_000, 10, True)
rng = np.random.RandomState()

sym_10M = "demo_10M"
sym_100M = "demo_100M"
sym_1B = "demo_1B"

Choose which symbol you want to work with

sym_10M: symbol with 10 million rows
sym_100M: symbol with 100 million rows
sym_1B: symbol with 1 billion rows

assign the symbol you want to work with to the sym variable

example: sym = sym_10M

In [ ]:

Copied!

sym = sym_10M
sym = sym_10M

Run this cell to set up the DataFrame according to the symbol name

In [ ]:

Copied!





if sym==sym_10M:
    num_rows = 10_000_000
elif sym==sym_100M:
    num_rows = 100_000_000
elif sym==sym_1B:
    num_rows = 1_000_000_000
input_df = pd.DataFrame(
    {
        "grouping_column_10": list(random.choices(ten_grouping_values, k=num_rows)),
        "grouping_column_100_000": list(random.choices(one_hundred_thousand_grouping_values, k=num_rows)),
        "numeric_column": rng.rand((num_rows))
    }
)
if sym==sym_10M:
    num_rows = 10_000_000
elif sym==sym_100M:
    num_rows = 100_000_000
elif sym==sym_1B:
    num_rows = 1_000_000_000
input_df = pd.DataFrame(
    {
        "grouping_column_10": list(random.choices(ten_grouping_values, k=num_rows)),
        "grouping_column_100_000": list(random.choices(one_hundred_thousand_grouping_values, k=num_rows)),
        "numeric_column": rng.rand((num_rows))
    }
)

Demo Start¶

In [ ]:

Copied!

lib.write(sym, input_df)
lib.write(sym, input_df)

Show how the data has been sliced and written to disk.

In [ ]:

Copied!

lib._nvs.read_index(sym)
lib._nvs.read_index(sym)

Show the first 100 rows of data as a sample.

In [ ]:

Copied!

lib.head(sym, n=100).data
lib.head(sym, n=100).data

Reading¶

Read the symbol without any filtering.

In [ ]:

Copied!

%%time
lib.read(sym)
%%time
lib.read(sym)

Most of the time is spent allocating Python strings in the column with 100,000 unique strings, so omitting this column is much faster.

In [ ]:

Copied!

%%time
lib.read(sym, columns=["grouping_column_10", "numeric_column"])
%%time
lib.read(sym, columns=["grouping_column_10", "numeric_column"])

Filtering¶

Note that all of the values in the numeric column are between 0 and 1. This query therefore does not filter out any data. This demonstrates that doing a full table scan does not significantly impact the performance. Also note that the read call is not practically instant, as no data is read until collect is called on the LazyDataFrame.

In [ ]:

Copied!

%%time
lazy_df = lib.read(sym, lazy=True)
lazy_df = lazy_df[lazy_df["numeric_column"] < 2.0]
%%time
lazy_df = lib.read(sym, lazy=True)
lazy_df = lazy_df[lazy_df["numeric_column"] < 2.0]

In [ ]:

Copied!

%%time
lazy_df.collect()
%%time
lazy_df.collect()

Now we are filtering down to approximately 10% of the rows in the symbol. This is faster than reading, as there are now fewer Python strings to allocate.

In [ ]:

Copied!





%%time
lazy_df = lib.read(sym, lazy=True)
lazy_df = lazy_df[lazy_df["numeric_column"] < 0.1]
df = lazy_df.collect().data
%%time
lazy_df = lib.read(sym, lazy=True)
lazy_df = lazy_df[lazy_df["numeric_column"] < 0.1]
df = lazy_df.collect().data

In [ ]:

Copied!

df
df

Projections¶

Creating a new column as a funtion of existing columns and constants is approximately the same speed as a filter that doesn't reduce the amount of data displayed.

In [ ]:

Copied!





%%time
lazy_df = lib.read(sym, lazy=True)
lazy_df["new_column"] = lazy_df["numeric_column"] * 2.0
df = lazy_df.collect().data
%%time
lazy_df = lib.read(sym, lazy=True)
lazy_df["new_column"] = lazy_df["numeric_column"] * 2.0
df = lazy_df.collect().data

In [ ]:

Copied!

df
df

Equivalently, use the apply method to achieve the same results.

In [ ]:

Copied!

lazy_df = lib.read(sym, lazy=True)
lazy_df.apply("new_column", lazy_df["numeric_column"] * 2.0)
lazy_df.collect().data
lazy_df = lib.read(sym, lazy=True)
lazy_df.apply("new_column", lazy_df["numeric_column"] * 2.0)
lazy_df.collect().data

If using apply before the LazyDataFrame object has been created, the col function can be used as placeholders for columns names.

In [ ]:

Copied!

lazy_df = lib.read(sym, lazy=True).apply("new_column", adb.col("numeric_column") * 2.0)
lazy_df.collect().data
lazy_df = lib.read(sym, lazy=True).apply("new_column", adb.col("numeric_column") * 2.0)
lazy_df.collect().data

Groupbys and Aggregations¶

Grouping is again faster than just reading due to the reduced number of Python string allocations, even with the extra computation performed.

In [ ]:

Copied!





%%time
lazy_df = lib.read(sym, lazy=True)
lazy_df.groupby("grouping_column_10").agg({"numeric_column": "mean"})
df = lazy_df.collect().data
%%time
lazy_df = lib.read(sym, lazy=True)
lazy_df.groupby("grouping_column_10").agg({"numeric_column": "mean"})
df = lazy_df.collect().data

In [ ]:

Copied!

df
df

Even grouping on a pathologically large number of unique values does not significantly reduce the performance.

In [ ]:

Copied!





%%time
lazy_df = lib.read(sym, lazy=True)
lazy_df.groupby("grouping_column_100_000").agg({"numeric_column": "mean"})
df = lazy_df.collect().data
%%time
lazy_df = lib.read(sym, lazy=True)
lazy_df.groupby("grouping_column_100_000").agg({"numeric_column": "mean"})
df = lazy_df.collect().data

In [ ]:

Copied!

df
df

Combinations¶

These operations can be arbitrarily combined in a seqential pipeline.

In [ ]:

Copied!





%%time
lazy_df = lib.read(sym, lazy=True)
lazy_df = lazy_df[lazy_df["numeric_column"] < 0.1].apply("new_column", lazy_df["numeric_column"] * 2.0).groupby("grouping_column_10").agg({"numeric_column": "mean", "new_column": "max"})
df = lazy_df.collect().data
%%time
lazy_df = lib.read(sym, lazy=True)
lazy_df = lazy_df[lazy_df["numeric_column"] < 0.1].apply("new_column", lazy_df["numeric_column"] * 2.0).groupby("grouping_column_10").agg({"numeric_column": "mean", "new_column": "max"})
df = lazy_df.collect().data

In [ ]:

Copied!

df
df

Batch Operations¶

In [ ]:

Copied!





# Setup two symbols
batch_sym_1 = f'{sym}_1'
batch_sym_2 = f'{sym}_2'
syms = [batch_sym_1, batch_sym_2]
lib.write(batch_sym_1, input_df)
lib.write(batch_sym_2, input_df)
# Setup two symbols
batch_sym_1 = f'{sym}_1'
batch_sym_2 = f'{sym}_2'
syms = [batch_sym_1, batch_sym_2]
lib.write(batch_sym_1, input_df)
lib.write(batch_sym_2, input_df)

read_batch also has a lazy argument, which returns a LazyDataFrameCollection.

In [ ]:

Copied!

lazy_dfs = lib.read_batch(syms, lazy=True)
lazy_dfs
lazy_dfs = lib.read_batch(syms, lazy=True)
lazy_dfs

The same processing operations can be applied to all of the symbols being read in the batch. Note in the cell output that the pipe | is outside the list of LazyDataFrames, so the WHERE clause is applied to all of the symbols.

In [ ]:

Copied!

lazy_dfs = lazy_dfs[lazy_dfs["numeric_column"] < 0.1]
lazy_dfs
lazy_dfs = lazy_dfs[lazy_dfs["numeric_column"] < 0.1]
lazy_dfs

Calling collect() on a LazyDataFrameCollection uses read_batch under the hood, and so is generally more performant than serialised read calls.

In [ ]:

Copied!

dfs = lazy_dfs.collect()
dfs
dfs = lazy_dfs.collect()
dfs

In [ ]:

Copied!

dfs[0].data.head()
dfs[0].data.head()

In [ ]:

Copied!

dfs[1].data.head()
dfs[1].data.head()

Separate processing operations can be applied to the individual symbols in the batch if desired.

In [ ]:

Copied!

lazy_dfs = lib.read_batch(syms, lazy=True)
lazy_dfs = lazy_dfs.split()
lazy_dfs
lazy_dfs = lib.read_batch(syms, lazy=True)
lazy_dfs = lazy_dfs.split()
lazy_dfs

Note in the cell output that the pipes | are now inside the list of LazyDataFrames, so the PROJECT clauses are applied to individual symbols.

In [ ]:

Copied!





lazy_dfs[0].apply("new_column_1", 2 * adb.col("numeric_column"))
lazy_dfs[1].apply("new_column_1", 4 * adb.col("numeric_column"))
lazy_dfs = adb.LazyDataFrameCollection(lazy_dfs)
lazy_dfs
lazy_dfs[0].apply("new_column_1", 2 * adb.col("numeric_column"))
lazy_dfs[1].apply("new_column_1", 4 * adb.col("numeric_column"))
lazy_dfs = adb.LazyDataFrameCollection(lazy_dfs)
lazy_dfs

In [ ]:

Copied!

dfs = lazy_dfs.collect()
dfs
dfs = lazy_dfs.collect()
dfs

In [ ]:

Copied!

dfs[0].data
dfs[0].data

In [ ]:

Copied!

dfs[1].data
dfs[1].data

If desired, these two modes of operation can be combined in an intuitive manner.

In [ ]:

Copied!





lazy_dfs = lib.read_batch(syms, lazy=True)
lazy_dfs = lazy_dfs[lazy_dfs["numeric_column"] < 0.1]
lazy_dfs = lazy_dfs.split()
lazy_dfs[0].apply("new_column_1", 2 * adb.col("numeric_column"))
lazy_dfs[1].apply("new_column_1", 4 * adb.col("numeric_column"))
lazy_dfs = adb.LazyDataFrameCollection(lazy_dfs)
lazy_dfs = lazy_dfs[lazy_dfs["new_column_1"] < 0.1]
lazy_dfs
lazy_dfs = lib.read_batch(syms, lazy=True)
lazy_dfs = lazy_dfs[lazy_dfs["numeric_column"] < 0.1]
lazy_dfs = lazy_dfs.split()
lazy_dfs[0].apply("new_column_1", 2 * adb.col("numeric_column"))
lazy_dfs[1].apply("new_column_1", 4 * adb.col("numeric_column"))
lazy_dfs = adb.LazyDataFrameCollection(lazy_dfs)
lazy_dfs = lazy_dfs[lazy_dfs["new_column_1"] < 0.1]
lazy_dfs

In [ ]:

Copied!

dfs = lazy_dfs.collect()
dfs = lazy_dfs.collect()

In [ ]:

Copied!

dfs[0].data
dfs[0].data

In [ ]:

Copied!

dfs[1].data
dfs[1].data