ArcticDB_demo_compact_data

No description has been provided for this image

ArcticDB Data Compaction Demo¶

This demo notebook demonstrates the compact_data method for reslicing data on disk in order to optimise read performance, and the accompanying compact_data_explain_plan method.

This is what you need to know about it:

After running compact_data against a symbol, all chunks on disk are guaranteed to have a row-count within 33% of some target value.
- This target value defaults to the rows_per_segment library config setting if not specified.
The compaction operates on the latest live version of the symbol, and creates a new version unless the data is already suitably compacted.
The metadata from the latest live version is maintained by the compact_data method.
compact_data_explain_plan is like a dry run of the compaction, showing what the effect of running compact_data would be. No actual data is read from or written to disk.

For more information about the on-disk data format, see this section of the documentation, as well as the documentation for the rows_per_segment library configuration option.

Setup¶

In [ ]:

Copied!

!pip install arcticdb
!pip install arcticdb

In [2]:

Copied!

# To ensure reproducible performance measurements
import os
os.environ["ARCTICDB_VersionStore_NumCPUThreads_int"] = "4"
# To ensure reproducible performance measurements
import os
os.environ["ARCTICDB_VersionStore_NumCPUThreads_int"] = "4"

In [3]:

Copied!

import numpy as np
import pandas as pd
import arcticdb as adb
import numpy as np
import pandas as pd
import arcticdb as adb

In [4]:

Copied!

rng = np.random.default_rng()
rng = np.random.default_rng()

In [5]:

Copied!

# object store
arctic = adb.Arctic("lmdb://arcticdb_compact_data")
# object store
arctic = adb.Arctic("lmdb://arcticdb_compact_data")

In [6]:

Copied!





# So that each run of the notebook starts from an empty library
arctic.delete_library("compact_data")
# library - rows_per_segment not specified, so will default to 100,000
lib = arctic.get_library("compact_data", create_if_missing=True)
# symbol
sym = "OHLCV_minutely"
# So that each run of the notebook starts from an empty library
arctic.delete_library("compact_data")
# library - rows_per_segment not specified, so will default to 100,000
lib = arctic.get_library("compact_data", create_if_missing=True)
# symbol
sym = "OHLCV_minutely"

Create Some Data¶

Simulate open-high-low-close-volume (OHLCV) minutely data since the turn of the millennium
Writing pattern is to append one day's worth of data at a time
This will create 1 data key for each call to append

In [7]:

Copied!

days = pd.date_range("2000-01-01", "2026-01-01", freq="B")
len(days)
days = pd.date_range("2000-01-01", "2026-01-01", freq="B")
len(days)

Out[7]:

In [8]:

Copied!





def generate_day_data(day):
    minutely_ts = pd.date_range(day + pd.Timedelta(hours=8), day + pd.Timedelta(hours=16), freq="min")
    num_rows = len(minutely_ts)
    return pd.DataFrame({"open": rng.random(num_rows), "high": rng.random(num_rows), "low": rng.random(num_rows), "close": rng.random(num_rows), "volume": rng.random(num_rows)}, index=minutely_ts)
def generate_day_data(day):
    minutely_ts = pd.date_range(day + pd.Timedelta(hours=8), day + pd.Timedelta(hours=16), freq="min")
    num_rows = len(minutely_ts)
    return pd.DataFrame({"open": rng.random(num_rows), "high": rng.random(num_rows), "low": rng.random(num_rows), "close": rng.random(num_rows), "volume": rng.random(num_rows)}, index=minutely_ts)

In [9]:

Copied!

for day in days:
    df = generate_day_data(day)
    lib.append(sym, df, metadata=f"some metadata {day}")
for day in days:
    df = generate_day_data(day)
    lib.append(sym, df, metadata=f"some metadata {day}")

This reads 6,784 data keys:¶

In [10]:

Copied!

%%timeit
lib.read(sym)
%%timeit
lib.read(sym)

45.7 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]:

Copied!

vit_before_compaction = lib.read(sym)
vit_before_compaction.data
vit_before_compaction = lib.read(sym)
vit_before_compaction.data

Out[11]:

	open	high	low	close	volume
2000-01-03 08:00:00	0.652239	0.559158	0.377035	0.262168	0.988915
2000-01-03 08:01:00	0.519549	0.814606	0.754541	0.258416	0.147980
2000-01-03 08:02:00	0.261307	0.525803	0.486785	0.956356	0.771978
2000-01-03 08:03:00	0.186911	0.061975	0.462259	0.481657	0.419798
2000-01-03 08:04:00	0.931898	0.025461	0.715260	0.205572	0.600472
...	...	...	...	...	...
2026-01-01 15:56:00	0.017824	0.529770	0.020662	0.079490	0.902616
2026-01-01 15:57:00	0.986598	0.380662	0.423765	0.822334	0.373985
2026-01-01 15:58:00	0.083624	0.424631	0.749354	0.270700	0.977522
2026-01-01 15:59:00	0.642927	0.110095	0.376601	0.399092	0.289307
2026-01-01 16:00:00	0.323974	0.906545	0.268960	0.921583	0.735306

3263104 rows × 5 columns

In [12]:

Copied!

vit_before_compaction.version
vit_before_compaction.version

Out[12]:

In [13]:

Copied!

vit_before_compaction.metadata
vit_before_compaction.metadata

Out[13]:

'some metadata 2026-01-01 00:00:00'

Find out what impact calling `compact_data` would have¶

In [14]:

Copied!

compact_data_info = lib.compact_data_explain_plan(sym)
compact_data_info
compact_data_info = lib.compact_data_explain_plan(sym)
compact_data_info

Out[14]:

CompactDataInfo(will_do_work=true, version_id_before=6783, version_id_after=6784, num_row_slices_before=6784, num_row_slices_after=33)

will_do_work - whether calling compact_data will do any compaction, or do nothing if the data is already suitably compacted
version_id_before - the version number of the latest live version right now
version_id_after - the version number of the latest version after calling compact_data
num_row_slices_before - how many row-slices the data is comprised of right now
num_row_slices_after - how many row-slices the data will have after calling compact_data

In addition to the fields shown in the string representation above, there are also lists that show the row-slice boundaries before and after compaction.

Before compaction, every row slice has 481 rows:

In [15]:

Copied!

compact_data_info.row_slices_before[:5]
compact_data_info.row_slices_before[:5]

Out[15]:

[0, 481, 962, 1443, 1924]

In [16]:

Copied!

compact_data_info.row_slices_before[-5:]
compact_data_info.row_slices_before[-5:]

Out[16]:

[3261180, 3261661, 3262142, 3262623, 3263104]

After compaction, every row-slice will have 100,000 ±33% rows. This is because we did not specify the `rows_per_segment` argument, and so it used the library default of 100,000:¶

In [17]:

Copied!

compact_data_info.row_slices_after
compact_data_info.row_slices_after

Out[17]:

Print the number of rows in each slice after compaction:¶

In [18]:

Copied!

for idx in range(len(compact_data_info.row_slices_after) - 1):
    print(compact_data_info.row_slices_after[idx + 1] - compact_data_info.row_slices_after[idx])
for idx in range(len(compact_data_info.row_slices_after) - 1):
    print(compact_data_info.row_slices_after[idx + 1] - compact_data_info.row_slices_after[idx])

Note that the row-slices do not all have the same number of rows post-compaction. The reasoning for this will be explained below.¶

Compact the data!¶

In [19]:

Copied!

lib.compact_data(sym)
lib.compact_data(sym)

Out[19]:

VersionedItem(symbol='OHLCV_minutely', library='compact_data', data=n/a, version=6784, metadata=None, host='LMDB(path=/users/is/aowens/source/man.arcticdb/arcticdb_link/docs/mkdocs/docs/notebooks/arcticdb_compact_data)', timestamp=1779456252894183043)

Like other modification operations, this returns a `VersionedItem` containing information about the version that was just written¶

The data is unchanged, but now only 33 (larger) data keys need to be read. This is of particular benefit on higher latency storages where bandwidth isn't an issue, but each IO round trip adds 10s or 100s of milliseconds¶

In [20]:

Copied!

%%timeit
lib.read(sym)
%%timeit
lib.read(sym)

13.8 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Even on LMDB storage with zero latency, runs ~x3.5 faster!¶

In [21]:

Copied!

vit_after_compaction = lib.read(sym)
vit_after_compaction.data
vit_after_compaction = lib.read(sym)
vit_after_compaction.data

Out[21]:

	open	high	low	close	volume
2000-01-03 08:00:00	0.652239	0.559158	0.377035	0.262168	0.988915
2000-01-03 08:01:00	0.519549	0.814606	0.754541	0.258416	0.147980
2000-01-03 08:02:00	0.261307	0.525803	0.486785	0.956356	0.771978
2000-01-03 08:03:00	0.186911	0.061975	0.462259	0.481657	0.419798
2000-01-03 08:04:00	0.931898	0.025461	0.715260	0.205572	0.600472
...	...	...	...	...	...
2026-01-01 15:56:00	0.017824	0.529770	0.020662	0.079490	0.902616
2026-01-01 15:57:00	0.986598	0.380662	0.423765	0.822334	0.373985
2026-01-01 15:58:00	0.083624	0.424631	0.749354	0.270700	0.977522
2026-01-01 15:59:00	0.642927	0.110095	0.376601	0.399092	0.289307
2026-01-01 16:00:00	0.323974	0.906545	0.268960	0.921583	0.735306

3263104 rows × 5 columns

In [22]:

Copied!

vit_before_compaction.data.equals(vit_after_compaction.data)
vit_before_compaction.data.equals(vit_after_compaction.data)

Out[22]:

True

A new version has been created:¶

In [23]:

Copied!

vit_after_compaction.version
vit_after_compaction.version

Out[23]:

The metadata is the same as the version that got compacted:¶

In [24]:

Copied!

vit_after_compaction.metadata
vit_after_compaction.metadata

Out[24]:

'some metadata 2026-01-01 00:00:00'

Now suppose a data vendor restates a single days worth of data:¶

In [25]:

Copied!





day = days[1_000]
df = generate_day_data(day)
lib.update(sym, df)
df
day = days[1_000]
df = generate_day_data(day)
lib.update(sym, df)
df

Out[25]:

	open	high	low	close	volume
2003-11-03 08:00:00	0.513891	0.731586	0.751961	0.947970	0.239459
2003-11-03 08:01:00	0.103845	0.013911	0.083932	0.264079	0.696063
2003-11-03 08:02:00	0.584501	0.905620	0.505111	0.143124	0.299208
2003-11-03 08:03:00	0.186388	0.188366	0.507724	0.299007	0.311037
2003-11-03 08:04:00	0.343084	0.009887	0.172661	0.187550	0.848564
...	...	...	...	...	...
2003-11-03 15:56:00	0.809524	0.905820	0.006144	0.731492	0.103905
2003-11-03 15:57:00	0.503292	0.884031	0.742494	0.414768	0.739807
2003-11-03 15:58:00	0.018958	0.534203	0.526111	0.062234	0.926631
2003-11-03 15:59:00	0.533881	0.098154	0.498576	0.255710	0.667218
2003-11-03 16:00:00	0.524923	0.628789	0.003717	0.209919	0.449783

481 rows × 5 columns

In [26]:

Copied!

compact_data_info = lib.compact_data_explain_plan(sym)
compact_data_info
compact_data_info = lib.compact_data_explain_plan(sym)
compact_data_info

Out[26]:

CompactDataInfo(will_do_work=true, version_id_before=6785, version_id_after=6786, num_row_slices_before=35, num_row_slices_after=33)

In [27]:

Copied!

compact_data_info.row_slices_before[4:8]
compact_data_info.row_slices_before[4:8]

Out[27]:

[400192, 481000, 481481, 500240]

In [28]:

Copied!

compact_data_info.row_slices_after[4:6]
compact_data_info.row_slices_after[4:6]

Out[28]:

[400192, 500240]

The 3 row-slices spanning rows 400,192 to 500,240 will be combined into 1¶

These cover the timestamps from 2003-03-12 to 2003-12-26 i.e. the date that was restated, plus some surrounding dates.

No other row-slices in the data will be compacted, as they are already suitably sized, and therefore do not need to be read, processed, and rewritten. This is why compaction does not necessarily produce uniform numbers of rows in each slice, as this can result in small appends or updates triggering all of the data to be rewritten during the compaction.

In [29]:

Copied!

lib.compact_data(sym)
lib.compact_data(sym)

Out[29]:

VersionedItem(symbol='OHLCV_minutely', library='compact_data', data=n/a, version=6786, metadata=None, host='LMDB(path=/users/is/aowens/source/man.arcticdb/arcticdb_link/docs/mkdocs/docs/notebooks/arcticdb_compact_data)', timestamp=1779456264058020828)

The data has just been compacted, and so compacting again immediately will have no effect:¶

In [30]:

Copied!

lib.compact_data_explain_plan(sym)
lib.compact_data_explain_plan(sym)

Out[30]:

CompactDataInfo(will_do_work=false, version_id_before=6786, version_id_after=6786, num_row_slices_before=33, num_row_slices_after=33)

If the library default `rows_per_segment` is not appropriate for a symbol, this can be explicitly specified to both methods¶

In [31]:

Copied!

lib.compact_data_explain_plan(sym, rows_per_segment=1_000_000)
lib.compact_data_explain_plan(sym, rows_per_segment=1_000_000)

Out[31]:

CompactDataInfo(will_do_work=true, version_id_before=6786, version_id_after=6787, num_row_slices_before=33, num_row_slices_after=3)

In [32]:

Copied!

lib.compact_data(sym, rows_per_segment=1_000_000)
lib.compact_data(sym, rows_per_segment=1_000_000)

Out[32]:

VersionedItem(symbol='OHLCV_minutely', library='compact_data', data=n/a, version=6787, metadata=None, host='LMDB(path=/users/is/aowens/source/man.arcticdb/arcticdb_link/docs/mkdocs/docs/notebooks/arcticdb_compact_data)', timestamp=1779456264292791517)

This can also be used to slice into smaller segments if this aids read performance¶

In [33]:

Copied!

lib.compact_data_explain_plan(sym, rows_per_segment=10_000)
lib.compact_data_explain_plan(sym, rows_per_segment=10_000)

Out[33]:

CompactDataInfo(will_do_work=true, version_id_before=6787, version_id_after=6788, num_row_slices_before=3, num_row_slices_after=247)

In [34]:

Copied!

lib.compact_data(sym, rows_per_segment=10_000)
lib.compact_data(sym, rows_per_segment=10_000)

Out[34]:

VersionedItem(symbol='OHLCV_minutely', library='compact_data', data=n/a, version=6788, metadata=None, host='LMDB(path=/users/is/aowens/source/man.arcticdb/arcticdb_link/docs/mkdocs/docs/notebooks/arcticdb_compact_data)', timestamp=1779456264576206308)