ArcticDB_demo_lmdb
View in Github | Open in Google Colab!pip install arcticdb
import time
import numpy as np
import pandas as pd
from datetime import datetime
import arcticdb as adb
ArcticDB Concepts and Terminology¶
- Namespace – Collections of libraries. Used to separate logical environments from each other. Analogous to database server.
- Library – Contains multiple symbols which are grouped in a certain way (different users, markets etc). Analogous to database.
- Symbol – Atomic unit of data storage. Identified by a string name. Data stored under a symbol strongly resembles a Pandas DataFrame. Analogous to tables.
- Version – Every modifying action (write, append, update) performed on a symbol creates a new version of that object.
- Snapshot – Data associated with all or some symbols at a particular point-in-time can be snapshotted and later retrieved via the read method.
ArcticDB is designed for Time Series data¶
Let's create a few small dataframes of daily data so we can see what's happening.
daily1 = pd.DataFrame(np.ones((4, 3))*1, index=pd.date_range('1/1/2023', periods=4, freq="D"), columns=list('ABC'))
daily1
A | B | C | |
---|---|---|---|
2023-01-01 | 1.0 | 1.0 | 1.0 |
2023-01-02 | 1.0 | 1.0 | 1.0 |
2023-01-03 | 1.0 | 1.0 | 1.0 |
2023-01-04 | 1.0 | 1.0 | 1.0 |
daily2 = pd.DataFrame(np.ones((4, 3))*2, index=pd.date_range('1/5/2023', periods=4, freq="D"), columns=list('ABC'))
daily2
A | B | C | |
---|---|---|---|
2023-01-05 | 2.0 | 2.0 | 2.0 |
2023-01-06 | 2.0 | 2.0 | 2.0 |
2023-01-07 | 2.0 | 2.0 | 2.0 |
2023-01-08 | 2.0 | 2.0 | 2.0 |
daily3 = pd.DataFrame(np.ones((4, 3))*3, index=pd.date_range('1/3/2023', periods=4, freq="D"), columns=list('ABC'))
daily3
A | B | C | |
---|---|---|---|
2023-01-03 | 3.0 | 3.0 | 3.0 |
2023-01-04 | 3.0 | 3.0 | 3.0 |
2023-01-05 | 3.0 | 3.0 | 3.0 |
2023-01-06 | 3.0 | 3.0 | 3.0 |
Library Management¶
For this demo we will configure the LMDB file based backend. ArcticDB achieves its high performance and scale when configured with an object store backend (e.g. S3).
arctic = adb.Arctic("lmdb://arcticdb_demo")
You can have an unlimited number of libraries, but we will just create one to start with.
lib = arctic.get_library('sample', create_if_missing=True)
Reading & writing data¶
Read a pandas dataframe from source, and write it to the target¶
ArcticDB generally adheres to a philosphy of Pandas In, Pandas Out. read and write both work with Pandas DataFrames.
Note - within a library it is common to have many thousands of symbols.
write_record = lib.write("DAILY", daily1)
write_record
VersionedItem(symbol='DAILY', library='sample', data=n/a, version=0, metadata=None, host='LMDB(path=/content/arcticdb_demo)')
read_record = lib.read("DAILY")
read_record
VersionedItem(symbol='DAILY', library='sample', data=<class 'pandas.core.frame.DataFrame'>, version=0, metadata=None, host='LMDB(path=/content/arcticdb_demo)')
NB: You can version multiple symbols/tables together with a library level Snapshot!
read_record.data
A | B | C | |
---|---|---|---|
2023-01-01 | 1.0 | 1.0 | 1.0 |
2023-01-02 | 1.0 | 1.0 | 1.0 |
2023-01-03 | 1.0 | 1.0 | 1.0 |
2023-01-04 | 1.0 | 1.0 | 1.0 |
Modifying Data¶
ArcticDB supports data modifications such as update and append.
lib.append("DAILY", daily2)
lib.read("DAILY").data
A | B | C | |
---|---|---|---|
2023-01-01 | 1.0 | 1.0 | 1.0 |
2023-01-02 | 1.0 | 1.0 | 1.0 |
2023-01-03 | 1.0 | 1.0 | 1.0 |
2023-01-04 | 1.0 | 1.0 | 1.0 |
2023-01-05 | 2.0 | 2.0 | 2.0 |
2023-01-06 | 2.0 | 2.0 | 2.0 |
2023-01-07 | 2.0 | 2.0 | 2.0 |
2023-01-08 | 2.0 | 2.0 | 2.0 |
lib.update("DAILY", daily3)
lib.read("DAILY").data
A | B | C | |
---|---|---|---|
2023-01-01 | 1.0 | 1.0 | 1.0 |
2023-01-02 | 1.0 | 1.0 | 1.0 |
2023-01-03 | 3.0 | 3.0 | 3.0 |
2023-01-04 | 3.0 | 3.0 | 3.0 |
2023-01-05 | 3.0 | 3.0 | 3.0 |
2023-01-06 | 3.0 | 3.0 | 3.0 |
2023-01-07 | 2.0 | 2.0 | 2.0 |
2023-01-08 | 2.0 | 2.0 | 2.0 |
ArcticDB is bitemporal¶
All ArcticDB operations are versioned - rewind through time to understand historical revisions and enable point-in-time analysis of data!
# Rewind to version...
lib.read("DAILY", as_of=write_record.version).data
A | B | C | |
---|---|---|---|
2023-01-01 | 1.0 | 1.0 | 1.0 |
2023-01-02 | 1.0 | 1.0 | 1.0 |
2023-01-03 | 1.0 | 1.0 | 1.0 |
2023-01-04 | 1.0 | 1.0 | 1.0 |
ArcticDB supports extremely large DataFrames¶
One typical use case is to store the history of >100k measures in one dataframe for easy timeseries and cross-sectional analysis.
For this demo notebook we'll just do 10,000 rows of hourly data by 10,000 columns of measures.
n = 10_000
large = pd.DataFrame(np.linspace(1, n, n)*np.linspace(1, n, n)[:,np.newaxis], columns=[f'c{i}' for i in range(n)], index=pd.date_range('1/1/2020', periods=n, freq="H"))
large.tail()
c0 | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 | ... | c9990 | c9991 | c9992 | c9993 | c9994 | c9995 | c9996 | c9997 | c9998 | c9999 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2021-02-20 11:00:00 | 9996.0 | 19992.0 | 29988.0 | 39984.0 | 49980.0 | 59976.0 | 69972.0 | 79968.0 | 89964.0 | 99960.0 | ... | 99870036.0 | 99880032.0 | 99890028.0 | 99900024.0 | 99910020.0 | 99920016.0 | 99930012.0 | 99940008.0 | 99950004.0 | 99960000.0 |
2021-02-20 12:00:00 | 9997.0 | 19994.0 | 29991.0 | 39988.0 | 49985.0 | 59982.0 | 69979.0 | 79976.0 | 89973.0 | 99970.0 | ... | 99880027.0 | 99890024.0 | 99900021.0 | 99910018.0 | 99920015.0 | 99930012.0 | 99940009.0 | 99950006.0 | 99960003.0 | 99970000.0 |
2021-02-20 13:00:00 | 9998.0 | 19996.0 | 29994.0 | 39992.0 | 49990.0 | 59988.0 | 69986.0 | 79984.0 | 89982.0 | 99980.0 | ... | 99890018.0 | 99900016.0 | 99910014.0 | 99920012.0 | 99930010.0 | 99940008.0 | 99950006.0 | 99960004.0 | 99970002.0 | 99980000.0 |
2021-02-20 14:00:00 | 9999.0 | 19998.0 | 29997.0 | 39996.0 | 49995.0 | 59994.0 | 69993.0 | 79992.0 | 89991.0 | 99990.0 | ... | 99900009.0 | 99910008.0 | 99920007.0 | 99930006.0 | 99940005.0 | 99950004.0 | 99960003.0 | 99970002.0 | 99980001.0 | 99990000.0 |
2021-02-20 15:00:00 | 10000.0 | 20000.0 | 30000.0 | 40000.0 | 50000.0 | 60000.0 | 70000.0 | 80000.0 | 90000.0 | 100000.0 | ... | 99910000.0 | 99920000.0 | 99930000.0 | 99940000.0 | 99950000.0 | 99960000.0 | 99970000.0 | 99980000.0 | 99990000.0 | 100000000.0 |
5 rows × 10000 columns
t1 = time.time()
lib.write('large', large)
t2 = time.time()
print(f'Wrote {n*n/(t2-t1)/1e6:.2f} million floats per second.')
Wrote 13.30 million floats per second.
You can select out rows and columns efficiently, necessary when the data doesn't fit into ram.
subframe = lib.read(
"large",
columns=["c0", "c1", "c5000", "c5001", "c9998", "c9999"],
date_range=(datetime(2020, 6, 13, 8), datetime(2020, 6, 13, 13))
).data
subframe
c0 | c1 | c5000 | c5001 | c9998 | c9999 | |
---|---|---|---|---|---|---|
2020-06-13 08:00:00 | 3945.0 | 7890.0 | 19728945.0 | 19732890.0 | 39446055.0 | 39450000.0 |
2020-06-13 09:00:00 | 3946.0 | 7892.0 | 19733946.0 | 19737892.0 | 39456054.0 | 39460000.0 |
2020-06-13 10:00:00 | 3947.0 | 7894.0 | 19738947.0 | 19742894.0 | 39466053.0 | 39470000.0 |
2020-06-13 11:00:00 | 3948.0 | 7896.0 | 19743948.0 | 19747896.0 | 39476052.0 | 39480000.0 |
2020-06-13 12:00:00 | 3949.0 | 7898.0 | 19748949.0 | 19752898.0 | 39486051.0 | 39490000.0 |
2020-06-13 13:00:00 | 3950.0 | 7900.0 | 19753950.0 | 19757900.0 | 39496050.0 | 39500000.0 |
ArcticDB supports extremely long DataFrames¶
Another typical use case is high frequency data with billions of rows.
For this demo notebook we will just try a modest 100 million rows of second frequency data.
n = 100_000_000
long = pd.DataFrame(np.linspace(1, n, n), columns=['Price'], index=pd.date_range('1/1/2020', periods=n, freq="S"))
long.tail()
Price | |
---|---|
2023-03-03 09:46:35 | 99999996.0 |
2023-03-03 09:46:36 | 99999997.0 |
2023-03-03 09:46:37 | 99999998.0 |
2023-03-03 09:46:38 | 99999999.0 |
2023-03-03 09:46:39 | 100000000.0 |
t1 = time.time()
lib.write('long', long)
t2 = time.time()
print(f'Wrote {n/(t2-t1)/1e6:.2f} million floats per second.')
Wrote 12.20 million floats per second.
You can query the data with with the familiarity of Pandas and the efficiency of C++¶
For more information please check out our LazyDataFrame and QueryBuilder docs.
%%time
lazy_df = lib.read("long", lazy=True)
lazy_df = lazy_df[(lazy_df["Price"] > 49e6) & (lazy_df["Price"] < 51e6)]
filtered = lazy_df.collect().data
CPU times: user 3.07 s, sys: 525 ms, total: 3.6 s Wall time: 2.3 s
len(filtered)
1999999
filtered.tail()
Price | |
---|---|
2021-08-13 06:39:54 | 50999995.0 |
2021-08-13 06:39:55 | 50999996.0 |
2021-08-13 06:39:56 | 50999997.0 |
2021-08-13 06:39:57 | 50999998.0 |
2021-08-13 06:39:58 | 50999999.0 |
Where to go from here?¶
- Read the docs
- Signup for Slack via our website
- Checkout the code on Github