ArcticDB Fundamentals
This tutorial will walk through the fundamentals of ArcticDB:
- Accessing libraries
- Writing data
- Reading data
- Modifying data
To start, let's import Arctic:
from arcticdb import Arctic
Accessing libraries
Connect to your storage:
# Connect using defined keys
ac = Arctic('s3s://s3.eu-west-2.amazonaws.com:arctic-test-aws?access=<access key>&secret=<secret key>')
# Leave AWS SDK to work out auth details
ac = Arctic('s3s://s3.eu-west-2.amazonaws.com:arctic-test-aws?aws_auth=true)
For more information on how the AWS SDK configures authentication without utilising defined keys, please see the AWS documentation.
Access a library using either the [library_name]
notation or the get_library
method:
lib = ac['library']
# ...equivalent to...
lib = ac.get_library['library']
Let's see what data is already present:
>>> lib.list_symbols()
['sym_2', 'sym_1', 'symbol']
ArcticDB API
ArcticDB's API is built around four main primitives that each operate over a single symbol.
- write: Creates a new version consisting solely of the item passed in.
- append: Creates a new version consisting of the item appended to the previously-written data.
- update: Creates a new version consisting the previous data patched with the provided item.
- read: Retrieves the given version (if no version is provided the latest version is used).
These primitives aren't exhaustive, but cover most use cases. We'll show the usage of these primitives.
Writing data
Let's start by generating some data. The below snippet generates some random data with a datetime index:
import numpy as np
import pandas as pd
NUM_COLUMNS=10
NUM_ROWS=100_000
df = pd.DataFrame(np.random.randint(0,100,size=(NUM_ROWS, NUM_COLUMNS)), columns=[f"COL_{i}" for i in range(NUM_COLUMNS)], index=pd.date_range('2000', periods=NUM_ROWS, freq='h'))
Let's take this data and write it to ArcticDB:
>>> lib.write("my_data", df)
VersionedItem(symbol=my_data,library=test_fundamentals_1,data=<class 'NoneType'>,version=0,metadata=None,host=local)
Reading data
To read the data, simply use the read
primitive:
>>> data = lib.read("my_data")
>>> data
VersionedItem(symbol=my_data,library=test_fundamentals_1,data=<class 'pandas.core.frame.DataFrame'>,version=0,metadata=None,host=local)
Note that you get back a VersionedItem
- it allows us to retrieve the version of the written data:
>>> data.version
0
As this is the first write to this symbol, the version is 0
. To retrieve the data:
>>> data.data
Slicing and filtering
See the getting started guide for more information.
Modifying data
Let's append some data. First, note that the data we've written ends in 2011:
>>> data.data.tail()
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 COL_8 COL_9
2011-05-29 11:00:00 44 94 70 32 91 4 35 19 74 53
2011-05-29 12:00:00 79 51 67 48 8 83 46 54 86 38
2011-05-29 13:00:00 60 98 74 4 81 86 64 78 13 32
2011-05-29 14:00:00 27 24 16 6 84 99 11 94 29 4
2011-05-29 15:00:00 81 76 52 93 31 91 64 2 26 78
That's simply because the data we generated started on January 1st, 2000 at 00:00 at consists of 100,000 rows, incrementing one hour at a time. When append
-ing data, the data you are appending must begin
after the existing data ends. As a result, let's generate some data that begins in 2012:
df_to_append = pd.DataFrame(np.random.randint(0,100,size=(NUM_ROWS, NUM_COLUMNS)), columns=[f"COL_{i}" for i in range(NUM_COLUMNS)], index=pd.date_range('2012', periods=NUM_ROWS, freq='h'))
Now let's append!
>>> lib.append("my_data", df_to_append)
VersionedItem(symbol=my_data,library=test_fundamentals_1,data=<class 'NoneType'>,version=1,metadata=None,host=local)
append
has created a new version of the data. When reading version 0, the data will end in 2011. When reading version 1, the data will end in 2023.
Update
If append can only append date that begins after the existing data ends, then it begs the question - how do we mutate data?
The answer is that we use the update
primitive. update
overwrites (creating a new version - nothing is lost!) existing symbol data with the data that is passed in.
Note that the entire range between the first and last index entry in the existing data is replaced in its entirety with the data that is passed in, adding additional index entries if
required. This means update
is a contiguous operation - see the documentation of update
for more information.
Time travel!
ArcticDB is bitemporal - all new versions are timestamped! Let's pull in the first version of the data, prior to the append
:
>>> lib.read("my_date", as_of=0).data.tail()
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 COL_8 COL_9
2011-05-29 11:00:00 44 94 70 32 91 4 35 19 74 53
2011-05-29 12:00:00 79 51 67 48 8 83 46 54 86 38
2011-05-29 13:00:00 60 98 74 4 81 86 64 78 13 32
2011-05-29 14:00:00 27 24 16 6 84 99 11 94 29 4
2011-05-29 15:00:00 81 76 52 93 31 91 64 2 26 78
Note that it ends in 2011 - it's like the append
never happened. as_of
can take a timestamp (datatime.datetime
or pd.Timestamp
) as well.