Skip to content

Metadata

ArcticDB enables you to store arbitrary binary-blobs alongside symbols and versions.

The below example shows a basic example of writing and reading metadata (in this case a Python dictionary):

import arcticdb as adb
# This example assumes the below variables (host, bucket, access, secret) are validly set
ac = adb.Arctic(f"s3://{HOST}:{BUCKET}?access={ACCESS}&secret={SECRET})
library = "my_library"

if library not in ac.list_libraries():
    ac.create_library(library)

library = ac[library]

metadata = {
    "This": "is",
    "a": "Python",
    "Dictionary": "!"
}

lib.write("meta", data=pd.DataFrame(), metadata=metadata)  # or write_metadata can be used - will still create a new version, but doesn't require `data` to be passed in

assert lib.read("meta").metadata == metadata
assert lib.read_metadata("meta").metadata == metadata  # Same as read, but doesn't return data from storage

New versions of symbols do not "inherit" the metadata of a previous version. Metadata needs to be specified explicitly each time that you create a new version of the symbol:

lib.write("new_sym", data=pd.DataFrame(), metadata=metadata)
lib.write("new_sym", data=pd.DataFrame())

assert lib.read("new_sym").metadata is None
assert lib.read("new_sym", as_of=0).metadata == metadata

Serialization Format

We use msgpack serialization for metadata when possible. We support the built-in msgpack types and also:

  • Pandas timestamps pd.Timestamp
  • Python datetime datetime.datetime
  • Python timedelta datetime.timedelta

Documentation of supported msgpack structures is available here. Arrays and maps correspond to Python lists and dicts.

When this msgpack serialization of the metadata fails due to unsupported types we fall back to pickling the metadata. Pickling can have serious downsides as it may not be possible to unpickle data written with one set of library versions from a client with a different set of library versions.

Because of this, we log a warning when metadata gets pickled. You can disable the warning by setting an environment variable ARCTICDB_PickledMetadata_loglevel_str to DEBUG. The log message looks like:

Pickling metadata - may not be readable by other clients

The metadata may be up to 4GB in size.

Practical example - using metadata to track vendor timelines

One common example for metadata is to store the vendor-provided date alongside the version. For example, let's say we are processing three files - data-2004-01-01.csv, data-2004-01-02.csv and data-2004-01-03.csv. Each file name contains a date which we'd like to be able to store along side the version information in ArcticDB.

We can do this using the following code:

import glob
import datetime
import arcticdb as adb

# This example assumes the below variables (host, bucket, access, secret) are validly set
ac = adb.Arctic(f"s3://{HOST}:{BUCKET}?access={ACCESS}&secret={SECRET})
library = "my_library"

if library not in ac.list_libraries():
    ac.create_library(library)

library = ac[library]

file_names = glob.glob('*.csv')  # returns ['data-2004-01-01.csv', 'data-2004-01-02.csv', 'data-2004-01-03.csv']

for i, name in enumerate(file_names):
    data = pd.read_csv('name')
    date = datetime.datetime.strptime(name[5:][:-4], '%Y-%m-%d')

    if i == 0:
        lib.write("symbol", data, metadata={"vendor_date": date})
    else:
        lib.append("symbol", data, metadata={"vendor_date": date})

We'll now use this to read the data along the vendor-provided timeline - that is, we'll retrieve the data as if we had written each file on the day it was generated. We'll read all metadata entries for all versions of the symbol and select the date that is most recent with respect to a given date (in this case 2004-01-02):

# Continuing from the previous code snippet
metadata = [
    (v["version"], lib.read_metadata("symbol", as_of=v["version"]).metadata.get("vendor_date"))
    for v in lib.list_versions("symbol")
]
sorted_metadata = sorted(metadata, key=lambda x: x[1])

version_to_read_index = bisect_right([x[1] for x in sorted_metadata], datetime.datetime(2004, 1, 2))
lib.read("symbol", as_of=sorted_metadata[version_to_read_index - 1][0])

Note that if the data is written across multiple symbols, then ArcticDB Snapshots can be used to achieve the same result.