Frequently Asked Questions¶

Note

This FAQ document covers multiple topic areas - please see the contents table on the right for more information.

Product¶

What is ArcticDB?¶

ArcticDB is a high performance DataFrame database built for the modern Python Data Science ecosystem. ArcticDB is an embedded database engine - which means that installing ArcticDB is as simple as installing a Python package. This also means that ArcticDB does not require any server infrastructure to function.

ArcticDB is optimised for numerical datasets spanning millions of rows and columns, enabling you to store and retrieve massive datasets within a Pythonic, DataFrame-like API that researchers, data scientists and software engineers will find immediately familiar.

How does ArcticDB differ from the version of Arctic on GitHub?¶

Please see the history page.

How does ArcticDB differ from Apache Parquet?¶

Both ArcticDB and Parquet enable the storage of columnar data without requiring additional infrastructure.

ArcticDB however uses a custom storage format that means it offers the following functionality over Parquet:

Versioned modifications ("time travel") - ArcticDB is bitemporal.
Timeseries indexes. ArcticDB is a timeseries database and as such is optimised for slicing and dicing timeseries data containing billions of rows.
Data discovery - ArcticDB is built for teams. Data is structured into libraries and symbols rather than raw filepaths.
Support for streaming data. ArcticDB is a fully functional streaming/tick database, enabling the storage of both batch and streaming data.
Support for "dynamic schemas" - ArcticDB supports datasets with changing schemas (column sets) over time.
Support for automatic data deduplication.

What sort of data is ArcticDB best suited to?¶

ArcticDB is an OLA(nalytical)P DBMS, rather than an OLT(ransactional)P DBMS.

In practice, this means that ArcticDB is optimised for large numerical datasets and for queries that operate over many rows at a time.

Does ArcticDB require a server?¶

No. ArcticDB is a fully fledged embedded analytical database system, designed for modern cloud and on-premises object storage that does not require a server for any of the core features.

What languages can I use ArcticDB with?¶

Bindings are currently only available for Python.

What is the best practice for saving Data to ArcticDB?¶

Users should consider how they store data based on their specific use cases. See the guide here.

What are the Limitations of ArcticDB being client-side?¶

The serverless nature of ArcticDB provides excellent performance, making it ideal for data science applications where speed and efficiency are key. It ensures atomicity, consistency, and durability but does not isolate transactions. Changes to symbols are performed on a last-writer-wins principle, and without isolation, write-after-read transactions are not supported, which is not suitable for use cases requiring strong transactional guarantees.

What storage options does ArcticDB support?¶

ArcticDB offers compatibility with a wide array of storage choices, both on-premises and in the cloud. It is verified to work with multiple storage systems such as AWS S3, Azure Blob Storage, LMDB, In-memory, Ceph, MinIO (Linux), Pure Flashblade S3, Scality S3, and VAST Data S3, with plans to support additional options soon.

What are the trade offs with ArcticDB Versioning?¶

ArcticDB versions data by default, allowing for point-in-time analysis and efficient data updates, including daily appends and historical corrections, making it ideal for research datasets. The database is capable of de-duplicating data that has not changed between versions, using storage space efficiently. The ArcticDB enterprise tools including data pruning and compaction to help manage storage and data-fragmentation as new versions are created. Storing large numbers of versions of data does require more storage.

More information can be found here!

What granularity of authorization does ArcticDB support?¶

Authentication is at storage account level, and authorization can be done at ArcticDB Library level for most S3 backends (with directory/path permissions), otherwise also at storage account level. There are many third-party authentication and authorization integrations available for the backends.

How is ArcticDB data catalogued and discoverable by consumers?¶

ArcticDB offers capabilities to list libraries and symbols, complete with metadata. You can use these functions to discover and browse data stored in ArcticDB.

How can I get started using ArcticDB?¶

Please see our getting started guide!

Technical¶

Does ArcticDB use SQL?¶

No. ArcticDB enables data access and modifications with a Python API that speaks in terms of Pandas DataFrames. See the reference documentation for more details.

Does ArcticDB de-duplicate data?¶

Yes.

On each write, ArcticDB will check the previous version of the symbol that you are writing (and only this version - other symbols will not be scanned!) and skip the write of identical segments. Please keep in mind however that this is most effective when version n is equal to version n-1 plus additional data at the end - and only at the end! If there is additional data inserted into the in the middle, then all segments occuring after that modification will almost certainly differ. ArcticDB segments data at fixed intervals and data is only de-duplicated if the hashes of the data segments are identical - as a result, a one row offset will prevent effective de-duplication.

Note that this is a library configuration option that is off by default, see help(LibraryOptions) for details of how to enable it.

How does ArcticDB enable advanced analytics?¶

ArcticDB is primarily focused on filtering and transfering data from storage through to memory - at which point Pandas, NumPy, or other standard analytical packages can be utilised for analytics.

That said, ArcticDB does offer a limited set of analytical functions that are executed inside the C++ storage engine offering significant performance benefits over Pandas. For more information, see the documentation for the LazyDataFrame, LazyDataFrameCollection, and QueryBuilder classes.

What does Pickling mean?¶

ArcticDB has two means for storing data:

ArcticDB can store your data using the Arctic On-Disk Storage Format.
ArcticDB can Pickle your data, storing it as a giant binary blob.

(1) is vastly more performant (i.e. reads and writes are faster), space efficient and unlocks data slicing as described in the getting started guide. There are no practical advantages to storing your data as a Pickled binary-blob - other than certain data types must be Pickled for ArcticDB to be able to store them at all!

ArcticDB is only able to store the following data types natively:

Pandas DataFrames
NumPy arrays
Integers (including timestamps - though timezone information in timestamps is removed)
Floats
Bools
Strings (written as part of a DataFrame/NumPy array)

Note that ArcticDB cannot efficiently store custom Python objects, even if inserted into a Pandas DataFrames/NumPy array. Pickled data cannot be index or column-sliced, and neither update nor append primitives will function on pickled data.

How does indexing work in ArcticDB?¶

See the Getting Started page for details of supported index types.

Can I `append` with additional columns / What is Dynamic Schema?¶

You can append (or update) with differing column sets to symbols for which the containing library has Dynamic Schema enabled. See the documentation for the create_library method for more information.

You can also change the type of numerical columns - for example, integers will be promoted to floats on read.

How does ArcticDB segment data?¶

See On Disk Storage Format and the documentation for the rows_per_segment and columns_per_segment library configuration options for more details.

How does ArcticDB handle streaming data?¶

ArcticDB support for streaming data is on our roadmap for the coming months

How does ArcticDB handle concurrent writers?¶

Without a centralised server, ArcticDB does not support transactions. Instead, ArcticDB supports concurrent writers across symbols - but not to a single symbol (unless "staging the writes"). It is up to the writer to ensure that clients do not concurrently modify a single symbol.

In the case of concurrent writers to a single symbol, the behaviour will be last-writer-wins. Data is not lost per se, but only the version of the last-writer will be accessible through the version chain.

To reiterate, ArcticDB supports concurrent writers to multiple symbols, even within a single library.

Note

ArcticDB does support staging multiple single-symbol concurrent writes. See the documentation for staged.

Does ArcticDB cache any data locally?¶

Yes, please see the Runtime Configuration page for details.

How can I enable detailed logging?¶

Please see the Runtime Configuration page for details.

How can I tune the performance of ArcticDB?¶

Please see the Runtime Configuration page for details.

Does ArcticDB support categorical data?¶

ArcticDB currently offers extremely limited support for categorical data. Series and DataFrames with categorical columns can be provided to the write and write_batch methods, and will then behave as expected on read. However, append and update are not yet supported with categorical data, and will raise an exception if attempted. Analytics such as filtering using the LazyDataFrame or QueryBuilder classes is also not supported with categorical data, and will either raise an exception, or give incorrect results, depending on the exact operations requested.

Why do I get a `NormalizationException` about timezones?¶

ArcticDB stores all timezone-aware timestamps as UTC internally. On read, the original timezone is restored by the output layer, which can fail if the timezone cannot be resolved on the host system.

Arrow or Polars output (Windows only)

PyArrow requires a separately installed IANA timezone database on Windows. If it is missing, you will see an error like:

arcticdb.exceptions.NormalizationException: Cannot locate timezone 'UTC':
Unable to get Timezone database version from C:\Users\<user>\Downloads\tzdata

ArcticDB Arrow/Polars output uses PyArrow for timezone conversion, which
requires a timezone database on Windows.
To install, run:
  python -c "from pyarrow.util import download_tzdata_on_windows; download_tzdata_on_windows()"
Details: https://arrow.apache.org/docs/python/install.html#tzdata-on-windows

Running the one-liner above will download the database. This is only needed once per environment.

Pandas output

Timezone handling is delegated to Pandas. If a timezone is not recognized, you may see a NormalizationException wrapping a Pandas timezone lookup failure. Older versions of Pandas use pytz for timezone resolution, while newer versions (2.0+) use the standard library zoneinfo module, which supports a broader set of timezone names. Upgrading Pandas can resolve such errors.

What are the naming rules for symbols, snapshots, and libraries?¶

ArcticDB validates names for symbols, snapshots, and libraries. The rules are enforced when creating new entities. Existing entities with non-conforming names remain accessible and writeable.

Symbol and snapshot names:

Must not be empty.
Maximum length of 254 characters.
Only printable ASCII characters (codes 32 through 126 inclusive) are allowed.
The characters *, <, and > are forbidden.
Numeric symbol keys (integer identifiers) are not subject to these checks.

Library names follow the same rules as above, with additional constraints. Library names may be dot-separated (e.g. research.equities), and each dot-separated part:

Must not be empty (i.e. the name must not contain .. or end with .).
Must not start with / (this causes failures with LMDB and MongoDB backends).

Some storage backends impose additional restrictions:

Azure forbids \ in symbol, snapshot, and library names. The Azure Blob service silently converts \ to /, so the stored path would no longer match the recorded name.
MongoDB forbids / anywhere in the library name.
LMDB on Windows forbids the characters <>:"|?*, and names ending with . or whitespace.

Bypassing symbol/snapshot validation:

If you need to write symbols or snapshots that do not conform to these rules, you can disable strict checking by setting the environment variable ARCTICDB_VersionStore_NoStrictSymbolCheck_int=1. This is not recommended for new data as our testing only covers names that conform to these rules.

See also error codes 7002 and 7003 for the error messages raised on validation failure.

How does ArcticDB handle `NaN`?¶

The handling of NaN in ArcticDB depends on the type of the column under consideration:

For string columns, NaN, as well as Python None, are fully supported.
For floating-point numeric columns, NaN is also fully supported.
For integer numeric columns NaN is not supported. A column that otherwise contains only integers will be treated as a floating point column if a NaN is encountered by ArcticDB, at which point the usual rules around type promotion for libraries configured with or without dynamic schema all apply as usual.