Skip to content

Arctic API

The primary API used to access and manage ArcticDB libraries. Use this to get a handle to a Library instance, which can then be used for subsequent operations as documented in the Library API section.

Class Description
Arctic Top-level library management class.
LibraryOptions Configuration options that can be applied when libraries are created.

arcticdb.Arctic

Top-level library management class. Arctic instances can be configured against an S3 environment and enable the creation, deletion and retrieval of Arctic libraries.

__init__

__init__(
    uri: str,
    encoding_version: EncodingVersion = DEFAULT_ENCODING_VERSION,
)

Initializes a top-level Arctic library management instance.

For more information on how to use Arctic Library instances please see the documentation on Library.

PARAMETER DESCRIPTION
uri

URI specifying the backing store used to access, configure, and create Arctic libraries.

S3

The S3 URI connection scheme has the form ``s3(s)://<s3 end point>:<s3 bucket>[?options]``.

Use s3s as the protocol if communicating with a secure endpoint.

Options is a query string that specifies connection specific options as ``<name>=<value>`` pairs joined with
``&``.

Available options for S3:

+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                    | Description                                                                                                                                                     |
+===========================+=================================================================================================================================================================+
| port                      | port to use for S3 connection                                                                                                                                   |
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| region                    | S3 region                                                                                                                                                       |
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| use_virtual_addressing    | Whether to use virtual addressing to access the S3 bucket                                                                                                       |
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| access                    | S3 access key                                                                                                                                                   |
+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------- -------------------+
| secret                    | S3 secret access key                                                                                                                                            |
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| path_prefix               | Path within S3 bucket to use for data storage                                                                                                                   |
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| aws_auth                  | If true, authentication to endpoint will be computed via AWS environment vars/config files. If no options are provided ``aws_auth`` will be assumed to be true. |
+---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

Note: When connecting to AWS, ``region`` can be automatically deduced from the endpoint if the given endpoint
specifies the region and ``region`` is not set.

Azure

The Azure URI connection scheme has the form ``azure://[options]``.
It is based on the Azure Connection String, with additional options for configuring ArcticDB.
Please refer to https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string for more details.

``options`` is a string that specifies connection specific options as ``<name>=<value>`` pairs joined with ``;`` (the final key value pair should not include a trailing ``;``).

Additional options specific for ArcticDB:

+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                    | Description                                                                                                                                                   |
+===========================+===============================================================================================================================================================+
| Container                 | Azure container for blobs                                                                                                                                     |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Path_prefix               | Path within Azure container to use for data storage                                                                                                           |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| CA_cert_path              | (Non-Windows platform only) Azure CA certificate path. If not set, default path will be used.                                                                 |
|                           | Note: For Linux distribution, default path is set to ``/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem``.                                                   |
|                           | If the certificate cannot be found in the provided path, an Azure exception with no meaningful error code will be thrown.                                     |
|                           | For more details, please see https://github.com/Azure/azure-sdk-for-cpp/issues/4738.                                                                          |
|                           | For example, ``Failed to iterate azure blobs 'C' 0:``.                                                                                                        |
|                           |                                                                                                                                                               |
|                           | Default certificate path in various Linux distributions:                                                                                                      |
|                           | "/etc/ssl/certs/ca-certificates.crt"                  Debian/Ubuntu/Gentoo etc.                                                                               |
|                           | "/etc/pki/tls/certs/ca-bundle.crt"                    Fedora/RHEL 6                                                                                           |
|                           | "/etc/ssl/ca-bundle.pem"                              OpenSUSE                                                                                                |
|                           | "/etc/pki/tls/cacert.pem"                             OpenELEC                                                                                                |
|                           | "/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem"   CentOS/RHEL 7                                                                                           |
|                           | "/etc/ssl/cert.pem"                                   Alpine Linux                                                                                            |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

For Windows user, `CA_cert_path` cannot be set. Please set CA certificate related option on Windows setting.
For details, you may refer to https://learn.microsoft.com/en-us/skype-sdk/sdn/articles/installing-the-trusted-root-certificate

Exception: Azure exceptions message always ends with ``{AZURE_SDK_HTTP_STATUS_CODE}:{AZURE_SDK_REASON_PHRASE}``.

Please refer to https://github.com/Azure/azure-sdk-for-cpp/blob/24ed290815d8f9dbcd758a60fdc5b6b9205f74e0/sdk/core/azure-core/inc/azure/core/http/http_status_code.hpp for
more details of provided status codes.

Note that due to a bug in Azure C++ SDK (https://github.com/Azure/azure-sdk-for-cpp/issues/4738), Azure may not give meaningful status codes and
reason phrases in the exception. To debug these instances, please set the environment variable ``export AZURE_LOG_LEVEL`` to ``1`` to turn on the SDK debug logging.

LMDB

The LMDB connection scheme has the form ``lmdb:///<path to store LMDB files>[?options]``.

Options is a query string that specifies connection specific options as ``<name>=<value>`` pairs joined with
``&``.

+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                    | Description                                                                                                                                                   |
+===========================+===============================================================================================================================================================+
| map_size                  | LMDB map size (see http://www.lmdb.tech/doc/group__mdb.html#gaa2506ec8dab3d969b0e609cd82e619e5). String. Supported formats are:                               |
|                           |                                                                                                                                                               |
|                           | "150MB" / "20GB" / "3TB"                                                                                                                                      |
|                           |                                                                                                                                                               |
|                           | The only supported units are MB / GB / TB.                                                                                                                    |
|                           |                                                                                                                                                               |
|                           | On Windows and MacOS, LMDB will materialize a file of this size, so you need to set it to a reasonable value that your system has                             |
|                           | room for, and it has a small default (order of 1GB). On Linux, this is an upper bound on the space used by LMDB and the default is large                      |
|                           | (order of 100GB).                                                                                                                                             |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

Example connection strings are ``lmdb:///home/user/my_lmdb`` or ``lmdb:///home/user/my_lmdb?map_size=2GB``.

In-Memory

The in-memory connection scheme has the form ``mem://``.

The storage is local to the ``Arctic`` instance.

TYPE: str

encoding_version

When creating new libraries with this Arctic instance, the default encoding version to use. Can be overridden by specifying the encoding version in the LibraryOptions argument to create_library.

TYPE: EncodingVersion DEFAULT: DEFAULT_ENCODING_VERSION

Examples:

>>> ac = Arctic('s3://MY_ENDPOINT:MY_BUCKET')  # Leave AWS to derive credential information
>>> ac = Arctic('s3://MY_ENDPOINT:MY_BUCKET?region=YOUR_REGION&access=ABCD&secret=DCBA') # Manually specify creds
>>> ac = Arctic('azure://CA_cert_path=/etc/ssl/certs/ca-certificates.crt;BlobEndpoint=https://arctic.blob.core.windows.net;Container=acblob;SharedAccessSignature=sp=sig')
>>> ac.create_library('travel_data')
>>> ac.list_libraries()
['travel_data']
>>> travel_library = ac['travel_data']
>>> ac.delete_library('travel_data')

create_library

create_library(
    name: str,
    library_options: Optional[LibraryOptions] = None,
) -> Library

Creates the library named name.

Arctic libraries contain named symbols which are the atomic unit of data storage within Arctic. Symbols contain data that in most cases strongly resembles a DataFrame and are versioned such that all modifying operations can be tracked and reverted.

Arctic libraries support concurrent writes and reads to multiple symbols as well as concurrent reads to a single symbol. However, concurrent writers to a single symbol are not supported other than for primitives that explicitly state support for single-symbol concurrent writes.

PARAMETER DESCRIPTION
name

The name of the library that you wish to create.

TYPE: str

library_options

Options to use in configuring the library. Defaults if not provided are the same as are documented in LibraryOptions.

TYPE: Optional[LibraryOptions] DEFAULT: None

Examples:

>>> arctic = Arctic('s3://MY_ENDPOINT:MY_BUCKET')
>>> arctic.create_library('test.library')
>>> my_library = arctic['test.library']
RETURNS DESCRIPTION
Library that was just created

delete_library

delete_library(name: str) -> None

Removes the library called name. This will remove the underlying data contained within the library and as such will take as much time as the underlying delete operations take.

If no library with name exists then this is a no-op. In particular this method does not raise in this case.

PARAMETER DESCRIPTION
name

Name of the library to delete.

TYPE: str

get_library

get_library(
    name: str,
    create_if_missing: Optional[bool] = False,
    library_options: Optional[LibraryOptions] = None,
) -> Library

Returns the library named name.

This method can also be invoked through subscripting. Arctic('bucket').get_library("test") is equivalent to Arctic('bucket')["test"].

PARAMETER DESCRIPTION
name

The name of the library that you wish to retrieve.

TYPE: str

create_if_missing

If True, and the library does not exist, then create it.

TYPE: Optional[bool] DEFAULT: False

library_options

If create_if_missing is True, and the library does not already exist, then it will be created with these options, or the defaults if not provided. If create_if_missing is True, and the library already exists, ensures that the existing library options match these. Unused if create_if_missing is False.

TYPE: Optional[LibraryOptions] DEFAULT: None

Examples:

>>> arctic = Arctic('s3://MY_ENDPOINT:MY_BUCKET')
>>> arctic.create_library('test.library')
>>> my_library = arctic.get_library('test.library')
>>> my_library = arctic['test.library']
RETURNS DESCRIPTION
Library

get_uri

get_uri() -> str

Returns the URI that was used to create the Arctic instance.

Examples:

>>> arctic = Arctic('s3://MY_ENDPOINT:MY_BUCKET')
>>> arctic.get_uri()
RETURNS DESCRIPTION
s3

TYPE: //MY_ENDPOINT:MY_BUCKET

has_library

has_library(name: str) -> bool

Query if the given library exists

PARAMETER DESCRIPTION
name

Name of the library to check the existence of.

TYPE: str

RETURNS DESCRIPTION
True if the library exists, False otherwise.

list_libraries

list_libraries() -> List[str]

Lists all libraries available.

Examples:

>>> arctic = Arctic('s3://MY_ENDPOINT:MY_BUCKET')
>>> arctic.list_libraries()
['test.library']
RETURNS DESCRIPTION
A list of all library names that exist in this Arctic instance.

arcticdb.LibraryOptions

Configuration options that can be applied when libraries are created.

ATTRIBUTE DESCRIPTION
dynamic_schema

See __init__ for details.

TYPE: bool

dedup

See __init__ for details.

TYPE: bool

rows_per_segment

See __init__ for details.

TYPE: int

columns_per_segment

See __init__ for details.

TYPE: int

__init__

__init__(
    *,
    dynamic_schema: bool = False,
    dedup: bool = False,
    rows_per_segment: int = 100000,
    columns_per_segment: int = 127,
    encoding_version: Optional[EncodingVersion] = None
)
PARAMETER DESCRIPTION
dynamic_schema

Controls whether the library supports dynamically changing symbol schemas.

The schema of a symbol refers to the order of the columns and the type of the columns.

If False, then the schema for a symbol is set on each write call, and cannot then be modified by successive updates or appends. Each successive update or append must contain the same column set in the same order with the same types as the initial write.

When disabled, ArcticDB will tile stored data across both the rows and columns. This enables highly efficient retrieval of specific columns regardless of the total number of columns stored in the symbol.

If True, then updates and appends can contain columns not originally seen in the most recent write call. The data will be dynamically backfilled on read when required for the new columns. Furthermore, Arctic will support numeric type promotions should the type of a column change - for example, should column A be of type int32 on write, and of type float on the next append, the column will be returned as a float to Pandas on read. Supported promotions include (narrow) integer to (wider) integer, and integer to float.

When enabled, ArcticDB will only tile across the rows of the data. This will result in slower column subsetting when storing a large number of columns (>1,000).

TYPE: bool DEFAULT: False

dedup

Controls whether calls to write and write_batch will attempt to deduplicate data segments against the previous live version of the specified symbol.

If False, new data segments will always be written for the new version of the symbol being created.

If True, the content hash, start index, and end index of data segments associated with the previous live version of this symbol will be compared with those about to be written, and will not be duplicated in the storage device if they match.

Keep in mind that this is most effective when version n is equal to version n-1 plus additional data at the end - and only at the end! If there is additional data inserted at the start or into the the middle, then all segments occuring after that modification will almost certainly differ. ArcticDB creates new segments at fixed intervals and data is only de-duplicated if the hashes of the data segments are identical. A one row offset will therefore prevent this de-duplication.

Note that these conditions will also be checked with write_pickle and write_pickle_batch. However, pickled objects are always written as a single data segment, and so dedup will only occur if the written object is identical to the previous version.

TYPE: bool DEFAULT: False

rows_per_segment

Together with columns_per_segment, controls how data being written, appended, or updated is sliced into separate data segment objects before being written to storage.

By splitting data across multiple objects in storage, calls to read and read_batch that include the date_range and/or columns parameters can reduce the amount of data read from storage by only reading those data segments that contain data requested by the reader.

For example, if writing a dataframe with 250,000 rows and 200 columns, by default, this will be sliced into 6 data segments: 1 - rows 1-100,000 and columns 1-127 2 - rows 100,001-200,000 and columns 1-127 3 - rows 200,001-250,000 and columns 1-127 4 - rows 1-100,000 and columns 128-200 5 - rows 100,001-200,000 and columns 128-200 6 - rows 200,001-250,000 and columns 128-200

Data segments that cover the same range of rows are said to belong to the same row-slice (e.g. segments 2 and 5 in the example above). Data segments that cover the same range of columns are said to belong to the same column-slice (e.g. segments 2 and 3 in the example above).

Note that this slicing is only applied to the new data being written, existing data segments from previous versions that can remain the same will not be modified. For example, if a 50,000 row dataframe with a single column is written, and then another dataframe also with 50,000 rows and one column is appended to it, there will still be two data segments each with 50,000 rows.

Note that for libraries with dynamic_schema enabled, columns_per_segment does not apply, and there is always a single column-slice. However, rows_per_segment is used, and there will be multiple row-slices.

TYPE: int DEFAULT: 100000

columns_per_segment

See rows_per_segment

TYPE: int DEFAULT: 127

encoding_version

The encoding version to use when writing data to storage. v2 is faster, but still experimental, so use with caution.

TYPE: Optional[EncodingVersion] DEFAULT: None