DataFrame Processing Operations API¶
arcticdb.LazyDataFrame ¶
Bases: QueryBuilder
Lazy dataframe implementation, allowing chains of queries to be added before the read is actually executed.
Returned by Library.read
, Library.head
, and Library.tail
calls when lazy=True
.
See Also
QueryBuilder for supported querying operations.
Examples:
>>>
# Specify that we want version 0 of "test" symbol, and to only return the "new_column" column in the output
>>> lazy_df = lib.read("test", as_of=0, columns=["new_column"], lazy=True)
# Perform a filtering operation
>>> lazy_df = lazy_df[lazy_df["col1"].isin(0, 3, 6, 9)]
# Create a new column through a projection operation
>>> lazy_df["new_col"] = lazy_df["col1"] + lazy_df["col2"]
# Actual read and processing happens here
>>> df = lazy_df.collect().data
METHOD | DESCRIPTION |
---|---|
collect |
Read the data and execute any queries applied to this object since the read call. |
collect ¶
collect() -> VersionedItem
Read the data and execute any queries applied to this object since the read call.
RETURNS | DESCRIPTION |
---|---|
VersionedItem
|
Object that contains a .data and .metadata element. |
arcticdb.LazyDataFrameCollection ¶
Bases: QueryBuilder
Lazy dataframe implementation for batch operations. Allows the application of chains of queries to be added before
the actual reads are performed. Queries applied to this object will be applied to all the symbols being read.
If per-symbol queries are required, split can be used to break this class into a list of LazyDataFrame
objects.
Returned by Library.read_batch
calls when lazy=True
.
See Also
QueryBuilder for supported querying operations.
Examples:
>>>
# Specify that we want the latest version of "test_0" symbol, and version 0 of "test_1" symbol
>>> lazy_dfs = lib.read_batch(["test_0", ReadRequest("test_1", as_of=0)], lazy=True)
# Perform a filtering operation on both the "test_0" and "test_1" symbols
>>> lazy_dfs = lazy_dfs[lazy_dfs["col1"].isin(0, 3, 6, 9)]
# Perform a different projection operation on each symbol
>>> lazy_dfs = lazy_dfs.split()
>>> lazy_dfs[0].apply("new_col", lazy_dfs[0]["col1"] + 1)
>>> lazy_dfs[1].apply("new_col", lazy_dfs[1]["col1"] + 2)
# Bring together again and perform the same filter on both symbols
>>> lazy_dfs = LazyDataFrameCollection(lazy_dfs)
>>> lazy_dfs = lazy_dfs[lazy_dfs["new_col"] > 0]
# Actual read and processing happens here
>>> res = lazy_dfs.collect()
METHOD | DESCRIPTION |
---|---|
__init__ |
Gather a list of |
collect |
Read the data and execute any queries applied to this object since the read_batch call. |
split |
Separate the collection into a list of LazyDataFrames, including any queries already applied to this object. |
__init__ ¶
__init__(lazy_dataframes: List[LazyDataFrame])
Gather a list of LazyDataFrame
s into a single object that can be collected together.
PARAMETER | DESCRIPTION |
---|---|
lazy_dataframes
|
Collection of
TYPE:
|
collect ¶
collect() -> List[Union[VersionedItem, DataError]]
Read the data and execute any queries applied to this object since the read_batch call.
RETURNS | DESCRIPTION |
---|---|
List[Union[VersionedItem, DataError]]
|
See documentation on |
split ¶
split() -> List[LazyDataFrame]
Separate the collection into a list of LazyDataFrames, including any queries already applied to this object.
RETURNS | DESCRIPTION |
---|---|
List[LazyDataFrame]
|
|
arcticdb.QueryBuilder ¶
Build a query to process read results with. Syntax is designed to be similar to Pandas:
q = adb.QueryBuilder()
q = q[q["a"] < 5] (equivalent to q = q[q.a < 5] provided the column name is also a valid Python variable name)
dataframe = lib.read(symbol, query_builder=q).data
For Group By and Aggregation functionality please see the documentation for the groupby
. For projection
functionality, see the documentation for the apply
method.
Supported arithmetic operations when projection or filtering:
- Binary arithmetic: +, -, *, /
- Unary arithmetic: -, abs
Supported filtering operations:
-
isna, isnull, notna, and notnull - return all rows where a specified column is/is not NaN or None. isna is equivalent to isnull, and notna is equivalent to notnull, i.e. no distinction is made between NaN and None values in column types that support both (e.g. strings). For example:
q = q[q["col"].isna()]
-
Binary comparisons: <, <=, >, >=, ==, !=
- Unary NOT: ~
- Binary combinators: &, |, ^
- List membership: isin, isnotin (also accessible with == and !=)
isin/isnotin accept lists, sets, frozensets, 1D ndarrays, or *args unpacking. For example:
l = [1, 2, 3]
q.isin(l)
is equivalent to...
q.isin(1, 2, 3)
Boolean columns can be filtered on directly:
q = adb.QueryBuilder()
q = q[q["boolean_column"]]
and combined with other operations intuitively:
q = adb.QueryBuilder()
q = q[(q["boolean_column_1"] & ~q["boolean_column_2"]) & (q["numeric_column"] > 0)]
Arbitrary combinations of these expressions is possible, for example:
q = q[(((q["a"] * q["b"]) / 5) < (0.7 * q["c"])) & (q["b"] != 12)]
See tests/unit/arcticdb/version_store/test_filtering.py for more example uses.
Timestamp filtering¶
pandas.Timestamp, datetime.datetime, pandas.Timedelta, and datetime.timedelta objects are supported. Note that internally all of these types are converted to nanoseconds (since epoch in the Timestamp/datetime cases). This means that nonsensical operations such as multiplying two times together are permitted (but not encouraged).
Restrictions¶
String equality/inequality (and isin/isnotin) is supported for printable ASCII characters only. Although not prohibited, it is not recommended to use ==, !=, isin, or isnotin with floating point values.
Exceptions¶
inf or -inf values are provided for comparison Column involved in query is a Categorical Symbol is pickled Column involved in query is not present in symbol Query involves comparing strings using <, <=, >, or >= operators Query involves comparing a string to one or more numeric values, or vice versa Query involves arithmetic with a column containing strings
METHOD | DESCRIPTION |
---|---|
apply |
Apply enables new columns to be created using supported QueryBuilder numeric operations. See the documentation for the |
date_range |
DateRange to read data for. Applicable only for Pandas data with a DateTime index. Returns only the part |
groupby |
Group symbol by column name. GroupBy operations must be followed by an aggregation operator. Currently the following five aggregation |
head |
Filter out all but the first n rows of data. If n is negative, return all rows except the last n rows. |
optimise_for_memory |
Reduce peak memory usage during the query, at the expense of some performance. |
optimise_for_speed |
Process query as fast as possible (the default behaviour) |
prepend |
Applies processing specified in other before any processing already defined for this QueryBuilder. |
resample |
Resample a symbol on the index. The symbol must be datetime indexed. Resample operations must be followed by |
row_range |
Row range to read data for. Inclusive of the lower bound, exclusive of the upper bound. |
tail |
Filter out all but the last n rows of data. If n is negative, return all rows except the first n rows. |
then |
Applies processing specified in other after any processing already defined for this QueryBuilder. |
apply ¶
apply(name, expr)
Apply enables new columns to be created using supported QueryBuilder numeric operations. See the documentation for the
QueryBuilder class for more information on supported expressions - any expression valid in a filter is valid when using
apply
.
PARAMETER | DESCRIPTION |
---|---|
name
|
Name of the column to be created
|
expr
|
Expression
|
Examples:
>>> df = pd.DataFrame(
{
"VWAP": np.arange(0, 10, dtype=np.float64),
"ASK": np.arange(10, 20, dtype=np.uint16),
"VOL_ACC": np.arange(20, 30, dtype=np.int32),
},
index=np.arange(10),
)
>>> lib.write("expression", df)
>>> q = adb.QueryBuilder()
>>> q = q.apply("ADJUSTED", q["ASK"] * q["VOL_ACC"] + 7)
>>> lib.read("expression", query_builder=q).data
VOL_ACC ASK VWAP ADJUSTED
0 20 10 0.0 207
1 21 11 1.0 238
2 22 12 2.0 271
3 23 13 3.0 306
4 24 14 4.0 343
5 25 15 5.0 382
6 26 16 6.0 423
7 27 17 7.0 466
8 28 18 8.0 511
9 29 19 9.0 558
RETURNS | DESCRIPTION |
---|---|
QueryBuilder
|
Modified QueryBuilder object. |
date_range ¶
date_range(date_range: DateRangeInput)
DateRange to read data for. Applicable only for Pandas data with a DateTime index. Returns only the part of the data that falls within the given range. If this is the only processing clause being applied, then the returned data object will use less memory than passing date_range directly as an argument to the read method, at the cost of possibly being slightly slower.
PARAMETER | DESCRIPTION |
---|---|
date_range
|
A date range in the same format as accepted by the read method.
TYPE:
|
Examples:
>>> q = adb.QueryBuilder()
>>> q = q.date_range((pd.Timestamp("2000-01-01"), pd.Timestamp("2001-01-01")))
RETURNS | DESCRIPTION |
---|---|
QueryBuilder
|
Modified QueryBuilder object. |
groupby ¶
groupby(name: str)
Group symbol by column name. GroupBy operations must be followed by an aggregation operator. Currently the following five aggregation operators are supported:
- "mean" - compute the mean of the group
- "sum" - compute the sum of the group
- "min" - compute the min of the group
- "max" - compute the max of the group
- "count" - compute the count of group
For usage examples, see below.
PARAMETER | DESCRIPTION |
---|---|
name
|
Name of the column to group on. Note that currently GroupBy only supports single-column groupings.
TYPE:
|
Examples:
Average (mean) over two groups:
>>> df = pd.DataFrame(
{
"grouping_column": ["group_1", "group_1", "group_1", "group_2", "group_2"],
"to_mean": [1.1, 1.4, 2.5, np.nan, 2.2],
},
index=np.arange(5),
)
>>> q = adb.QueryBuilder()
>>> q = q.groupby("grouping_column").agg({"to_mean": "mean"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
to_mean
group_1 1.666667
group_2 2.2
Max over one group:
>>> df = pd.DataFrame(
{
"grouping_column": ["group_1", "group_1", "group_1"],
"to_max": [1, 5, 4],
},
index=np.arange(3),
)
>>> q = adb.QueryBuilder()
>>> q = q.groupby("grouping_column").agg({"to_max": "max"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
to_max
group_1 5
Max and Mean:
>>> df = pd.DataFrame(
{
"grouping_column": ["group_1", "group_1", "group_1"],
"to_mean": [1.1, 1.4, 2.5],
"to_max": [1.1, 1.4, 2.5]
},
index=np.arange(3),
)
>>> q = adb.QueryBuilder()
>>> q = q.groupby("grouping_column").agg({"to_max": "max", "to_mean": "mean"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
to_max to_mean
group_1 2.5 1.666667
Min and max over one column, mean over another:
>>> df = pd.DataFrame(
{
"grouping_column": ["group_1", "group_1", "group_1", "group_2", "group_2"],
"agg_1": [1, 2, 3, 4, 5],
"agg_2": [1.1, 1.4, 2.5, np.nan, 2.2],
},
index=np.arange(5),
)
>>> q = adb.QueryBuilder()
>>> q = q.groupby("grouping_column")
>>> q = q.agg({"agg_1_min": ("agg_1", "min"), "agg_1_max": ("agg_1", "max"), "agg_2": "mean"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
agg_1_min agg_1_max agg_2
group_1 1 3 1.666667
group_2 4 5 2.2
RETURNS | DESCRIPTION |
---|---|
QueryBuilder
|
Modified QueryBuilder object. |
head ¶
head(n: int = 5)
Filter out all but the first n rows of data. If n is negative, return all rows except the last n rows.
PARAMETER | DESCRIPTION |
---|---|
n
|
Number of rows to select if non-negative, otherwise number of rows to exclude.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
QueryBuilder
|
Modified QueryBuilder object. |
optimise_for_memory ¶
optimise_for_memory()
Reduce peak memory usage during the query, at the expense of some performance.
Optimisations applied:
- Memory used by strings that are present in segments read from storage, but are not required in the final dataframe that will be presented back to the user, is reclaimed earlier in the processing pipeline.
prepend ¶
prepend(other)
Applies processing specified in other before any processing already defined for this QueryBuilder.
PARAMETER | DESCRIPTION |
---|---|
other
|
QueryBuilder to apply before this one in the processing pipeline.
|
RETURNS | DESCRIPTION |
---|---|
QueryBuilder
|
Modified QueryBuilder object. |
resample ¶
resample(
rule: Union[str, DateOffset],
closed: Optional[str] = None,
label: Optional[str] = None,
offset: Optional[Union[str, Timedelta]] = None,
origin: Union[str, Timestamp] = "epoch",
)
Resample a symbol on the index. The symbol must be datetime indexed. Resample operations must be followed by an aggregation operator. Currently, the following 7 aggregation operators are supported:
- "mean" - compute the mean of the group
- "sum" - compute the sum of the group
- "min" - compute the min of the group
- "max" - compute the max of the group
- "count" - compute the count of group
- "first" - compute the first value in the group
- "last" - compute the last value in the group
Note that not all aggregators are supported with all column types:
- Numeric columns - support all aggregators
- Bool columns - support all aggregators
- String columns - support count, first, and last aggregators
- Datetime columns - support all aggregators EXCEPT sum
Note that time-buckets which contain no index values in the symbol will NOT be included in the returned DataFrame. This is not the same as Pandas default behaviour. Resampling is currently not supported with:
- Dynamic schema where an aggregation column is missing from one or more of the row-slices.
- Sparse data.
The resample results match pandas resample with origin="epoch"
. We plan to add an 'origin' argument in
a future release and will then change the default value to '"start_day"' to match the Pandas default. This
will change the results in cases where the rule is not a multiple of 24 hours.
PARAMETER | DESCRIPTION |
---|---|
rule
|
The frequency at which to resample the data. Supported rule strings are ns, us, ms, s, min, h, and D, and multiples/combinations of these, such as 1h30min. pd.DataOffset objects representing frequencies from this set are also accepted.
TYPE:
|
closed
|
Which boundary of each time-bucket is closed. Must be one of 'left' or 'right'. If not provided, the default is left for all currently supported frequencies.
TYPE:
|
label
|
Which boundary of each time-bucket is used as the index value in the returned DataFrame. Must be one of 'left' or 'right'. If not provided, the default is left for all currently supported frequencies.
TYPE:
|
offset
|
Offset the start of each bucket. Supported strings are the same as in
TYPE:
|
origin
|
The timestamp on which to adjust the grouping. Supported string are:
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
QueryBuilder
|
Modified QueryBuilder object. |
RAISES | DESCRIPTION |
---|---|
ArcticDbNotYetImplemented
|
A frequency string or Pandas DateOffset object are provided to the rule argument outside the supported frequencies listed above. |
ArcticNativeException
|
The closed or label arguments are not one of "left" or "right" |
SchemaException
|
Raised on call to read if:
|
UserInputException
|
|
Examples:
Resample two hours worth of minutely data down to hourly data, summing the column 'to_sum':
>>> df = pd.DataFrame(
{
"to_sum": np.arange(120),
},
index=pd.date_range("2024-01-01", freq="min", periods=120),
)
>>> q = adb.QueryBuilder()
>>> q = q.resample("h").agg({"to_sum": "sum"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
to_sum
2024-01-01 00:00:00 1770
2024-01-01 01:00:00 5370
As above, but specifying that the closed boundary of each time-bucket is the right hand side, and also to label the output by the right boundary:
>>> q = adb.QueryBuilder()
>>> q = q.resample("h", closed="right", label="right").agg({"to_sum": "sum"})
>>> lib.read("symbol", query_builder=q).data
to_sum
2024-01-01 00:00:00 0
2024-01-01 01:00:00 1830
2024-01-01 02:00:00 5310
Nones, NaNs, and NaTs are omitted from aggregations:
>>> df = pd.DataFrame(
{
"to_mean": [1.0, np.nan, 2.0],
},
index=pd.date_range("2024-01-01", freq="min", periods=3),
)
>>> q = adb.QueryBuilder()
>>> q = q.resample("h").agg({"to_mean": "mean"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
to_mean
2024-01-01 00:00:00 1.5
Output column names can be controlled through the format of the dict passed to agg:
>>> df = pd.DataFrame(
{
"agg_1": [1, 2, 3, 4, 5],
"agg_2": [1.0, 2.0, 3.0, np.nan, 5.0],
},
index=pd.date_range("2024-01-01", freq="min", periods=5),
)
>>> q = adb.QueryBuilder()
>>> q = q.resample("h")
>>> q = q.agg({"agg_1_min": ("agg_1", "min"), "agg_1_max": ("agg_1", "max"), "agg_2": "mean"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
agg_1_min agg_1_max agg_2
2024-01-01 00:00:00 1 5 2.75
row_range ¶
row_range(row_range: Tuple[int, int])
Row range to read data for. Inclusive of the lower bound, exclusive of the upper bound. Should behave the same as df.iloc[start:end], including in the handling of negative start/end values.
PARAMETER | DESCRIPTION |
---|---|
row_range
|
Row range to read data for. Inclusive of the lower bound, exclusive of the upper bound.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
QueryBuilder
|
Modified QueryBuilder object. |
tail ¶
tail(n: int = 5)
Filter out all but the last n rows of data. If n is negative, return all rows except the first n rows.
PARAMETER | DESCRIPTION |
---|---|
n
|
Number of rows to select if non-negative, otherwise number of rows to exclude.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
QueryBuilder
|
Modified QueryBuilder object. |
then ¶
then(other)
Applies processing specified in other after any processing already defined for this QueryBuilder.
PARAMETER | DESCRIPTION |
---|---|
other
|
QueryBuilder to apply after this one in the processing pipeline.
|
RETURNS | DESCRIPTION |
---|---|
QueryBuilder
|
Modified QueryBuilder object. |