Query Builder API¶

arcticdb.QueryBuilder ¶

Build a query to process read results with. Syntax is designed to be similar to Pandas: >>> q = adb.QueryBuilder() >>> q = q[q["a"] < 5] (equivalent to q = q[q.a < 5] provided the column name is also a valid Python variable name) >>> dataframe = lib.read(symbol, query_builder=q).data

For Group By and Aggregation functionality please see the documentation for the groupby. For projection functionality, see the documentation for the apply method.

Supported arithmetic operations when projection or filtering:

Binary arithmetic: +, -, *, /
Unary arithmetic: -, abs

Supported filtering operations:

Binary comparisons: <, <=, >, >=, ==, !=
Unary NOT: ~
Binary combinators: &, |, ^
List membership: isin, isnotin (also accessible with == and !=)

isin/isnotin accept lists, sets, frozensets, 1D ndarrays, or *args unpacking. For example:

>>> l = [1, 2, 3]
>>> q.isin(l)

is equivalent to...

>>> q.isin(1, 2, 3)

Boolean columns can be filtered on directly:

>>> q = adb.QueryBuilder()
>>> q = q[q["boolean_column"]]

and combined with other operations intuitively:

>>> q = adb.QueryBuilder()
>>> q = q[(q["boolean_column_1"] & ~q["boolean_column_2"]) & (q["numeric_column"] > 0)]

Arbitrary combinations of these expressions is possible, for example:

>>> q = q[(((q["a"] * q["b"]) / 5) < (0.7 * q["c"])) & (q["b"] != 12)]

See tests/unit/arcticdb/version_store/test_filtering.py for more example uses.

Timestamp filtering¶

pandas.Timestamp, datetime.datetime, pandas.Timedelta, and datetime.timedelta objects are supported. Note that internally all of these types are converted to nanoseconds (since epoch in the Timestamp/datetime cases). This means that nonsensical operations such as multiplying two times together are permitted (but not encouraged).

Restrictions¶

String equality/inequality (and isin/isnotin) is supported for printable ASCII characters only. Although not prohibited, it is not recommended to use ==, !=, isin, or isnotin with floating point values.

Exceptions¶

inf or -inf values are provided for comparison Column involved in query is a Categorical Symbol is pickled Column involved in query is not present in symbol Query involves comparing strings using <, <=, >, or >= operators Query involves comparing a string to one or more numeric values, or vice versa Query involves arithmetic with a column containing strings

apply ¶

apply(name, expr)

Apply enables new columns to be created using supported QueryBuilder numeric operations. See the documentation for the QueryBuilder class for more information on supported expressions - any expression valid in a filter is valid when using apply.

PARAMETER	DESCRIPTION
`name`	Name of the column to be created
`expr`	Expression

Examples:

>>> df = pd.DataFrame(
    {
        "VWAP": np.arange(0, 10, dtype=np.float64),
        "ASK": np.arange(10, 20, dtype=np.uint16),
        "VOL_ACC": np.arange(20, 30, dtype=np.int32),
    },
    index=np.arange(10),
)
>>> lib.write("expression", df)
>>> q = adb.QueryBuilder()
>>> q = q.apply("ADJUSTED", q["ASK"] * q["VOL_ACC"] + 7)
>>> lib.read("expression", query_builder=q).data
VOL_ACC  ASK  VWAP  ADJUSTED
0     20   10   0.0       207
1     21   11   1.0       238
2     22   12   2.0       271
3     23   13   3.0       306
4     24   14   4.0       343
5     25   15   5.0       382
6     26   16   6.0       423
7     27   17   7.0       466
8     28   18   8.0       511
9     29   19   9.0       558

RETURNS	DESCRIPTION
`QueryBuilder`	Modified QueryBuilder object.

date_range ¶

date_range(date_range: DateRangeInput)

DateRange to read data for. Applicable only for Pandas data with a DateTime index. Returns only the part of the data that falls within the given range. The returned data object will use less memory than passing date_range directly as an argument to the read method, at the cost of being slightly slower. Must be the first clause in the QueryBuilder object.

PARAMETER	DESCRIPTION
`date_range`	A date range in the same format as accepted by the read method. TYPE: `DateRangeInput`

Examples:

>>> q = adb.QueryBuilder()
>>> q = q.date_range((pd.Timestamp("2000-01-01"), pd.Timestamp("2001-01-01")))

RETURNS	DESCRIPTION
`QueryBuilder`	Modified QueryBuilder object.

groupby ¶

groupby(name: str)

Group symbol by column name. GroupBy operations must be followed by an aggregation operator. Currently the following five aggregation operators are supported: * "mean" - compute the mean of the group * "sum" - compute the sum of the group * "min" - compute the min of the group * "max" - compute the max of the group * "count" - compute the count of group

For usage examples, see below.

PARAMETER	DESCRIPTION
`name`	Name of the column to group on. Note that currently GroupBy only supports single-column groupings. TYPE: `str`

Examples:

Average (mean) over two groups:

>>> df = pd.DataFrame(
    {
        "grouping_column": ["group_1", "group_1", "group_1", "group_2", "group_2"],
        "to_mean": [1.1, 1.4, 2.5, np.nan, 2.2],
    },
    index=np.arange(5),
)
>>> q = adb.QueryBuilder()
>>> q = q.groupby("grouping_column").agg({"to_mean": "mean"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
             to_mean
    group_1  1.666667
    group_2       NaN

Max over one group:

>>> df = pd.DataFrame(
    {
        "grouping_column": ["group_1", "group_1", "group_1"],
        "to_max": [1, 5, 4],
    },
    index=np.arange(3),
)
>>> q = adb.QueryBuilder()
>>> q = q.groupby("grouping_column").agg({"to_max": "max"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
             to_max
    group_1  5

Max and Mean:

>>> df = pd.DataFrame(
    {
        "grouping_column": ["group_1", "group_1", "group_1"],
        "to_mean": [1.1, 1.4, 2.5],
        "to_max": [1.1, 1.4, 2.5]
    },
    index=np.arange(3),
)
>>> q = adb.QueryBuilder()
>>> q = q.groupby("grouping_column").agg({"to_max": "max", "to_mean": "mean"})
>>> lib.write("symbol", df)
>>> lib.read("symbol", query_builder=q).data
                to_max   to_mean
    group_1     2.5  1.666667

RETURNS	DESCRIPTION
`QueryBuilder`	Modified QueryBuilder object.

optimise_for_memory ¶

optimise_for_memory()

Reduce peak memory usage during the query, at the expense of some performance.

Optimisations applied:

Memory used by strings that are present in segments read from storage, but are not required in the final dataframe that will be presented back to the user, is reclaimed earlier in the processing pipeline.

optimise_for_speed ¶

optimise_for_speed()

Process query as fast as possible (the default behaviour)

then ¶

then(other)

Applies processing specified in other after any processing already defined for this QueryBuilder.

PARAMETER	DESCRIPTION
`other`	QueryBuilder to apply after this one in the processing pipeline.

RETURNS	DESCRIPTION
`QueryBuilder`	Modified QueryBuilder object.