ArcticDB_demo_resample
View in Github | Open in Google ColabIn [ ]:
Copied!
!pip install arcticdb
!pip install arcticdb
In [2]:
Copied!
import numpy as np
import pandas as pd
import arcticdb as adb
import numpy as np
import pandas as pd
import arcticdb as adb
ArcticDB Resample Demo¶
This demo notebook showcases the high-performance resample capability of ArcticDB.
This is what you need to know about it:
- It runs on-the-fly as part of the read
- This makes it much more efficient than Pandas on large datasets
- The usage is similar to the Pandas resample function
- You can apply multiple aggregators to each column
- It can be used for downsampling high frequency data and generating "bar" data (see example 4)
Setup¶
In [3]:
Copied!
# object store
arctic = adb.Arctic("lmdb://arcticdb_resample")
# object store
arctic = adb.Arctic("lmdb://arcticdb_resample")
In [4]:
Copied!
# library
lib = arctic.get_library('resample', create_if_missing=True)
# library
lib = arctic.get_library('resample', create_if_missing=True)
Create Some Data¶
- timeseries with 12,000,000 rows and a 1-second index
- int, float, string columns
- write the data into ArcticDB
In [5]:
Copied!
# data for resampling
index = pd.date_range("1990-01-01", periods=12_000_000, freq="s")
int_data = np.arange(len(index), dtype=np.uint64)
float_data = np.round(np.random.uniform(95., 105., len(index)), 3)
letters = ['a','b','c','d','e','f','g']
mkt_data = pd.DataFrame(
index=index,
data={
"id": int_data,
"price": float_data,
"category": (letters*(len(index)//len(letters) + 1))[:len(index)]
}
)
# data for resampling
index = pd.date_range("1990-01-01", periods=12_000_000, freq="s")
int_data = np.arange(len(index), dtype=np.uint64)
float_data = np.round(np.random.uniform(95., 105., len(index)), 3)
letters = ['a','b','c','d','e','f','g']
mkt_data = pd.DataFrame(
index=index,
data={
"id": int_data,
"price": float_data,
"category": (letters*(len(index)//len(letters) + 1))[:len(index)]
}
)
In [6]:
Copied!
# view the first 10 rows of the data
mkt_data.head(10)
# view the first 10 rows of the data
mkt_data.head(10)
Out[6]:
id | price | category | |
---|---|---|---|
1990-01-01 00:00:00 | 0 | 95.176 | a |
1990-01-01 00:00:01 | 1 | 97.872 | b |
1990-01-01 00:00:02 | 2 | 104.930 | c |
1990-01-01 00:00:03 | 3 | 103.573 | d |
1990-01-01 00:00:04 | 4 | 97.052 | e |
1990-01-01 00:00:05 | 5 | 103.435 | f |
1990-01-01 00:00:06 | 6 | 99.339 | g |
1990-01-01 00:00:07 | 7 | 103.358 | a |
1990-01-01 00:00:08 | 8 | 104.301 | b |
1990-01-01 00:00:09 | 9 | 104.651 | c |
In [7]:
Copied!
# write the data into ArcticDB
sym = 'market_data'
lib.write(sym, mkt_data)
# write the data into ArcticDB
sym = 'market_data'
lib.write(sym, mkt_data)
Out[7]:
VersionedItem(symbol='market_data', library='resample', data=n/a, version=0, metadata=None, host='LMDB(path=~/arcticdb_resample)', timestamp=1718958796318913629)
1. Simple Resample¶
- Resample to 1-minute
- Use different aggregators
- Resample can be thought of as a time-based groupby
- The groups are all the rows within a time interval
- Run also in Pandas to compare performance and results
In [8]:
Copied!
# frequency and aggregator params
freq1 = '1min'
aggs1 = {'id': 'max', 'price': 'last', 'category': 'count'}
# frequency and aggregator params
freq1 = '1min'
aggs1 = {'id': 'max', 'price': 'last', 'category': 'count'}
In [9]:
Copied!
%%time
# create the resample query and apply it on the read
market_data_1min_df = lib.read(sym, lazy=True).resample(freq1).agg(aggs1).collect().data
print(len(market_data_1min_df))
market_data_1min_df.tail()
%%time
# create the resample query and apply it on the read
market_data_1min_df = lib.read(sym, lazy=True).resample(freq1).agg(aggs1).collect().data
print(len(market_data_1min_df))
market_data_1min_df.tail()
200000 CPU times: user 684 ms, sys: 251 ms, total: 935 ms Wall time: 171 ms
Out[9]:
id | price | category | |
---|---|---|---|
1990-05-19 21:15:00 | 11999759 | 104.106 | 60 |
1990-05-19 21:16:00 | 11999819 | 104.456 | 60 |
1990-05-19 21:17:00 | 11999879 | 95.570 | 60 |
1990-05-19 21:18:00 | 11999939 | 103.967 | 60 |
1990-05-19 21:19:00 | 11999999 | 97.899 | 60 |
In [10]:
Copied!
%%time
# read the full data set and resample in Pandas
full_df = lib.read(sym).data
market_data_1min_pd_df = full_df.resample(freq1).agg(aggs1)
print(len(market_data_1min_pd_df))
market_data_1min_pd_df.tail()
%%time
# read the full data set and resample in Pandas
full_df = lib.read(sym).data
market_data_1min_pd_df = full_df.resample(freq1).agg(aggs1)
print(len(market_data_1min_pd_df))
market_data_1min_pd_df.tail()
200000 CPU times: user 1.6 s, sys: 401 ms, total: 2 s Wall time: 1.15 s
Out[10]:
id | price | category | |
---|---|---|---|
1990-05-19 21:15:00 | 11999759 | 104.106 | 60 |
1990-05-19 21:16:00 | 11999819 | 104.456 | 60 |
1990-05-19 21:17:00 | 11999879 | 95.570 | 60 |
1990-05-19 21:18:00 | 11999939 | 103.967 | 60 |
1990-05-19 21:19:00 | 11999999 | 97.899 | 60 |
2. Multiple Aggregators per Column¶
- Similar to NamedAgg in Pandas
- Downsample to 5-minute frequency
- Apply both max and last aggregators to the price column
- For multiple aggregators, the syntax is
output_column_name: (input_column_name: aggregator)
In [11]:
Copied!
freq2 = '5min'
aggs2 = {'id': 'max', 'price_last': ('price' ,'last'), 'price_count': ('price' ,'count'), 'category': 'first'}
freq2 = '5min'
aggs2 = {'id': 'max', 'price_last': ('price' ,'last'), 'price_count': ('price' ,'count'), 'category': 'first'}
In [12]:
Copied!
%%time
lib.read(sym, lazy=True).resample(freq2).agg(aggs2).collect().data
%%time
lib.read(sym, lazy=True).resample(freq2).agg(aggs2).collect().data
CPU times: user 1.07 s, sys: 415 ms, total: 1.49 s Wall time: 151 ms
Out[12]:
id | category | price_count | price_last | |
---|---|---|---|---|
1990-01-01 00:00:00 | 299 | a | 300 | 102.172 |
1990-01-01 00:05:00 | 599 | g | 300 | 101.450 |
1990-01-01 00:10:00 | 899 | f | 300 | 96.718 |
1990-01-01 00:15:00 | 1199 | e | 300 | 96.345 |
1990-01-01 00:20:00 | 1499 | d | 300 | 98.955 |
... | ... | ... | ... | ... |
1990-05-19 20:55:00 | 11998799 | d | 300 | 100.277 |
1990-05-19 21:00:00 | 11999099 | c | 300 | 103.596 |
1990-05-19 21:05:00 | 11999399 | b | 300 | 96.182 |
1990-05-19 21:10:00 | 11999699 | a | 300 | 99.911 |
1990-05-19 21:15:00 | 11999999 | g | 300 | 97.899 |
40000 rows × 4 columns
3. Processing Pipeline: Chaining Operations¶
- Downsample to 2.5-minutes frequency
- Group the resampled data by the category column
- Aggregate the category groups using mean
In [14]:
Copied!
%%time
lib.read(sym, lazy=True).resample('2min30s').agg({'id': 'min', 'category': 'first'}).groupby('category').agg({'id': 'mean'}.collect().data
%%time
lib.read(sym, lazy=True).resample('2min30s').agg({'id': 'min', 'category': 'first'}).groupby('category').agg({'id': 'mean'}.collect().data
CPU times: user 1.12 s, sys: 309 ms, total: 1.43 s Wall time: 183 ms
Out[14]:
id | |
---|---|
category | |
a | 5999700.0 |
b | 5999925.0 |
d | 5999850.0 |
e | 6000075.0 |
f | 5999775.0 |
g | 6000000.0 |
c | 6000150.0 |
4. Example: OHLC (Open High Low Close) Bars¶
- Downsample to 5-minute frequency
- Use multiple aggregators on the price column
- This is a simple example of how to convert tick data to OHLC bar data
In [15]:
Copied!
freq_ohlc = '5min'
agg_ohlc = {
'open': ('price', 'first'),
'high': ('price', 'max'),
'low': ('price', 'min'),
'close': ('price', 'last')
}
freq_ohlc = '5min'
agg_ohlc = {
'open': ('price', 'first'),
'high': ('price', 'max'),
'low': ('price', 'min'),
'close': ('price', 'last')
}
In [16]:
Copied!
%%time
ohlc_5min_bars = lib.read(sym, lazy=True).resample(freq_ohlc).agg(agg_ohlc).collect().data
%%time
ohlc_5min_bars = lib.read(sym, lazy=True).resample(freq_ohlc).agg(agg_ohlc).collect().data
CPU times: user 1.26 s, sys: 492 ms, total: 1.75 s Wall time: 118 ms
In [17]:
Copied!
ohlc_5min_bars.head()
ohlc_5min_bars.head()
Out[17]:
close | low | high | open | |
---|---|---|---|---|
1990-01-01 00:00:00 | 102.172 | 95.076 | 104.992 | 95.176 |
1990-01-01 00:05:00 | 101.450 | 95.008 | 104.999 | 98.520 |
1990-01-01 00:10:00 | 96.718 | 95.053 | 104.990 | 103.959 |
1990-01-01 00:15:00 | 96.345 | 95.070 | 104.969 | 95.878 |
1990-01-01 00:20:00 | 98.955 | 95.011 | 104.983 | 103.538 |
Conclusion¶
We have demonstrated the following about the ArcticDB resample feature:
- Easy to use, especially if you already resample in Pandas
- Very high performance - in particular much faster than reading all the data then resampling in Pandas
- Can be combined with other query functions to build processing pipelines
- Can be used to generate timeseries bars