ArcticDB_merge
View in Github | Open in Google ColabArcticDB Merge¶
Motivation¶
The merge API offers a straightforward way for users to modify or insert specific rows in their data. Data providers often issue corrections for particular entries, and the merge functionality can efficiently apply these updates.
Example¶
Let's start with a quick example before exploring the details.
In this example, we have an ArcticDB library called prices that stores price data, and a symbol named daily_prices for daily prices. At some point, the data provider issued corrections for 2023-01-05, 2023-01-07, and 2023-01-19. We want to apply these corrections, modifying only the affected rows.
Begin by importing the necessary modules.
import pandas as pd
import arcticdb
import numpy as np
from IPython.display import display
from arcticdb.version_store.library import MergeStrategy, MergeAction
Next, create an ArcticDB instance and a library.
ac = arcticdb.Arctic("lmdb://merge_example")
lib = ac.get_library("prices", create_if_missing=True)
Create an example prices DataFrame and store it in ArcticDB.
# Create example data
daily_prices = pd.DataFrame(
data={
"High": [102.5, 103.6, 101.7, 103.2, 104.8, 106.0, 108.9, 107.6, 109.3, 108.7,
111.1, 113.5, 113.2, 115.0, 112.3, 115.9, 117.3, 118.6, 119.8, 119.1],
"Low": [100.5, 101.8, 99.7, 100.9, 101.7, 103.4, 106.2, 104.8, 107.5, 106.3,
109.1, 110.8, 110.2, 111.7, 109.6, 112.7, 114.0, 115.4, 116.5, 116.2],
"Volume": [1200, 1500, 900, 1400, 1600, 1900, 2500, 2300, 1800, 2100,
2000, 2700, 2600, 3000, 2200, 3100, 2800, 3200, 3500, 3400],
},
index=pd.date_range(start="2023-01-01", periods=20, freq="D")
)
daily_prices
| High | Low | Volume | |
|---|---|---|---|
| 2023-01-01 | 102.5 | 100.5 | 1200 |
| 2023-01-02 | 103.6 | 101.8 | 1500 |
| 2023-01-03 | 101.7 | 99.7 | 900 |
| 2023-01-04 | 103.2 | 100.9 | 1400 |
| 2023-01-05 | 104.8 | 101.7 | 1600 |
| 2023-01-06 | 106.0 | 103.4 | 1900 |
| 2023-01-07 | 108.9 | 106.2 | 2500 |
| 2023-01-08 | 107.6 | 104.8 | 2300 |
| 2023-01-09 | 109.3 | 107.5 | 1800 |
| 2023-01-10 | 108.7 | 106.3 | 2100 |
| 2023-01-11 | 111.1 | 109.1 | 2000 |
| 2023-01-12 | 113.5 | 110.8 | 2700 |
| 2023-01-13 | 113.2 | 110.2 | 2600 |
| 2023-01-14 | 115.0 | 111.7 | 3000 |
| 2023-01-15 | 112.3 | 109.6 | 2200 |
| 2023-01-16 | 115.9 | 112.7 | 3100 |
| 2023-01-17 | 117.3 | 114.0 | 2800 |
| 2023-01-18 | 118.6 | 115.4 | 3200 |
| 2023-01-19 | 119.8 | 116.5 | 3500 |
| 2023-01-20 | 119.1 | 116.2 | 3400 |
lib.write("daily_prices", daily_prices)
VersionedItem(symbol='daily_prices', library='prices', data=n/a, version=2, metadata=None, host='LMDB(path=/home/vasil/Documents/source/ArcticDB/docs/mkdocs/docs/notebooks/merge_example)', timestamp=1771501596070866899)
Create the correction DataFrame.
daily_prices_correction = pd.DataFrame(
data={
"High": [3000.0, 4000.0, 5000.0],
"Low": [1000.0, 2000.0, 3000.0],
"Volume":[10000, 20000, 30000]
},
index=pd.DatetimeIndex([pd.Timestamp("2023-01-05"), pd.Timestamp("2023-01-07"), pd.Timestamp("2023-01-19")])
)
daily_prices_correction
| High | Low | Volume | |
|---|---|---|---|
| 2023-01-05 | 3000.0 | 1000.0 | 10000 |
| 2023-01-07 | 4000.0 | 2000.0 | 20000 |
| 2023-01-19 | 5000.0 | 3000.0 | 30000 |
Perform the merge update using the correction data. Ensure that both daily_prices_correction and the stored data are sorted. By default, two rows in DataFrames with a datetime index are considered matching if their indexes are equal.
lib.merge_experimental("daily_prices", daily_prices_correction, strategy=MergeStrategy(matched="update", not_matched_by_target="do_nothing"))
print("Merged result")
display(lib.read("daily_prices").data)
print("Diff between merged and original data")
daily_prices.compare(lib.read("daily_prices").data, keep_equal=True)
Merged result
| High | Low | Volume | |
|---|---|---|---|
| 2023-01-01 | 102.5 | 100.5 | 1200 |
| 2023-01-02 | 103.6 | 101.8 | 1500 |
| 2023-01-03 | 101.7 | 99.7 | 900 |
| 2023-01-04 | 103.2 | 100.9 | 1400 |
| 2023-01-05 | 3000.0 | 1000.0 | 10000 |
| 2023-01-06 | 106.0 | 103.4 | 1900 |
| 2023-01-07 | 4000.0 | 2000.0 | 20000 |
| 2023-01-08 | 107.6 | 104.8 | 2300 |
| 2023-01-09 | 109.3 | 107.5 | 1800 |
| 2023-01-10 | 108.7 | 106.3 | 2100 |
| 2023-01-11 | 111.1 | 109.1 | 2000 |
| 2023-01-12 | 113.5 | 110.8 | 2700 |
| 2023-01-13 | 113.2 | 110.2 | 2600 |
| 2023-01-14 | 115.0 | 111.7 | 3000 |
| 2023-01-15 | 112.3 | 109.6 | 2200 |
| 2023-01-16 | 115.9 | 112.7 | 3100 |
| 2023-01-17 | 117.3 | 114.0 | 2800 |
| 2023-01-18 | 118.6 | 115.4 | 3200 |
| 2023-01-19 | 5000.0 | 3000.0 | 30000 |
| 2023-01-20 | 119.1 | 116.2 | 3400 |
Diff between merged and original data
| High | Low | Volume | ||||
|---|---|---|---|---|---|---|
| self | other | self | other | self | other | |
| 2023-01-05 | 104.8 | 3000.0 | 101.7 | 1000.0 | 1600 | 10000 |
| 2023-01-07 | 108.9 | 4000.0 | 106.2 | 2000.0 | 2500 | 20000 |
| 2023-01-19 | 119.8 | 5000.0 | 116.5 | 3000.0 | 3500 | 30000 |
Merge vs. Update¶
ArcticDB also provides an update method, which requires sorted input. The key difference is that update overwrites all data between the start and end of the input, potentially removing any rows that fall within those bounds and are not in the input.
lib.write("prices_update_example", daily_prices)
lib.update("prices_update_example", daily_prices_correction)
lib.read("prices_update_example").data
| High | Low | Volume | |
|---|---|---|---|
| 2023-01-01 | 102.5 | 100.5 | 1200 |
| 2023-01-02 | 103.6 | 101.8 | 1500 |
| 2023-01-03 | 101.7 | 99.7 | 900 |
| 2023-01-04 | 103.2 | 100.9 | 1400 |
| 2023-01-05 | 3000.0 | 1000.0 | 10000 |
| 2023-01-07 | 4000.0 | 2000.0 | 20000 |
| 2023-01-19 | 5000.0 | 3000.0 | 30000 |
| 2023-01-20 | 119.1 | 116.2 | 3400 |
In the example above, the input to update included only rows for 2023-01-05, 2023-01-07, and 2023-01-19. As a result, any data between 2023-01-05 and 2023-01-19 that was not present in the update input is omitted from the final result.
Merge: Semantics and Behavior¶
In this context, the data stored in ArcticDB is referred to as the target, while the input to merge is called the source.
Merge works by performing a join between target and source on a subset of columns, and updating target based on the strategy parameter. If target is a time series, the index will always be included among the join columns. This is done to ensure that ordered date-time indexes stay ordered after performing a merge. Otherwise the following would be possible
lib.write("sym", pd.DataFrame({"a": [1, 2, 3]}, index=pd.DatetimeIndex([pd.Timestamp(1), pd.Timestamp(2), pd.Timestamp(3)])))
lib.merge_experimental("sym", pd.DataFrame({"a": [2]}, index=pd.DatetimeIndex([pd.Timestamp(10)])), on=["a"])
print(lib.read("sym").data)
a
1970-01-01 00:00:00.000000001 1
1970-01-01 00:00:00.000000010 2
1970-01-01 00:00:00.000000003 3
Strategies¶
The strategy is a named tuple that defines how target will be modified. Its members can be either case-insensitive strings (update, insert, or do_nothing) or members of the MergeAction enum.
matchedspecifies what to do when a row insourcematches a row intarget. Acceptable values areupdateanddo_nothing.not_matched_by_targetspecifies what to do when a row insourcedoes not match any row intarget. Acceptable values areinsertanddo_nothing.
Acceptable combinations¶
matched="update",not_matched_by_target="do_nothing"matched="update",not_matched_by_target="insert"matched="do_nothing",not_matched_by_target="insert"
Unacceptable combinations¶
matched="do_nothing",not_matched_by_target="do_nothing"- nothing is going to happenmatched="do_nothing"|"insert"|"update",not_matched_by_target="update"- cannot update a row that is not existing in the targetmatched="insert",not_matched_by_target="do_nothing"|"insert"|"update"- while technically possible it doesn't make sense to insert duplicates
Examples¶
MergeStrategy(matched="update", not_matched_by_target="do_nothing")updates only the rows intargetthat match the selected columns. Rows that do not match remain unchanged, and new rows are not inserted. If a row insourcematches multiple rows intarget, all matching rows intargetwill be updated.
data_with_duplicates = pd.DataFrame(
{"Bid": [100, 101, 102], "Ask": [101.1, 101, 102.5]},
index=pd.DatetimeIndex([pd.Timestamp("2025-01-01 08:00:00"), pd.Timestamp("2025-01-01 08:00:00"), pd.Timestamp("2025-01-01 09:00:00")])
)
lib.write("merge_update_with_duplicates", data_with_duplicates)
print("Original data")
display(lib.read("merge_update_with_duplicates").data)
lib.merge_experimental(
"merge_update_with_duplicates",
pd.DataFrame(
{"Bid": [105, 102], "Ask": [105.3, 102]},
index=pd.DatetimeIndex([pd.Timestamp("2025-01-01 08:00:00"), pd.Timestamp("2025-02-01 08:00:00")])
),
MergeStrategy(matched="update", not_matched_by_target="do_nothing")
)
print("Data after merge")
display(lib.read("merge_update_with_duplicates").data)
Original data
| Bid | Ask | |
|---|---|---|
| 2025-01-01 08:00:00 | 100 | 101.1 |
| 2025-01-01 08:00:00 | 101 | 101.0 |
| 2025-01-01 09:00:00 | 102 | 102.5 |
Data after merge
| Bid | Ask | |
|---|---|---|
| 2025-01-01 08:00:00 | 105 | 105.3 |
| 2025-01-01 08:00:00 | 105 | 105.3 |
| 2025-01-01 09:00:00 | 102 | 102.5 |
In the example above, there is a row in source with the index value 2025-01-01 08:00:00 that matches two rows in target. Both matching rows in target are updated. A row in source that does not match any row in target is not inserted. Rows in target that do not match any row in source remain unchanged.
The elements of MergeStrategy can also be values of the MergeAction enum.
data_with_duplicates = pd.DataFrame(
{"Bid": [100, 101, 102], "Ask": [101.1, 101, 102.5]},
index=pd.DatetimeIndex([pd.Timestamp("2025-01-01 08:00:00"), pd.Timestamp("2025-01-01 08:00:00"), pd.Timestamp("2025-01-01 09:00:00")])
)
lib.write("merge_update_with_duplicates", data_with_duplicates)
print("Original data")
display(lib.read("merge_update_with_duplicates").data)
lib.merge_experimental(
"merge_update_with_duplicates",
pd.DataFrame(
{"Bid": [105, 102], "Ask": [105.3, 102]},
index=pd.DatetimeIndex([pd.Timestamp("2025-01-01 08:00:00"), pd.Timestamp("2025-02-01 08:00:00")])
),
MergeStrategy(MergeAction.UPDATE, not_matched_by_target=MergeAction.DO_NOTHING)
)
print("Data after merge")
display(lib.read("merge_update_with_duplicates").data)
Original data
| Bid | Ask | |
|---|---|---|
| 2025-01-01 08:00:00 | 100 | 101.1 |
| 2025-01-01 08:00:00 | 101 | 101.0 |
| 2025-01-01 09:00:00 | 102 | 102.5 |
Data after merge
| Bid | Ask | |
|---|---|---|
| 2025-01-01 08:00:00 | 105 | 105.3 |
| 2025-01-01 08:00:00 | 105 | 105.3 |
| 2025-01-01 09:00:00 | 102 | 102.5 |
MergeStrategy(matched="do_nothing", not_matched_by_target="insert")inserts rows fromsourcethat do not match the selected columns.MergeStrategy(matched="update", not_matched_by_target="insert")updatestargetby modifying matching rows and inserting non-matching rows fromsource.
When the strategy involves updating on match, a row in target must not be matched by more than one row in source, as this would create ambiguity about which values to use.
The on Parameter¶
The on parameter allows specifying additional columns to use for matching. When on is provided, a row in source matches a row in target only when both the index and all columns listed in on have equal values. The parameter can contain any column that appear both in source and target, there aren't any restrictions on the number, dtype or order of the columns.
Important: For date-time indexed data, the index is always included in matching and cannot be excluded.
Example: Matching on Additional Columns¶
Consider a scenario where we have intraday price data with multiple exchanges. Corrections arrive for specific exchange/timestamp combinations. Without on, all rows sharing the same timestamp would match. With on=["Exchange"], only the row with the correct timestamp and exchange will be updated.
# Target: price data from multiple exchanges at the same timestamps
multi_exchange_data = pd.DataFrame(
{
"Exchange": ["NYSE", "LSE", "NYSE", "LSE"],
"Price": [100.0, 200.0, 110.0, 210.0],
},
index=pd.DatetimeIndex([
pd.Timestamp("2025-01-01 09:00"),
pd.Timestamp("2025-01-01 09:00"),
pd.Timestamp("2025-01-01 10:00"),
pd.Timestamp("2025-01-01 10:00"),
])
)
lib.write("multi_exchange", multi_exchange_data)
print("Original data")
display(lib.read("multi_exchange").data)
# Source: correction only for NYSE at 09:00
correction = pd.DataFrame(
{
"Exchange": ["NYSE"],
"Price": [999.0],
},
index=pd.DatetimeIndex([pd.Timestamp("2025-01-01 09:00")])
)
print("Correction data")
display(correction)
# Merge using on=["Exchange"] so only the NYSE row at 09:00 is updated
lib.merge_experimental(
"multi_exchange",
correction,
strategy=MergeStrategy(matched="update", not_matched_by_target="do_nothing"),
on=["Exchange"]
)
print("Data after merge with on=['Exchange']")
display(lib.read("multi_exchange").data)
Original data
| Exchange | Price | |
|---|---|---|
| 2025-01-01 09:00:00 | NYSE | 100.0 |
| 2025-01-01 09:00:00 | LSE | 200.0 |
| 2025-01-01 10:00:00 | NYSE | 110.0 |
| 2025-01-01 10:00:00 | LSE | 210.0 |
Correction data
| Exchange | Price | |
|---|---|---|
| 2025-01-01 09:00:00 | NYSE | 999.0 |
Data after merge with on=['Exchange']
| Exchange | Price | |
|---|---|---|
| 2025-01-01 09:00:00 | NYSE | 999.0 |
| 2025-01-01 09:00:00 | LSE | 200.0 |
| 2025-01-01 10:00:00 | NYSE | 110.0 |
| 2025-01-01 10:00:00 | LSE | 210.0 |
Only the NYSE row at 2025-01-01 09:00 was updated. The LSE row at the same timestamp and all rows at 10:00 remain unchanged.
Without on=["Exchange"], both rows at 09:00 (NYSE and LSE) would have been updated, because both match on the index alone.
Equality Semantics for NaN and None in on Columns¶
- Float columns:
NaNis considered equal toNaN. This differs from the IEEE 754 standard whereNaN != NaN. - String columns:
NoneandNaNare indistinguishable. All of the following are treated as matches:NaN == NaN,NaN == None,None == NaN,None == None.
Example: NaN Matching in Float Columns¶
# Target: data with NaN in a float column
float_nan_data = pd.DataFrame(
{
"Sector": [np.nan, 1.0, np.nan, 2.0],
"Price": [100.0, 200.0, 300.0, 400.0],
},
index=pd.DatetimeIndex([
pd.Timestamp("2025-01-01"),
pd.Timestamp("2025-01-01"),
pd.Timestamp("2025-01-02"),
pd.Timestamp("2025-01-02"),
])
)
lib.write("float_nan_example", float_nan_data)
print("Original data")
display(lib.read("float_nan_example").data)
# Source: correction for the NaN sector row at 2025-01-01
correction_nan = pd.DataFrame(
{
"Sector": [np.nan],
"Price": [999.0],
},
index=pd.DatetimeIndex([pd.Timestamp("2025-01-01")])
)
print("Correction data (NaN in Sector)")
display(correction_nan)
# NaN == NaN in float columns, so the NaN row at 2025-01-01 will match
lib.merge_experimental(
"float_nan_example",
correction_nan,
strategy=MergeStrategy(matched="update", not_matched_by_target="do_nothing"),
on=["Sector"]
)
print("Data after merge - NaN matched NaN in the float Sector column")
display(lib.read("float_nan_example").data)
Original data
| Sector | Price | |
|---|---|---|
| 2025-01-01 | NaN | 100.0 |
| 2025-01-01 | 1.0 | 200.0 |
| 2025-01-02 | NaN | 300.0 |
| 2025-01-02 | 2.0 | 400.0 |
Correction data (NaN in Sector)
| Sector | Price | |
|---|---|---|
| 2025-01-01 | NaN | 999.0 |
Data after merge - NaN matched NaN in the float Sector column
| Sector | Price | |
|---|---|---|
| 2025-01-01 | NaN | 999.0 |
| 2025-01-01 | 1.0 | 200.0 |
| 2025-01-02 | NaN | 300.0 |
| 2025-01-02 | 2.0 | 400.0 |
The NaN value in the Sector column matched between source and target at 2025-01-01, so only that row was updated. The row with Sector=1.0 at the same timestamp was not affected.
Example: None and NaN Are Indistinguishable in String Columns¶
In string columns, None and NaN are treated as the same value. A source row with None in a string on column will match a target row with NaN in the same column (and vice versa).
# Target: data with NaN in a string column
string_none_data = pd.DataFrame(
{
"Exchange": [np.nan, "LSE", None, "LSE"],
"Price": [100.0, 200.0, 300.0, 400.0],
},
index=pd.DatetimeIndex([
pd.Timestamp("2025-01-01"),
pd.Timestamp("2025-01-01"),
pd.Timestamp("2025-01-02"),
pd.Timestamp("2025-01-02"),
])
)
lib.write("string_none_example", string_none_data)
print("Original data (NaN at 2025-01-01, None at 2025-01-02 in Exchange)")
display(lib.read("string_none_example").data)
# Source: correction with None in the string column - will match both NaN and None in target
correction_none = pd.DataFrame(
{
"Exchange": [None],
"Price": [999.0],
},
index=pd.DatetimeIndex([pd.Timestamp("2025-01-01")])
)
print("Correction data (None in Exchange)")
display(correction_none)
# None in source matches NaN in target for string columns
lib.merge_experimental(
"string_none_example",
correction_none,
strategy=MergeStrategy(matched="update", not_matched_by_target="do_nothing"),
on=["Exchange"]
)
print("Data after merge - None in source matched NaN in target's string column")
display(lib.read("string_none_example").data)
Original data (NaN at 2025-01-01, None at 2025-01-02 in Exchange)
| Exchange | Price | |
|---|---|---|
| 2025-01-01 | NaN | 100.0 |
| 2025-01-01 | LSE | 200.0 |
| 2025-01-02 | None | 300.0 |
| 2025-01-02 | LSE | 400.0 |
Correction data (None in Exchange)
| Exchange | Price | |
|---|---|---|
| 2025-01-01 | None | 999.0 |
Data after merge - None in source matched NaN in target's string column
20260219 13:46:36.494286 1173784 I arcticdb | Column Exchange does not have non null elements.
| Exchange | Price | |
|---|---|---|
| 2025-01-01 | None | 999.0 |
| 2025-01-01 | LSE | 200.0 |
| 2025-01-02 | None | 300.0 |
| 2025-01-02 | LSE | 400.0 |
The target had NaN in the Exchange column at 2025-01-01 and None at 2025-01-02. The source contained None in Exchange at 2025-01-01. Because None and NaN are indistinguishable in string columns, the source row matched the target row at 2025-01-01 (where the target had NaN). The row at 2025-01-02 was not affected because the index did not match.
MultiIndex Support¶
Merge also works with pd.MultiIndex DataFrames. The behavior depends on the first level of the MultiIndex:
- Datetime first level: The DataFrame is treated like a timeseries, meaning the first column will automatically be included in the
onparameter and used for matching. The remaining index levels are treated as regular columns. They can be updated by the merge or included inonfor additional matching. - Non-datetime first level: The DataFrame is treated like a row-range-indexed DataFrame. The
onparameter is required to be a non-empty list of columns. Matching is performed on the columns in theonparameter.
Example: MultiIndex with Datetime First Level¶
When the first level of the MultiIndex is a datetime, merge matches on the datetime values. All other index levels and data columns are eligible for update.
# Target: daily prices with a MultiIndex of (date, exchange)
dates = pd.to_datetime(["2025-01-01", "2025-01-01", "2025-01-02", "2025-01-02", "2025-01-03", "2025-01-03"])
exchanges = ["NYSE", "LSE", "NYSE", "LSE", "NYSE", "LSE"]
idx = pd.MultiIndex.from_arrays([dates, exchanges], names=["Date", "Exchange"])
multi_target = pd.DataFrame(
{"Price": [100.0, 200.0, 110.0, 210.0, 120.0, 220.0]},
index=idx
)
lib.write("multi_datetime", multi_target)
print("Original data (MultiIndex with datetime first level)")
display(lib.read("multi_datetime").data)
# Source: corrections for 2025-01-02 — updates both NYSE and LSE rows at that date
source_dates = pd.to_datetime(["2025-01-02", "2025-01-02"])
source_idx = pd.MultiIndex.from_arrays(
[source_dates, ["CORRECTED_NYSE", "CORRECTED_LSE"]],
names=["Date", "Exchange"]
)
source_correction = pd.DataFrame({"Price": [999.0, 888.0]}, index=source_idx)
print("Correction data")
display(source_correction)
# Merge: matches on datetime (first level), updates Exchange and Price
lib.merge_experimental(
"multi_datetime",
source_correction,
strategy=MergeStrategy(matched="update", not_matched_by_target="do_nothing")
)
print("Data after merge — only 2025-01-02 rows updated")
display(lib.read("multi_datetime").data)
Both rows at 2025-01-02 were updated because they share the same datetime. The Exchange level — which is the second index level — was also updated to the new values from source. Rows at other dates remain unchanged.
To update only a specific exchange at a given date, use the on parameter with the second index level name. Note that since Exchange is a non-first index level, it is treated as a regular column internally and can be used in on.
# Reset to original data
lib.write("multi_datetime_on", multi_target)
print("Original data")
display(lib.read("multi_datetime_on").data)
# Source: correction only for NYSE at 2025-01-02
source_dates = pd.to_datetime(["2025-01-02"])
source_idx = pd.MultiIndex.from_arrays([source_dates, ["NYSE"]], names=["Date", "Exchange"])
source_nyse_only = pd.DataFrame({"Price": [999.0]}, index=source_idx)
print("Correction data (NYSE only)")
display(source_nyse_only)
# Merge with on=["Exchange"] — matches on datetime AND Exchange
lib.merge_experimental(
"multi_datetime_on",
source_nyse_only,
strategy=MergeStrategy(matched="update", not_matched_by_target="do_nothing"),
on=["Exchange"]
)
print("Data after merge with on=['Exchange'] — only NYSE at 2025-01-02 updated")
display(lib.read("multi_datetime_on").data)
With on=["Exchange"], matching requires both the datetime (first level) and the Exchange value to be equal. Only the NYSE row at 2025-01-02 was updated; the LSE row at the same date was left unchanged.
Example: MultiIndex with Non-Datetime First Level¶
When the first level of the MultiIndex is not a datetime, the DataFrame is treated like a row-range-indexed DataFrame. The on parameter is required and all index levels become updatable columns.
# Target: product catalog with a MultiIndex of (Category, ProductID)
idx = pd.MultiIndex.from_arrays(
[["Electronics", "Electronics", "Clothing", "Clothing"],
[101, 102, 201, 202]],
names=["Category", "ProductID"]
)
catalog = pd.DataFrame(
{"Name": ["Laptop", "Phone", "Jacket", "Shoes"], "Price": [999.0, 499.0, 150.0, 89.0]},
index=idx
)
lib.write("catalog", catalog)
print("Original data (MultiIndex with non-datetime first level)")
display(lib.read("catalog").data)
# Source: update the price of product with Name="Phone"
source_idx = pd.MultiIndex.from_arrays([["UPDATED"], [999]], names=["Category", "ProductID"])
source_update = pd.DataFrame({"Name": ["Phone"], "Price": [399.0]}, index=source_idx)
print("Correction data")
display(source_update)
# Merge with on=["Name"] — matches on the Name column
lib.merge_experimental(
"catalog",
source_update,
strategy=MergeStrategy(matched="update", not_matched_by_target="do_nothing"),
on=["Name"]
)
print("Data after merge with on=['Name'] — Phone row updated")
display(lib.read("catalog").data)
The row with Name="Phone" was matched and updated. Because the MultiIndex has a non-datetime first level, matching is done purely on the on columns — the index levels (Category, ProductID) are treated as regular columns and were also updated from the source values.
Limitations¶
Current implementation has a few limitations
- Both
targetandsourcemust have a sorted index when usingpd.DatetimeIndexor apd.MultiIndexwith a datetime first level. - The library must use a static schema, and both
sourceandtargetmust share the same schema. - The only supported strategy is
MergeStrategy(matched="update", not_matched_by_target="do_nothing"). - For
pd.MultiIndex, the behavior is determined by the first level: datetime first level uses datetime matching, non-datetime first level uses row-range matching (requires theonparameter).