给定一个主left数据集,如何在指定列上right与具有最小欧几里得距离 ( ) 的数据集合并?d = sqrt(a^2 + b^2)
细节:
- 如果 of
key1和key2fromleft都存在于 中,则与匹配and且具有最小值 fromright的行合并key1key2sqrt((aux1r - aux1l)^2 + (aux2r - aux2l)^2) - 如果 of
key1和key2are notNaN且 ofkey1和key2fromleft都不存在于 中right,则与具有最小值 from 的行合并sqrt((aux1r - aux1l)^2 + (aux2r - aux2l)^2) - 如果左侧
key1或其中之一是 ,则与右侧与非或匹配的行合并,并且具有最小值key2NaNNaNkey1key2sqrt((aux1r - aux1l)^2 + (aux2r - aux2l)^2) - 如果 of
key1和key2from left 都是NaN,则与具有最小值 from 的右侧行合并sqrt((aux1r - aux1l)^2 + (aux2r - aux2l)^2)
输入数据框的示例,以及合并后想要的数据框:
import pandas as pd
import numpy as np
left = pd.DataFrame([
# "key1" "key2" "aux1l" "aux2l" "left1" "left2"
[np.nan, np.nan, 1.00, 1.00, "unimportant", "unimportant"], # left[0]
["key1", np.nan, 1.00, 1.00, "unimportant", "unimportant"], # left[1]
[np.nan, "key2", 1.00, 1.00, "unimportant", "unimportant"], # left[2]
["key1", "key2", 1.00, 1.00, "unimportant", "unimportant"], # left[3]
["key1unique", "key2unique", 1.00, 1.00, "unimportant", "unimportant"], # left[4]
["key1repeat", "key2repeat", 1.00, 1.00, "unimportant", "unimportant"], # left[5]
["key1repeat", "key2repeat", 1.00, 1.00, "unimportant", "unimportant"], # left[6]
], columns=["key1", "key2", "aux1l", "aux2l", "left1", "left2"])
right = pd.DataFrame([
# "key1" "key2" "aux1r" "aux2r" "right1" "right2"
[np.nan, "key2", 0.99, 0.97, "unimportant", "unimportant"],
["key1", "key2", 0.99, 0.96, "unimportant", "unimportant"],
["key1repeat", "key2repeat", 1.85, 1.56, "unimportant", "unimportant"],
["key1repeat", "key2repeat", 0.99, 0.99, "unimportant", "unimportant"],
], columns=["key1", "key2", "aux1r", "aux2r", "right1", "right2"])
# what to do here?
# left.merge(right) discards left with no matches (left[4] discarded, but want to fill with closest match with aux1l/aux2l with aux1r/aux2r)
# it does not matter if aux1r and aux2r is included
wanted = pd.DataFrame([
# "key1" "key2" "auxl1" "aux2l" "left1" "left2" "right1" "right2"
[np.nan, np.nan, 1.00, 1.00, "unimportant", "unimportant", "unimportant", "unimportant"],
["key1", np.nan, 1.00, 1.00, "unimportant", "unimportant", "unimportant", "unimportant"],
[np.nan, "key2", 1.00, 1.00, "unimportant", "unimportant", "unimportant", "unimportant"],
["key1", "key2", 1.00, 1.00, "unimportant", "unimportant", "unimportant", "unimportant"],
["key1unique", "key2unique", 1.00, 1.00, "unimportant", "unimportant", "unimportant", "unimportant"],
["key1repeat", "key2repeat", 1.00, 1.00, "unimportant", "unimportant", "unimportant", "unimportant"],
["key1repeat", "key2repeat", 1.00, 1.00, "unimportant", "unimportant", "unimportant", "unimportant"],
], columns=["key1", "key2", "aux1l", "aux2l", "left1", "left2", "right1", "right2"])
