如何与最小欧几里得距离合并?

数据挖掘 Python 熊猫 数据
2022-02-17 18:01:02

给定一个主left数据集,如何在指定列上right与具有最小欧几里得距离 ( ) 的数据集合并?d = sqrt(a^2 + b^2)

细节:

  • 如果 ofkey1key2fromleft都存在于 中,则与匹配and且具有最小值 fromright的行合并key1key2sqrt((aux1r - aux1l)^2 + (aux2r - aux2l)^2)
  • 如果 ofkey1key2are notNaN且 ofkey1key2fromleft都不存在于 中right,则与具有最小值 from 的行合并sqrt((aux1r - aux1l)^2 + (aux2r - aux2l)^2)
  • 如果左侧key1其中之一是 ,则与右侧与非匹配的行合并,并且具有最小值key2NaNNaN key1key2sqrt((aux1r - aux1l)^2 + (aux2r - aux2l)^2)
  • 如果 ofkey1key2from left 都是NaN,则与具有最小值 from 的右侧行合并sqrt((aux1r - aux1l)^2 + (aux2r - aux2l)^2)

输入数据框的示例,以及合并后想要的数据框:

import pandas as pd
import numpy as np

left = pd.DataFrame([
    #  "key1"      "key2"        "aux1l" "aux2l" "left1"        "left2"
    [np.nan,       np.nan,       1.00,   1.00,   "unimportant", "unimportant"],  # left[0]
    ["key1",       np.nan,       1.00,   1.00,   "unimportant", "unimportant"],  # left[1]
    [np.nan,       "key2",       1.00,   1.00,   "unimportant", "unimportant"],  # left[2]
    ["key1",       "key2",       1.00,   1.00,   "unimportant", "unimportant"],  # left[3]
    ["key1unique", "key2unique", 1.00,   1.00,   "unimportant", "unimportant"],  # left[4]
    ["key1repeat", "key2repeat", 1.00,   1.00,   "unimportant", "unimportant"],  # left[5]
    ["key1repeat", "key2repeat", 1.00,   1.00,   "unimportant", "unimportant"],  # left[6]
], columns=["key1", "key2", "aux1l", "aux2l", "left1", "left2"])

right = pd.DataFrame([
    # "key1"       "key2"        "aux1r"  "aux2r"  "right1"       "right2"
    [np.nan,       "key2",       0.99,    0.97,    "unimportant", "unimportant"],
    ["key1",       "key2",       0.99,    0.96,    "unimportant", "unimportant"],
    ["key1repeat", "key2repeat", 1.85,    1.56,    "unimportant", "unimportant"],
    ["key1repeat", "key2repeat", 0.99,    0.99,    "unimportant", "unimportant"],
], columns=["key1", "key2", "aux1r", "aux2r", "right1", "right2"])

# what to do here?
# left.merge(right) discards left with no matches (left[4] discarded, but want to fill with closest match with aux1l/aux2l with aux1r/aux2r)

# it does not matter if aux1r and aux2r is included
wanted = pd.DataFrame([
    # "key1"       "key2"        "auxl1"  "aux2l" "left1"       "left2"        "right1"       "right2"
    [np.nan,       np.nan,       1.00,   1.00,   "unimportant", "unimportant", "unimportant", "unimportant"],
    ["key1",       np.nan,       1.00,   1.00,   "unimportant", "unimportant", "unimportant", "unimportant"],
    [np.nan,       "key2",       1.00,   1.00,   "unimportant", "unimportant", "unimportant", "unimportant"],
    ["key1",       "key2",       1.00,   1.00,   "unimportant", "unimportant", "unimportant", "unimportant"],
    ["key1unique", "key2unique", 1.00,   1.00,   "unimportant", "unimportant", "unimportant", "unimportant"],
    ["key1repeat", "key2repeat", 1.00,   1.00,   "unimportant", "unimportant", "unimportant", "unimportant"],
    ["key1repeat", "key2repeat", 1.00,   1.00,   "unimportant", "unimportant", "unimportant", "unimportant"],
], columns=["key1", "key2", "aux1l", "aux2l", "left1", "left2", "right1", "right2"])
1个回答

让我们简化要求

右侧数据框中的所需行是

  1. 具有最少数量的不匹配键
  2. 具有最小欧几里得距离
def merge_left_and_right(left, right):
    # save original input from modification
    left = left.copy()
    right = right.copy()
    
    # numerate rows in the right dataframe
    right['row_num'] = range(len(right))

    # find the best matchings for left dataframe
    left['row_num'] = [
        find_the_best_matching(right,
                               key1=row['key1'],
                               key2=row['key2'],
                               aux1l=row['aux1l'],
                               aux2l=row['aux2l'])
        for index, row in left.iterrows()
    ]
    
    # merge dataframe by row number
    right = right.drop(['key1', 'key2'], axis=1)
    merged = left.merge(right, on='row_num', how='left')
    merged = merged.drop('row_num', axis=1)
    
    return merged

def find_the_best_matching(right, key1, key2, aux1l, aux2l):
    right = right.copy()

    # keys match only when they aren't NaN and equal ("np.nan != np.nan" is True)
    right['unmatched_key_count'] = 0
    right['unmatched_key_count'] += (right['key1'] != key1).astype(int)
    right['unmatched_key_count'] += (right['key2'] != key2).astype(int)

    right['euclidean_distance'] = np.sqrt((right['aux1r'] - aux1l) ** 2 + (right['aux2r'] - aux2l) ** 2)
    
    # Sort by unmatched amount, then by distance. The first row will be best
    return right.sort_values(['unmatched_key_count', 'euclidean_distance']).iloc[0]['row_num']

在此处输入图像描述