删除过于相似而不能重复的行

数据挖掘 Python 数据清理
2022-02-23 17:53:54

我有一个房地产广告数据集有几行是关于相同的房地产,所以它充满了不完全相同的重复。删除过于相似而不能重复的行的最佳方法是什么?

它看起来像这样:

    ID  URL CRAWL_SOURCE    PROPERTY_TYPE   NEW_BUILD   DESCRIPTION IMAGES  SURFACE LAND_SURFACE    BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY    ZIP_CODE    DEPT_CODE   PUBLICATION_START_DATE  PUBLICATION_END_DATE    LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE
0   22c05930-0eb5-11e7-b53d-bbead8ba43fe    http://www.avendrealouer.fr/location/levallois...   A_VENDRE_A_LOUER    APARTMENT   False   Au rez de chaussée d'un bel immeuble récent,...   ["https://cf-medias.avendrealouer.fr/image/_87...   72.0    NaN NaN ... Lamirand Et Associes    AGENCY  54178039    Levallois-Perret    92300.0 92  2017-03-22T04:07:56.095 NaN 2017-04-21T18:52:35.733 NaN
1   8d092fa0-bb99-11e8-a7c9-852783b5a69d    https://www.bienici.com/annonce/ag440414-16547...   BIEN_ICI    APARTMENT   False   Je vous propose un appartement dans la rue Col...   ["http://photos.ubiflow.net/440414/165474561/p...   48.0    NaN NaN ... Proprietes Privees  MANDATARY   54178039    Levallois-Perret    92300.0 92  2018-09-18T11:04:44.461 NaN 2019-06-06T10:08:10.89  2018-09-25

到目前为止,我试图比较描述:

df['is_duplicated'] = df.duplicated(['DESCRIPTION'])

并比较照片数组:

def image_similarity(imageAurls,imageBurls):
    imageAurls = ast.literal_eval(imageAurls)
    imageBurls = ast.literal_eval(imageBurls)
    for urlA in imageAurls:
        responseA = requests.get(urlA)
        imgA = Image.open(BytesIO(responseA.content))
        print(imgA)
        for urlB in imageBurls:
            responseB = requests.get(urlB)
            imgB = Image.open(BytesIO(responseB.content))    
            hash0 = imagehash.average_hash(imgA) 
            hash1 = imagehash.average_hash(imgB) 
            cutoff = 5

            if hash0 - hash1 < cutoff:
                print(urlA)
                print(urlB)
                return('similar')
        return('not similar')

df['NextImage'] = df['IMAGES'][df['IMAGES'].index - 1]
df['IsSimilar'] = df.apply(lambda x: image_similarity(x['IMAGES'], x['NextImage']), axis=1)
1个回答

看看记录链接方法。它们通常用于在两个数据集之间查找相同的实体,但也可用于将数据集链接到自身并查找重复记录(在文献中称为重复数据删除)