我有一个房地产广告数据集。有几行是关于相同的房地产,所以它充满了不完全相同的重复。删除过于相似而不能重复的行的最佳方法是什么?
它看起来像这样:
ID URL CRAWL_SOURCE PROPERTY_TYPE NEW_BUILD DESCRIPTION IMAGES SURFACE LAND_SURFACE BALCONY_SURFACE ... DEALER_NAME DEALER_TYPE CITY_ID CITY ZIP_CODE DEPT_CODE PUBLICATION_START_DATE PUBLICATION_END_DATE LAST_CRAWL_DATE LAST_PRICE_DECREASE_DATE
0 22c05930-0eb5-11e7-b53d-bbead8ba43fe http://www.avendrealouer.fr/location/levallois... A_VENDRE_A_LOUER APARTMENT False Au rez de chaussée d'un bel immeuble récent,... ["https://cf-medias.avendrealouer.fr/image/_87... 72.0 NaN NaN ... Lamirand Et Associes AGENCY 54178039 Levallois-Perret 92300.0 92 2017-03-22T04:07:56.095 NaN 2017-04-21T18:52:35.733 NaN
1 8d092fa0-bb99-11e8-a7c9-852783b5a69d https://www.bienici.com/annonce/ag440414-16547... BIEN_ICI APARTMENT False Je vous propose un appartement dans la rue Col... ["http://photos.ubiflow.net/440414/165474561/p... 48.0 NaN NaN ... Proprietes Privees MANDATARY 54178039 Levallois-Perret 92300.0 92 2018-09-18T11:04:44.461 NaN 2019-06-06T10:08:10.89 2018-09-25
到目前为止,我试图比较描述:
df['is_duplicated'] = df.duplicated(['DESCRIPTION'])
并比较照片数组:
def image_similarity(imageAurls,imageBurls):
imageAurls = ast.literal_eval(imageAurls)
imageBurls = ast.literal_eval(imageBurls)
for urlA in imageAurls:
responseA = requests.get(urlA)
imgA = Image.open(BytesIO(responseA.content))
print(imgA)
for urlB in imageBurls:
responseB = requests.get(urlB)
imgB = Image.open(BytesIO(responseB.content))
hash0 = imagehash.average_hash(imgA)
hash1 = imagehash.average_hash(imgB)
cutoff = 5
if hash0 - hash1 < cutoff:
print(urlA)
print(urlB)
return('similar')
return('not similar')
df['NextImage'] = df['IMAGES'][df['IMAGES'].index - 1]
df['IsSimilar'] = df.apply(lambda x: image_similarity(x['IMAGES'], x['NextImage']), axis=1)