如何比较列的两个数据框中的出现次数并提取相似性

数据挖掘 熊猫
2022-02-24 09:23:03

我试图在两个数据帧中找到相同的出现次数这是我上一个问题的后续问题
我有 2 个数据帧

df1=pd.DataFrame([[1,None],[1,None,],[1,None],[1,'item_a'],[2,'item_a'],[2,'item_b'],[2,'item_f'],[3,'item_e'],[3,'item_e'],[3,'item_g'],[3,'item_h']],columns=['id','A'])
df2=pd.DataFrame([[1,'item_a'],[1,'item_b'],[1,'item_c'],[1,'item_d'],[2,'item_a'],[2,'item_b'],[2,'item_c'],[2,'item_d'],[3,'item_e'],[3,'item_f'],[3,'item_g'],[3,'item_h']],columns=['id','A'])

 df1
        id  A
    0   1   None
    1   1   None
    2   1   None
    3   1   item_a # id 1 has 1 occurrences in total in df1
    4   2   item_a
    5   2   item_b
    6   2   item_f #id 2 has 3 occurrences in total in df1(id 2 has 3 occurrences here)
    7   3   item_e
    8   3   item_e
    9   3   item_g
    10  3   item_h #id3 has 4 ccurrences in total in df1



df2
    id  A
0   1   item_a
1   1   item_b
2   1   item_c
3   1   item_d
4   2   item_a
5   2   item_b
6   2   item_c
7   2   item_d
8   3   item_e
9   3   item_f
10  3   item_g
11  3   item_h


我得到了关于如何通过使用找到相似之处的答案

previous result:
d=pd.merge(df1,df2,how='inner')
        id  A
3   1   item_a # id 1 has 1 occurrences in total in d
4   2   item_a
5   2   item_b # id 2 has 2 occurrences in total in d(id 2 has 2 occurrences here which does not match all the occurrences(3) in df1)
7   3   item_e
8   3   item_e
9   3   item_g
10  3   item_h #id 3 has 4 occurrences in total in d

我试图在两个数据框中找到相同数量的出现:
d[d['id'].value_counts()==df1['id'].value_counts()]
Which gave me an error:Can only compare identically-labeled Series objects
我还尝试了不同的方法,使用 rename 为 value_counts 放置列名并合并它们但失败了。

匹配:df1 中出现的计数,结果数据帧 d 中出现的 id 匹配计数

        cnt_in_df1|cntin_d
for id1:     1    | 1  count #match => id 1 should be in the desired output.
for id2:     3    | 2  count #mismatch=> id 2 should not be in the desired output
for id3:     4    | 4  count #match => id 3 should be in the desired output.

My desired output for this question:

        id  count 
    0   1    1
    1   3    4
1个回答

编辑:感谢您澄清问题。所以现在的问题是检查两个数据帧中的 id 计数是否相同。

您可以这样做:

d1 = pd.DataFrame(df1[~df1['A'].isnull()].groupby("id").size())
d2 = pd.DataFrame(d[~d['A'].isnull()].groupby("id").size())

d = pd.merge(d1,d2,on="id")

ids_ = d[d["0_x"] == d["0_y"]].index.values

RETURN:
array([1, 3])

现在这将给出一个 id 数组,其中两者的计数df1相同d