我有一个数据集,其中分数列必须在 0 到 100 之间,并且主题列必须是['Math','Science','English']. 但是,我的数据集对于某些行具有不同的值。
我应该如何处理这些行?
subject score ...
1 Math 90 ...
2 Science 85 ...
3 English 105 ...
4 Comp 95 ...
5 Math 80 ...
6 Science 70 ...
我有一个数据集,其中分数列必须在 0 到 100 之间,并且主题列必须是['Math','Science','English']. 但是,我的数据集对于某些行具有不同的值。
我应该如何处理这些行?
subject score ...
1 Math 90 ...
2 Science 85 ...
3 English 105 ...
4 Comp 95 ...
5 Math 80 ...
6 Science 70 ...
如果您需要清理数据,您可以删除包含无效值的行,或者尝试更正它们。
这里举两个例子:
# If you want to change the score, so values below 0
# are changed to 0 and values above 100 are changed
# to 100 you can do that like this:
df['score']= df['score'].clip(0, 100)
# Or alternatively (in case you have more complicated
# operations, you can also use where. For the
# correction of the scores, this would look like
df['score']= df['score'].where(df['score']<=100, 100)
df['score']= df['score'].where(df['score']>0, 0)
# If you want to drop the rows that contain undefined
# subjects, you can do that as follows:
valid_subjects= ['Math','Science','English']
# define an indexer that contains True for all rows which are invalid
invalid_subj_indexer= ~df['subject'].isin(valid_subjects)
# now drop them
df.drop(invalid_subj_indexer.index[invalid_subj_indexer], inplace=True)
结果如下:
subject score ...
1 Math 90 ...
2 Science 85 ...
3 English 100 ...
5 Math 80 ...
6 Science 70 ...
您可以通过首先执行以下行来创建测试数据框来测试上面的行:
import io
import pandas as pd
raw=\
""" subject score ...
1 Math 90 ...
2 Science 85 ...
3 English 105 ...
4 Comp 95 ...
5 Math 80 ...
6 Science 70 ..."""
df= pd.read_csv(io.StringIO(raw), sep='\s+')