如何处理超出范围的值?

数据挖掘 Python 熊猫 数据清理
2022-02-17 17:30:00

我有一个数据集,其中分数列必须在 0 到 100 之间,并且主题列必须是['Math','Science','English']. 但是,我的数据集对于某些行具有不同的值。

我应该如何处理这些行?

  subject score ...
1 Math    90    ...
2 Science 85    ...
3 English 105   ...
4 Comp    95    ...
5 Math    80    ...
6 Science 70    ...
1个回答

如果您需要清理数据,您可以删除包含无效值的行,或者尝试更正它们。

这里举两个例子:

# If you want to change the score, so values below 0 
# are changed to 0 and values above 100 are changed
# to 100 you can do that like this:

df['score']= df['score'].clip(0, 100)

# Or alternatively (in case you have more complicated
# operations, you can also use where. For the
# correction of the scores, this would look like

df['score']= df['score'].where(df['score']<=100, 100)
df['score']= df['score'].where(df['score']>0, 0)

# If you want to drop the rows that contain undefined
# subjects, you can do that as follows:

valid_subjects= ['Math','Science','English']
# define an indexer that contains True for all rows which are invalid
invalid_subj_indexer= ~df['subject'].isin(valid_subjects)
# now drop them
df.drop(invalid_subj_indexer.index[invalid_subj_indexer], inplace=True)

结果如下:

   subject  score  ...
1     Math     90  ...
2  Science     85  ...
3  English    100  ...
5     Math     80  ...
6  Science     70  ...

您可以通过首先执行以下行来创建测试数据框来测试上面的行:

import io
import pandas as pd

raw=\
"""  subject score ...
1 Math    90    ...
2 Science 85    ...
3 English 105   ...
4 Comp    95    ...
5 Math    80    ...
6 Science 70    ..."""

df= pd.read_csv(io.StringIO(raw), sep='\s+')