数据挖掘 - 如何处理超出范围的值？ - 吾爱随笔录

如何处理超出范围的值？

数据挖掘 Python 熊猫数据清理

2022-02-17 17:30:00

我有一个数据集，其中分数列必须在 0 到 100 之间，并且主题列必须是['Math','Science','English']. 但是，我的数据集对于某些行具有不同的值。

我应该如何处理这些行？

  subject score ...
1 Math    90    ...
2 Science 85    ...
3 English 105   ...
4 Comp    95    ...
5 Math    80    ...
6 Science 70    ...

1个回答

如果您需要清理数据，您可以删除包含无效值的行，或者尝试更正它们。

这里举两个例子：

# If you want to change the score, so values below 0 
# are changed to 0 and values above 100 are changed
# to 100 you can do that like this:

df['score']= df['score'].clip(0, 100)

# Or alternatively (in case you have more complicated
# operations, you can also use where. For the
# correction of the scores, this would look like

df['score']= df['score'].where(df['score']<=100, 100)
df['score']= df['score'].where(df['score']>0, 0)

# If you want to drop the rows that contain undefined
# subjects, you can do that as follows:

valid_subjects= ['Math','Science','English']
# define an indexer that contains True for all rows which are invalid
invalid_subj_indexer= ~df['subject'].isin(valid_subjects)
# now drop them
df.drop(invalid_subj_indexer.index[invalid_subj_indexer], inplace=True)

结果如下：

   subject  score  ...
1     Math     90  ...
2  Science     85  ...
3  English    100  ...
5     Math     80  ...
6  Science     70  ...

您可以通过首先执行以下行来创建测试数据框来测试上面的行：

import io
import pandas as pd

raw=\
"""  subject score ...
1 Math    90    ...
2 Science 85    ...
3 English 105   ...
4 Comp    95    ...
5 Math    80    ...
6 Science 70    ..."""

df= pd.read_csv(io.StringIO(raw), sep='\s+')

其它你可能感兴趣的问题

上一篇SQL 和 JSON 数据库 - R 下一篇似乎教科书“机器学习-概率视角”以相反的方式使用输入和输出，是吗？