5
编辑日期: 2024-11-28 文章阅读: 次
小技巧5:如何将分类中出现次数较少的值归为 others?
这也是我们在数据清洗、特征构造中面临的一个任务。 如下一个 DataFrame:
d = {"name":['Jone','Alica','Emily','Robert','Tomas',
'Zhang','Liu','Wang','Jack','Wsx','Guo'],
"categories": ["A", "C", "A", "D", "A",
"B", "B", "C", "A", "E", "F"]}
df = pd.DataFrame(d)
df
结果:
name categories
0 Jone A
1 Alica C
2 Emily A
3 Robert D
4 Tomas A
5 Zhang B
6 Liu B
7 Wang C
8 Jack A
9 Wsx E
10 Guo F
D、E、F 仅在分类中出现一次,A 出现次数较多。
步骤 1:统计频次,并归一
frequencies = df["categories"].value_counts(normalize = True)
frequencies
结果:
A 0.363636
B 0.181818
C 0.181818
F 0.090909
E 0.090909
D 0.090909
Name: categories, dtype: float64
步骤 2:设定阈值,过滤出频次较少的值
threshold = 0.1
small_categories = frequencies[frequencies < threshold].index
small_categories
结果:
Index(['F', 'E', 'D'], dtype='object')
步骤 3:替换值
df["categories"] = df["categories"].replace(small_categories, "Others")
替换后的 DataFrame:
name categories
0 Jone A
1 Alica C
2 Emily A
3 Robert Others
4 Tomas A
5 Zhang B
6 Liu B
7 Wang C
8 Jack A
9 Wsx Others
10 Guo Others