跳转至

🔥AI副业赚钱星球

点击下面图片查看

郭震AI

5

编辑日期: 2024-11-28 文章阅读:

小技巧5:如何将分类中出现次数较少的值归为 others?

这也是我们在数据清洗、特征构造中面临的一个任务。 如下一个 DataFrame:

d = {"name":['Jone','Alica','Emily','Robert','Tomas',
             'Zhang','Liu','Wang','Jack','Wsx','Guo'],
     "categories": ["A", "C", "A", "D", "A", 
                    "B", "B", "C", "A", "E", "F"]}
df = pd.DataFrame(d)
df

结果:

    name    categories
0   Jone    A
1   Alica   C
2   Emily   A
3   Robert  D
4   Tomas   A
5   Zhang   B
6   Liu B
7   Wang    C
8   Jack    A
9   Wsx E
10  Guo F

D、E、F 仅在分类中出现一次,A 出现次数较多。

步骤 1:统计频次,并归一

frequencies = df["categories"].value_counts(normalize = True)
frequencies

结果:

A    0.363636
B    0.181818
C    0.181818
F    0.090909
E    0.090909
D    0.090909
Name: categories, dtype: float64

步骤 2:设定阈值,过滤出频次较少的值

threshold = 0.1
small_categories = frequencies[frequencies < threshold].index
small_categories

结果:

Index(['F', 'E', 'D'], dtype='object')

步骤 3:替换值

df["categories"] = df["categories"].replace(small_categories, "Others")

替换后的 DataFrame:

    name    categories
0   Jone    A
1   Alica   C
2   Emily   A
3   Robert  Others
4   Tomas   A
5   Zhang   B
6   Liu B
7   Wang    C
8   Jack    A
9   Wsx Others
10  Guo Others
京ICP备20031037号-1