来源:Datawhale
分类数据(categorical data)是按照现象的某种属性对其进行分类或分组而得到的反映事物类型的数据,又称定类数据。直白来说,就是取值为有限的,或者说是固定数量的可能值。例如:性别、血型等。
今天,我们来学习下,Pandas如何处理分类数据。主要围绕以下几个方面展开:
本文目录
1. Category的创建及其性质
1.1. 分类变量的创建
1.2. 分类变量的结构
1.3. 类别的修改
2. 分类变量的排序
2.1. 序的建立
2.2. 排序
3. 分类变量的比较操作
3.1. 与标量或等长序列的比较
3.2. 与另一分类变量的比较
4. 问题及练习
4.1. 问题
首先,读入数据:
import pandas as pd
import numpy as np
df = pd.read_csv('data/table.csv')
df.head()
一、category的创建及其性质
pd.Series(["a", "b", "c", "a"], dtype="category")
temp_df = pd.DataFrame({'A':pd.Series(["a", "b", "c", "a"], dtype="category"),'B':list('abcd')})
temp_df.dtypes
cat = pd.Categorical(["a", "b", "c", "a"], categories=['a','b','c'])
pd.Series(cat)
pd.cut(np.random.randint(0,60,5), [0,10,30,60])
pd.cut(np.random.randint(0,60,5), [0,10,30,60], right=False, labels=['0-10','10-30','30-60'])
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.describe()
s.cat.categories
Index(['a', 'b', 'c', 'd'], dtype='object')
s.cat.ordered
False
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.set_categories(['new_a','c'])
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.rename_categories(['new_%s'%i for i in s.cat.categories])
s.cat.rename_categories({'a':'new_a','b':'new_b'})
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.add_categories(['e'])
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.remove_categories(['d'])
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.remove_unused_categories()
s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.as_ordered()
s
退化为无序变量,只需要使用as_unordered
s.cat.as_unordered()
pd.Series(["a", "d", "c", "a"]).astype('category').cat.set_categories(['a','c','d'],ordered=True)
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s.cat.reorder_categories(['a','c','d'],ordered=True)
#s.cat.reorder_categories(['a','c'],ordered=True) #报错
#s.cat.reorder_categories(['a','c','d','e'],ordered=True) #报错
s = pd.Series(np.random.choice(['perfect','good','fair','bad','awful'],50)).astype('category')
s.cat.set_categories(['perfect','good','fair','bad','awful'][::-1],ordered=True).head()
s.sort_values(ascending=False).head()
df_sort = pd.DataFrame({'cat':s.values,'value':np.random.randn(50)}).set_index('cat')
df_sort.head()
df_sort.sort_index().head()
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == 'a'
s == list('abcd')
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == s
s != s
s_new = s.cat.set_categories(['a','d','e'])
#s == s_new #报错
s = pd.Series(["a", "d", "c", "a"]).astype('category')
#s >= s #报错
s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.reorder_categories(['a','c','d'],ordered=True)
s >= s
from pandas.api.types import union_categoricals
a = pd.Categorical(['b','c'])
b = pd.Categorical(['a','b'])
union_categoricals([a,b])
默认情况下,生成的类别将按照在数据中显示的顺序排列。如果要对类别进行排序,可使用sort_categories=True参数。
union_categoricals也适用于组合相同类别和顺序信息的两个分类。
union_categoricals可以在合并分类时重新编码类别的整数代码。
【问题三】 当使用groupby方法或者value_counts方法时,分类变量的统计结果和普通变量有什么区别?
分类变量的groupby方法/value_counts方法,统计对象是类别。
cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10])
s = pd.Series(cat, name="cat")
cat
s.iloc[0:2] = 10
cat
4.2. 练习
df = pd.read_csv('data/Earthquake.csv')
df_result = df.copy()
df_result['深度'] = pd.cut(df['深度'],[0,5,10,15,20,30,50,np.inf], right=False, labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ','Ⅴ','Ⅵ','Ⅶ'])
df_result = df_result.set_index('深度').sort_index()
df_result.head()
跟(a)很相似,cut方法对深度,烈度进行切分,把index设为[‘深度’,‘烈度’],然后进行索引排序即可。
df['烈度'] = pd.cut(df['烈度'],[0,3,4,5,np.inf], right=False, labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ'])
df['深度'] = pd.cut(df['深度'],[0,5,10,15,20,30,50,np.inf], right=False, labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ','Ⅴ','Ⅵ','Ⅶ'])
df_ds = df.set_index(['深度','烈度'])
df_ds.sort_index()
因为Categories中肯定包含出现的变量。所以将第一个参数作为index,第二个参数作为columns,建立一个DataFrame,然后把出现的变量组合起来,对应位置填入1即可。
foo = pd.Categorical(['b','a'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
import numpy
def my_crosstab(a, b):
s1 = pd.Series(list(foo.categories), name='row')
s2 = list(bar.categories)
df = pd.DataFrame(np.zeros((len(s1), len(s2)),int),index=s1, columns=s2)
index_1 = list(foo)
index_2 = list(bar)
for loc in zip(index_1, index_2):
df.loc[loc] = 1
return df
my_crosstab(foo, bar)
——END——