1、丢弃指定轴上的项
丢弃某条轴上的一个或多个项很简单,只要有一个索引数组或列表即可。由于需要执行一些数据整理和集合逻辑,所以drop方法返回的是一个在指定轴上删除了指定值的新对象:
In [1]: import pandas as pd In [2]: import numpy as np In [3]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']) In [4]: new_obj = obj.drop('c') In [5]: new_obj Out[5]: a 0 b 1 d 3 e 4 dtype: float64
对于DataFrame,可以删除任意轴上的索引值:
In [6]: data = pd.DataFrame(np.arange(16).reshape((4, 4)), ...: index=['Ohio', 'Colorado', 'Utah', 'New York'], ...: columns=['one', 'two', 'three', 'four']) In [7]: data.drop(['Colorado', 'Ohio']) Out[7]: one two three four Utah 8 9 10 11 New York 12 13 14 15 [2 rows x 4 columns] In [8]: data.drop('two', axis=1) Out[8]: one three four Ohio 0 2 3 Colorado 4 6 7 Utah 8 10 11 New York 12 14 15 [4 rows x 3 columns] In [9]: data.drop(['two', 'four'], axis=1) Out[9]: one three Ohio 0 2 Colorado 4 6 Utah 8 10 New York 12 14 [4 rows x 2 columns]
2、索引、选取和过滤
Series索引(obj[...])的工作方式类似于NumPy数组的索引,只不过Series的索引值不只是整数。下面是几个例子:
In [10]: obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd']) In [11]: obj['b'] Out[11]: 1.0 In [12]: obj[1] Out[12]: 1.0 In [13]: obj[2:4] Out[13]: c 2 d 3 dtype: float64 In [14]: obj[['b', 'a', 'd']] Out[14]: b 1 a 0 d 3 dtype: float64 In [15]: obj[[1, 3]] Out[15]: b 1 d 3 dtype: float64 In [16]: obj[obj < 2] Out[16]: a 0 b 1 dtype: float64
利用标签的切片运算与普通的Python切片运算不同,其末端是包含的(inclusive):
In [17]: obj['b':'c'] Out[17]: b 1 c 2 dtype: float64
设置的方式也很简单:
In [18]: obj['b':'c'] = 5 In [19]: obj Out[19]: a 0 b 5 c 5 d 3 dtype: float64
如你所见,对DataFrame进行索引其实就是获取一个或多个列:
In [20]: data = pd.DataFrame(np.arange(16).reshape((4, 4)), ....: index=['Ohio', 'Colorado', 'Utah', 'New York'], ....: columns=['one', 'two', 'three', 'four']) In [21]: data Out[21]: one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 [4 rows x 4 columns] In [22]: data['two'] Out[22]: Ohio 1 Colorado 5 Utah 9 New York 13 Name: two, dtype: int32 In [23]: data[['three', 'one']] Out[23]: three one Ohio 2 0 Colorado 6 4 Utah 10 8 New York 14 12 [4 rows x 2 columns]
这种索引方式有几个特殊的情况。首先通过切片或布尔型数组选取行:
In [24]: data[:2] Out[24]: one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 [2 rows x 4 columns] In [25]: data[data['three'] > 5] Out[25]: one two three four Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 [3 rows x 4 columns]
有些读者可能会认为这不太合乎逻辑,但这种语法的确来源于实践。另一种用法是通过布尔型DataFrame(比如下面这个由标量比较运算得出的)进行索引:
In [26]: data < 5 Out[26]: one two three four Ohio True True True True Colorado True False False False Utah False False False False New York False False False False [4 rows x 4 columns] In [27]: data[data < 5] = 0 In [28]: data Out[28]: one two three four Ohio 0 0 0 0 Colorado 0 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 [4 rows x 4 columns]
说明:
这段代码的目的是使DataFrame在语法上更像ndarray。
为了在DataFrame的行上进行标签索引,我引入了专门的索引字段ix。它使你可以通过NumPy式的标记法以及轴标签从DataFrame中选取行和列的子集。之前曾提到过,这也是一种重新索引的简单手段:
In [29]: data.ix['Colorado', ['two', 'three']] Out[29]: two 5 three 6 Name: Colorado, dtype: int32 In [30]: data.ix[['Colorado', 'Utah'], [3, 0, 1]] Out[30]: four one two Colorado 7 0 5 Utah 11 8 9 [2 rows x 3 columns] In [31]: data.ix[2] Out[31]: one 8 two 9 three 10 four 11 Name: Utah, dtype: int32 In [32]: data.ix[:'Utah', 'two'] Out[32]: Ohio 0 Colorado 5 Utah 9 Name: two, dtype: int32 In [33]: data.ix[data.three > 5, :3] Out[33]: one two three Colorado 0 5 6 Utah 8 9 10 New York 12 13 14 [3 rows x 3 columns]
注意:
在设计pandas时,我觉得必须输入frame[:, col]才能选取列实在有些麻烦,因为列的选取是一种最常见的操作。于是,我就把所有的标签索引功能都放到ix中了。
3、算法运算和数据对齐
pandas最重要的一个功能是,它可以对不同索引的对象进行算法运算。在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。
In [34]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e']) In [35]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g']) In [36]: s1 Out[36]: a 7.3 c -2.5 d 3.4 e 1.5 dtype: float64 In [37]: s2 Out[37]: a -2.1 c 3.6 e -1.5 f 4.0 g 3.1 dtype: float64
将它们相加就会产生:
In [39]: s1 + s2 Out[39]: a 5.2 c 1.1 d NaN e 0.0 f NaN g NaN dtype: float64
说明:
自动的数据对齐操作在不重叠的索引处引入了NA值。缺失值会在算术运算过程中传播。
对于DataFrame,对齐操作会同时发生在行和列上:
In [40]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado']) In [41]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) In [42]: df1 Out[42]: b c d Ohio 0 1 2 Texas 3 4 5 Colorado 6 7 8 [3 rows x 3 columns] In [43]: df2 Out[43]: b d e Utah 0 1 2 Ohio 3 4 5 Texas 6 7 8 Oregon 9 10 11 [4 rows x 3 columns]
把它们相加后将会返回一个新的DataFrame,其索引和列为原来那两个DataFrame的并集:
In [44]: df1 + df2 Out[44]: b c d e Colorado NaN NaN NaN NaN Ohio 3 NaN 6 NaN Oregon NaN NaN NaN NaN Texas 9 NaN 12 NaN Utah NaN NaN NaN NaN [5 rows x 4 columns]
4、在算术方法中填充值
在对不同索引的对象进行算术运算时,你可能希望当一个对象中某个轴标签在另一个对象中找不到时填充一个特殊值(比如0):
In [45]: df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd')) In [46]: df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde')) In [47]: df1 Out[47]: a b c d 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 [3 rows x 4 columns] In [48]: df2 Out[48]: a b c d e 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 [4 rows x 5 columns]
将它们相加时,没有重叠的位置就会产生NA值:
In [49]: df1 + df2 Out[49]: a b c d e 0 0 2 4 6 NaN 1 9 11 13 15 NaN 2 18 20 22 24 NaN 3 NaN NaN NaN NaN NaN [4 rows x 5 columns]
使用df1的add方法,传入df2以及一个fill_value参数:
In [50]: df1.add(df2, fill_value=0) Out[50]: a b c d e 0 0 2 4 6 4 1 9 11 13 15 9 2 18 20 22 24 14 3 15 16 17 18 19 [4 rows x 5 columns]
与此类似,在对Series或DataFrame重新索引时,也可以指定一个填充值:
In [51]: df1.reindex(columns=df2.columns, fill_value=0) Out[51]: a b c d e 0 0 1 2 3 0 1 4 5 6 7 0 2 8 9 10 11 0 [3 rows x 5 columns]
5、DataFrame和Series之间的运算
跟NumPy数组一样,DataFrame和Series之间算术运算也是有明确规定的。先来看一个具有启发性的例子,计算一个二维数组与其某行之间的差:
In [52]: arr = np.arange(12.).reshape((3, 4)) In [53]: arr Out[53]: array([[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.]]) In [54]: arr[0] Out[54]: array([ 0., 1., 2., 3.]) In [55]: arr - arr[0] Out[55]: array([[ 0., 0., 0., 0.], [ 4., 4., 4., 4.], [ 8., 8., 8., 8.]])
这就叫做广播(broadcasting)。DataFrame和Series之间的运算差不多也是如此:
In [56]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) In [57]: series = frame.ix[0] In [58]: frame Out[58]: b d e Utah 0 1 2 Ohio 3 4 5 Texas 6 7 8 Oregon 9 10 11 [4 rows x 3 columns] In [59]: series Out[59]: b 0 d 1 e 2 Name: Utah, dtype: float64
默认情况下,DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播:
In [60]: frame - series Out[60]: b d e Utah 0 0 0 Ohio 3 3 3 Texas 6 6 6 Oregon 9 9 9 [4 rows x 3 columns]
如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集:
In [61]: series2 = pd.Series(range(3), index=['b', 'e', 'f']) In [62]: frame + series Out[62]: b d e Utah 0 2 4 Ohio 3 5 7 Texas 6 8 10 Oregon 9 11 13 [4 rows x 3 columns] In [63]: frame + series2 Out[63]: b d e f Utah 0 NaN 3 NaN Ohio 3 NaN 6 NaN Texas 6 NaN 9 NaN Oregon 9 NaN 12 NaN [4 rows x 4 columns]
如果你希望匹配行且在列上广播,则必须使用算术运算方法。例如:
In [64]: series3 = frame['d'] In [65]: frame Out[65]: b d e Utah 0 1 2 Ohio 3 4 5 Texas 6 7 8 Oregon 9 10 11 [4 rows x 3 columns] In [66]: series3 Out[66]: Utah 1 Ohio 4 Texas 7 Oregon 10 Name: d, dtype: float64 In [67]: frame.sub(series3, axis=0) Out[67]: b d e Utah -1 0 1 Ohio -1 0 1 Texas -1 0 1 Oregon -1 0 1 [4 rows x 3 columns]
传入的轴号就是希望匹配的轴。在本例中,我们的目的是匹配DataFrame的行索引并进行广播。
6、函数应用和映射
NumPy的ufuncs(元素级数组方法)也可用于操作pandas对象:
In [68]: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) In [69]: frame Out[69]: b d e Utah -1.477719 -1.530953 -0.913435 Ohio 0.285921 0.337583 -0.114854 Texas 0.977180 0.803043 1.179746 Oregon 1.121824 1.111941 -1.532408 [4 rows x 3 columns] In [70]: np.abs(frame) Out[70]: b d e Utah 1.477719 1.530953 0.913435 Ohio 0.285921 0.337583 0.114854 Texas 0.977180 0.803043 1.179746 Oregon 1.121824 1.111941 1.532408 [4 rows x 3 columns]
另一个常见的操作是,将函数应用到由各列或行所形成的一维数组上。DataFrame的apply方法即可实现此功能:
In [71]: f = lambda x: x.max() - x.min() In [72]: frame.apply(f) Out[72]: b 2.599543 d 2.642894 e 2.712154 dtype: float64 In [73]: frame.apply(f, axis=1) Out[73]: Utah 0.617518 Ohio 0.452436 Texas 0.376704 Oregon 2.654232 dtype: float64
许多最为常见的数组统计功能都被实现成DataFrame的方法(如sum和mean),因此无需使用apply方法。除标量值外,传递给apply的函数还可以返回由多个值组成的Series:
In [75]: def f(x): ....: return pd.Series([x.min(), x.max()], index=['min', 'max']) ....: In [76]: frame.apply(f) Out[76]: b d e min -1.477719 -1.530953 -1.532408 max 1.121824 1.111941 1.179746 [2 rows x 3 columns]
此外,元素级的Python函数也是可以用的。假如你想得到frame中各个浮点值的格式化字符串,使用applymap即可:
In [77]: format = lambda x: '%.2f' % x In [78]: frame.applymap(format) Out[78]: b d e Utah -1.48 -1.53 -0.91 Ohio 0.29 0.34 -0.11 Texas 0.98 0.80 1.18 Oregon 1.12 1.11 -1.53 [4 rows x 3 columns]
之所以叫做applymap,是因为Series有一个用于应用元素级函数的map方法:
In [79]: frame['e'].map(format) Out[79]: Utah -0.91 Ohio -0.11 Texas 1.18 Oregon -1.53 Name: e, dtype: object