Numpy教程第2部分 - 数据分析的重要功能- 专知

Numpy教程第2部分 - 数据分析的重要功能

【导读】Numpy是python数据分析和科学计算的核心软件包。上次介绍了numpy的一些基础操作。例如如何创建一个array，如何提取array元素，重塑（reshape）数组，生成随机数（random）等，在这一部分，专知成员Fan将详细介绍numpy的高级功能，这些功能对于数据分析和操作非常重要。

Numpy教程第1部分可以参见专知公众号：

Numpy教程第1部分 - 阵列简介（常用基础操作总结）

▌一、如何使用np.where获得满足给定条件的索引位置？

1、有时候我们不仅仅需要知道array中满足条件的元素是什么，也需要知道满足条件的元素在array中的索引：

import numpy as np
arr_rand = np.array([8, 8, 3, 7, 7, 0, 4, 2, 5, 2])
print("Array: ", arr_rand)
# Positions where value > 5
index_gt5 = np.where(arr_rand > 5)
print("Positions where value > 5: ", index_gt5)
#> Array:  [8 8 3 7 7 0 4 2 5 2]
#> Positions where value > 5:  (array([0, 1, 3, 4]),)

2、对于提取出的索引，可以利用数组的take方法取出符合条件的元素：

arr_rand.take(index_gt5)
#> array([[8, 8, 7, 7]])

3、np.where可以在括号里添加两个参数，a和b，当条件满足时，元素改为a，不满足时改为b

np.where(arr_rand > 5, 'gt5', 'le5')
#> array(['gt5', 'gt5', 'le5', 'gt5', 'gt5', 'le5', 'le5', 'le5', 'le5', 'le5'],
      dtype='<U3')

4、找到数组中最大值（argmax）与最小值(argmin)的索引

print('Position of max value: ', np.argmax(arr_rand))  
#> Position of max value:  0

▌二、如何导入和导出.csv的数据？

1、np.genfromtxt可以导入web URLs的数据，并且可以解决缺少值，多个分隔符，以及处理不规则数量的列等问题。

#关闭科学计数法
np.set_printoptions(suppress=True)  
# Import data from csv file url
path = 'https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv'
data = np.genfromtxt(path, delimiter=',', skip_header=1, filling_values=-999, dtype='float')
data[:3]  # see first 3 rows
#> array([[18.,8.,307.,130.,3504.,12.,70.,#> 1.,-999. ],
#>[15.,8.,350.,165.,3693.,11.5,70.,#>1.,-999. ],
#>[18.,8.,318.,150.,3436.,11.,70.,#>1.,-999.]])

最后一列输出都为-999，因为array需要数据类型一样，对于最后一列的文本信息，它不知道该怎么去转化。

2、那么如何处理包含数字和文本列的数据集呢？

正如在上节所提到的，您可以将dtype设为object，当然这里你可以设为None

#data2 = np.genfromtxt(path, delimiter=',', skip_header=1, dtype='object')
data2 = np.genfromtxt(path, delimiter=',', skip_header=1, dtype=None)
data2[:3]  # see first 3 rows
#> array([( 18., 8,  307., 130, 3504,  12. , 70, 1, b'
"chevrolet chevelle malibu"'),
#>        ( 15., 8,  350., 165, 3693,  11.5, 70, 1,
 b'"buick skylark 320"'),
#>        ( 18., 8,  318., 150, 3436,  11. , 70, 1,
b'"plymouth satellite"')],
#>       dtype=[('f0', '<f8'), ('f1', '<i8'), ('f2', '<f8'), 
('f3', '<i8'), ('f4', '<i8'), 
('f5', '<f8'), ('f6', '<i8'), ('f7', '<i8'), ('f8', 'S38')])

3、将arary导出为.csv文件？

np.savetxt("out.csv", data, delimiter=",")

▌三、如何保存并加载numpy对象？

在某些情况下，我们希望将大型转换后的numpy数组保存到磁盘并直接将其加载回控制台，而无需重新运行数据转换代码。Numpy为此提供了.npy和.npz文件类型。

1、如果你想存储一个ndarray对象，用np.save存为.npy,然后用np.load导入。

2、如果你想在一个文件中存储多于1个的ndarray对象，用np.savez存为.npz。

# Save single numpy array object as .npy file
np.save('myarray.npy', arr2d)  
# Save multile numy arrays as a .npz file
np.savez('array.npz', arr2d_f, arr2d_b)

3、导出数据

# Load a .npy file
a = np.load('myarray.npy')print(a)
#> [[0 1 2]#>  [3 4 5]#>  [6 7 8]]
# Load a .npz file
b = np.load('array.npz')print(b.files)
b['arr_0']
#> ['arr_0', 'arr_1']
#> array([[ 0.,  1.,  2.],
#>        [ 3.,  4.,  5.],
#>        [ 6.,  7.,  8.]])

▌三、如何连接两个numpy数组，列式和行式？

Method 1: np.concatenate
Method 2: np.vstack 和 np.hstack
Method 3: np.r_ 和np.c_

这三种方法输出的结果是一样的。但是np.r_和np.c_都使用方括号来堆栈数组。但首先，让我创建要并置的数组。

1、让我们垂直堆叠数组：

a = np.zeros([4, 4])
b = np.ones([4, 4])
# Vertical Stack Equivalents (Row wise)
np.concatenate([a, b], axis=0)

np.vstack([a,b])  
np.r_[a,b]  
#> array([[ 0.,  0.,  0.,  0.],
#>        [ 0.,  0.,  0.,  0.],
#>        [ 0.,  0.,  0.,  0.],
#>        [ 0.,  0.,  0.,  0.],
#>        [ 1.,  1.,  1.,  1.],
#>        [ 1.,  1.,  1.,  1.],
#>        [ 1.,  1.,  1.,  1.],
#>        [ 1.,  1.,  1.,  1.]])

2、让我们水平堆叠数组：


np.concatenate([a, b], axis=1) 
np.hstack([a,b])  
np.c_[a,b]
#> array([[ 0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.],
#>        [ 0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.],
#>        [ 0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.],
#>        [ 0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.]])

4、用np.r_把复杂的数据序列创建成一维的array

# Horizontal Stack Equivalents (Coliumn wise)np.r_[[1,2,3], 0, 0, [4,5,6]]
#> array([1, 2, 3, 0, 0, 4, 5, 6])

▌四、如何基于一列或多列对numpy数组进行排序？

arr = np.random.randint(1,6, size=[8, 4])
arr
#> array([[3, 3, 2, 1],
#>        [1, 5, 4, 5],
#>        [3, 1, 4, 2],
#>        [3, 4, 5, 5],
#>        [2, 4, 5, 5],
#>        [4, 4, 4, 2],
#>        [2, 4, 1, 3],
#>        [2, 2, 4, 3]])

按列排序：

np.sort(arr, axis=0)
#> array([[1, 1, 1, 1],
#>        [2, 2, 2, 2],
#>        [2, 3, 4, 2],
#>        [2, 4, 4, 3],
#>        [3, 4, 4, 3],
#>        [3, 4, 4, 5],
#>        [3, 4, 5, 5],
#>        [4, 5, 5, 5]])

1、如何使用argsort对基于1d的numpy数组进行排序？

np.argsort返回1d数组排序后的索引位置

# Get the index positions that would sort the array
x = np.array([1, 10, 5, 2, 8, 9])
sort_index = np.argsort(x)
print(sort_index)
#> [0 3 2 4 5 1]

这里3表示排序后第二小的数位于原来数组的第四个位置（索引从0开始）

x[sort_index]
#> array([ 1,  2,  5,  8,  9, 10])

2、按照第一列排序，其他列保持不变

arr = np.random.randint(1,6, size=[8, 4])
arr
#> array([[3, 3, 2, 1],
#>        [1, 5, 4, 5],
#>        [3, 1, 4, 2],
#>        [3, 4, 5, 5],
#>        [2, 4, 5, 5],
#>        [4, 4, 4, 2],
#>        [2, 4, 1, 3],
#>        [2, 2, 4, 3]])
sorted_index_1stcol = arr[:, 0].argsort()

arr[sorted_index_1stcol]
#> array([[1, 5, 4, 5],
#>        [2, 4, 5, 5],
#>        [2, 4, 1, 3],
#>        [2, 2, 4, 3],
#>        [3, 3, 2, 1],
#>        [3, 1, 4, 2],
#>        [3, 4, 5, 5],
#>        [4, 4, 4, 2]])

▌五、对日期进行操作

1、Numpy通过np.datetime64对象实现日期，该对象支持精度直到纳秒。您可以使用标准的YYYY-MM-DD格式的日期字符串创建一个。

date64 = np.datetime64('2018-02-04 23:10:10')
date64
#> numpy.datetime64('2018-02-04T23:10:10')
#输出日期
dt64 = np.datetime64(date64, 'D')
dt64
#> numpy.datetime64('2018-02-04')

2、给日期增值

默认情况下，是加在天数上。其他情况下需要用np.timedelta64

print('Add 10 days: ', dt64 + 10)
tenminutes = np.timedelta64(10, 'm')  # 10 minutes
print('Add 10 minutes: ', dt64 + tenminutes)

3、让我将dt64转换回字符串

np.datetime_as_string(dt64)
#> '2018-02-04'

4、用np.is_busday()来判断某一天是否为工作日

print('Date: ', dt64)
print("Is it a business day?: ", np.is_busday(dt64))  
#增加两个工作日，打印离该日期前且最近的工作日
print("Add 2 business days, rolling forward to nearest biz day: ", 
np.busday_offset(dt64, 2, roll='forward'))  
#增加两个工作日，找出靠后的工作日
print("Add 2 business days, rolling backward to nearest biz day: ", 
np.busday_offset(dt64, 2, roll='backward'))  
#> Date:  2018-02-04
#> Is it a business day?:  False
#> Add 2 business days, rolling forward to nearest biz day:  2018-02-07
#> Add 2 business days, rolling backward to nearest biz day:  2018-02-06

▌六、如何创建日期序列

dates = np.arange(np.datetime64('2018-02-01'), 
np.datetime64('2018-02-10'))print(dates)
# 检查一下是否是工作日
np.is_busday(dates)
#> ['2018-02-01' '2018-02-02' '2018-02-03' '2018-02-04' '2018-02-05'
#>  '2018-02-06' '2018-02-07' '2018-02-08' '2018-02-09']
array([ True,  True, False, False,  True,  True,  True, True, True],
 dtype=bool)

▌七、如何将numpy.datetime64转换为datetime.datetime对象？

import datetime
dt = dt64.tolist()
Dt
#> datetime.date(2018, 2, 4)
print('Year: ', dt.year)  
print('Day of month: ', dt.day)print('Month of year: ', dt.month)  
print('Day of Week: ', dt.weekday())  # Sunday
#> Year:  2018
#> Day of month:  4
#> Month of year:  2
#> Day of Week:  6

▌八、矢量化 - 使标量函数适用于矢量

1、foo函数（下面会定义）接受一个数字，如果它是“奇数”则将其平方，否则它将它除以2。当你将这个函数应用于标量（单个数字）时，它可以很好地工作，但在应用于array时失败。使用vectorize()后，你可以在array上很好地工作。

# Define a scalar functiondef foo(x):
    if x % 2 == 1:
        return x**2
    else:
        return x/2
print('x = 10 returns ', foo(10))
print('x = 11 returns ', foo(11))
#> x = 10 returns  5.0
#> x = 11 returns  121
# Vectorize foo(). Make it work on vectors.
foo_v = np.vectorize(foo, otypes=[float])
print('x = [10, 11, 12] returns ', foo_v([10, 11, 12]))
print('x = [[10, 11, 12], [1, 2, 3]] returns ', foo_v([[10, 11, 12], 
[1, 2, 3]]))
#> x = [10, 11, 12] returns  [   5.  121.    6.]
#> x = [[10, 11, 12], [1, 2, 3]] returns  [[   5.  121.    6.]
#>  [   1.    1.    9.]]

2、找出行与列的最大值与最小值

np.random.seed(100)
arr_x = np.random.randint(1,10,size=[4,10])
arr_x
#> array([[9, 9, 4, 8, 8, 1, 5, 3, 6, 3],#>
        [3, 3, 2, 1, 9, 5, 1, 7, 3, 5],
#>        [2, 6, 4, 5, 5, 4, 8, 2, 2, 8],
#>        [8, 1, 3, 4, 3, 6, 9, 2, 1, 8]])

上面是随机生成一个4*10的数组

def max_minus_min(x):
    return np.max(x) - np.min(x)
# Apply along the rows
print('Row wise: ', np.apply_along_axis(max_minus_min, 1, arr=arr_x))
# Apply along the columns
print('Column wise: ', np.apply_along_axis(max_minus_min, 0, arr=arr_x))
#> Row wise:  [8 8 6 8]
#> Column wise:  [7 8 2 7 6 5 8 5 5 5]

3、searchsorted - 查找要插入的位置，以便数组保持排序状态

x = np.arange(10)
print('Where should 5 be inserted?: ', np.searchsorted(x, 5))
print('Where should 5 be inserted (right)?: ', np.searchsorted(x, 5, 
side='right'))
#> Where should 5 be inserted?:  5
#> Where should 5 be inserted (right)?:  6

▌九、如何给一个数组增加维度？

有时您可能想将一维数组转换为二维数组（如电子表格）而不添加任何其他数据。

# Create a 1D array
x = np.arange(5)
print('Original array: ', x)
# Introduce a new column axis
x_col = x[:, np.newaxis]
print('x_col shape: ', x_col.shape)
print(x_col)
# Introduce a new row axis
x_row = x[np.newaxis, :]
print('x_row shape: ', x_row.shape)
print(x_row)
#> Original array:  [0 1 2 3 4]
#> x_col shape:  (5, 1)
#> [[0]
#>  [1]
#>  [2]
#>  [3]
#>  [4]]
#> x_row shape:  (1, 5)
#> [[0 1 2 3 4]]

▌十、其他一些有用的功能

1、Digitize

用于查找x中的元素在bins的哪个范围，下例中三个范围为0-3；3-6；6-9；分别代表1,2,3，（其中范围取左边界，右边为开边界，即0-3包含0、1、2）

x = np.arange(10)
x
#>array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
bins = np.array([0, 3, 6, 9])
# Get bin allotments
np.digitize(x, bins)
#> array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])

2、clip

限定范围，大于或小于边界的点都用边界的值代替

np.clip(x, 3, 8)
#> array([3, 3, 3, 3, 4, 5, 6, 7, 8, 8])

3、Histogram and Bincount

histogram（）和bincount（）都给出了出现的频率。但有一定的差异。前者只统计出现的元素的频率，而后者计算最小值和最大值之间同类型所有元素的频率，包括没有出现的元素的概率。具体看下面的例子：

x = np.array([1,1,2,2,2,4,4,5,6,6,6]) # doesn't need to be sorted
np.bincount(x) # 0 occurs 0 times, 1 occurs 2 times, 2 occurs thrice.......
#> array([0, 2, 3, 0, 2, 1, 3], dtype=int64)

counts, bins = np.histogram(x, [0, 2, 4, 6, 8])
print('Counts: ', counts)
print('Bins: ', bins)
#> Counts:  [2 3 3 3]
#> Bins:  [0 2 4 6 8]