NumPy, short for Numerical Python, is the fundamental package required for hig performance scientific computing and data analysis.Here are some of the things it provides:
- ndarray, a fast and space-efficient multidimensional array providing vectorized arithmetic operations and sophisticated broadcasting capabilities
- Standard mathematical functions for fast operations on entire arrays of data without having to write loops
- Linear algebra, random number generation, and Fourier transform capabilities
- Tools for integrating code written in C, C++, and Fortran
The main areas of functionality of this package are:
- Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations
- Common array algorithms like sorting, unique, and set operations
- Efficient descriptive statistics and aggregating/summarizing data
- Data alignment and relational data manipulations for merging and joining together heterogeneous data sets
- Expressing conditional logic as array expressions instead of loops with if-elifelse branches
- Group-wise data manipulations (aggregation, transformation, function application).
import numpy as np
Creating Arrays
Create a list and convert it to a numpy array
mylist = [1, 2, 3]
x = np.array(mylist)
x
array([1, 2, 3])
Or just pass in a list directly
y = np.array([4, 5, 6])
y
array([4, 5, 6])
type(y)
numpy.ndarray
Pass in a list of lists to create a multidimensional array.
m = np.array([[7, 8, 9], [10, 11, 12]])
m
array([[ 7, 8, 9],
[10, 11, 12]])
Use the shape method to find the dimensions of the array. (rows, columns)
np.shape(m)
(2, 3)
in ndarrays, all elements must have same datatype; numpy transforms automatically
l = [1, 2.5, "Dog", True] #lists can store different datatypes
for i in l:
print(type(i))
a = np.array(l)
print(a)
for i in a:
print(type(i))
<class 'int'>
<class 'float'>
<class 'str'>
<class 'bool'>
['1' '2.5' 'Dog' 'True']
<class 'numpy.str_'>
<class 'numpy.str_'>
<class 'numpy.str_'>
<class 'numpy.str_'>
arange returns evenly spaced values within a given interval.
n = np.arange(0, 30, 2) # start at 0 count up by 2, stop before 30
n
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])
reshape returns an array with the same data with a new shape.
n = n.reshape(3, 5) # reshape array to be 3x5
n
array([[ 0, 2, 4, 6, 8],
[10, 12, 14, 16, 18],
[20, 22, 24, 26, 28]])
linspace returns evenly spaced numbers over a specified interval.
o = np.linspace(0, 4, 9) # return 9 evenly spaced values from 0 to 4
o
array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])
resize changes the shape and size of array in-place.
o.resize(3, 3)
o
array([[0. , 0.5, 1. ],
[1.5, 2. , 2.5],
[3. , 3.5, 4. ]])
ones returns a new array of given shape and type, filled with ones.
np.ones((3, 2))
array([[1., 1.],
[1., 1.],
[1., 1.]])
zeros returns a new array of given shape and type, filled with zeros.
np.zeros((2, 3))
array([[0., 0., 0.],
[0., 0., 0.]])
eye returns a 2-D array with ones on the diagonal and zeros elsewhere.
np.eye(3)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
diag extracts a diagonal or constructs a diagonal array.
print(y)
np.diag(y)
[4 5 6]
array([[4, 0, 0],
[0, 5, 0],
[0, 0, 6]])
print(o)
np.diag(o)
[[0. 0.5 1. ]
[1.5 2. 2.5]
[3. 3.5 4. ]]
array([0., 2., 4.])
Create an array using repeating list (or use np.tile)
np.array([1, 2, 3] * 3)
array([1, 2, 3, 1, 2, 3, 1, 2, 3])
a = np.array([1, 2, 3])
np.tile(a, 3)
array([1, 2, 3, 1, 2, 3, 1, 2, 3])
Repeat elements of an array using repeat.
np.repeat([1, 2, 3], 3)
array([1, 1, 1, 2, 2, 2, 3, 3, 3])
Combining Arrays
p = np.ones([2, 3], int)
p
array([[1, 1, 1],
[1, 1, 1]])
Use vstack to stack arrays in sequence vertically (row wise).
np.vstack([p, 2*p])
array([[1, 1, 1],
[1, 1, 1],
[2, 2, 2],
[2, 2, 2]])
Use hstack to stack arrays in sequence horizontally (column wise).
np.hstack([p, 2*p])
array([[1, 1, 1, 2, 2, 2],
[1, 1, 1, 2, 2, 2]])
Operations
Use +, -, *, / and ** to perform element wise addition, subtraction, multiplication, division and power.
print(x + y) # elementwise addition [1 2 3] + [4 5 6] = [5 7 9]
print(x - y) # elementwise subtraction [1 2 3] - [4 5 6] = [-3 -3 -3]
[5 7 9]
[-3 -3 -3]
print(x * y) # elementwise multiplication [1 2 3] * [4 5 6] = [4 10 18]
print(x / y) # elementwise divison [1 2 3] / [4 5 6] = [0.25 0.4 0.5]
[ 4 10 18]
[0.25 0.4 0.5 ]
print(x**2) # elementwise power [1 2 3] ^2 = [1 4 9]
[1 4 9]
np.sqrt(x)
array([1. , 1.41421356, 1.73205081])
np.exp(x)
array([ 2.71828183, 7.3890561 , 20.08553692])
np.log(x)
array([0. , 0.69314718, 1.09861229])
np.ceil(np.log(x))
array([0., 1., 2.])
np.floor(np.log(x))
array([0., 0., 1.])
np.abs(x)
array([1, 2, 3])
np.around([-3.23, -0.76, 1.44, 2.65, ], decimals = 0) #evenly round all elements to the given number of decimals.
array([-3., -1., 1., 3.])
Dot Product
print(x, y)
x.dot(y) # dot product 1*4 + 2*5 + 3*6
[1 2 3] [4 5 6]
32
z = np.array([y, y**2])
print(z)
print(np.shape(z))
print(len(z)) # number of rows of array
[[ 4 5 6]
[16 25 36]]
(2, 3)
2
Let’s look at transposing arrays. Transposing permutes the dimensions of the array.
z.T
array([[ 4, 16],
[ 5, 25],
[ 6, 36]])
z.T.shape
(3, 2)
Use .dtype to see the data type of the elements in the array.
z.dtype
dtype('int32')
Use .astype to cast to a specific type.
z = z.astype('f')
z.dtype
dtype('float32')
Math Functions
a = np.array([-4, -2, 1, 3, 5])
a.sum()
3
a.max()
5
a.min()
-4
a.mean()
0.6
a.std()
3.2619012860600183
a.var()
10.64
np.percentile(a, 3)
-3.76
Covariance
Covariance is an indicator of the extent to which 2 random variables are dependent on each other. A higher number denotes higher dependency. changes in scale affects covariance.
aa = np.random.random((3, 3))
aa
array([[0.38109917, 0.50598335, 0.31684724],
[0.19499768, 0.46588364, 0.10965197],
[0.82519221, 0.92736672, 0.90446043]])
np.cov(aa) #covariance matrix
array([[0.00924947, 0.01778154, 0.00199988],
[0.01778154, 0.03459402, 0.0048454 ],
[0.00199988, 0.0048454 , 0.00287463]])
Correlation
Correlation is a statistical measure that indicates how strongly two variables are related. changes in scale does not affect correlation.
np.corrcoef(aa) #correlation matrix
array([[1. , 0.99405522, 0.38784076],
[0.99405522, 1. , 0.48589001],
[0.38784076, 0.48589001, 1. ]])
argmax and argmin return the index of the maximum and minimum values in the array.
a.argmax()
4
a.argmin()
0
Indexing / Slicing
s = np.arange(0, 13, 1) ** 2
s
array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144],
dtype=int32)
Use bracket notation to get the value at a specific index. Remember that indexing starts at 0.
s[0], s[4], s[-1]
(0, 16, 144)
Use : to indicate a range. array[start:stop]
Leaving start or stop empty will default to the beginning/end of the array.
s[1:5]
array([ 1, 4, 9, 16], dtype=int32)
Use negatives to count from the back.
s[-4:]
array([ 81, 100, 121, 144], dtype=int32)
A second : can be used to indicate step-size. array[start:stop:stepsize]
Here we are starting 5th element from the end, and counting backwards by 3 until the beginning of the array is reached.
s[-5::-3]
array([64, 25, 4], dtype=int32)
Let’s look at a multidimensional array.
r = np.arange(36)
r.resize((6, 6))
r
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
Use bracket notation to slice: array[row, column]
r[2, 2]
14
And use : to select a range of rows or columns
r[3, 3:7]
array([21, 22, 23])
Here we are selecting all the rows up to (and not including) row 2, and all the columns up to (and not including) the last column.
r[:2, :-1]
array([[ 0, 1, 2, 3, 4],
[ 6, 7, 8, 9, 10]])
This is a slice of the last row, and only every other element.
r[-1, 0:-1:2]
array([30, 32, 34])
We can also perform conditional indexing. Here we are selecting values from the array that are greater than 30. (Also see np.where)
r[r > 30]
array([31, 32, 33, 34, 35])
Here we are assigning all values in the array that are greater than 30 to the value of 30.
r[r > 30] = 30
r
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 30, 30, 30, 30, 30]])
mask1 = (r > 5) & (r < 8) #element-wise check if greater 5 and smaller 8 (logical and)
mask2 = (r > 5) | (r < 8) #element-wise check if greater 5 or smaller 8 (logical or)
mask3 = ~((r > 5) & (r < 8)) #the opposite of mask1
r[mask1]
array([6, 7])
r[mask2]
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 30, 30, 30,
30, 30])
r[mask3]
array([ 0, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 30, 30, 30, 30, 30])
Copying Data
Be careful with copying and modifying arrays in NumPy!
r2 is a slice of r
r2 = r[:3,:3]
r2
array([[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14]])
Set this slice’s values to zero ([:] selects the entire array)
r2[:] = 0
r2
array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
r has also been changed!
r
array([[ 0, 0, 0, 3, 4, 5],
[ 0, 0, 0, 9, 10, 11],
[ 0, 0, 0, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 30, 30, 30, 30, 30]])
To avoid this, use r.copy to create a copy that will not affect the original array
r_copy = r.copy()
r_copy
array([[ 0, 0, 0, 3, 4, 5],
[ 0, 0, 0, 9, 10, 11],
[ 0, 0, 0, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 30, 30, 30, 30, 30]])
Now when r_copy is modified, r will not be changed.
r_copy[:] = 10
print(r_copy, '\n')
print(r)
[[10 10 10 10 10 10]
[10 10 10 10 10 10]
[10 10 10 10 10 10]
[10 10 10 10 10 10]
[10 10 10 10 10 10]
[10 10 10 10 10 10]]
[[ 0 0 0 3 4 5]
[ 0 0 0 9 10 11]
[ 0 0 0 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]
[30 30 30 30 30 30]]
Iterating Over Arrays
Let’s create a new 4 by 3 array of random numbers 0-9.
test = np.random.randint(0,10,(4,3))
test
array([[2, 9, 7],
[6, 9, 0],
[8, 4, 0],
[5, 3, 4]])
Iterate by row:
for row in test:
print(row)
[2 9 7]
[6 9 0]
[8 4 0]
[5 3 4]
Iterate by index:
for i in range(len(test)):
print(test[i])
[2 9 7]
[6 9 0]
[8 4 0]
[5 3 4]
Iterate by row and index:
for i, row in enumerate(test):
print('row', i, 'is', row)
row 0 is [2 9 7]
row 1 is [6 9 0]
row 2 is [8 4 0]
row 3 is [5 3 4]
Use zip to iterate over multiple iterables.
test2 = test**2
test2
array([[ 4, 81, 49],
[36, 81, 0],
[64, 16, 0],
[25, 9, 16]], dtype=int32)
for i, j in zip(test, test2):
print(i,'+',j,'=',i+j)
[2 9 7] + [ 4 81 49] = [ 6 90 56]
[6 9 0] + [36 81 0] = [42 90 0]
[8 4 0] + [64 16 0] = [72 20 0]
[5 3 4] + [25 9 16] = [30 12 20]
Random Numbers
a = np.random.randint(1,101,10) #creating 10 random integers between 1 (incl.) and 101 (excl.)
a
array([57, 33, 4, 14, 69, 92, 33, 27, 51, 85])
np.random.seed(123) #setting a seed enables reproducibility
a = np.random.randint(1,101,10)
a
array([67, 93, 99, 18, 84, 58, 87, 98, 97, 48])
np.random.normal(5, 2,10) #creating 10 normal disctributed numbers with mean 5 and std 2
array([1.76139987, 2.77207117, 4.10511856, 8.33680322, 4.71325505,
3.7616182 , 3.46113306, 6.15349204, 5.25305184, 2.39702205])
b = np.arange(1,101) #creating array b from 1 to 100
b
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78,
79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
92, 93, 94, 95, 96, 97, 98, 99, 100])
np.random.shuffle(b) #randomly shuffle ndarray b
b
array([ 5, 3, 96, 64, 73, 34, 37, 25, 67, 89, 17, 39, 30,
9, 6, 1, 14, 80, 98, 87, 63, 38, 82, 33, 58, 29,
66, 60, 32, 68, 20, 36, 74, 24, 10, 72, 100, 43, 46,
47, 84, 75, 40, 95, 22, 12, 99, 88, 81, 61, 90, 97,
42, 54, 45, 52, 69, 18, 79, 13, 50, 57, 21, 51, 26,
56, 83, 44, 2, 55, 15, 7, 27, 71, 94, 31, 92, 16,
19, 78, 23, 11, 59, 91, 76, 65, 70, 4, 41, 77, 35,
28, 86, 53, 93, 8, 49, 62, 48, 85])
b.sort() #sorting ndarray b again
b[::-1] #sorting in reverse order
array([100, 99, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88,
87, 86, 85, 84, 83, 82, 81, 80, 79, 78, 77, 76, 75,
74, 73, 72, 71, 70, 69, 68, 67, 66, 65, 64, 63, 62,
61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49,
48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36,
35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23,
22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10,
9, 8, 7, 6, 5, 4, 3, 2, 1])
np.random.seed(123)
b1 = np.random.choice(b, 100, replace = True) #randomly creating a 100 elements sample of ndarray b with/without replacement
b1
array([ 67, 93, 99, 18, 84, 58, 87, 98, 97, 48, 74, 33, 47,
97, 26, 84, 79, 37, 97, 81, 69, 50, 56, 68, 3, 85,
40, 67, 85, 48, 62, 49, 8, 100, 93, 53, 98, 86, 95,
28, 35, 98, 77, 41, 4, 70, 65, 76, 35, 59, 11, 23,
78, 19, 16, 28, 31, 53, 71, 27, 81, 7, 15, 76, 55,
72, 2, 44, 59, 56, 26, 51, 85, 57, 50, 13, 19, 82,
2, 52, 45, 49, 57, 92, 50, 87, 4, 68, 12, 22, 90,
99, 4, 12, 4, 95, 7, 10, 88, 15])
b1.sort() #sorting b1
b1
array([ 2, 2, 3, 4, 4, 4, 4, 7, 7, 8, 10, 11, 12,
12, 13, 15, 15, 16, 18, 19, 19, 22, 23, 26, 26, 27,
28, 28, 31, 33, 35, 35, 37, 40, 41, 44, 45, 47, 48,
48, 49, 49, 50, 50, 50, 51, 52, 53, 53, 55, 56, 56,
57, 57, 58, 59, 59, 62, 65, 67, 67, 68, 68, 69, 70,
71, 72, 74, 76, 76, 77, 78, 79, 81, 81, 82, 84, 84,
85, 85, 85, 86, 87, 87, 88, 90, 92, 93, 93, 95, 95,
97, 97, 97, 98, 98, 98, 99, 99, 100])
np.unique(b1) #unique elements of b1
array([ 2, 3, 4, 7, 8, 10, 11, 12, 13, 15, 16, 18, 19,
22, 23, 26, 27, 28, 31, 33, 35, 37, 40, 41, 44, 45,
47, 48, 49, 50, 51, 52, 53, 55, 56, 57, 58, 59, 62,
65, 67, 68, 69, 70, 71, 72, 74, 76, 77, 78, 79, 81,
82, 84, 85, 86, 87, 88, 90, 92, 93, 95, 97, 98, 99,
100])
np.array(list(set(b1))) #same
array([ 2, 3, 4, 7, 8, 10, 11, 12, 13, 15, 16, 18, 19,
22, 23, 26, 27, 28, 31, 33, 35, 37, 40, 41, 44, 45,
47, 48, 49, 50, 51, 52, 53, 55, 56, 57, 58, 59, 62,
65, 67, 68, 69, 70, 71, 72, 74, 76, 77, 78, 79, 81,
82, 84, 85, 86, 87, 88, 90, 92, 93, 95, 97, 98, 99,
100])
np.unique(b1).size #how many unique elements?
66
np.unique(b1, return_index= True, return_counts=True) #.unique()-method is quite informative
(array([ 2, 3, 4, 7, 8, 10, 11, 12, 13, 15, 16, 18, 19,
22, 23, 26, 27, 28, 31, 33, 35, 37, 40, 41, 44, 45,
47, 48, 49, 50, 51, 52, 53, 55, 56, 57, 58, 59, 62,
65, 67, 68, 69, 70, 71, 72, 74, 76, 77, 78, 79, 81,
82, 84, 85, 86, 87, 88, 90, 92, 93, 95, 97, 98, 99,
100]),
array([ 0, 2, 3, 7, 9, 10, 11, 12, 14, 15, 17, 18, 19, 21, 22, 23, 25,
26, 28, 29, 30, 32, 33, 34, 35, 36, 37, 38, 40, 42, 45, 46, 47, 49,
50, 52, 54, 55, 57, 58, 59, 61, 63, 64, 65, 66, 67, 68, 70, 71, 72,
73, 75, 76, 78, 81, 82, 84, 85, 86, 87, 89, 91, 94, 97, 99],
dtype=int64),
array([2, 1, 4, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 1,
1, 1, 1, 1, 1, 2, 2, 3, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 2, 1, 1,
1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 3, 1, 2, 1, 1, 1, 2, 2, 3, 3, 2, 1],
dtype=int64))
Performance
size = 1000000 #number of elements
a = np.arange(size) #ndarray
l = list(range(size)) #list
%timeit a+2 #ndarray: measuring time for element-wise addition
1.47 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [i+2 for i in l] #list: measuring time for element-wise addition
64.7 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit a*2 #multiplication
1.68 ms ± 63.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [i*2 for i in l] #multiplication
66.9 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit a**2 #square
1.67 ms ± 76.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [i**2 for i in l] #square
251 ms ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.sqrt(a) #square root
3.37 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit [i**0.5 for i in l] #square root
198 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Case Study Numpy vs. Python Standard Library
%timeit (np.random.randint(1,11,100*10000).reshape(10000,100) == 1).sum(axis = 1).mean()
16.1 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
import random
def simulation(): # using nested loops, if statements and lists
results = []
for _ in range(10000):
l = []
for _ in range(100):
if random.randint(1,10) == 1:
l.append(True)
else:
l.append(False)
results.append(sum(l))
return (sum(results) / len(results))
%timeit simulation()
757 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
References
- Applied data science with python by Michigan university, Coursera
- Python for data analysis book by O’Reilly
- Pandas Bootcamp by Udemy