BIO-210: Applied software engineering for life sciences
Python Introduction II - Data Types 2 and Numpy#
Data types - Containers II#
The container datatypes are extremely useful in Python. Last time you learned about lists and dictionaries. Today, we will have a look at tuples and sets.
Tuples#
Tuples, like lists, are an ordered collection of values. The main difference between lists and tuples is that, while the former are mutable objects, tuples are immutable. This means that it is not possible to change the value of one of the elements of a tuple, unless it is a mutable object itself. To make things clear, a tuple of integers cannot be modified. However, a tuple of lists still allows the contained lists to be modified (in this case the tuple just stores a reference to the actual list, which is still mutable). In short, tuples are useful containers when the content of the collection is not expected to change or the number of objects to increase/decrease. Removing or adding an element to a tuple, differently from what happens with lists, is an expensive operation, as it creates an entirely new tuple. Tuples can be created by direct definition or from an iterable object.
x = (1, 3, 5, 6)
print("x = (1, 3, 5, 6) , type of x: ", type(x))
x_2 = x[2]
print("The element of the tuple in position 2 is", x_2)
x = tuple(range(10))
print("Tuple including the 10 digits: ", x)
x[2] = 20
x = (1, 3, 5, 6) , type of x: <class 'tuple'>
The element of the tuple in position 2 is 5
Tuple including the 10 digits: (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[1], line 10
7 x = tuple(range(10))
8 print("Tuple including the 10 digits: ", x)
---> 10 x[2] = 20
TypeError: 'tuple' object does not support item assignment
The last statement should have raised a TypeError
, as we were trying to change an element in the tuple, which is not allowed.
Tuples can be conveniently looped through with a for statement. Unlike lists, tuples do not offer any “tuple comprehension” syntax. The keyword tuple
is necessary to define a tuple from a generator.
x = (1, 3, 5, 6)
x_squared = tuple(el ** 2 for el in x)
print(x_squared)
(1, 9, 25, 36)
Sets#
Sets, similarly to their mathematical counterpart, are unordered collections of unique objects. Like lists, a set can be created by directly defining its elements of by passing an iterable object to the function set()
.
x = {1, 3, 5, 6, 3}
print("x = {1, 3, 5, 6, 3} , type of x: ", type(x))
print("Resulting set: ", x, " NB: all elements are unique!")
x = set(range(10))
print("Set including the 10 digits: ", x)
x = {1, 3, 5, 6, 3} , type of x: <class 'set'>
Resulting set: {1, 3, 5, 6} NB: all elements are unique!
Set including the 10 digits: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Sets have many useful operations, such as add, to add a single element, or update, to add all the elements in an iterable. With the methods remove and discard you can remove a single element of a set. remove
will raise an error if the element is not in the set, while discard
will not.
x = {1, 2, 3}
print("Initial set: ", x)
x.add(6)
print("After adding 6: ", x)
x.update(range(5, 8))
print("After adding the numbers between 5 and 7: ", x)
x.remove(3)
print("After removing 3: ", x)
Initial set: {1, 2, 3}
After adding 6: {1, 2, 3, 6}
After adding the numbers between 5 and 7: {1, 2, 3, 5, 6, 7}
After removing 3: {1, 2, 5, 6, 7}
As no ordering is defined, it is not possible to access a specific element of a set by index. You can instead iterate through the elements of a set with a for loop or with the set comprehension syntax. Note that the definition ordering is not preserved when printing!
x = {1, 3, 5, 6}
x_squared = set()
for el in x:
x_squared.add(el**2)
print(x_squared)
{1, 36, 9, 25}
x = {1, 3, 5, 6}
x_squared = {el ** 2 for el in x}
print(x_squared)
{1, 36, 9, 25}
Sets are especially useful because of their support to the operations difference
, intersection
and union
. Their behavior is the same as the corresponding mathematical operations, and they return set
objects as well:
a = {2, 5, -1}
b = {4, 5, 9}
print("A = ", a)
print("B = ", b)
print("A minus B: ", a.difference(b))
print("A intersection B: ", a.intersection(b))
print("A union B: ", a.union(b))
A = {2, 5, -1}
B = {9, 4, 5}
A minus B: {2, -1}
A intersection B: {5}
A union B: {2, 4, 5, 9, -1}
Numpy#
Numpy is a widely used Python library for scientific computing. Its long list of functionalities and great performance have made it a fundamental tool for virtually any scientist using python. It is commonly imported with the nickname np.
import numpy as np
Numpy arrays#
The basic data type of numpy is the multidimensional array. The main way to create one is starting from a (nested) collection (e.g. a list). The array will have as many dimensions as the depth of the list (a list of lists has depth 2, a list of lists of lists 3, etc.).
a = np.array([3, 4, 1])
b = np.array([[1, 2], [4, -1], [3, 3]])
print("a =", a, ", shape of a: ", a.shape)
print("b =\n", b, ", shape of b: ", b.shape)
a = [3 4 1] , shape of a: (3,)
b =
[[ 1 2]
[ 4 -1]
[ 3 3]] , shape of b: (3, 2)
In the previous example, numpy can automatically infer the dimensions of the input data and organize them accordingly (uni- and bi-dimensional arrays). Other common ways of initializing arrays are with constant or random values. Numpy offers the handy functions zeros
, ones
and the module random
. For example, random.randn
samples the elements of the matrix from a standard normal distribution.
a = np.ones((3, 4))
print("ones((3, 4)) =")
print(a)
b = np.zeros((2, 5))
print("\nzeros((2, 5)) =")
print(b)
c = np.random.randn(3, 3)
print("\nrandom.randn(3, 3) =")
print(c)
ones((3, 4)) =
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
zeros((2, 5)) =
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
random.randn(3, 3) =
[[ 1.70596578 1.39319054 1.62815468]
[-1.0738511 0.10633754 -0.17790967]
[-0.65043917 1.39673578 -0.7873843 ]]
Other useful array creation functions include arange
and linspace
. The first one behaves as range
, but returning an array. The second genertes an array of a given number of equally spaced values between a minimum and a maximum.
a = np.arange(3, 10)
print("arange(3, 10) =", a)
b = np.linspace(3, 4, 11)
print("linspace(3, 4, 11) =", b)
arange(3, 10) = [3 4 5 6 7 8 9]
linspace(3, 4, 11) = [3. 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4. ]
Operations and functions#
Multidimensional arrays obviously support all the basic mathematical operations. The default operators (+
, -
, *
, /
) perform element-wise additions, subtractions, multiplications and divisions.
a = np.array([1, 3, -2])
b = np.array([4, -1, 2])
s = a + b
d = a - b
p = a * b
q = a / b
print("a =", a, "\tb =", b)
print("a + b = ", s)
print("a - b = ", d)
print("a * b = ", p)
print("a / b = ", q)
a = [ 1 3 -2] b = [ 4 -1 2]
a + b = [5 2 0]
a - b = [-3 4 -4]
a * b = [ 4 -3 -4]
a / b = [ 0.25 -3. -1. ]
Also many common analytic functions are implemented in Numpy, e.g, log
, exp
, sin
, sqrt
and many others. They are also applied element-wise to multidimensional arrays.
Exercise 1. Print the sine of 100 equally spaced values in the interval [-5, 5]
# Your code here
x_vals = np.linspace(-5, 5, 100)
print(np.sin(x_vals))
[ 0.95892427 0.98264051 0.99633934 0.99988113 0.99322975 0.97645303
0.94972199 0.91330913 0.86758566 0.8130177 0.75016154 0.67965796
0.60222569 0.51865411 0.42979519 0.33655477 0.23988339 0.14076655
0.04021468 -0.06074715 -0.1610897 -0.25979004 -0.35584199 -0.44826636
-0.53612093 -0.61851008 -0.69459392 -0.76359681 -0.82481532 -0.87762535
-0.92148855 -0.95595775 -0.98068157 -0.99540796 -0.99998679 -0.99437139
-0.978619 -0.95289021 -0.91744731 -0.87265161 -0.81895978 -0.75691917
-0.68716224 -0.61040014 -0.52741539 -0.43905397 -0.34621667 -0.24984992
-0.1509361 -0.05048358 0.05048358 0.1509361 0.24984992 0.34621667
0.43905397 0.52741539 0.61040014 0.68716224 0.75691917 0.81895978
0.87265161 0.91744731 0.95289021 0.978619 0.99437139 0.99998679
0.99540796 0.98068157 0.95595775 0.92148855 0.87762535 0.82481532
0.76359681 0.69459392 0.61851008 0.53612093 0.44826636 0.35584199
0.25979004 0.1610897 0.06074715 -0.04021468 -0.14076655 -0.23988339
-0.33655477 -0.42979519 -0.51865411 -0.60222569 -0.67965796 -0.75016154
-0.8130177 -0.86758566 -0.91330913 -0.94972199 -0.97645303 -0.99322975
-0.99988113 -0.99633934 -0.98264051 -0.95892427]
Exercise 2. Compute the square root of the first 10 integers using Numpy functions
# Your code here
vals = np.arange(10)
print(np.sqrt(vals))
[0. 1. 1.41421356 1.73205081 2. 2.23606798
2.44948974 2.64575131 2.82842712 3. ]
In Numpy we also find functions for vector and matrix operations. For example, the function inner
implements the scalar product between two arrays. The function dot
implements the matrix multiplication operation in the mathematical sense (scalar products between all the rows of the first matrix and the columns of the second), which can also be used to compute the matrix-vector product. These implementations of linear algebra operations are highly optimized and are much faster than an implementation with for loops one could write in Python. It is therefore important to use numpy functions as much as possible when working with array to get the maximum efficiency.
Exercise 3. Define two random matrices of size (3x4) and (4x2) and multiply them. Is the resulting shape what you expected?
# Your code here
m1 = np.random.randn(3,4)
m2 = np.random.randn(4,2)
m_prod = np.dot(m1, m2)
print(f"The shape, as expected, is {m_prod.shape}")
The shape, as expected, is (3, 2)
Note that np.dot
is implemented too by the @
operator (np.dot(a,b) == a@b
).
Accessing the array’s elements#
Numpy arrays are suitable for the storage of large amount of data. It is therefore convenient to know some smart way to access their elements. As vectors are an ordered structure, elements can be accessed by their index.
a = np.array([[1, 2], [4, -1], [3, 3]])
el = a[1, 0]
print(a)
print("\nThe element in position (1, 0) is ", el)
[[ 1 2]
[ 4 -1]
[ 3 3]]
The element in position (1, 0) is 4
If you need to access larger portions of contiguous or regularly spaced elements of a numpy array, then you can use the slicing operations. The simplest form of slicing just works like the access by index, but replacing the index in one or more dimensions with 2 indices, separated by “:”. For instance, the syntax x[2:4, 3:9] returns the values with index 2 to 4 (4 excluded) along the first axis, and from 3 to 9 (9 excluded) along the second axis. You can optionally define a skip value: the syntax x[2:9:2] will select only every second element between 2 and 9. It is often useful to leave one or more values empty. x[:4] means “from the start up to 4”, while x[3:] means “from 3 up to the end”. x[:, 3] would return the full column 3. Tip: indices also work backwards, meaning that the last element can also be retrieved with the index -1, the second last with -2, etc.
Exercise 4. Define a random array (values from the normal distribution) of shape (3, 4, 4). Slice the 2x2x2 cube at the beginning of each axis and print it.
# Your code here
arr = np.random.randn(3, 4, 4)
cube = arr[:2,:2,:2]
print(cube)
[[[-1.08328854 1.28221963]
[-1.27201834 -0.37782782]]
[[-1.08413894 -1.09632892]
[ 1.30228114 0.22380735]]]
Exercise 5. Define a matrix of size (8x8). Undersample it into a (4x4) matrix by selecting every second element both along the rows and the columns.
# Your code here
arr = np.random.randn(8,8)
arr_undersampled = arr[::2,::2]
print(arr_undersampled.shape)
(4, 4)
Exercise 6. The slicing operation returns a reference to the sliced part of the array. This means that changing the value of the slice also changes the value of the original array. Define a (5x5) random matrix, slice the third row and assign the value 1 to its first 3 elements. Print the original matrix.
# Your code here
arr = np.random.randn(5,5)
row = arr[3]
row[:3] = 1
print(arr)
[[ 2.67459331 -0.82852207 -1.10716602 0.3684717 -0.88834742]
[-0.96288291 0.4976672 -1.6695397 1.29641602 -0.94383472]
[-2.33519636 0.46634051 -0.29988549 -1.23414653 0.46697357]
[ 1. 1. 1. 0.43479929 0.27697386]
[ 1.21550755 -0.63349573 -2.98844918 -1.48884266 0.52650454]]
Array manipulation#
Arrays often require to be manipulated to be in the correct format for the computation. For example, a dataset of pictures might be stored as a flat vector, but we might need them in the form of a rectangle. Numpy offers a long list of functions to handle arrays. Here we are going to focus on the functions reshape
, transpose
and concatenate
reshape
is used to rearrange the shape of a vector without changing the values of its elements. It receives the list of sizes of the resulting array in each dimension and reorders the elements accordingly. It is possible to leave one of the dimensions blank (by passing a -1), as it can be inferred by the sizes of the other dimensions and the number of elements.
x = np.random.randn(100)
print("The original shape is", x.shape)
x_square = np.reshape(x, [10, 10])
print("The new shape is", x_square.shape)
The original shape is (100,)
The new shape is (10, 10)
Exercise 7. Define a random matrix of size (100x100) and reshape it into an array of size (100x10x10). Try not specifying the last dimension and verify that it still has the expected shape
# Your code here
x = np.random.randn(100,100)
x_reshaped = x.reshape(100,10,-1)
print("The reshaped shape is", x_reshaped.shape)
The reshaped shape is (100, 10, 10)
transpose
is simply used to swap the indices of the elements of a matrix.
Exercise 8. Create a random (3x5) matrix m and compute its transposed m_t. Verify that both the products between m_t and m and m and m_t result in a symmetric matrix
# Your code here
m = np.random.randn(3,5)
m_t = m.transpose(1,0)
m_t_m = np.dot(m_t, m)
m_m_t = np.dot(m, m_t)
print("M M^T is symmetric:", np.allclose(m_m_t, m_m_t.transpose()))
print("M^T M is symmetric:", np.allclose(m_t_m, m_t_m.transpose()))
M M^T is symmetric: True
M^T M is symmetric: True
concatenate
is the function to merge multiple arrays into a single one. Through the keyword axis one can specify along which dimension to attach the array to the other.
Exercise 9. Define two random matrices of sizes (2x5). Use the function concatenate
to merge them into a new matrix. First try passing axis = 0, then axis = 1. How does the shape of the result change?
# Your code here
m1 = np.random.randn(2,5)
m2 = np.random.randn(2,5)
concat_axis0 = np.concatenate((m1,m2), axis=0)
concat_axis1 = np.concatenate((m1,m2), axis=1)
print("Shape after concatenation along axis 0:", concat_axis0.shape)
print("Shape after concatenation along axis 1:", concat_axis1.shape)
Shape after concatenation along axis 0: (4, 5)
Shape after concatenation along axis 1: (2, 10)
Broadcasting#
Broadcasting is a useful tool to write compact and efficient code with Numpy. The idea is that Numpy will sometimes accept vectors and matrices of different shapes when executing operations such as a sum or an element-wise product. For example:
x = np.array([[2, 4], [3, 1], [0, -1]])
y = np.ones((3, 1))
result = x + y
print(result)
[[3. 5.]
[4. 2.]
[1. 0.]]
In the previous code we have summed a (3x2) matrix and a (3x1) vector. Numpy succeeds in the task because it interprets the operation as “sum vector y to all the columns of x”. In fact, broadcasting follows these 2 rules:
1 - If the number of dimensions between the two matrices is different, prepend dummy dimensions to the array with fewer dimensions until the numbers match.
2 - In all the dimensions in which one array has size 1 and the other \(n > 1\), the array with size 1 behaves like its values are repeated n times.
When applicable, broadcasting is an extremely useful tool due to its high efficiency.
Exercise 10. Create a (10x10) matrix in which all rows contain the numbers from 0 to 9, plus some random noise (the random noise is different for each row). Take advantage of broadcasting.
# Your code here
row = np.reshape(np.arange(10), (-1, 1))
noise = np.random.randn(1, 10)
m = row + noise
print(m)
[[ 1.08186659 0.74109193 0.65680219 -0.30619884 -0.6734278 1.19796811
-0.41357663 -0.08581642 0.54504751 -0.94848207]
[ 2.08186659 1.74109193 1.65680219 0.69380116 0.3265722 2.19796811
0.58642337 0.91418358 1.54504751 0.05151793]
[ 3.08186659 2.74109193 2.65680219 1.69380116 1.3265722 3.19796811
1.58642337 1.91418358 2.54504751 1.05151793]
[ 4.08186659 3.74109193 3.65680219 2.69380116 2.3265722 4.19796811
2.58642337 2.91418358 3.54504751 2.05151793]
[ 5.08186659 4.74109193 4.65680219 3.69380116 3.3265722 5.19796811
3.58642337 3.91418358 4.54504751 3.05151793]
[ 6.08186659 5.74109193 5.65680219 4.69380116 4.3265722 6.19796811
4.58642337 4.91418358 5.54504751 4.05151793]
[ 7.08186659 6.74109193 6.65680219 5.69380116 5.3265722 7.19796811
5.58642337 5.91418358 6.54504751 5.05151793]
[ 8.08186659 7.74109193 7.65680219 6.69380116 6.3265722 8.19796811
6.58642337 6.91418358 7.54504751 6.05151793]
[ 9.08186659 8.74109193 8.65680219 7.69380116 7.3265722 9.19796811
7.58642337 7.91418358 8.54504751 7.05151793]
[10.08186659 9.74109193 9.65680219 8.69380116 8.3265722 10.19796811
8.58642337 8.91418358 9.54504751 8.05151793]]