Python notes

Displaying results

The print function is the main way of displaying output in Python:

print("hello world")

displays the text "hello world" (without the quotes). Text enclosed by matching pairs of single or double quotes is called a string (we will see other ways of making a string later). Strings (and only strings) are considered as exact bits of text and might be used to store a patient's name or gene sequences or...

The placeholder {} can be used with the string format method to denote the location to insert other values into a string. For example,

print("Today is my first day of {} class.".format("informatics"))

displays "Today is my first day of informatics class."

We can use as many placeholders as we want; their values are, by default, inserted in order:

print("Did you know that {} founded {}?".format("Clara Barton", "the American Red Cross"))

displays "Did you know that Clara Barton founded the American Red Cross?"

To make text with placeholders easier to read, you can give them names and refer to those names in the format arguments. For example:

print("Our next patient is {person}. Unfortunately, {person} has recently been diagnosed with {condition}."
      .format(person="Bob", condition="hangnails"))

displays "Hi my name is Bob. Nice to meet you Bob; I am Sally." Note that since we were inside the call to print, we can move the .format to the next line to make it easier to read. We'll use a variant of this when we're using NoSQL databases.

To learn more about the format method, including ways to change formatting, see the relevant section of the Python documentation (Links to an external site.).

Creating variables

A variable is a name that is assigned to an object or value; this assignment is typically done through the assignment operator = as in:

body_temperature = 37.2
heart_count = 1
note = 'Patient reported stomachache.'
body_parts = {'head', 'shoulders', 'knees', 'toes'}
hospital_visits = ['June 3, 2019', 'June 5, 2019', 'December 3, 2027', 'March 5, 2039']

Here the variables body_temperature, heart_count, note, body_parts, and hospital_visits are, respectively, a float, an int, a string, a set, and a list. The set and list here contain strings, but they could just as easily contain numbers or other data types. (For reasons beyond the scope of this course, a set cannot contain lists, but a list can contain lists and sets.)

We use variables to store results needed for later calculations (see Arithmetic, below); e.g.

right_hand_phalanges = 14
left_hand_phalanges = 14
right_foot_phalanges = 14
left_foot_phalanges = 14
total_phalanges = right_hand_phalanges + left_hand_phalanges + right_foot_phalanges + left_foot_phalanges

(Notice that to make our code readable, we always give meaningful variable names.)

A floating point number is a number offering a certain number of bits of precision (enough for about 15 digits) regardless of where the decimal point falls. This is the typical way "decimals" are represented on computers; for more, see the Python tutorial's discussion on floating point numbers (Links to an external site.).

An int represents an integer, that is, a number with no decimal component. For the most part, these can be used interchangably with floating point numbers that happen to be integers, but occasionally the difference matters (e.g. when working with non-Python code, or when generating a range (Links to an external site.)).

A string is a sequence of characters stored exactly, such as the free-text doctor's note shown here. These are indicated using matching pairs of single or double quotes. Alternatively matching sets of three quotes can be used to indicate the beginning and ending of a multi-line string, such as:

extended_note = """Patient reported a stomachache.

Tests for abdominal muscle injury negative. Recommend monitoring and antacid.
"""

Placeholders can be used with strings; these are indicated using curly brackets. The brackets are replaced with values passed into the format (Links to an external site.) method as of the time of the format call. For example,

note = "The patient presented with {} heart(s).".format(heart_count)

note = "The patient presented with {} heart(s) and a body temperature of {}.".format(heart_count, body_temperature)

You might have noticed that we saw the format method above when we talked about the print function; passing named arguments works in general, not just when printing:

note = "The patient presented with {heart} heart(s) and a body temperature of {temp}.".format(heart=2, temp=37)

Using named arguments makes it easier to tell what data the placeholder is supposed to represent and removes the need to worry about argument order. Using placeholders and format is extremely useful when dynamically creating text.

A list represents an ordered collection of data. Here for example, hospital_visits is a list of visit dates. We can use square brackets to indicate an item in a given position in the list (starting from 0). That is, the first item in the list is

hospital_visits[0]

The third visit date was thus:

hospital_visits[2]

(Remember, we count the visits as: 0, 1, 2.)

Negative numbers can be used to indicate position in the list as measured from the end; i.e. the last hospital visit was in

hospital_visits[-1]

We can use the assignment operator to update the value of an item in a list; for example if the next-to-last (i.e. index -2) hospital visit actually happened on December 3, 2028, we can change the existing value via:

hospital_visits[-2] = "December 3, 2028"

If our patient has a new hospital visit, we can append it to the list, with, e.g.

hospital_visits.append("December 27, 2042")

Lists have several other methods of potential interest: the insert method allows inserting an element into an arbitrary position in the list; the count method counts the occurrences of a given value; the index method finds the first location; the sort method can be used to sort the list (and the sorting rule may optionally be specified).

The length of a list (in our case, the total number of visits) can be determined using len; e.g.

len(hospital_visits)

It is occasionally useful to have a list-like object representing all the numbers (integers) between two values. This is done using range. To get a list-like object of all integers between 7 (included) and 42 (not included), we use

ages = range(7, 42)

Note: the last value is always not included.

In most ways, a range behaves like a list; e.g. to get the fourth (index 3) age, we use

ages[3]

The difference is that this value cannot be changed; ages is the specified number range so the index 3 entry must always be 10. To allow assigning values, we could have instead created a list from the range:

ages = list(range(7, 42))

Python provides a number of other built-in data types, including dictionaries (described below), complex numbers (e.g. 3+2j, useful for algorithms, but unlikely to show up in raw health data), booleans (True, False), long integers, byte arrays, tuples (Links to an external site.), and more.

Arithmetic

Python supports the basic arithmetic operators: addition (+), subtraction (-), multiplication (*), division (/), exponentiation (**).

number_of_fingers = 1 + 1 + 1 + 1 + 1
hours_in_a_year = 24 * 365

Order of operations and parentheses follow the normal rules of mathematics:

principal = 1000 * 1.05 ** 30
something_else_entirely = (1000 * 1.05) ** 30
volume_of_tumor = 4. / 3 * 3.14 * 2 ** 3

(Warning: Prior to Python 3, Python used what was known as integer division where if two integers e.g. 4 and 3 were divided, it would return an integer; i.e. it would drop the fractional part, and 4 / 3 would return 1 instead of 1.333. The 4. in the above line forces the 4 to be a float and therefore even in older versions of Python the division would return 1.333.)

There are additional arithmetic operators, including: matrix multiplication (@), integer division (//) and modulus (%).

Making choices

Using an if statement, code can do different things depending on different conditions. For example, the code:

temperature = 39
if temperature > 38:
    print('Fever detected. NSAIDs indicated.')

prints out the message, because the temperature is above the specified threshold.

A colon (:) begins a block of code that only happens if the condition is satisfied. The entire block of code must be indented:

if 2 > 3:
    print('this will never display')
    print('nor will this')
print('this on the other hand always prints out')

Python's comparison operators include:

    >     greater than
    >=    greater than or equal
    <     less than
    <=    less than or equal
    ==    equal
    !=    not equal

In particular, notice that equality testing uses two equal signs (==).

if diagnosis == 'diabetes':
    print('Consider Metformin or insulin.')
else:
    print('No treatment recommendations at this time.')

The else block gets executed when the condition is not true.

More complicated comparisons can be formed by combining comparisons using and, or, and not.

if diagnosis == 'diabetes' and not metformin_tried:
    print('Try Metformin.')

We can test if an item is in (or not in) a list using the in and not in operators, respectively:

'June 5, 2019' in hospital_visits

displays True

'5 June 2019' in hospital_visits

displays False

Why the difference? Because for now, we're treating dates as strings of characters, nothing more. So since the date was written in the first but not the second way, only the first returns True. Tomorrow, we'll handle dates in a more sophisticated way using the dateutil module.

Loops

Computers are great for doing similar calculations repeatedly. If we know in advance the set of things that we want to use for a calculation, we can use a for loop. For example, the following prints a list of today's patients:

patients = ['Blackwell, E', 'Lister, J', 'Vesalius, A', 'Freud, S', 'Salk, J']
for patient in patients:
    print('Patient {name} is scheduled for a consult today.'.format(name=patient))

As with if statements, the block of code that goes with the for is indented.

(The choice of the variable patient here was completely arbitrary; we could just as easily have written "for x in patients:" The computer wouldn't care, but then the name wouldn't mean anything to human readers, making the code that much harder to interpret. Other equally good choices for the variable name here include: person, name.)

Using loops like this allows us to not have to write the same code twice. This is a general programming concept called Don't Repeat Yourself. (Links to an external site.) Functions and methods (described below) offer another way to avoid repeating ourselves. This allows us to reduce our effort and avoid introducing copy-paste bugs.

In practice, we know more than just a single piece of data; we may have a list of lists of data grouping related information. For example, suppose that instead of the above we paired patients with their birth years:

patients = [['Blackwell, E', 1821],
            ['Lister, J', 1827],
            ['Vesalius, A', 1514],
            ['Freud, S', 1856],
            ['Salk, J', 1914]]

We can then loop over each patient getting the name and birth year by giving each of these variable names in the for statement, separated by commas:

for name, year in patients:
    print('Meeting with {} today, who was born in {}.'.format(name, year))

If the lists of patient data were longer, we would simply add more comma-separated variable names in the for statement.

(You may have noticed that there is no inherent reason why birth year should come second and name first besides that consistency is needed. Dictionaries, described below, allow avoiding ordering data.)

As we loop over our data, we may want to do different things based on the data. For example, we can print the list of all patients over 150 years old:

for name, year in patients:
    if 2019 - year > 150:
        print('Patient {name} is over 150 years old.'.format(name=name))

If we know a loop should stop when a certain condition has been reached, we can check for that condition with an if statement and leave the loop early by using break; for example, the following prints the squares of the numbers 1, 3, 5, 7, and 9 except it stops when the square would be bigger than 40 and does not print it or anything after it:

numbers = [1, 3, 5, 7, 9]
for number in numbers:
    if number * number > 40:
        break
    print(number * number)

Note: order matters! If the print call came before the if statement, the output would be different. How would it be different and why?

On rare occasions, it is useful to know the index that goes with the value being looped over. We do this using enumerate as in:

for i, word in enumerate(['the', 'quick', 'brown', 'fox', 'jumped']):
    print('The {}th word is {}.".format(i, word))

running this prints out a list of locations and the associated word. enumerate automatically pairs each item with its index. Recall that the first position in a list is position 0.

The while (Links to an external site.) statement offers another way of defining loops in Python. For both for and while, loops can be exited early (typically based on the choice made in an if statement) using the break keyword.

Dictionaries

Dictionaries provide list-like syntax for data that has no natural order, and is best identified by named keys instead of consecutively numbered indices. For example, the dictionary:

person = {
    'name': 'Barton, C', 
    'birthyear': 1821
}

has two keys: 'name' and 'birthyear'. As with lists, we read and write specific fields using [] notation; e.g.

person['founderOf'] = 'American Red Cross'
print('{} founded the {}.'.format(person['name'], person['founderOf'])

There is no natural 0th, 1th, etc item as the three types of data here: name, birthyear, founderOf have no inherent order. While we cannot access an item of data by number, there is still a certain number of pieces of data known in the dictionary; this is returned by the len function, e.g. len(person), which here would return 3, as there are 3 keys (name, birthyear, founderOf) and associated values.

We can get an iterable (an object that can be looped over) of all the keys (fields) in the dictionary data using person.keys() and of all the values using person.values(). To turn these iterables into lists, we use the list function; e.g. all_keys = list(person.keys()).

We can loop over all keys and values together using the items method:

for fact_type, fact_value in person.items():
    print('{}: {}'.format(fact_type, fact_value))

More complicated data structures can be built by combining lists and dictionaries. For example, we could restructure the patients data from before as follows:

patients = [{
                 'name': 'Blackwell, E', 
                 'birthyear': 1821
            },
            {
                 'name': 'Lister, J',
                 'birthyear': 1827
            },
            {
                'name': 'Vesalius, A', 
                'birthyear': 1514
            },
            {
                'birthyear': 1856,
                'name': 'Freud, S' 
            },
            {
                'name': 'Salk, J', 
                'invented': 'Polio vaccine',
                'birthyear': 1914
            }]

Note that for Freud, we have listed his birthyear before his name; this has no effect on any analysis code because we only identify the type of data based on the key not on the order. We have also included additional information about Jonas Salk. Analyses that do not require this type of information simply ignore the extra information in a dictionary.

The code printing out the name and birth years becomes:

for patient in patients:
    print('Meeting with {} today, who was born in {}.'.format(patient['name'], patient['birthyear']))

The in operator evaluates to True if a certain field is in a dictionary; otherwise it is False. Accessing a field that is not present is an error (but the get (Links to an external site.) method provides an alternative). We can use in to find and print a list of our patients inventions:

for data in patients:
    if 'invented' in data:
        print('{} invented {}'.format(data['name'], data['invented']))

You may wonder why in our exampled patients remains a list instead of itself being a dictionary with the keys being names. The answer is simple: names are not unique but dictionary keys must be. For a time, social security numbers were often used to disambiguate people. This is a problem for many reasons, but two major reasons not to do this: (1) using SSNs increases the risk of identity theft, and (2) not everyone has a SSN (e.g. most non-Americans who have never earned money in the United States).

In practice, a unique identifier is assigned to every patient.

Functions

Functions offer another way of avoiding repetition. They can compute values or perform actions that are done multiple times. They are also used as a way of self-documenting code so that the purpose can be determined via the function name.

A value or values may be returned from a function using the return statement:

def age(birthyear):
    return 2019 - birthyear

Note that as with if, for, while, etc... the body of a function definition must be indented and preceeded with a colon.

Here, birthyear is an argument. Multiple arguments may be specified:

def volume_of_rectangular_tumor(length, width, height):
    return length * width * height

Variables assigned inside of a function are generally only available within that function; a function may read variables defined in a higher scope.

Some useful built-in functions include: len, max, sum.

Optional values may be specified as keyword arguments:

def advance_value(value, increment=1):
    return value + increment

Methods

A method is like a function, but it operates on a given object. Syntactically, a method is invoked with the object, a dot, the method name, parentheses, and any arguments. (As with functions, methods can take positional arguments, keyword arguments, or a combination of the two.)

For example, we have already seen the format method of a string. That method returns a new string with the placeholders replaced:

note = "His temperature was {temp} degrees.".format(temp=37)

Here the string was defined on the same line the method was invoked, but often these are separated, especially for complex objects:

template = "His temperature was {temp} degrees."
completed_note = template.format(temp=37)

The format method leaves the original template unchanged and returns a new one. Strings also have lower and upper methods that return lowercase and uppercase versions of the strings, respectively, without changing their values.

Other methods, like a list's sort method (or like append, shown above) modify the object itself:

names = ['Jones', 'Smith', 'Flintstone']
names.sort()
print(names)  # displays ['Flintstone', 'Jones', 'Smith']

The # here indicates a comment; that is, text the computer ignores that is used to make the code easier to read.

(Every type of data can be sorted given an appropriate sorting rule; for more, see the Python tutorial on sorting (Links to an external site.).)

Libraries (modules)

Python provides many libraries (modules) to facilitate common calculations. Over 200 come with every Python installation, and many distributions (e.g. Anaconda (Links to an external site.)) include even more. Additional modules may be found and automatically installed by using pip (or conda), although you will not need to do that for this course.

Libraries are loaded using the import statement; specific functions etc may be loaded using from. Dot notation is used to access items within the library.

In this course, we will use:

dateutil

For working with dates.

from dateutil.parser import parse as parse_date
date1 = parse_date('1 Jan 1970')
date2 = parse_date('July 23, 2019')
print(date2.year)
print(date2.month)
print((date2 - date1).days)
print((date2 - date1).total_seconds())
major_events = {parse_date('Mar 10, 1876'): 'First telephone call',
                parse_date('1969-10-29'): 'ARPANET created',
                parse_date('3 December 1967'): 'First human heart transplant'}
print(major_events[parse_date('Oct 29, 1969')])

pandas

import pandas as pd
data = pd.read_csv('patients.csv')
data.head()  # run this interactively
data[['subject_id', 'row_id', 'gender']]  # run this interactively to display the data frame
data[['subject_id', 'row_id', 'gender']][:4] # same, but grabs the first four rows
data['dob'] = pd.to_datetime(data['dob'])  # process all the dates at one time, can compare with parse_date values

for row in data.itertuples():
    print(row.dob, row.gender)

x = pd.Series(range(10))
y = x * x
print(y)

A data frame's to_csv method can be used to store it as a csv file.

Data frames may be filtered to include only rows matching a given condition. For example, to get a frame of only the rows where gender is specified, you can use the notna method:

data[data['gender'].notna()]

(The oppose of notna is isna.)

Select rows with a specified value using ==; other comparison operators >, >=, etc may be used as well. Combine with | (or), & (and):

data[data['gender'] == 'M']

For a comparison of pandas with SQL, see e.g. https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html (Links to an external site.)

requests

For loading data from the web.

import requests
data = requests.get('https://senselab.med.yale.edu/_site/webapi/object.json/{id_}?woatts=23,24'.format(id_=279)).json()
print(data['object_name'])

numpy

For linear algebra and vectorized mathematics.

import numpy as np
a = np.array([1, 5, 165])
print(a * 2)

Matrices may be created as two dimensional arrays:

a = np.array([[1, 2], [3, 4]])

The * operator behaves elementwise while @ performs matrix multiplication; i.e. compare a * a with a @ a.

numpy.random allows generating random numbers from a variety of distributions; for example:

numpy.random.poisson(5)

You can also generate many values at one time by specifying a size argument:

numpy.random.normal(0, 2, size=10)  # mean 0; std: 2

pymongo

For connecting to MongoDB.

from pymongo import MongoClient
mongodb = MongoClient()  
test_db = mongod.test_db
test_collection = test_db.collection
test_collection.insert_one({'systolic': 120, 'temperature': 37})

MySQLdb

For connecting to MySQL; can put query results into pandas.

import MySQLdb
import pandas as pd
db = MySQLdb.connect(host="localhost", user="username", passwd="password", db="dbname")
cur = db.cursor()
data = pd.read_sql("select insurance, diagnosis from Admissions where hospital_expire_flag=1", db)
data.head()

plotnine (aka ggplot)

from plotnine import ggplot, aes, geom_point, geom_line, geom_bar, geom_text
my_data = pd.DataFrame({'time': [1, 2, 3, 4, 5], 'rabbits': [1, 4, 9, 16, 25]})
ggplot(my_data, aes(x='time', y='rabbits')) + geom_point() + geom_line()

(ggplot(my_data, aes(x='time', y='rabbits'))
    + geom_bar(stat='identity', position='dodge', show_legend=False)
    + geom_text(aes(label='rabbits'), va='bottom', format_string='{} rabbits'))

Other modules commonly used in data science that we won't have time to talk about in this course include: math (Links to an external site.), scipy (Links to an external site.), nltk (Links to an external site.).

Basic file access

Reading all of a file into a string

with open('filename.txt') as my_file:
    file_contents = my_file.read()

Reading a file line-by-line

with open('filename.txt') as my_file:
    for line in my_file:
        # do something with the line (includes the newline character)
        print(line)

Encodings

Strings are represented in a computer's memory as a series of numbers. Given the wide diversity in human languages, the question then arises as to how to represent all possible characters; this led to the emergence of the Unicode standard (Links to an external site.), however for various reasons, Unicode code points are generally not used directly to encode characters on a disk, but instead one of a number of possible encodings are used instead. The most common is UTF-8 (Links to an external site.), and Python 3 assumes UTF-8 by default but you can specify other encodings using the encoding keyword argument to open; e.g.:

with open('filename.txt', encoding='iso-8859-1') as my_file:
    for line in my_file:
        # do something with the line (includes the newline character)
        print(line)

The Python documentation contains a list of supported encodings (Links to an external site.) and known aliases.

JSON

Writing to a file

Suppose data is a dictionary or list, possibly with nested dictionaries and lists, e.g. here a sampling of pubmed:

data = {
  2: {"journal": "Biochemical and biophysical research communications",
      "mesh": ["Fourier Analysis", "Magnetic Resonance Spectroscopy"]},
  8: {"journal": "Biochemical pharmacology",
      "mesh": ["Amidohydrolases", "Animals", "Esterases"]}
}

then we can write this to a file using Python's json module:

import json
with open('data.json', 'w') as f:
    f.write(json.dumps(data))

The file will look very similar to the way you would describe the data in Python, with the most notable differences being that strings are always enclosed in double quotes and null replaces None. By default, the content will all be written on a single line, which could be hard to read, but we can use the indent keyword to have Python automatically add indentation and line breaks to make it more readable, e.g.

import json
with open('data.json', 'w') as f:
    f.write(json.dumps(data, indent=4))

Reading from a file

To read data from a JSON file we use the json.load function:

with open('data.json', 'r') as f:
     data = json.load(f)

Similarly, json.loads loads JSON from a string instead of from a file.