Mid-term Review Topics
- Python/Programming
- REPL, Notebook, Module
- Assignment
- Data types (
int
,str
,float
,tuple
,list
,set
,dict
) - Constants (
True
,False
,None
) - Control Structures
(
if
/elif
/else
,for
,while
, comprehensions) - Built-in Functions (
print()
,len()
,range()
,open()
,enumerate()
,any()
,all()
) - Functions (arguments, default arguments,
return
) - Raw data (
open()
,str.encode()
,bytes.decode()
,urllib.request.urlopen().read()
) - Pandas (
read_csv()
,read_excel
,to_csv()
,to_excel
)
- NLP/CL
- Basic tokenization
- Text normalization (purpose, kinds of normalization)
- Types vs Tokens
- Frequency Distributions, Conditional Frequency Distributions
- Text corpora (Gutenberg, Brown, annotated corpora)
- Lexical resources (word lists, stopwords, pronouncing dictionaries, WordNet)
- Software Engineering
- Paths and folders
- Git (purpose, usage)
- Efficiency (e.g., which function is faster)
- Mutability of data structures
- Side-effects of functions
- Unicode (purpose, what is it)
- Files and Streams (purpose, difference from strings/bytestrings)
- Virtual Environments (purpose, troubleshooting)
Learning Objectives
Python/Programming
This category is for programming concepts, techniques, and structures as used in Python.
Modes of Programming
Python is a very flexible language that can be used in multiple ways.
REPL
REPL (Read-Evaluate-Print-Loop) describes the way the computer
operates an interactive interpreter such as Python’s IDLE
interpreter. The computer reads what the user entered, evaluates it,
prints the output, and loops (starts this process again). When you see
code written with >>>
on the left side, it is
showing what you would see in an interactive interpreter:
>>> print("hello")
hello
You can start an interactive interpreter by running the Python command in a terminal without any arguments:
$ python
3.8.2 (default, Jul 16 2020, 14:00:26)
Python 9.3.0] on linux
[GCC "help", "copyright", "credits" or "license" for more information.
Type >>>
The REPL mode of Python is useful for quick experiments and debugging, but it is not easy to revise what has been entered so it is not convenient for larger programs.
Module
Perhaps the most common way to write Python code is in a
module. A module is just a .py
file containing
Python code. The file is saved on the computer and it is not run until
it is loaded and processed by a Python process. Modules meant to be
imported by other modules are generally called libraries, while
modules meant to be run directly are called scripts. Imagine
you had a file hello.py
with the following contents:
print('hello')
You can then run this script at the command line and see its output
as follows (depending on your system and environment, the
python3
command may be python
,
py
, or something else):
$ python3 hello.py
hello
Using modules to write Python code is the most versatile mode and is recommended for most uses.
Notebook
A mode of using Python popular for demonstration or pedagogical purposes is the notebook. It mixes human language text with Python code and sometimes provides a more friendly view of the output (e.g., rendered charts and tables). This style of programming is called literate programming. It looks something like this:
Expressions, Statements, Definitions
Assignment
>>> x = 3 # assign the value 3 to the variable named 'x'
>>> x = x + 1 # evaluate the expression 'x + 1' then assign its result to x
>>> x
4
>>> x += 1 # shorthand for the above
>>> x
5
>>> x = 'foo' # variables can be reassigned to values of a different type
>>> x.upper() # an expression involving the variable does not change its value
'FOO'
>>> x
'foo'
>>> x = [1] # ...
>>> x.append(2) # unless the value is mutable
>>> x
1, 2]
[>>> x = x.append(3) # be careful not to reassign to function that return None (don't do this)
>>> x is None
True
Functions
def myfunction(x, y, z=0): # define a function named 'myfunction' with 3 parameters;
# x and y are required, z has a default value so it is optional
return x * y * z # return a value so the caller can make use of it
# if there is no return statement, the function returns None
Once a function is defined, call it by its name:
>>> myfunction(1, 2)
0
>>> myfunction(1, 2, 3)
6
Data Types
int
Integer types represent positive or negative whole numbers.
>>> 1024 # literal integers are base-10
1024
>>> int(3.14159) # int() casts floats to integers
3
>>> int('101') # int() casts strings to integers
101
>>> int('101', 2) # an explicit base can be given
5
>>> 20 // 3 # floor division drops any remainder
6
float
Floating-point numbers represent numbers with a fractional component.
>>> 3.14 # literal floats include a decimal
3.14
>>> 6.02e23 # and/or an exponent
6.02e+23
>>> float(3) # float() casts from integers
3.0
>>> float('3.14') # and strings (but only base-10)
3.14
>>> 20 / 3 # regular division results in a float
6.666666666666667
>>> 20 / 5 # even when there's no remainder
4.0
str
Strings are sequences of characters.
>>> 'a string' # strings are surrounded by single quotes
'a string'
>>> "another string" # or double quotes
'another string'
>>> '''yet
... another''' # or three single/double quotes (newlines allowed)
>>> 'もう一つ' # unicode is ok
'もう一つ'
Useful string methods:
- str.split() – return the list of tokens in the string
- str.lower() – return the string with all cased characters downcased
list
Lists are mutable sequences of values.
>>> x = [] # make an empty list
>>> x = list() # this works, too
>>> x.append(1) # add a single value
>>> x.extend([2,3]) # add multiple values
>>> x
1, 2, 3]
[>>> x[1:] # slice from the 2nd element
2, 3]
[>>> x[:-1] # slice until 1 before the last element
1, 2]
[>>> len(x) # find how many elements are on the list
3
>>> x.count(3) # count how many times a particular element occurs
1
>>> 4 in x # use 'in' to check for membership
False
tuple
Tuples are immutable sequences of values.
>>> () # make an empty tuple
()>>> tuple() # or use the function
()>>> (1) # but this doesn't work (it looks like an expression)
1
>>> 1, 2 # use commas to create tuples in general
1, 2)
(>>> 1, # even for single items
1,)
(>>> (1, 2) # but often we need parentheses around them
1, 2)
(>>> x = tuple([1, 1, 2, 3]) # create tuples from an iterable
>>> x[1:-1] # slicing works as expected
1, 2)
(>>> x.count(1) # many list operations work on tuples (but not mutating ones)
2
set
Sets are containers of unique values.
>>> set() # use the function to create an empty set ('{}' creates a dictionary)
set()
>>> {1, 2} # Use braces with comma-separated values to create non-empty sets
1, 2}
{>>> {1, 1, 2, 3} # repeated values are dropped
1, 2, 3}
{>>> x = {1, 2}
>>> x | {3} # | is the operator for set-union
1, 2, 3}
{>>> x & {2} # & is the operator for set-intersection
2}
{>>> x.add(3) # use add() to add a single value
>>> x.update([3, 4]) # use update() to add multiple values (duplicates still dropped)
>>> x
1, 2, 3, 4}
{>>> 2 in x # check for membership
True
dict
Dictionaries are much like sets but they associate keys to values.
>>> {} # create an empty dict with {}
{}>>> dict() # the function works, too
{}>>> {'a': 1, 'b': 2} # keys ('a' and 'b') map to values (1, 2)
'a': 1, 'b': 2}
{>>> d = {'a': 1}
>>> d['b'] = 2 # assign a single key to a value
>>> d.update({'c': 3, 'd': 4}) # update() to assign multiple keys to values
>>> d['c'] # retrieve the value associated with a key
3
>>> d['b'] = 1 # keys are unique; here 'b' is reassigned
>>> d # but values need not be unique
'a': 1, 'b': 1, 'c': 3, 'd': 4}
{>>> len(d) # len() gets the number of keys
4
>>> list(d) # list(d) returns the list of keys
'a', 'b', 'c', 'd']
[>>> 'b' in d # check for membership of keys
True
>>> 1 in d # cannot check for membership of values this way
False
Constants
A constant is a special type of variable whose value cannot be changed. However, we don’t really use constants in Python since any variable can be reassigned. It is good to know, however, which variable names (like ‘list’, ‘dict’, etc..) should be avoided so you don’t accidentally reassign key functions. Naming constant variables in all capital letters is a convention to separate them from ordinary variables, however, it does not actually prevent reassignment.
True, False
Boolean operators that identify values.
None
An operator that identifies presence/absence.
Control Structures
if
An if
statement is used for a conditional
block—code that is executed only if a condition is true. See the
Python Tutorial’s section on if
Statements.
Only one thing below will be printed:
= 'Dog'
word if word in vocabulary:
print(word, 'exists in the vocabulary')
elif word.lower() in vocabulary:
print(word, 'exists in vocabulary when case-normalized')
else:
print(word, 'does not exist in vocabulary')
Note that if you use multiple if
statements instead of
elif
or else
, they will be attempted
regardless of whether the previous conditions were true. In the
following, both will print if both 'Dog'
and
'dog'
are in the vocabulary.
= 'Dog'
word if word in vocabulary:
print(word, 'exists in the vocabulary')
if word.lower() in vocabulary:
print(word, 'exists in vocabulary when case-normalized')
for
A for
statement (also called a for-loop) loops over each
item in an iterable. This generally means the loop is bounded by the
number of items in the iterable, and it’s used when the iterable items
themselves are used inside the loop. See the Python Tutorial’s section
on for
Statements.
= [1, 2, 3]
numbers for number in numbers:
= number**2
square print(f'the square of {number} is {square}')
while
A while
statement (also called a while-loop) loops as
long as a condition is met. This condition can be just about anything,
and sometimes it is True
, meaning an infinite loop. See the
Python Tutorial’s example of while
statements. While-loops are often used when the number of iterations
is not fixed or known in advance, or when there is no iterable whose
items are used inside the loop.
# print Fibonacci numbers less than 100
= 0
a = 1
b while a < 100:
print(a)
= b # temporary variable for the old value of b
tmp = b + a
b = tmp a
Comprehensions
Built-in Functions
print()
len()
range()
enumerate()
any()
all()
open()
Raw data
open()
str.encode()
bytes.decode()
urllib.request.urlopen()
Pandas
Pandas is a library for working with dataframes, a table-like structure that allows for rapid calculations. It also helps us to organize data into a human-readable structure. You can used Pandas by importing it and initializing a dataframe:
import pandas as pd
= pd.DataFrame() df
You can also use pandas to read and write character-delimited files or excel spreadsheets:
= pd.read_csv('randomfile.csv', delimiter="\t")
df = pd.to_csv('randomfile.csv', sep="\t")
df
= pd.read_excel('randomfile.xlsx')
df = pd.to_excel('randomfile.xlsx') df
NLP/CL
This category is for concepts and techniques related to natural language processing or computational linguistics.
Concepts
types-vs-tokens
tokenization
normalization
Frequency Distributions
Conditional Frequency Distributions
Text Corpora
Lexical Resources
Tools
NLTK
Software Engineering
This category is for practices of software engineering as well as programming concepts that are relevant to programming languages beyond Python.