Regular Expressions in Python

Sometimes we want to check whether a one string is a substring of another.
```
>>> 'tho' in 'python'
True
>>> 'pho' in 'python'
False
```
But we might like to look for more general patterns in strings, for example whether a Dutch post code appears in the string
```
'Science Park 113, 1098 XG Amsterdam, The Netherlands'
```
We can describe the post code as four digits followed by a space followed by two upper-case letters. Other patterns we might like to look for are email addresses or URLs in text.
Regular expressions are a way of performing such searches and retrieving the parts we want from strings.
First we import the regular expressions code
```
import re
```

What can the re module do?

>>> dir(re)
['DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'S',
 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE',
 '__all__', '__builtins__', '__doc__', '__file__', '__name__', '__package__',
 '__version__', '_alphanum', '_cache', '_cache_repl', '_compile', '_compile_repl',
 '_expand', '_pattern_type', '_pickle', '_subx', 'compile', 'copy_reg', 'error',
 'escape', 'findall', 'finditer', 'match', 'purge', 'search', 'split', 'sre_compile',
 'sre_parse', 'sub', 'subn', 'sys', 'template']
>>> help(re.compile)
Help on function compile in module re:

compile(pattern, flags=0)
Compile a regular expression pattern, returning a pattern object.

We compile a pattern into a regular expression, in this case the simple pattern that matches a single letter 'a'
```
pattern = re.compile('a')
```
Like other things in Python, we can ask pattern what it can do
```
>>> dir(pattern)
['findall', 'finditer', 'flags', 'groupindex', 'groups', 'match',
 'pattern', 'scanner', 'search', 'split', 'sub', 'subn']
```
(Note that I leave out the names beginning with underscores here and in the rest of this page.)
Let try to search in strings with and without an a
```
>>> pattern.search('xxxxx')
>>> pattern.search('xxxaxxx')
<_sre.SRE_Match at 0x177c3d8>
```
We see that when the string does not contain the pattern then nothing is returned but when it does contain the pattern, some kind of SRE_Match object is returned.

Let's assign that object to a variable and ask what it can do

>>> m = pattern.search('xxxaxxx')
>>> dir(m)
['end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup',
 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']
>>> m.start
<function start>
>>> m.start()
3
>>> m.end()
4
>>> 'xxxaxxx'[m.start():m.end()]
'a'
>>> m.string
'xxxaxxx'
>>> m.re
re.compile(r'a')

We see that we can access the start and end points in the string where the pattern matched. We can also recover the string searched in and the regular expression sought.

Suppose now that we want to search for not just the character a, but a vowel, one of the characters a,e,i,o,u. We can indicate that we want one of a set of characters by enclosing them in square brackets
```
>>> pattern = re.compile('[aeiou]')
>>> pattern.search('xxxxxx')
>>> pattern.search('xxxexxx')
<_sre.SRE_Match at 0x177c578>
```
Using square brackets to match character classes is just one feature of regular expressions. Let's look at some others from the Python Regular Expression HOWTO.

Now we can write a short script to grab a list of all postcodes on the AUC contacts page.

"""Using the urllib module, read the AUC contact page. Using the re module, get
all occurrences of a Dutch postcode and return them as a list."""

from urllib.request import urlopen
import re

postcode_re = re.compile("[0-9]{4} [A-Z]{2}")
auc_contacts_url = "http://www.auc.nl/about-auc/contact-us/contact-us"


def postcodes(url=auc_contacts_url):
    with urlopen(url) as p:
        s = str(p.read())
        return postcode_re.findall(s)


print(postcodes(auc_contacts_url))

Here is the same script using the requests library instead of the urllib.request library.

"""Using the urllib module, read the AUC contact page. Using the re module, get
all occurrences of a Dutch postcode and return them as a list."""

import requests
import re

postcode_re = re.compile("[0-9]{4} [A-Z]{2}")
auc_contacts_url = "http://www.auc.nl/about-auc/contact-us/contact-us"


def postcodes(url=auc_contacts_url):
    response = requests.get(url)
    if response.ok:
        return postcode_re.findall(response.text)
    else:
        return "Failed to get URL"


print(postcodes(auc_contacts_url))

Suppose we're looking at the table of contents page for Learn Python the Hard Way book and we'd like to download and save all of the chapters. We could click on each of them in turn and then save the pages but there are 56 of them! Shouldn't we be able to write a Python program to grab the contents page, find each of the URLs in it and then download and save each of them in turn? How could we do that?
The URL for the table of contents page is
```
http://learnpythonthehardway.org/book/
```
and, if we look at the HTML source for that page we can see that each chapter appears on a line that looks something like these:
```
<li><a class="reference external" href="ex28.html">Exercise 28: Boolean Practice</a></li>
<li><a class="reference external" href="ex29.html">Exercise 29: What If</a></li>
```
If we generalise over all of those links, we get something like
```
<li><a class="reference external" href="exXX.html">YY</a></li>
```
where XX and YY are different substrings on each line. What we would like to be able to describe is a string like that line where it doesn't matter what the XX and YY are. A regular expression which matches this is
```
r'<li><a class="reference external" href="ex[0-9]+.html">.*</a></li>'
```
From this regular expression we would like to grab the href part so that we can save the file using that name. To do this we group it:
```
r'<li><a class="reference external" href="(ex[0-9]+.html)">.*</a></li>'
```
Here is a Python script that downloads all chapters. You'll notice that it has some shortcomings.
1. It doesn't download any of the necessary CSS style files.
2. It doesn't download any of the images required by some of the pages.
3. It doesn't download any of the Javascript files needed for the dynamic aspects of the pages.
However the script can be readily adapted to remedy all of the above.
The process of looking into web pages and extracting information from them is sometimes called Web Scraping. Be aware of the reasons why scraping people's web content might make them unhappy.
Moving through the web link by link is known as Web crawling. It forms the basis of how web search engines such as Google index the content of the web so that we can search it.
When you have a hammer, everything looks like a nail so remember that regular expressions are not a panacea; they should be used sparingly. For example, writing a regular expression for matching email addresses can get out of hand. Here is an evaluation of the trade offs. And here is one that works in almost all cases:
```
email_re = re.compile(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
```