Simplifying Regex Matching
So… One of the most difficult things I found when it came to learning about string matching was understanding regex. A quick rundown on what it is if you’re not sure what it is: Regex stands for regular expression, this means a simple match of characters and digits within a string (or within a variable).
Things like \d+\s+\w+, \d\d\d-\d\d\d-\d\d\d\d, or even \d{3}-\d{3}-d{4} look confusing…
Essentially, it lets you match a string, any string inside of another string.
Matching Numbers
Say you need to match a phone number within a string, you could be pulling it out to add to a database, or to save it as a variable you could use:
# import the regex library at the start of the block
import re
def phoneRegexGroup():
# creates tuples out of all matches and iters through them
phonenum = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
mo = phonenum.findall('Cell: 415-555-9999 Work: 212-555-0000')
for m in mo:
print(m)
The above block should give you something like you see below.
So what is going on here? First we are defining what kind of thing we should be matching, in this case numbers. In the example above we can see that the regex \d{3} is used; this is a regex matching for a numeral ‘\d‘ and it is looking for it three time ‘\d{3}‘. Using this will match any 3 numerals in a string, but it will only match the first three that it finds.
So we have found the first digits that we were matching for, now what? Well we can group the matches and print them out well in this case we can use the group() command to print it. In the example above if you only wanted the first number, you’d type print(mo.group()) this would return the number ‘415-555-9999’.
Matching letters
So way we want to match something that isn’t numbers… like a string of partial string for example. In this example we’ll be trying to match any following keyword that starts with ‘bat’.
import re
def xmasRegex():
xmas = re.compile(r'\d+\s\w+')
mo = xmas.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')
for m in mo:
print(m)
So what does the code block above do? Let’s break it down. Firstly, we have set our regex variable up xmas = re.compile(r’\d+\s\w+’)
. We’ve added something new here, ‘+\s\w+‘
so what does this do? We know that \d is gor a digit, but what is \s and \w? Well \s matches any space, tab, or newline; essentially making \s the ‘space matcher’ for our regex. \w is for matching any digit, letter, or underscore. so what we have is a search for with the string above is essentially, ‘find any number followed by a space, then any length of letter, followed by the same pattern’.
Custom Character Classes
Say we wanted to match just a specific set of letters and pull them from a string? We can do that with custom character classes. A simple example would be removing all the vowels from the alpha:
import re
def vowelRegex():
vowelRegexExclude = re.compile(r'[^aeiouAEIOU]')
mo = vowelRegexExclude.findall('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz')
print(mo)
The above example pulls all the matching vowels and excludes them. How’d this happen? Well there is a caret right at the start but after the opening bracket of the regex matching that caused it to flip to a negative value, meaning ignore/exclude its contents. If we take out the caret:
As you can see above, removing the caret after the bracket causes it to ONLY display the wanted letters.
Dive In
The world of regex can be confusing and a little frustrating at times but once you understand the basics you can move forward to much more complex matching. I recommend testing your patterns with regexpal and Regexr to help expand your understanding a little more!
Character Class Cheat Sheet
Shorthand Character Class | Represents |
---|---|
/d | Any digit from 0 to 9 |
/D | Any character that isn’t a digit from 0 to 9 |
/w | Any letter, digit, or underscore, word match |
/W | Any character that isn’t a letter, digit, or underscore |
/s | Any space, tab, or newline, space matching |
/S | Any character that is not a space, tab, or newline |