Regular Expressions in Python

Anusha Kodavanti
4 min readJul 13, 2022

Is this something you’ve ever wondered about? How did the website verify that your new password met the requirements of its sign-up form? With the find option, how did you retrieve the search results for a specific word on a website? Yes all these are performed using Regular Expression in short known as “regex”

Regular expressions are extremely useful they help for matching common patterns of text. Simply put, the programming language reads an input and validates it against a pattern given to it.

A regular expression can be developed using certain characters(a-z, A_Z, 0–9), metacharacters(\d,\w,\W,\S) and operators(+,*,?).

Recently, one of my use cases was to extract a portion of a word and then add a prefix to it. For fun, I have created a list of Game of Thrones characters and I want to represent their houses against their names. This could be done using other list methods but I wanted to use regex for this. While writing regex we use r as a literal before each string, which will ask python to interpret it as a raw string rather than performing a programming function. If“\t “ is used in the search pattern of the string it will retrieve results that match ‘/t’ rather that giving out results with 4 spaces(tab function).

#importing regex module 
import re
# sample data in list
GOT = ['Bran Stark', 'Cersei Lannister', 'Tyrion Lannister', 'Theon Greyjoy', 'Daenerys Targaryen', 'Tyrion Lannister', 'Robert Baratheon', 'Robin Arryn', 'Robb Stark']
#Intializing empty list to store values
last_name=[]
House_names=[]
#using findall function from the regex module
for i in GOT:
last_name.extend(re.findall(r'\s(.*)', i))
for x in last_name:
b=prefix + str(x)
House_names.append(b)
#assigning to a data frame
df={'Name':GOT,'Last Name':last_name,'House':House_names} df=pd.DataFrame(df) df.head()

I have my data in a list called GOT and I want to run through each element in GOT and separate the last names of each GOT character. In my Parent list (GOT), there is a space between the first and last name, which I will use as the regex pattern. In order to parse this to a regex, we use /s, which identifies white space, and dot(.) which matches alphabets, numbers, whitespaces, except for newlines and *, which provide zero or more characters. There is also a function within the re module called “findall” that returns all matches. The next step is to add a prefix to the last_name, which is a simple for loop without regex, and then store the results back into a data frame.

Output with prefix added

Using Metacharacters in Regex

A regex is essentially a way to identify patterns and to put these into a machine’s understanding. Not only characters but we can also parse numbers to regex. For this I am making up a list where phone numbers and Social Security numbers are mixed up, and you need to separate them. In the US, there is a certain pattern that the Phone Numbers(xxx-xxx-xxxx) and SSN (xxx-xx-xxx) follow, we will be using this as a base to form our regex.

#Intializing lists 
d=['321-456-7654','111-11-1111','222-222-2222','123-45-6789','123.987.6457']
phonenumber=[]
ssn=[]
for i in d:
phonenumber.extend(re.findall(r'\d{3}[-.]\d{3}[-.]\d{4}', i))
print('Phone Numbers :'+str(phonenumber)) for j in d:
ssn.extend(re.findall(r'\d{3}[-.]\d{2}[-.]\d{4}', j))
print('ssn: '+str(ssn))

In terms of our regex, we use /d to instruct Python to look for digits and we specify the number of digits within the {} and separators between the digits are used in open braces [-.] . The Output for this would be

Output with two separate list

Moving on to the initial question of how the website validates your sign up password according to their requirements. Let’s say you are asked to create a password that needs to contain at least one uppercase, one lowercase, one digit and one special character in the same order. The text string here are the passwords inputted by various users.

text='''abc, Abc, ABC, A9@, Ab@9,Abc9, #Abc'''

pattern=re.compile(r'[a-zA-z]+[@#$]+[0-9]+')

matches= pattern.finditer(text)

for match in matches:
print (match)

This regex here, looks for uppercase and lowercase by parsing the ranges between [ ], the operator ‘+’ is used to look up multiple occurrences. The second part of the regex is searching at least one special characters and third would be the digit.

Output with matching Password Requirement

Sources:

Regex Cheat sheet: https://www.rexegg.com/regex-quickstart.html

More on Regex: https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html#:~:text=A regex consists of a,combining many smaller sub-expressions.

--

--

Anusha Kodavanti

Amazonian| Data Science Enthusiast| Supply Chain Analytics