Regex: General

Table of Contents

  1. Regex: General
    1. What is regex?
      1. Using Built-in Python Function
      2. Use Regex module
    2. Special char vs. Regular char
      1. Using Regex
      2. Special string char
    3. General char
    4. Counts
      1. match fixed length
      2. match a ranged length
    5. Greedy vs. Non-Greedy Search
      1. Greedy
      2. Non-greedy
    6. Matching Times
      1. Matching 0 or infinite times Use *
      2. Matching 1 or infinite times, Use +
      3. Matching 1 or 0 time, use ?
      4. Note on ?
    7. Boundary Match
      1. Validate an account
    8. Grouping
      1. Check certain times
    9. The third parameter
      1. match the case
      2. Don’t care the case
      3. match with .

Regex: General

I will go through regex (regular expression) in python here.

What is regex?

  • regex is a special string
  • checking if a target string is compatible with the string we are looking for
  • It’s used for quick text search and replace
  • can use both built-in functions or re

Using Built-in Python Function

# use the built-in functions index
a = "C|C++|Java|C#|Python|Javascript"
a.index("Python")

14

# Alternatively we can use the in keyword
"Python" in a

True

Use Regex module

import re
# use re module
result = re.findall("Python", a)
result

['Python']

Special char vs. Regular char

  • Get all the numbers from the string “C20|C8+8+8|Java|C#|Python|Javascript”
  • This can be done with either a for loop, which is ugly or regex

Using Regex

  1. Find all the numbers: Use the keyword \d

    import re
    
    a = "C20|C8+8+8|Java|C#|Python|Javascript"
    
    # in re, \d is a special char for numbers
    
    result = re.findall("\d", a)
    
    result
    
    ['2', '0', '8', '8', '8']
    
  2. Find all the non-numbers Use the keyword \D

    import re
    
    a = "C20|C8+8+8|Java|C#|Python|Javascript"
    
    # in re, \D is a special char for numbers
    
    result = re.findall("\D", a)
    
    result
    
    ['C', '|', 'C', '+', '+', '|', 'J', 'a', 'v', 'a', '|', 'C', '#', '|', 'P', 'y', 
    't', 'h', 'o', 'n', '|', 'J', 'a', 'v', 'a', 's', 'c', 'r', 'i', 'p', 't']
    

Special string char

  1. Find strings that contains

    with a list of strings, find the string whose middle letter is c or f. We can use [] to contain the string chars, or. Regular char is used to define the boundary

    import re
    
    s = "abc, acc, adc, aec, afc, ahc"
    
    r = re.findall("a[c,f]c", s)
    
    r
    
    ['acc', 'afc']
    
  2. Find strings that don’t contain, use ^ at the beginning of the pattern

    import re
    
    s = "abc, acc, adc, aec, afc, ahc"
    
    r = re.findall("a[^c,f,e]c", s)
    
    r
    
    ['abc', 'adc', 'ahc']
    
  3. A more efficient way

    In [] we don’t have to list all the chars. Instead, we can just use [start_char, end_char]

    import re
    
    s = "abc, acc, adc, aec, afc, ahc, arc, azc,"
    
    r = re.findall("a[b-f]c", s)
    
    r
    
    ['abc', 'acc', 'adc', 'aec', 'afc']
    

General char

Get all e.g. number type chars

import re

a = "python11111java6789python"

# get all the numbers, can use \d or [0-9]

r = re.findall("[0-9]", a)

r

['1', '1', '1', '1', '1', '6', '7', '8', '9']

Get all non number chars

import re

a = "python11111java6789python"

# get all the numbers, can use \d or [0-9]

r = re.findall("[^0-9]", a)

r

['p', 'y', 't', 'h', 'o', 'n', 'j', 'a', 'v', 'a', 'p', 'y', 't', 'h', 'o', 'n']

Get all number and letter chars, \w can find all word, number and _

import re

# add a number of non-word chars
a = "!@#@!#@!#@!#@python11111java6789python"

# get all the numbers, can use \d or [0-9]

r = re.findall("\w", a)

# \w is equivalent to [a-zA-Z0-9_]

r

['p', 'y', 't', 'h', 'o', 'n', '1', '1', '1', '1', '1', 'j', 'a', 
'v', 'a', '6', '7', '8', '9', 'p', 'y', 't', 'h', 'o', 'n']

Get all non number and letter chars, including escape char

import re

# add a number of non-word chars
a = "!@#@!#@!#@!#@python11111java6789python\t\n\r"

# get all the numbers, can use \d or [0-9]

r = re.findall("\W", a)

# \w is equivalent to [a-zA-Z0-9_]

r

['!', '@', '#', '@', '!', '#', '@', '!', '#', '@', '!', '#', '@', '\t', '\n', '\r']

Get all escape char, use \s

import re

# add a number of non-word chars
a = "!@#@!#@!#@!#@python11111java6789python\t\n\r"

# get all the numbers, can use \d or [0-9]

r = re.findall("\s", a)

# \w is equivalent to [a-zA-Z0-9_]

r

['\t', '\n', '\r']

Counts

  • regex only matches 1 char
  • how to match the whole word?

match fixed length

use numbers with [a-z]{3}

import re

a = "!@#@!#@!#@!#@python11111java6789python\t\n\r"

r = re.findall("[a-z]{3}", a)

r

['pyt', 'hon', 'jav', 'pyt', 'hon']

match a ranged length

import re

# add a number of non-word chars
a = "!@#@!#@!#@!#@python11111java6789javascript\t\n\r"

r = re.findall("[a-z]{3,10}", a)

r

['python', 'java', 'javascript']

Greedy

  • In the example above, when searching for “python”, regex already matches “pyt”, in should have stopped there, however, it gets the whole word, why?
  • By default, the search is greedy. Regex will search for “pyt”, but it will not stop there it will continue search for 10 chars to find the whole “python” until the pattern match fails by meeting “1”

Non-greedy

How to achieve a non-greedy search? Add ? at the end of the searching pattern

import re

# add a number of non-word chars
a = "!@#@!#@!#@!#@python11111java6789javascript\t\n\r"

# add ? at the end
r = re.findall("[a-z]{3,10}?", a)

r

['pyt', 'hon', 'jav', 'jav', 'asc', 'rip']

Matching Times

  • There are a number of options for matching times

Matching 0 or infinite times Use *

“pyth0n” will not be matched, but “pytho” will be matched as * matches 0 or infinite times

import re

a = "pyth0npytho1python2pythonn3"

r = re.findall("python*", a)

r

['pytho', 'python', 'pythonn']

Matching 1 or infinite times, Use +

“pyth0n” nor “pytho” will not be matched

import re

a = "pyth0npytho1python2pythonn3"

r = re.findall("python+", a)

r

['python', 'pythonn']

Matching 1 or 0 time, use ?

only “pytho” and “python” will be matched, “pythonn” will be matched but due to ?, it will not match the last n

import re

a = "pyth0npytho1python2pythonn3"

r = re.findall("python?", a)

r

['pytho', 'python', 'python']

Note on ?

? means ungreedy search after {number}, By default it’s the greedy mode

import re

a = "pyth0npytho1python2pythonn3"

r = re.findall("python{1,2}", a)

r

['python', 'pythonn']

Now non-greedy mode, will only take the min in ~{number}

import re

a = "pyth0npytho1python2pythonn3"

r = re.findall("python{1,2}?", a)

r

['python', 'python']

? means matching 0 or 1 time after pattern

import re

a = "pyth0npytho1python2pythonn3"

r = re.findall("python?", a)

r

['pytho', 'python', 'python']

Boundary Match

  • Boundary match can be used for data validations e.g. email, tele etc.

Validate an account

  1. Can’t handle correct longer validation

    assume an account is 4 to 8 long, \d{4,8} It can validate the account length longer than 4

    import re
    
    qq_1 = "100010011"
    
    r = re.findall("\d{4,8}",qq_1)
    
    r
    
    ['10001001']
    
  2. Use boundary match

    use "^\d{4,8}$", $ means the str ends here

    import re
    
    qq_1 = "100010011"
    
    r = re.findall("^\d{4,8}$",qq_1)
    
    r
    
    []
    

Grouping

Grouping is used to Check if a string contains a pattern certain times

Check certain times

Check if the below string contains Python repeating 3 times and JS repeating 2 times. Use () which is and logic, whereas [] is or logic

import re

a = "PythonPythonPythonPythonJSJSJSJSJSPyPythonJSJSJSJSJSPython"

r = re.findall("(Python){3}(JS){2}", a)

r

[('Python', 'JS')]

The third parameter

match the case

import re
lan = "PythonC#Java"
r = re.findall("c#", lan)
r

[]

Don’t care the case

import re
lan = "PythonC#Java"
r = re.findall("c#", lan, re.I)
r

['C#']

match with .

. will match everything except for \n

import re

lang="pythonC#java\njavascript"

r = re.findall("c#.{1}", lang, re.I | re.S) #here | is and relation, both re.I and re.S need to satisfy

r

['C#j']