Regex: Sub

Table of Contents

  1. Introducing re.sub
  2. the power of re.sub
  3. function as a parameter

For regex, there are more than just find. Here I will discuss replacement

Introducing re.sub

re.sub matches first then replace the matched string

import re

lang = "PythonC#PHPC#C#C#C#"

# replace C# with GO

r = re.sub("C#", "GO", lang, 0)

r

'PythonGOPHPGOGOGOGO'

The 4th parameter 0 means to replace C# infinitly. If we change 0 to 1, C# will be replaced only once

import re

lang = "PythonC#PHPC#C#C#C#"

r = re.sub("C#", "GO", lang, 1)

r

'PythonGOPHPC#C#C#C#'

This is very similar to the built-in replace function

lang = "PythonC#PHPC#C#C#C#"

r = lang.replace("C#", "GO")

r

'PythonGOPHPGOGOGOGO'

replace only once

lang = "PythonC#PHPC#C#C#C#"

r = lang.replace("C#", "GO", 1)

r

'PythonGOPHPC#C#C#C#'

the power of re.sub

re.sub is pretty powerful as its 2nd parameter can be a func. First define an empty function, note that the 1st C# disappeared.

import re

def convert(x):
    pass

lang = "PythonC#PHPJavascriptC#"

r = re.sub("C#{1,2}", convert, lang, 1)

r

'PythonPHPJavascriptC#'

Now, we do something in the function. Use .group() to get all the matched str in function

import re

def convert(x):
    return "!!"+x.group()+"!!"

lang = "PythonC#PHPJavascriptC#"

r = re.sub("C#", convert, lang)

r

'Python!!C#!!PHPJavascript!!C#!!'

Obviously, lambda would be great here

import re

lang = "PythonC#PHPJavascriptC#"

r = re.sub("C#", lambda x:"GO", lang)

r

'PythonGOPHPJavascriptGO'

function as a parameter

Assume we have a string s=“A8C3721D86”, if the number is smaller than 6, replace it to 0, otherwise replace to 10

import re

s = "ABC3721D86"

def convert(x):
# Note to convert data type
    return str(0) if int(x.group())<=5 else str(9)

r = re.sub("\d", convert, s)

r

'ABC0900D99'

Accepting function is a really good design pattern, if a customized operation is needed.

Assume we have 2 digits here s=“A88C3271221D8162”, if the number is smaller than 50, replace it with sm, otherwise replace it with lg

import re

s = "A88C3271221D8162"

def convert(x):
    return "-sm" if int(x.group()) <= 50 else "-lg"

r = re.sub("\d{2}", convert, s)

r

'A-lgC-sm-lg-sm1D-lg-lg'