Regex: Search and Match

Table of Contents

  1. search
    1. Check the returned value
  2. match
    1. Check the returned value
  3. match only once
  4. group

There are two more methods in regex: search and match, however I don’t think they are as handy as .sub or .findall. Assume we have a string s=“A8C3721D86”

search

Check the returned value

.search returns a re object, search will search for the whole string until the pattern fails

import re

s="A8C3721D86"

r = re.search("\d", s)

r

<re.Match object; span=(1, 2), match='8'>

match

Check the returned value

.match returns None, as match starts from the first element

import re

s="A8C3721D86"

r = re.match("\d", s)

type(r)

NoneType

Now, change the first element to a number

import re

s="99C3721D86"

r = re.match("\d{2}", s)

r

<re.Match object; span=(0, 2), match='99'>

Now check the returned object, .group() returns the matched value

r.group()

'99'

.span() returns the position of the matched pattern

r.span()

(0, 2)

match only once

Compared to .findall and .sub, .search and .match will only search once

group

Now let’s check .group method, while scraping from webpages, there are html tags in the data, say I have scrapped a string s=”Life is short, I use python”, I want to get the contents between tag

import re

s = "<span>Life is short, I use python</span>"

r = re.search("<span></span>", s)

type(r)

NoneType

It returns a None type as the pattern was no good. Now recall we used . to match anything except for \n

r = re.search("<span>.*</span>", s)

r.group()

'<span>Life is short, I use python</span>'

It returns the whole string. We can use (.*) to group all the contents between

r = re.search("<span>(.*)</span>", s)

r.group()

'<span>Life is short, I use python</span>'

group’s default argument is 0, which returns the whole matching string. In this case, we should use 1 to get our desired result

r.group(1)

'Life is short, I use python'

This is the result I want! How about find all?

r = re.findall("<span>(.*)</span>", s)
r

['Life is short, I use python']

It returns exactly what I want in an array.

Now let’s add one more group.

s = "<span>Life is short, I use python</span> So <h3>This is Python</h3>"
r = re.search("<span>(.*)</span>(.*)<h3>(.*)</h3>", s)
r.group(0)

'<span>Life is short, I use python</span> So <h3>This is Python</h3>'

r.group(1)

'Life is short, I use python'

r.group(2)

' So '

r.group(3)

'This is Python'

.groups is a better method to return all groups in a tuple

r.groups()

('Life is short, I use python', ' So ', 'This is Python')