Đã đăng vào thg 4 24, 2016 1:24 SA 7 phút đọc

116

[Ruby] Get insights into Regular Expression

Bài đăng này đã không được cập nhật trong 3 năm

Lastweek a newbie ask me about Regular Expression, I unconsciously gave him a simple but powerful answer "Stack Overflow and Rubular" (lol). However, at that moment, I realized that my Ruby knowledge had missed a Swiss Army Knife for interacting with String so I decided to take my weekend for getting insights about Regular Expression. This post isn't a handbook about Regexp but I think at least it can help you to read and understand any Regexp.

1. What're regular expression(Regexp)?

Regular Expressions (or Regexp for short)'re sequence of characters built for matching String. In simple term, the Regexp acts as airport scanner which "makes hidden things to visible" and "hides regular stuff".

Nowadays, Regexp's an indisputable part of all modern programing language and Ruby's not an exception. Regexp's scoped by two slash characters (/) and we can easily find that Ruby has a built-in class for regular expression

2.2.4 :001 > //.class
 => Regexp

The postion between two slashes's reserved for pattern of Regexp which's responsible for finding string. For example:

2.2.4 :005 > /Hello/ =~ "Hello World"
 => 0
2.2.4 :009 > /Hello/ =~ "Hellio World"
 => nil
2.2.4 :006 > /Hello/.match("Hello World")
 => #<MatchData "Hello">
2.2.4 :006 > /Hello/.match("Hellio World")
 => nil

In here, "Hello" is the string that we want to match and it also's pattern of this Regexp. To check existence, Regexp class provides two functions match and =~ that returns nil when the string doesn't match the pattern. Sound easy?, yes because it's just a beginning. The next part'll bring headache to all of us.

2. Digging into Regexp

2.1. Build a pattern in Regexp

We've already known that Regex's built based on pattern staying in the middle of two forward slashes. The pattern not only contains a string but also includes a set of predictions and constrains that we want to look for in a string. Basically, a Regexp can has three components:

Litteral characters: "match this character"
The dot wildcard character (.): "match any character"
Character classes: "match one of these characters"

Let's find out structrue of each type

2.1.1. Litteral characters

This is the most simple pattern of Regexp when its missing's just matching itself in the string. In the first example, we used exactly string that we want to match ("Hello") to build Regexp (/Hello/). However, that's just for normal characters. Regexp also have a list of special characters that reserve for its funtions just like "def, each, while..." of Ruby. They're (^ $ ? . / \ [ ] { } ( ) + *). If we want to match a character that're reserved one, just put backslash(/) before it. For instance, if we want to match question mark(?)

2.2.4 :021 > /?/ =~ "How are you?"
SyntaxError: (irb):21: target of repeat operator is not specified: /?/
	from /Users/apple/.rvm/rubies/ruby-2.2.4/bin/irb:11:in `<main>'
2.2.4 :022 > /\?/ =~ "How are you?"
 => 11

2.1.2. The wildcard character "dot" (.)

Now, I have two string "mangoes" and "Superman". We can clearly see both two text can match with /man/. How can I kick out "Superman" and collect my fruit? Firstly, let look inside structure of two string. These strings contain "man", one's prefix and another one's suffix. To take my "mangoes", I uses /man./

2.2.4 :038 > "mangoes" =~ /man./
 => 0
2.2.4 :040 > "superman" =~ /man./
 => nil
2.2.4 :044 > /man./.match "Superman"
 => nil
2.2.4 :045 > /man./.match "mangoes"
 => #<MatchData "mang">

From the match function, we can easily detect that . represent for next character of string so it's replacement for "g" character while "Superman" doesn't have following charracter. Similar, if we want to wellcome just "Superman", the regex "/.man/" is employed.

2.1.3. Character classes

Awkwardly, I have three different strings: "boot", "foot" and "hoot" and I wanna to get just "foot" and "boot". Writing three separated regexp isn't a smart idea. Using dot(.)? Oh no, /.oot/ matches all of them. Now we need another type of regexp pattern, chareacter classses.

Character classes is an explicit of characters placed inside square brackets []. For above requirements, the regexp that I need's /[fh]oot/.

2.2.4 :001 > /[fh]oot/ =~ "boot"
 => nil
2.2.4 :002 > /[fh]oot/ =~ "hoot"
 => 0
2.2.4 :003 > /[fh]oot/ =~ "foot"
 => 0

Interestingly, the Character classes can be writing in range of character, such as:

/[a-z]/: all downcase characters form a to z
/[A-Z]/: same above but for upper case
/[0-9]/: digit from zero to nine which have a shortcut /\d/
/\w/: matches all digit, alphabetical or underscore
/\s/: matches nay whitespace character (space, tab, newline)

But, hold on a second, what should I write if I don't want to match them? Try to find out all except character and rewrite on the same way? Regexp's much more intelligent than we thought, just put a caret(^) on the beginning of class then you'll own the reverse regexp.

2.2.4 :004 > /[^fh]oot/ =~ "foot"
 => nil
2.2.4 :005 > /[^fh]oot/ =~ "hoot"
 => nil
2.2.4 :006 > /[^fh]oot/ =~ "boot"
 => 0

For \d, \w, \s, just transform them to upper case format (\D, \W, \S) then we have the negative ones.

2.2.4 :011 > /\D/ =~ "10"
 => nil
2.2.4 :012 > /\d/ =~ "10"
 => 0
2.2.4 :013 > /\d/ =~ "ab"
 => nil
2.2.4 :014 > /\D/ =~ "ab"
 => 0

2.2. Matching data

As we can see, =~ function return the position of pattern in original string nad it just let us know that our regex match or not match the string. If switching to match function, the result we get is an object

2.2.4 :017 > /Hello/.match "Hello world"
 => #<MatchData "Hello">

In Ruby, everything is Object and there's only way to build it through class. And we can figure out that, MatchData class is built to store results of matching. However, this simple explame couldn't show off the power of match function. Let move on matching with parentheses. Parentheses I have a longer string "Clark Ken,journalist" and this is its main features:

Alphabetical character include both upper and down case /[a-zA-Z]/
a whitespace /\s/
continue alphabet character /[a-zA-Z]/
a comma/,/
final alphabet character in downcase mode/[a-z]/

So from above requirements, we have regexp /[a-zA-Z]+\s[a-zA-z]+,[a-z]+/

2.2.4 :028 > /[a-zA-Z]+\s[a-zA-z]+,[a-z]+/.match string
 => #<MatchData "Clark Ken,journalist">

Sound this object's useless because from this String, we want to get the first/last name as well as Superman's occupation. Now, we need to separate pattern in sub groups by parentheses. Let see what'll happend:

2.2.4 :030 > m = /([a-zA-Z]+)\s([a-zA-z]+),([a-z]+)/.match string
 => #<MatchData "Clark Ken,journalist" 1:"Clark" 2:"Ken" 3:"journalist">

MatchData object has changed and from m variable, we can retrieve data by two way:

Index

2.2.4 :040 > m
 => #<MatchData "Clark Ken,journalist" 1:"Clark" 2:"Ken" 3:"journalist">
2.2.4 :042 > m[0]
 => "Clark Ken,journalist"
2.2.4 :043 > m[1]
 => "Clark"
2.2.4 :044 > m[2]
 => "Ken"
2.2.4 :045 > m[3]
 => "journalist"
2.2.4 :046 > m[4]
 => nil

Captures array:

2.2.4 :047 > m.captures
 => ["Clark", "Ken", "journalist"]

Should remember that each couple of parentheses has defined a group and the their index's ordered follow the rule "counting from the left". If I add another parenthese to cover fullname, we'll get

2.2.4 :048 > x = /(([a-zA-Z]+)\s([a-zA-z]+)),([a-z]+)/.match string
 => #<MatchData "Clark Ken,journalist" 1:"Clark Ken" 2:"Clark" 3:"Ken" 4:"journalist">

Awesome, now we don't need an unnecessary join() function to make a fullname.

2.3. Quantifiers, Anchors and Modifiers

This is another outstanding feature which makes Regexp look like a programming language rather than just an pure string. Basically, their missions are:

Quantifiers: specify how many times wanna to match
Anchors: stipulate the match occur in certain position
Mofifiers: switches to change behavior of regexp

Let's move further on each type

2.3.1 Quantifiers

We should remember that, each pattern that we defined match for just one time so we need a signal to tell Regexp to repeat its action

2.3.1.1. Greedy quantifiers

Greedy means that this quantifiers will never top until reach the end of string. Regexp constructs two greedy types Zeror or more This quantifier's defined by the star character(*). For instance, I need to collect content from <marvel> tag of this terrible XML code

</marvel>
<  /marvel>
</  marvel>
</marvel   >

And everything we need in here's just /<\s*/\smarvel\s>/

2.2.4 :049 > rexg = /<\s*\/\s*marvel\s*>/
 => /<\s*\/\s*marvel\s*>/
2.2.4 :050 > rexg.match "</marvel>"
 => #<MatchData "</marvel>">
2.2.4 :051 > rexg.match "<    /marvel>"
 => #<MatchData "<    /marvel>">
2.2.4 :052 > rexg.match "</   marvel>"
 => #<MatchData "</   marvel>">
2.2.4 :053 > rexg.match "</marvel    >"
 => #<MatchData "</marvel    >">

Finally, we've matched marvel tag and I swear that I'll the guy who wrote this XML code if Regexp doesn't have Zero or more One or more The syntax for one-or-more is plus sign(+) For example, my outdated website has just single text for entering name and I want to get first name as well as last name. The problem in this case's we don't know exactly how many spaces between two string.

2.2.4 :061 > /([a-zA-Z]+)\s+([a-zA-Z]+)/.match "Bruce    Wayne"
 => #<MatchData "Bruce    Wayne" 1:"Bruce" 2:"Wayne">
2.2.4 :062 > /([a-zA-Z]+)\s+([a-zA-Z]+)/.match "Bruce Wayne"
 => #<MatchData "Bruce Wayne" 1:"Bruce" 2:"Wayne">

=> Mission accomplished 2.3.1.2. Non-greedy quantifiers Contrast with above quantifiers, repeat matching in a limited number Zero or One I want to identify either "Mr" or "Mrs" in text which definitly contains a dot(.) right after that. It's meaning that "s" character may be occur single time or none in string. Repexp help us to resolve by question mark(?) following possible character

2.2.4 :063 > /Mrs?\./.match "Mr. Tam"
 => #<MatchData "Mr.">
2.2.4 :064 > /Mrs?\./.match "Mrs. Tam"
 => #<MatchData "Mrs.">

Specific numbers of repetitions Sometimes, we wanna to match the characters in a limited times suchs as matching phone number like "0164-1233456" due to network carrier code. It's clear that all of them are numberic except "-" and first block has four characters. In Regexp, if we want to determine a border for looping, we have to use curky braces({}) then specify the number inside it. For the specs above, the regexp is /\d{4}-\d+/

2.2.4 :065 > /\d{4}-\d+/.match "0164-123456"
 => #<MatchData "0164-123456">
2.2.4 :066 > /\d{4}-\d+/.match "0164-123"
 => #<MatchData "0164-123">
2.2.4 :067 > /\d{4}-\d+/.match "0164-1"
 => #<MatchData "0164-1">
2.2.4 :068 > /\d{4}-\d+/.match "0164-"
 => nil
2.2.4 :069 > /\d{4}-\d+/.match "0164123456"
 => nil

2.3.2. Anchors

Anchors let us to determine possition to place matching patterns based on natural character of Sring belong to instinct of String (see table bellow)

_. Notation	_. Description	_. Sample regexp	_. Sample string
^	Beginning of line	/^\s*#/	"#Ruby comment"
$	End of the line	/.$/	"I'm Bond\nJames Bond\n"
\A	Begining of string	/\ADear Sir/Madam/	"Dear Sir/Madam\nI'm Foo"
\z	End of string	/\goodbye.\z/	"Let say goodbye."
\Z	End of string except new line later	/\goodbye.\Z/	"Let say goodbye.\n"
\b	Word boundary	/\b\w+\b/	"%%%%hello^&%$"

2.3.3. Modifiers

Different to all above, modifiers stay out side of forward slash like /abc/i. The i modifiers shown that this regexp can be match for case sensitive like "Abc", "aBc", "ABC"... Besides, there're many useful option for modifiers:

m: match any character including newline
x: ignore whitespace between regexp -> break long regexp to multiple lines
o: guarantee that any Ruby embeded string through #{} perform just once.

3. Conclusion

Phew, finally we've walked around everything that Regexp owned. There're no any shortcut to learn about Regexp but I think at least by post can be helpful for someone wanna to take a challenge to understand Regexp and it's not your nightmare any more

Ruby Regular expression (RegExp)