Pages

Monday, December 9, 2013

R: Regex Telephone Number Matching

Hello Readers,


Firstly, here is an encouraging message from President Obama, calling for more students to learn code! Hooray, SCIENCE!


Now back to the post. Here we shall discuss using regular expressions to match North American telephone numbers in RegExr. For matching text from HTML5, check out this post.

Usual telephone numbers include a three digit area code followed by a seven digit number sequence, which can be separated by hyphen, period or even spaces. Sometimes the area code can be enclosed in parenthesis as well. So how do we match and extract the numbers that we require? So we turn to the flexibility of Regular Expressions to describe the target string we want to match. We will be using RegExr program, found here. Let us get started.


RegExr


The picture of the RegExr window, above, shows a variety of telephone numbers, some separated with hyphens, some with periods, and others just with spaces. So how can we match the digits in the phone numbers? To demonstrate, we can first use:

1. Literal Strings

  By using the actual numbers, we can manually match the telephone numbers. For example, if we type 498, then RegExpr matches the sequence of digits 498 twice, as shown below.


RegExpr Matching 498

2. Shorthand

  By using the power of regular expressions, we can incorporate shorthand, which is a special type of character. A backlash lowercase d denotes any digit. The opposite is true when the d is capitalized to D, as any character that is not a digit is matched, such as hyphens, periods, parenthesis, etc. Both are shown below.


Matches Any Digit- Shorthand
And the reverse (conveniently the capitalized letter D):

Matches Any Non-Digit- Shorthand
3. Quantifiers

Since telephone numbers follow a pattern of 3 digits, 3 digits, and 4 digits, we can tell the match to loop for a certain amount of times by using a quantifier. For example, using curly brackets {n} to surround a number after a regular expression will instruct RegExr to match that expression n times. This is shown below, as \d (any number) is matched exactly three times. Note how in the last set of 4 digits, only the first 3 are matched.


Shorthand Matching with a Quantifier
To match the set of 4 numbers as well, we can use {n1, n2} as a quantifier where RegExr will match expression from the range n1 to n2. Additionally, the quantifier ? matches 0 or 1 of the preceding token, whereas the + will match 1 or more. The asterisk * will match 0 or more times.


More Quantifiers- It Works!
The ? after both parenthesis allow them to match when they are and are not there, so it will match the area code as well as the other digits in the phone number. The digit sets are set to match {3, 4} three to four times, and the optional period will match any character, whether it be a hyphen, period, or white space. Then the whole expression has to match at least once or more with the + sign.

4. Putting It Together


However, there are many different ways to match a specific string. Below is another method which uses anchors.

Simply put, anchors do not match a specific character in the string, it matches the position. So while we can use the carrot sign, ^, to match the beginning of a line, it will not return any value without an expression. Likewise, the $ matches the end of a line.

See how the ^ and the $ constrain the first set of tokens to match the area code with or without the parenthesis at the start of the line, and that the last set of 4 digits have to end at the end of the line. The non-digits are matched with the \D token.


Another Method Using Anchors

Now we have seen the power and flexibility of regular expressions and how it can match numbers and different symbols. They will be very important when analyzing text data as well, especially making sense of data in HTML5 format.

We will continue to solve more regular expression problems in future posts. Stay tuned!
As always, thanks for reading,


Wayne

1 comment:

  1. This article gives the light in which we can observe the reality. This is very nice one and gives indepth information. Thanks for this nice article. 0800 mobile numbers

    ReplyDelete