Substitution & Translation in Regular Expressions with Perl

Substitution and Translation are quite useful in Perl which is used to identifying regular expressions and make substitutions based on matches. In previous article Matching Regular Expression with Perl i have explained what are match operators in Perl and how you can use regular expressions to find patterns in strings. Substitution and Translation is an integral part of Perl regular expression operators. Ideally, these are used to change strings, this gives us the power and tools to manipulate our information in any way we wish. We could scan an entire text file and change all the words “Sky” to “earth” if we want.

billige canada goose jakker i danmark, Canada Goose Sale, Canada Goose Outlet,goose jacket 511 330 116-0 number 16 seed, Canada Goose, Canada Goose,canada goose jackets stores in toronto., Canada Goose Store, Canada Goose Parkas

Lets take a quick example of substitutions in regular expressions. In its simplest form, substitutions work as follows:

$string =~ s/a/b/;

This will replace the first “a” in $string with a “b”. If you wanted to replace all “a”s with “b”s then all we need to do is put a “g” for global at the end of the line like so:

$string =~ s/a/b/g;

We can use all of the special operators with substitution that we did with match, for example, if we were working on the phone number example from the previous article and we wanted to smooth user input issues by removing everything that was not a digit then we could use the following:

$string =~ s/[^0-9]//g;

This replaces anything matched by the first expression, ie: anything except a digit, with what’s in the second expression, which is empty. We can’t do something like the following however, if we were looking to make all vowels uppercase.

$string =~ s/[aeiou]/[AEIOU]/g;

Square bracket notation does not work in the replacement side of the substitution, since in general there would be no way of knowing which character should be inserted. Instead this will replace every vowel with the string “[AEIOU]”. To properly replace all lowercase vowels with their uppercase equivalent, we can use another method: the translation tool:

$string =~ tr/aeiou/AEIOU/;

Translation works on a per character basis, replacing each item in the first list with the character at the same position in the second list. Handily, the second list wraps around, allowing us to write an expression like:

$string =~ tr/[1-9]/ /;

which replaces all numbers with a space. Translation is a simple operation, there’s no way to handle repetition or grouping, so it’s suitable only for basic replacements, for anything more substantial you’re better off with a series of substitutions.

Now let’s look at how you can use these regular expression tools in a real program. We’ll now look at a simple command line utility to help you cheat at crossword puzzles. We want a program which takes in incomplete information about a word and then searches a word list for possible solutions. Virtually all UNIX based systems (eg Linux and Mac) come with a reasonable word list, usually found at /usr/share/dict/words, but Windows users can pick one up here.

A perl program to solve this task could be written like this:

Running quickly through this example: first we take the first command line argument, then replaces all gaps with periods, then uses this as the pattern in a regular expression match, filtering standard input for lines that match the pattern. When I run this as so

cat /usr/share/dict/words | perl "h l"

the following output is printed:


Or, more usefully:

cat /usr/share/dict/words | perl "ab lu y"

prints “absolutely”.

Command line aficionados may notice that we’ve just implemented a very stripped down version of the common utility “grep”. In fact, the previous command could easily be replaced by:

grep "" /usr/share/dict/words

grep is an extremely handy utility for searching in text files using regular expressions, but be careful, the syntax for grep is not 100 percent identical to what perl uses. For more info take a look at the grep manual page by typing “man grep” in your shell.

A lot of the time you’ll want to change a line subtly, rather than replace static text with completely different text. One of the most common ways of doing this is by using groups in the replacement expression. In a previous article I showed how you can combine parts of an expression by surrounding it with parentheses, for example the following expression will replace a hyphen at the start of a line, or any amount of white space with a tab character:

$string =~ s/(^- )|([ \t]+)/\t/g;

The other advantage of groups is that you can insert the characters matched by a group in the match expression in the replacement. In perl the first 10 groups of a regular expression are automatically put into the variables $1-$0.

$string =~ s/^(.+)<BR/?>/<p>$1<\/p>/g;

Similarly, we can convert Comma Separated Variable (.csv) files into html tables quite easily, by applying a few regular expressions:

$string =~ s/([^,]+)[,\n]/<td>$1<\/td>/g;
$string =~ s/^(.+)$/<tr>$1<\/tr>/g;

Now in these expressions, particularly the paragraphing one, there is a consistent flaw, namely that regular expressions are by default case sensitive, whilst the html they run over may not be. We can tell perl to treat our regular expressions as case insensitive by using pattern modifier. We’ve already been using the modifier “g” to tell Perl to match globally, and we can tell it to be case insensitive in the same way:

$string =~ s/^(.+)<BR\/?>/<p>$1<\/p>/gi;

works the same as before, but will now pick up <BR> and <Br>. There are four more pattern modifiers that may be of use to you:

  • m: Treat the string as multiple lines, rather than as a single string with embedded new lines.
  • o: Only compile the expression once, regardless of the status of included variables
  • s: Treat the string as a single line.
  • x: Use extended syntax for regular expressions. This means that any white space that is not escaped is ignored, and regular expressions can be broken up over multiple lines. This allows you to write your more complicated expressions in an easier to read format, and let’s you insert comments.

Let’s run through a quick usage of the extended syntax on the paragraphing expression:


It’s the same expression, but the match pattern is broken up into three lines with comments at the end of each line explaining the three parts of the match. Comments inside extended regular expressions are contained within (?# and ). Now for this example, the comments might seem a little trivial, but for longer and more complicated expressions they can greatly increase the readability of your regular expressions.

I think that will be enough for you to explore more for Substitution. This tutorial must have given you enough confidence to begin with it.

Leave me a comment and let me hear your opinion. If you’ve got any thoughts, comments or suggestions for things we could add, leave a comment! Also please Subscribe to our RSS for latest tips, tricks and examples on cutting edge stuff.

1 I like it
0 I don't like it