Using regex and sed

Tag: 

Using regex (regular expressions) and a stream editor like sed can save you tons of time when working with files.

In this tutorial I'll show you a few examples.

Lets say you have many text files. Each file is 1 big paragraph like shown below. (I removed most of the paragraph for simplicity sake)

1{A Census of Israel's Warriors} Yahweh spoke to Moses in the wilderness of Sinai in the Tent of Meeting, on the first day of the second month in the second year after they had come out of the land of Egypt, saying, 2“Take a census of all the congregation of the children of Israel by their families, by their fathers’ houses, according to the number of the names, every male, one by one 3from twenty years old and upward, all who are able to go out to war in Israel. You and Aaron are to number them by their divisions. 4With you there is to be a man of every tribe, everyone head of his fathers’ house. 5These are the names of the men who are to stand with you: Of Reuben: Elizur the son of Shedeur.

And lets say you're going to import this file into a program that requires a particular format, like below. Sure, you could do this by hand, but that might take you hours and hours. And if you have many files it could even take you hundreds of hours. Or you might need to do this on a regular basis. That's where regex and a stream editor comes in handy.

$$ Nu 1:1
{A Census of Israel's Warriors} Yahweh spoke to Moses in the wilderness of Sinai in the Tent of Meeting, on the first day of the second month in the second year after they had come out of the land of Egypt, saying,
$$ Nu 1:2
“Take a census of all the congregation of the children of Israel by their families, by their fathers’ houses, according to the number of the names, every male, one by one
$$ Nu 1:3
from twenty years old and upward, all who are able to go out to war in Israel. You and Aaron are to number them by their divisions.
$$ Nu 1:4
With you there is to be a man of every tribe, everyone head of his fathers’ house.
$$ Nu 1:5
These are the names of the men who are to stand with you: Of Reuben: Elizur the son of Shedeur.

I'm running a Linux emulator called cygwin that has the sed vr 4.2.2 package installed on it running on a windows 10 computer. You don't need to install cygwin if you don't want too because sed has been ported to windows. Either way, you'll need sed.

Okay, now what?

There are four steps needed to get the file in the correct format.

  1. Insert a line before and after a pattern
  2. Remove leading and trailing whitespace
  3. Add a string before a pattern
  4. Remove blank lines

In step number one, we need to put all found digits on its own line and any strings after them on their own line as well.

These two expressions will do that, 's/[0-9]+/\n&/g; s/[0-9]+/&\n/g' the ; separates the two expressions.

Explanation of first expression:

In sed, the "s" before the forward slash means to substitute the /pattern on the left hand side/ with the expression on the right hand side/.

/[0-9]+/ is the on the left side, it's the pattern we're looking for. 0 - 9 are digits while the + sign means one or more, so were looking for 1 or more digits.

On the right side is the replacement /\n&/. In extended sed, using the -r flag, the \n& means to replace the matched pattern with a new line

In sed the /g means to continue finding the pattern declared on the left hand side with the replacement expression found on the right hand side until the end of the file is reached

So to put this together, when sed finds one ore more digits, it replaces what follows on a new line.

Explanation of second expression:

The s/[0-9]+/ has the same meanings as the first expression. However, the /&\n/ tells sed to look at the end of a line and if the [0-9]+ pattern is matched, add a new line.

The result of step one

sed -r 's/[0-9]+/\n&/g; s/[0-9]+/&\n/g' < filename

1
{A Census of Israel's Warriors} Yahweh spoke to Moses in the wilderness of Sinai in the Tent of Meeting, on the first day of the second month in the second year after they had come out of the land of Egypt, saying,
2
“Take a census of all the congregation of the children of Israel by their families, by their fathers’ houses, according to the number of the names, every male, one by one
3
from twenty years old and upward, all who are able to go out to war in Israel. You and Aaron are to number them by their divisions.
4
With you there is to be a man of every tribe, everyone head of his fathers’ house.
5
These are the names of the men who are to stand with you: Of Reuben: Elizur the son of Shedeur.

For step number two, we want to clean up the file by deleting all leading and trailing spaces.

To do this we use these two expression - 's/^\s*//g; s/\s*$//g'

Explanation for the first expression:

In sed the s/ means to substitute the expression on the left with the expression on the right.

In regex, the ^ means the beginning of a line.

In sed,  escaping the \s puts sed in extended mode making the s to represent a space. The -r flag (places sed in extended mode) needs to be used when the command is run.

the * means zero or more.

In regex, the last forward slashes // mean null.

In sed, the g means to read the file to its end.

Explanation for the second expression:

The only difference between the first and second is the & sign. It's the opposite of the ^ sign, and means the end of a line.

So putting this together sed looks at the file finding any space at the beginning of a line, s/^\s*/ and replace the space with nothing, //g, effectively removing it, and continue to the end of the file. Next, sed looks at the end of the lines and removes any spaces it finds.

The result of step two

sed -r 's/^\s*//g; s/\s*$//g' < filename

1
{A Census of Israel's Warriors} Yahweh spoke to Moses in the wilderness of Sinai in the Tent of Meeting, on the first day of the second month in the second year after they had come out of the land of Egypt, saying,
2
“Take a census of all the congregation of the children of Israel by their families, by their fathers’ houses, according to the number of the names, every male, one by one
3
from twenty years old and upward, all who are able to go out to war in Israel. You and Aaron are to number them by their divisions.
4
With you there is to be a man of every tribe, everyone head of his fathers’ house.
5
These are the names of the men who are to stand with you: Of Reuben: Elizur the son of Shedeur.

It's hard to see any results, but if you open the file in a text editor like Notepad++, you can turn on "Show Symbol" and choose White space. Spaces will show as a small red dot. Other editors may show the spaces differently.

Step three, add a string of text before a pattern.

To do this we use this expression: '/^[0-9]+/s/^/$$ Nu 1:/g'

Explanation:

The expression /^[0-9]+/ means to look at the beginning of a line for one or more digits.

The sed command /s/ means to replace, so, /s/^ means to replace the left hand expression with the right hand expression which is /$$ Nu 1:/g but only replace if the beginning of the line is a digit and continue to replace until the end of the file.

Result of step three:

sed -r '/^[0-9]+/s/^/$$ Nu 1:/g' < filename

$$ Nu 1:1
{A Census of Israel's Warriors} Yahweh spoke to Moses in the wilderness of Sinai in the Tent of Meeting, on the first day of the second month in the second year after they had come out of the land of Egypt, saying,
$$ Nu 1:2
“Take a census of all the congregation of the children of Israel by their families, by their fathers’ houses, according to the number of the names, every male, one by one
$$ Nu 1:3
from twenty years old and upward, all who are able to go out to war in Israel. You and Aaron are to number them by their divisions.
$$ Nu 1:4
With you there is to be a man of every tribe, everyone head of his fathers’ house.
$$ Nu 1:5
These are the names of the men who are to stand with you: Of Reuben: Elizur the son of Shedeur.

Now for the last step, removing blank lines.

To do this we use this expression: '/^\s*$/d'

Explanation:

Again we use /^ to look at the start of a line and we escape the s, \s to find a space and we use the * to find zero or more spaces, then sed looks at the end of the line &  and /d deletes it.

To run the command we type sed -r '/^\s*$/d' < filename into the command line

If there were any blank lines, the result would look like step 3 above.

Using files

The last thing I'd like to cover is putting your expressions into a file and then executing that file instead of having to type in the expression every time. This is helpful if you need to run the same expression many times.

Additionally I'll cover exporting the results into a separate file.

First, open a text editor, type in the expression s/[0-9]+/\n&/g; s/[0-9]+/&\n/g and then save it. You don't need to add the ' ' (single quotes) when using a file.

To use this file in sed you need to use the -f flag and the file you just saved. I'll use step 1 as an example

Open a shell and type sed -f filename < filename | less.

The -f flag tells sed to expect a file which contains the regex expression.

The < filename is the file you want sed to work on

The | sign (shift + the backslash key) is the pipe symbol which sends the results (in this case) to the less program.

The less program will display the results to your monitor. This is good because you can be reassured that the original file you had sed work on is not changed.

The q key exits the less program.

If you are satisfied with the results and you want to keep the original file and send the results of the expression to a new file you only need to add a > filename to the end of the command.

Example: sed -f filename < filename > filename

The > symbol outputs the results to a new file