Lab No. 8: Text Processing

In this Lab we are going to learn about common Unix/Linux tools for processing of text data.

sed

sed (which stands for stream editor) is used to perform transformations to text files or a stream of text. Common uses are editing files without using an interactive editor (e.g. on scripts), alter the output of a command in a command pipeline, or carry out batch edits on large groups of files.

sed accepts input in the form of text lines. During its execution, it goes through its input one line at a time, and applies one or more instructions (known as sed commands) to each one of them individually, and then outputs the processed lines to standard output.

Let’s create a file called testfile.txt with some text so we can use it to test sed functionality in the following sections.

[user@blue ~]$ cat << EOF > testfile.txt
> 1 a quick brown fox jumped over the quick wolf quickly
> 2 the quick brown fox jumped over the quick wolf quickly
> 3 no quick brown fox jumped over the quick wolf quickly
> 4 the quick brown fox jumped over the quick wolf quickly
> 5 the quick BROWN fox jumped over the quick wolf quickly
> 6 the quick BROWN fox jumped over the quick wolf quickly
> 7 the quick BROWN fox jumped over the quick wolf quickly
> 8 the quick BROWN fox jumped over the quick wolf quickly
> EOF

Just to make sure you typed that file correctly, use the following checksum:

[user@blue ~]$ md5sum testfile.txt
e2a8f3ce72f773b4b3ecd2c0feafdd50  testfile.txt

sed basic syntax

The basic sintax to invoke sed is:

sed [options] script filename

options are flags that modify the defaul behavior. The script specifies the sed commands that sed will apply to the input. If a filename is specified, input is taken from that file, otherwise, sed will read input from stdin.

Let’s see sed in action. In the following command we replace the word “quick” by “fast” in that we created earlier (testfile.txt):

[user@blue ~]$ sed 's/quick/fast/' testfile.txt
1 a fast brown fox jumped over the quick wolf quickly
2 the fast brown fox jumped over the quick wolf quickly
3 no fast brown fox jumped over the quick wolf quickly
4 the fast brown fox jumped over the quick wolf quickly
5 the fast BROWN fox jumped over the quick wolf quickly
6 the fast BROWN fox jumped over the quick wolf quickly
7 the fast BROWN fox jumped over the quick wolf quickly
8 the fast BROWN fox jumped over the quick wolf quickly

In this invocation of sed, we did not provide any options, the 's/quick/fast' part is the script argument with a single command (enclosed in single quotes), and we specified input from the testfile.txt filename.

Notice how when running this command the output of sed was written to stdout, and the contents of testfile.txt were not modified. We could actually modify the contents of testfile by using the -i option which stands for edit files in place. (But we don’t want to do that yet, we’ll do it later).

In this command, the argument s/quick/fast tells sed to replace the first instance of the word “quick” on every line of the input. Notice how we enclosed the script argument in single quotes. This is not required in all cases, but it is a common practice (and you should make a habit of it) because it prevents the shell from expanding any expressions in the sed command (recall what we leared in Lab 4), and also the shell uses spaces to determine the arguments to a program, so if the sed command has spaces on it, it absolutely needs to be enclosed in quotes.

For example, suppose you want to replace “quick brown” by “slow white”. Let’s see what happens if we do not use quotes:

[user@blue ~]$ sed s/quick brown/slow white/ testfile.txt
sed: -e expression #1, char 7: unterminated `s' command

What happened here is that the script argument was interpreted as s/quick because of the space between “quick” and “brown”. As mentioned before, this problem is avoided by enclosing the script argument in single quotes. You could also use double quotes, but be mindful that text within double quotes is still subject to shell expansions. That is why it is recommended to form the habit of enclosing the script argument in single quotes.

 [user@blue ~]$ sed 's/quick brown/slow white/' testfile.txt
1 a slow white fox jumped over the quick wolf quickly
2 the slow white fox jumped over the quick wolf quickly
3 no slow white fox jumped over the quick wolf quickly
4 the slow white fox jumped over the quick wolf quickly
5 the quick BROWN fox jumped over the quick wolf quickly
6 the quick BROWN fox jumped over the quick wolf quickly
7 the quick BROWN fox jumped over the quick wolf quickly
8 the quick BROWN fox jumped over the quick wolf quickly

We mentioned that sed can operate on a stream of text. To do this, you can need to write the input text to stdin of sed and omit the filename argument that we used before. In the following command we pipe the output of the echo command to sed:

[user@blue ~]$ echo -e "The quick brown fox\njumped over the lazy dog." | sed 's/quick/fast/'
The fast brown fox
jumped over the lazy dog.

You can specify more than one sed command as part of the script argument. In the next example we use two different commands to substitute “fox” with “rabbit” and anoter to replace “wolf” with “lynx”:

[user@blue ~]$ sed 's/fox/rabbit/;s/wolf/lynx/' testfile.txt
1 a quick brown rabbit jumped over the quick lynx quickly
2 the quick brown rabbit jumped over the quick lynx quickly
3 no quick brown rabbit jumped over the quick lynx quickly
4 the quick brown rabbit jumped over the quick lynx quickly
5 the quick BROWN rabbit jumped over the quick lynx quickly
6 the quick BROWN rabbit jumped over the quick lynx quickly
7 the quick BROWN rabbit jumped over the quick lynx quickly
8 the quick BROWN rabbit jumped over the quick lynx quickly

Note that we separated the sed commands using a semicolon. There is an alternative syntax where you can specify multiple script arguments to achieve the same results, which has the advantage that it is a little easier to read (specially if and of the text to replace or its replacement contains a semicolon).

[user@blue ~]$ sed -e 's/fox/rabbit/' -e 's/wolf/lynx/' testfile.txt
1 a quick brown rabbit jumped over the quick lynx quickly
2 the quick brown rabbit jumped over the quick lynx quickly
3 no quick brown rabbit jumped over the quick lynx quickly
4 the quick brown rabbit jumped over the quick lynx quickly
5 the quick BROWN rabbit jumped over the quick lynx quickly
6 the quick BROWN rabbit jumped over the quick lynx quickly
7 the quick BROWN rabbit jumped over the quick lynx quickly
8 the quick BROWN rabbit jumped over the quick lynx quickly

When you have a complex set of sed commands that you want to apply for a given task, you can have sed read them from a script file which could be as simple as a file with each command written on separate lines (more complex scripts can be created, but they are out of scope for this class). You specify the script file with the -f option:

[user@blue ~]$ cat << EOF > sedscript
> s/fox/rabbit/
> s/wolf/lynx/
> EOF
[user@blue ~]$ sed -f sedscript testfile.txt
1 a quick brown rabbit jumped over the quick lynx quickly
2 the quick brown rabbit jumped over the quick lynx quickly
3 no quick brown rabbit jumped over the quick lynx quickly
4 the quick brown rabbit jumped over the quick lynx quickly
5 the quick BROWN rabbit jumped over the quick lynx quickly
6 the quick BROWN rabbit jumped over the quick lynx quickly
7 the quick BROWN rabbit jumped over the quick lynx quickly
8 the quick BROWN rabbit jumped over the quick lynx quickly

One thing that you should be aware is that each sed command is applied one by one, and the output from each sed command is applied as the input to the next:

[user@blue ~]$ sed -e 's/fox/rabbit/' -e 's/rabbit/grasshopper/' testfile.txt
1 a quick brown grasshopper jumped over the quick wolf quickly
2 the quick brown grasshopper jumped over the quick wolf quickly
3 no quick brown grasshopper jumped over the quick wolf quickly
4 the quick brown grasshopper jumped over the quick wolf quickly
5 the quick BROWN grasshopper jumped over the quick wolf quickly
6 the quick BROWN grasshopper jumped over the quick wolf quickly
7 the quick BROWN grasshopper jumped over the quick wolf quickly
8 the quick BROWN grasshopper jumped over the quick wolf quickly

sed supports many several types of commands.

In this course we are going to learn the “substitute”, “print” and “delete” sed commands.

sed substitute command

The commands that we executed previously are examples of the substitute command, which is the most commonly used. To be precise, a substitute command has the following syntax (notice the “s” after the address elements)

[address]s/pattern/replacement/flags

As you have seen so far, sed command elements are typically separated by a “slash” (/) character, which is called a “delimiter”. In the substitute command the delimiter needs to appear three times, even if there are no flags. Also, you can use any character as a delimiter except a new line character. Using a delimiter other than slash can be useful when the slash character is part of the pattern or replacement. A typical example of this is when you have to make a substitution of strings that refer to paths in the filesystem. In the following example we use the underscore (_) as separator:

[user@blue ~]$ sed 's_quick_slow_' testfile.txt
1 a slow brown fox jumped over the quick wolf quickly
2 the slow brown fox jumped over the quick wolf quickly
3 no slow brown fox jumped over the quick wolf quickly
4 the slow brown fox jumped over the quick wolf quickly
5 the slow BROWN fox jumped over the quick wolf quickly
6 the slow BROWN fox jumped over the quick wolf quickly
7 the slow BROWN fox jumped over the quick wolf quickly
8 the slow BROWN fox jumped over the quick wolf quickly

When using the substitute command, the second token (quick) corresponds to a regular expression (regex) to match, the third token (slow) indicates the replacement text. The substitute command also accepts a fourth token, which in our examples is ommitted, and that corresponds to a flag that could be (There are other flags “p” and “w’ that we are not going to cover as part of this course)

  • ommited: only the first instance of the matching text is replaced
  • n: A number (between 1 and 512) that indicates that the replacement should be made for only the n th occurrence of the pattern
  • g: make changes globally on all ocurrences of the pattern.
  • p: print the lines where a pattern match occurred

We already saw in the first example how the word “quick” had only the first instance replaced. Let’s see how we can change specific instances within a line:

[user@blue ~]$ sed 's/quick/fast/2' testfile.txt
1 a quick brown fox jumped over the fast wolf quickly
2 the quick brown fox jumped over the fast wolf quickly
3 no quick brown fox jumped over the fast wolf quickly
4 the quick brown fox jumped over the fast wolf quickly
5 the quick BROWN fox jumped over the fast wolf quickly
6 the quick BROWN fox jumped over the fast wolf quickly
7 the quick BROWN fox jumped over the fast wolf quickly
8 the quick BROWN fox jumped over the fast wolf quickly
[user@blue ~]$ sed 's/quick/fast/3' testfile.txt
1 a quick brown fox jumped over the quick wolf fastly
2 the quick brown fox jumped over the quick wolf fastly
3 no quick brown fox jumped over the quick wolf fastly
4 the quick brown fox jumped over the quick wolf fastly
5 the quick BROWN fox jumped over the quick wolf fastly
6 the quick BROWN fox jumped over the quick wolf fastly
7 the quick BROWN fox jumped over the quick wolf fastly
8 the quick BROWN fox jumped over the quick wolf fastly

If we use the g flag (which stands for “global”) we can replace all the instances:

[user@blue ~]$ sed 's/quick/fast/g' testfile.txt
1 a fast brown fox jumped over the fast wolf fastly
2 the fast brown fox jumped over the fast wolf fastly
3 no fast brown fox jumped over the fast wolf fastly
4 the fast brown fox jumped over the fast wolf fastly
5 the fast BROWN fox jumped over the fast wolf fastly
6 the fast BROWN fox jumped over the fast wolf fastly
7 the fast BROWN fox jumped over the fast wolf fastly
8 the fast BROWN fox jumped over the fast wolf fastly

By default, sed will write to stdout every line that it processes, regardless if there is a match or not. We can fine control the output by using the -n option (“quiet mode”). This option will make sed to only output lines that are explicitly printed by either using a sed print command, or in the case of a sed substitute command, that use the p flag. Let’s see that in action:

[user@blue ~]$ sed -n 's/BROWN/brown/gp' testfile.txt
5 the quick brown fox jumped over the quick wolf quickly
6 the quick brown fox jumped over the quick wolf quickly
7 the quick brown fox jumped over the quick wolf quickly
8 the quick brown fox jumped over the quick wolf quickly

Notice how in the previous example only lines that originally matched the pattern BROWN were output. If we omit the p flag at the end, then we would get no output.

If we want to modify a specific line, we can do so by using the address element of the command. Adresses can be specified in different ways:

Notation Description
n The command will be applied only to line number n
$ The last line
n1,n2 A range of lines from n1 to n2
n~y Line number n then each subsequent line at y intervals
n1,+n2 Line n1 and the following n2 lines
n! All lines except line n

Let’s try some examples:

[user@blue ~]$ sed '2s/quick/fast/g' testfile.txt
1 a quick brown fox jumped over the quick wolf quickly
2 the fast brown fox jumped over the fast wolf fastly
3 no quick brown fox jumped over the quick wolf quickly
4 the quick brown fox jumped over the quick wolf quickly
5 the quick BROWN fox jumped over the quick wolf quickly
6 the quick BROWN fox jumped over the quick wolf quickly
7 the quick BROWN fox jumped over the quick wolf quickly
8 the quick BROWN fox jumped over the quick wolf quickly
[user@blue ~]$ sed '$s/quick/fast/g' testfile.txt
1 a quick brown fox jumped over the quick wolf quickly
2 the quick brown fox jumped over the quick wolf quickly
3 no quick brown fox jumped over the quick wolf quickly
4 the quick brown fox jumped over the quick wolf quickly
5 the quick BROWN fox jumped over the quick wolf quickly
6 the quick BROWN fox jumped over the quick wolf quickly
7 the quick BROWN fox jumped over the quick wolf quickly
8 the fast BROWN fox jumped over the fast wolf fastly
[user@blue ~]$ sed '3,5s/quick/fast/g' testfile.txt
1 a quick brown fox jumped over the quick wolf quickly
2 the quick brown fox jumped over the quick wolf quickly
3 no fast brown fox jumped over the fast wolf fastly
4 the fast brown fox jumped over the fast wolf fastly
5 the fast BROWN fox jumped over the fast wolf fastly
6 the quick BROWN fox jumped over the quick wolf quickly
7 the quick BROWN fox jumped over the quick wolf quickly
8 the quick BROWN fox jumped over the quick wolf quickly
[user@blue ~]$ sed '3~2s/quick/fast/g' testfile.txt
1 a quick brown fox jumped over the quick wolf quickly
2 the quick brown fox jumped over the quick wolf quickly
3 no fast brown fox jumped over the fast wolf fastly
4 the quick brown fox jumped over the quick wolf quickly
5 the fast BROWN fox jumped over the fast wolf fastly
6 the quick BROWN fox jumped over the quick wolf quickly
7 the fast BROWN fox jumped over the fast wolf fastly
8 the quick BROWN fox jumped over the quick wolf quickly
[user@blue ~]$ sed '2,+3s/quick/fast/g' testfile.txt
1 a quick brown fox jumped over the quick wolf quickly
2 the fast brown fox jumped over the fast wolf fastly
3 no fast brown fox jumped over the fast wolf fastly
4 the fast brown fox jumped over the fast wolf fastly
5 the fast BROWN fox jumped over the fast wolf fastly
6 the quick BROWN fox jumped over the quick wolf quickly
7 the quick BROWN fox jumped over the quick wolf quickly
8 the quick BROWN fox jumped over the quick wolf quickly
[user@blue ~]$ sed '4!s/quick/fast/g' testfile.txt
1 a fast brown fox jumped over the fast wolf fastly
2 the fast brown fox jumped over the fast wolf fastly
3 no fast brown fox jumped over the fast wolf fastly
4 the quick brown fox jumped over the quick wolf quickly
5 the fast BROWN fox jumped over the fast wolf fastly
6 the fast BROWN fox jumped over the fast wolf fastly
7 the fast BROWN fox jumped over the fast wolf fastly
8 the fast BROWN fox jumped over the fast wolf fastly

More elaborate patterns

So far we have used patterns that match literals (they do not use any of the metacharacters that we learned in Lab 7). Let’s try some tasks that require the usage of metacharacters.

Suppose we want to capitalize the word at the begining of the line (right after the line number) Let’s try the following expression, in which we try to capitalize the words “the”, “a” and “no”.

[user@blue ~]$ sed 's/the/The/;s/a/A/;s/no/No/'  testfile.txt
1 A quick brown fox jumped over The quick wolf quickly
2 The quick brown fox jumped over the quick wolf quickly
3 No quick brown fox jumped over The quick wolf quickly
4 The quick brown fox jumped over the quick wolf quickly
5 The quick BROWN fox jumped over the quick wolf quickly
6 The quick BROWN fox jumped over the quick wolf quickly
7 The quick BROWN fox jumped over the quick wolf quickly
8 The quick BROWN fox jumped over the quick wolf quickly

We can see that it makes substitutions that we did not want (on lines 1 and 3) One first attempt to solve this problem by using the begining of the line metacharacter, and for the sake of simplicity, let’s try only to take care of the word “the”

[user@blue ~]$ sed 's/^[0-9] the/The/' testfile.txt
1 a quick brown fox jumped over the quick wolf quickly
The quick brown fox jumped over the quick wolf quickly
3 no quick brown fox jumped over the quick wolf quickly
The quick brown fox jumped over the quick wolf quickly
The quick BROWN fox jumped over the quick wolf quickly
The quick BROWN fox jumped over the quick wolf quickly
The quick BROWN fox jumped over the quick wolf quickly
The quick BROWN fox jumped over the quick wolf quickly

That worked, but unfortunately, we lost the line number. We need a way to tell sed to preserve the ^[0-9] part. To do this, we need to make use of back references. To make use of backreferences, we need to enclose the subexpression in escaped parenthesis, and then we need to refer to them in the replacement part using the \digit special sequence, where digit is a single digit (between 1 and 9)

[user@blue ~]$ sed 's/\(^[0-9] \)the/\1The/' testfile.txt
1 a quick brown fox jumped over the quick wolf quickly
2 The quick brown fox jumped over the quick wolf quickly
3 no quick brown fox jumped over the quick wolf quickly
4 The quick brown fox jumped over the quick wolf quickly
5 The quick BROWN fox jumped over the quick wolf quickly
6 The quick BROWN fox jumped over the quick wolf quickly
7 The quick BROWN fox jumped over the quick wolf quickly
8 The quick BROWN fox jumped over the quick wolf quickly

We can now take care of the desired substitution as:

[user@blue ~]$ sed 's/\(^[0-9] \)the/\1The/;s/\(^[0-9] \)a/\1A/;s/\(^[0-9] \)no/\1No/' testfile.txt
1 A quick brown fox jumped over the quick wolf quickly
2 The quick brown fox jumped over the quick wolf quickly
3 No quick brown fox jumped over the quick wolf quickly
4 The quick brown fox jumped over the quick wolf quickly
5 The quick BROWN fox jumped over the quick wolf quickly
6 The quick BROWN fox jumped over the quick wolf quickly
7 The quick BROWN fox jumped over the quick wolf quickly
8 The quick BROWN fox jumped over the quick wolf quickly

That was quite a lot, and we can see how complex expressions can quickly turn into a hard to understand pattern. Let’s try to simplify this a little. We know that we want to capitalize the first lower case letter. Just to try something, let’s see if the [[:lower::]] capturing group could help. For the moment, let’s replace everything with a capital T:

[user@blue ~]$ sed 's/\(^[0-9] \)[[:lower:]]/\1T/' testfile.txt
1 T quick brown fox jumped over the quick wolf quickly
2 The quick brown fox jumped over the quick wolf quickly
3 To quick brown fox jumped over the quick wolf quickly
4 The quick brown fox jumped over the quick wolf quickly
5 The quick BROWN fox jumped over the quick wolf quickly
6 The quick BROWN fox jumped over the quick wolf quickly
7 The quick BROWN fox jumped over the quick wolf quickly
8 The quick BROWN fox jumped over the quick wolf quickly

Now, we are closer, but we need to preserve the lowercase character that was matched and convert it to uppercase. We know that we can preserve a match by using a back reference. We can solve the conversion to uppercase by using a special sequence \u which converts the next character in the replacement to uppercase:

[user@blue ~]$ sed 's/\(^[0-9] \)\([[:lower:]]\)/\1\U\2/' testfile.txt
1 A quick brown fox jumped over the quick wolf quickly
2 The quick brown fox jumped over the quick wolf quickly
3 No quick brown fox jumped over the quick wolf quickly
4 The quick brown fox jumped over the quick wolf quickly
5 The quick BROWN fox jumped over the quick wolf quickly
6 The quick BROWN fox jumped over the quick wolf quickly
7 The quick BROWN fox jumped over the quick wolf quickly
8 The quick BROWN fox jumped over the quick wolf quickly

The following table lists other special sequences that can be used with sed: .. cssclass:: minitable

Sequence Description
L Turn the replacement to lowercase (until a U or E is found)
U Turn the replacement to uppercase (until a L or E is found)
l Turn the next character to lowercase
u Turn the next character to uppercase
E Stop case conversion started by U or L
& Output the matched pattern

Just as grep, sed supports extended regular expressions, but you need to activate them by using the the -E option.

Assume for example, that we have a file called ccnumbers.txt that contains 16-digit credit card numbers:

[user@blue ~]$ cat ccnumbers.txt
4737458963699782
4916875767809262
4556856715310335
4556636000469644
4024007127553401

Suppose you want to print the credit card numbers with a space separating each one of the 4 digit segments. You can use the following command:

[user@blue ~]$ sed -E 's/([0-9]{4})([0-9]{4})([1234567890]{4})([[:digit:]]{4})/\1 \2 \3 \4/' ccnumbers.txt
4737 4589 6369 9782
4916 8757 6780 9262
4556 8567 1531 0335
4556 6360 0046 9644
4024 0071 2755 3401

See how we used back references, repetition quantifiers, and character classes ([1234567890] [[:digit:]] and [0-9] they all do the same and are used here for illustration purposes) Notice how when you enabled extended regular expressions you did not have to escape the parenthesis. Yes this can be maddening!

sed print command

The print command is used (as you might have guessed) to print specific lines from the input. It has the following syntax:

[address]/pattern/p

The address and pattern behave exactly in the same way as in the case of the substitute command. There is one modification that we need to make to the way are invoking sed which is to add the -n flag:

[user@blue ~]$ sed -n '1p' testfile.txt
1 a quick brown fox jumped over the quick wolf quickly

By using the pattern element, we can make sed behave just like grep:

[user@blue ~]$ sed -n '/BROWN/p' testfile.txt
5 the quick BROWN fox jumped over the quick wolf quickly
6 the quick BROWN fox jumped over the quick wolf quickly
7 the quick BROWN fox jumped over the quick wolf quickly
8 the quick BROWN fox jumped over the quick wolf quickly
[user@blue ~]$ grep BROWN testfile.txt
5 the quick BROWN fox jumped over the quick wolf quickly
6 the quick BROWN fox jumped over the quick wolf quickly
7 the quick BROWN fox jumped over the quick wolf quickly
8 the quick BROWN fox jumped over the quick wolf quickly

This is obviously not the prime usage of sed, but it does have it’s place in complex sed scripts.

sed delete command

The delete command is used (as you might have also guessed) to remove specific lines from the input: It requires the following syntax:

[address]/pattern/p

In the following example we remove all the lines that match BROWN:

[user@blue ~]$ sed  '/BROWN/d' testfile.txt
1 a quick brown fox jumped over the quick wolf quickly
2 the quick brown fox jumped over the quick wolf quickly
3 no quick brown fox jumped over the quick wolf quickly
4 the quick brown fox jumped over the quick wolf quickly

cut

cut is used to extract a section of text from a line and output the extrated section. It is a simple alternative when we have text that is organized in columns or fields. By default it uses the tab character as separator. We are going to show some examples of its usage here, but please refer to the man pages of cut to get familiar with the different options it supports.

Let’s first take our testfile.txt, replace spaces with tabs and create another file called data.txt

[user@blue ~]$ sed 's/ /\t/g' testfile.txt > data.txt
[user@blue ~]$ cat data.txt
1       a       quick   brown   fox     jumped  over    the     quick   wolf    quickly
2       the     quick   brown   fox     jumped  over    the     quick   wolf    quickly
3       no      quick   brown   fox     jumped  over    the     quick   wolf    quickly
4       the     quick   brown   fox     jumped  over    the     quick   wolf    quickly
5       the     quick   BROWN   fox     jumped  over    the     quick   wolf    quickly
6       the     quick   BROWN   fox     jumped  over    the     quick   wolf    quickly
7       the     quick   BROWN   fox     jumped  over    the     quick   wolf    quickly
8       the     quick   BROWN   fox     jumped  over    the     quick   wolf    quickly
[user@blue ~]$ md5sum data.txt
c1d1d33855ec4751d3c6aef158a1fdfa  data.txt

Notice how we used the escaped sequence \t for the tab character. Also, make sure you verify the checksum.

Let’s use cut got print the second column

[user@blue ~]$ cut -f 2 data.txt
a
the
no
the
the
the
the
the

We can also print a list or range of columns:

[user@blue ~]$ cut -f 2,5 data.txt
a       fox
the     fox
no      fox
the     fox
the     fox
the     fox
the     fox
the     fox
[user@blue ~]$ cut -f 2-5 data.txt
a       quick   brown   fox
the     quick   brown   fox
no      quick   brown   fox
the     quick   brown   fox
the     quick   BROWN   fox
the     quick   BROWN   fox
the     quick   BROWN   fox
the     quick   BROWN   fox

You could use cut with files that are not delimited with tabs by using -d option

[user@blue ~]$ cut -d ' ' -f 2-5 testfile.txt
a quick brown fox
the quick brown fox
no quick brown fox
the quick brown fox
the quick BROWN fox
the quick BROWN fox
the quick BROWN fox
the quick BROWN fox

Or you could select specific characters with the -c option:

[user@blue ~]$ cut  -c 3- testfile.txt
a quick brown fox jumped over the quick wolf quickly
the quick brown fox jumped over the quick wolf quickly
no quick brown fox jumped over the quick wolf quickly
the quick brown fox jumped over the quick wolf quickly
the quick BROWN fox jumped over the quick wolf quickly
the quick BROWN fox jumped over the quick wolf quickly
the quick BROWN fox jumped over the quick wolf quickly
the quick BROWN fox jumped over the quick wolf quickly

awk

awk, named after its developers (Aho, Weinberger and Kernighan), is a programming language that permits manipulation of structured data.

awk is way more than a tool, it is a programming language by itself. In this class we are going to learn only its most basic usage through some very basic examples.

In the following example we extract datafields from a string. awk uses whitespace characters as its defaul field delimiter. The awk field variables start at $1 and increment up through the end of the string. In the example, there are 9 fields. The variable $0 corresponds to entire line, the variable NR holds the current line number, and the variable NF contains the number of fields in the current line.

[user@blue ~]$ awk '{ print $1, $9, $5, $6, $7, $4, NR}' testfile.txt
1 quick fox jumped over brown 1
2 quick fox jumped over brown 2
3 quick fox jumped over brown 3
4 quick fox jumped over brown 4
5 quick fox jumped over BROWN 5
6 quick fox jumped over BROWN 6
7 quick fox jumped over BROWN 7
8 quick fox jumped over BROWN 8

Just like sed, awk accepts input from stdin:

[user@blue ~]$ awk '{ print $1, $9, $5, $6, $7, $4, NR}' < testfile.txt
1 quick fox jumped over brown 1
2 quick fox jumped over brown 2
3 quick fox jumped over brown 3
4 quick fox jumped over brown 4
5 quick fox jumped over BROWN 5
6 quick fox jumped over BROWN 6
7 quick fox jumped over BROWN 7
8 quick fox jumped over BROWN 8

Notice how you can print the last field by using the $NF variable:

[user@blue ~]$ echo 'The quick brown fox jumped over the lazy dog' | awk '{ print "The input was: "$0". There are a total of " NF " fields. The last field value is "$NF}'
The input was: The quick brown fox jumped over the lazy dog. There are a total of 9 fields. The last field value is dog

If the data is not formatted using spaces as separators, we can specify the delimiter by using the -F option.

[user@blue ~]$ echo 'this-is-a-hyphen-delimited-text' | awk -F'-' '{ print $4, $5}'
hyphen delimited

A common need is to be able to pass variables from bash to awk. The following example demonstrate an example of how to achieve this (the -v option is used to assign variables and their values)

[user@blue ~]$ VAR=3
[user@blue ~]$ echo "The quick brown fox" | awk -v myvar=$VAR '{print $myvar}'
brown

You can also perform substring extraction with awk through its substr function. The syntax of this function is

substr(string, position of first character, length of substring)

The position of the first character is 1-indexed (that is the first character has a position equal to one, not zero)

[user@blue ~]$ echo -e "This is linenumberone\nThis is linenumbertwo" | awk '{print substr($3,11,3)}'
one
two

In the next example we produce output only if the second field matches a certain value (“no”):

[user@blue ~]$ awk '$2 == "no" {print $0}' testfile.txt
3 no quick brown fox jumped over the quick wolf quickly

We can also apply regular expressions by using the ~ operator. In the following example, we tell awk to filter lines whose first field matches the pattern [713]

[user@blue ~]$ awk '$1 ~ /[713]/  {print $1,$2,$3,$4}' testfile.txt
1 a quick brown
3 no quick brown
7 the quick BROWN

tr

The tr command is used to translate or delete characters. It takes input from stdin it does not read files.

For example, lets replace spaces with tabs. We did that with sed earlier, but we can do it with tr as well:

[user@blue ~]$ tr ' ' '\t' < testfile.txt
1       a       quick   brown   fox     jumped  over    the     quick   wolf    quickly
2       the     quick   brown   fox     jumped  over    the     quick   wolf    quickly
3       no      quick   brown   fox     jumped  over    the     quick   wolf    quickly
4       the     quick   brown   fox     jumped  over    the     quick   wolf    quickly
5       the     quick   BROWN   fox     jumped  over    the     quick   wolf    quickly
6       the     quick   BROWN   fox     jumped  over    the     quick   wolf    quickly
7       the     quick   BROWN   fox     jumped  over    the     quick   wolf    quickly
8       the     quick   BROWN   fox     jumped  over    the     quick   wolf    quickly

You can also use character classes with tr

[user@blue ~]$ tr [:upper:] [:lower:] < testfile.txt
1 a quick brown fox jumped over the quick wolf quickly
2 the quick brown fox jumped over the quick wolf quickly
3 no quick brown fox jumped over the quick wolf quickly
4 the quick brown fox jumped over the quick wolf quickly
5 the quick brown fox jumped over the quick wolf quickly
6 the quick brown fox jumped over the quick wolf quickly
7 the quick brown fox jumped over the quick wolf quickly
8 the quick brown fox jumped over the quick wolf quickly

Notice that you did not need to enclose the character classes in double square brackets as you have done before (e.g. with grep).

tail

The tail command is used to output the last lines of files or input from stdin. Consult the man pages to see the different options it supports.

sort

We have seen the sort command in previous labs. Its purpose is to sort the contents of standard input or one or more files. The default sort order is ascending by character value. Consult the man pages to see the different options it supports.

uniq

We have also seen the uniq command in previous labs. uniq removes duplicated entries from a file. Note that uniq does not sort the input, it detects a duplicate if the previous line was identical to the current line that is processing. In order for uniq to remove duplicates, most likely you need to process the data with sort beforehand. Also note that the GNU version of sort has the -u option to remove duplicates. which renders a subsequent call to uniq unnecessary. However uniq has some interesting options to instead of removing duplicates, actually print duplicates.

Report (20 pts)

1. (4 pts) We saw in Lab No.3 that the file /etc/passwd is used to store user account information. Provide a command that will print a list of full user names (i.e. the first name and last names) of all the users in that file.

2. (4 pt) A python program saved in a file named ‘bad.py` was written using incorrect syntax and we need to change all the square brackets [] with parentheses (). Provide a command that will perform this task.

For example, if the contents of bad.py are:

def bad_function[]:
    print["this does not work"]
    print[[35+4]]

it should produce:

def bad_function():
    print("this does not work")
    print((35+4))
  1. (2 pt) Given a file named “alongfile.txt” which has more than 1000 lines, provide a command that will print lines 100-200 (both inclusive) of this file.
  2. (2 pt) Given a file named “alongfile.txt” which has more than 1000 lines, provide a command that will print the last 100 lines of this file.
  3. (4 pts) Recall the sed example with credit card numbers. Since we are guaranteed a consistent format in the ccnumbers.txt file, we could simplify with this expression:
[user@blue ~]$ sed -E 's/(.{4})/& /g;s/ $//' ccnumbers.txt
4737 4589 6369 9782
4916 8757 6780 9262
4556 8567 1531 0335
4556 6360 0046 9644
4024 0071 2755 3401

Explain (verbally) the script argument of the previous command.

  1. (4 pts) Provide a command that will take a file called input.txt and will output to stdout the line content with the line number prepended. For example, if the contents of input.txt are:
This is line one
This is line two
This is line three

The command output should be:

1 This is line one
2 This is line two
3 This is line three