Unix text editing - sed, tr, cut, od

Very short guide to sed, tr, cut and od

Author:  Sakari Mattila
Updated: 14-Sep-1999

Sed and tr are old and simple batch editors in Unix systems. You can use sed to change any string of printable characters into what ever other printable characters. Using tr you can change any one character into any other character, except null (code 0). You use cut to handle fixed-position data and select parts of a line. When you want to see what really happened you use od to look at the file in character-by-character mode.

Sed is in fact full editor, but nowadays it is only used to run simple scripts. You find full instructions and sed commang on sed man page. In order to get that manual page in Unix system you type:

> man sed

and press return. Getting tr man page is similar:

> man tr

and press return. Manual pages for other Unix commands are available similar way. The > at the beginning of the line is Unix prompt, it may be different character or a character string.

A model for automatic sed script is available. This script changes four characters forbidden in HTML into proper escaped equivalents.

A tr script to remove all non-printing characters from a file is below. Non-printing characters may be invisible, but cause problems with printing or sending the file via electronic mail. You run it from Unix command prompt, everything on one line:

> tr -d '\001'-'\011''\013''\014''\016'-'\037''\200'-'\377' 
   < filein > fileout

What is the meaning of this tr script is, that it deletes all charactes with octal value from 001 to 011, characters 013, 014, characters from 016 to 037 and characters from 200 to 377. Other characters are copied over from filein to fileout and these are printable. Please remember, you can not fold a line containing tr command, everything must be on one line, how long it would be. In practice, this script solves some mysterious Unix printing problems.

Type in a text file named "f127.TR" with the line starting tr above. Print the file on screen with cat f127.TR command, replace "filein" and "fileout" with your file names, not same the file, then copy and paste the line and run (execute) it. Please, remember this does not solve Unix end-of-file problem, that is the character '\000', also known as a 'null', in the file. Nor does it handle binary file problem, that is a file starting with two zeroes '\060' and '\060'

Sometimes there are some invisible characters causing havoc. This tr command line converts tabulate- characters into hashes (#) and formfeed- characters into stars (*).

> tr '\011\014' '#*'  < filein > fileout

The numeric value of tabulate is 9, hex 09, octal 011 and in C-notation it is \t or \011. Formfeed is 12, hex 0C, octal 014 and in C-notation it is \f or \014. Please note, tr replaces character from the first (leftmost) group with corresponding character in the second group. Characters in octal format, like \014 are counted as one character each.

The commands to take only part of the text on a line are awk and cut. With awk you can manipulate lines in several ways. Here is a short awk guide. Cut takes characters in fixed positions on the line or field with field separator characters. You use character version of cut this way:

> cut -c3,10-13 filein > fileout

This cut script takes characters in positions 3, 10,11, 12 and 13 from each line, handling one line at a time and puts these characters into output file fileout. -c puts cut into character mode, comma (,) separates numbers indicating individual character positions and hyphen (-) gives a position range.

When you want to see the contents of your file in character-by-character format, you use od, which produces output called octal dump. Give od a try with a small (tens of characters only) text file, then use it with larger files:

> od -c < filein

You see your file in character and two spaces format. Everything like \011 is octal coded non-printable characters and like \r are in C-notation. Explanation is in the od manual. Use man od to look at it.

Unix shell scripts can be used to manipulate texts. There are several programmable editors, best known is Emacs. Perl (www.phlab.missouri.edu/perl/perlcourse.html) is still more powerful text manipulator, but it is also more difficult to program. The ultimate text manipulators are programs written in Ada, C, C++ or other programming languages.

You will need some information on Unix regular expressions when using sed or tr. There is an introduction to regular expressions at the end of Short awk guide.

Unix man pages are fairly good guide to sed, Dale Dougherty's and Arnold Robbins' book sed & awk (ISBN 1-56592-225-5, 1990-1995) is better.

sed, tr and cut come with all Unix and Linux operating systems. These are command line utilities. sed, tr and cut are also included in several Unix-like utilities packages for MS Windows 95/98, MS Windows NT and other operating systems. Cygwin http://sourceware.cygnus.com/cygwin in one source of these packages. Source code in C is available with full Linux packages and GNU packages.

University of Canberra | UC home page | Telstra Bigpond home page | IE-ej | page map

Sakari.Mattila@canberra.edu.au