Formats

[This is a special Blast from the Past post where I republish Randal Schwartz’s original “Formats” chapter from the first edition of Learning Perl. I’ve really liked this output feature which has mostly been left behind by the online world that doesn’t do physical pages. It hasn’t been worth the 10 or so pages it would take up in the print version of the book, so I present it here mostly as it appeared—historical warts and all.]

What Is a Format?

Perl stands, among other things, for “Practical Extraction and Report Language.” It’s time to learn about that “…Report language” business.

Perl provides the notion of a simple report writing template, called a format. A format defines a constant part (the column headers, labels, fixed text, or whatever) and a variable part (the current data you’re reporting). The shape of the format is very close to the shape of the output, similar to formatted output in COBOL or the print using clauses of some BASICs.

Using a format consists of doing three things:

  1. Defining a format
  2. Loading up the data to be printed into the variable portions of the format (fields)
  3. Invoking the format

Most often, the first step is done once (in the program text so that it gets defined at compile-time), and the other two steps are performed repeatedly. (You can also create formats at run-time using the eval function, as described in Programming Perl and in the perlform(1) manpage).

Defining a Format

A format is defined using a format definition. This format definition can appear anywhere in your program text, like a subroutine. A format definition looks like this:

format someformatname =
fieldline
value_one, value_two, value_three
fieldline
value_one, value_two
fieldline
value_one, value_two, value_three
.

The first line contains the reserved word format, followed by the format name and then an equal sign (=). The format name is chosen from yet another namespace, and follows the same rule as everything else. Because format names are never used within the body of the program (except within string values), you can safely use names that are identical to reserved words. As you’ll see in the next section, “Invoking a Format”, most of your format names will probably be the same as filehandle names (which then makes them not the same as reserved words… oh well).

Following the first line comes the template itself, spanning zero or more text lines. The end of the template is indicated by a line consisting of a single dot by itself (In text files, the last line needs to end with a newline to work properly). Templates are sensitive to whitespace; this is one of the few places where the kind and amount of whitespace (space, newline, or tab) matters in the text of a Perl program.

The template definition contains a series of fieldlines. Each fieldline may contain fixed text—text that will be printed out literally when the format is invoked. Here’s an example of a fieldline with fixed text:

Hello, my name is Fred Flintstone.

Fieldlines may also contain fieldholders for variable text. If a line contains fieldholders, the following line of the template (called the value line) dictates a series of scalar values—one per fieldholder—that provide the values that will be plugged into the fields. Here’s an example of a fieldline with one fieldholder and the value line that follows:

Hello, my name is @<<<<<<<<<<
$name

The fieldholder is the @<<<<<<<<<< , which specifies a left-justified text field with 11 characters. More complete details about fieldholders will be given in the upcoming section, "More About the Fieldholders".

If the fieldline has multiple fieldholders, it needs multiple values, so the values are separated on the value line by commas:

Hello, my name is @<<<<<<<<<< and I'm @<< years old.
$name, $age

Putting all this together, we can create a simple format for an address label:

format ADDRESSLABEL =
===============================
| @<<<<<<<<<<<<<<<<<<<<<<<<<< |
$name
| @<<<<<<<<<<<<<<<<<<<<<<<<<< |
$address
| @<<<<<<<<<<<<<<<<, @< @<<<< |
$city,          $state, $zip
===============================
.

Note that the lines of equal signs at the top and bottom of the format have no fields and thus have no value lines following. (If you put a value line following such a fieldline, it will be interpreted as another fieldline, probably not doing what you want.)

Whitespace within the value line is ignored. Some people choose to use additional whitespace in the value line to line up the variable with the fieldholder on the preceding line (such as putting $zip underneath the third field of the previous line in this example), but that's just for looks. Perl doesn't care, and it doesn't affect your output.

Text after the first newline in a value is discarded (except in the special case of multiline fieldholders, described later).

A format definition is like a subroutine definition. It doesn't contain immediately executed code, and can therefore be placed anywhere in the file with the rest of the program. We tend to put ours toward the end of the file, ahead of our subroutine definitions.

Invoking a Format

You invoke a format with the write function. This function takes the name of a filehandle and generates text for that filehandle using the current format for that filehandle. By default, the current format for a filehandle is a format with the same name (so for the STDOUT filehandle, the STDOUT format is used), but we'll soon see that you can change it.

Let's take another look at that address label format, and create a file full of address labels. Here's a program segment:

format ADDRESSLABEL =
===============================
| @<<<<<<<<<<<<<<<<<<<<<<<<<< |
$name
| @<<<<<<<<<<<<<<<<<<<<<<<<<< |
$address
| @<<<<<<<<<<<<<<<<, @< @<<<< |
$city,          $state, $zip
===============================
.


open(ADDRESSLABEL,">labels-to-print") || die "can't create";
open(ADDRESSES,"addresses") || die "cannot open addresses";
while () {
    chomp; # remove newline
    ($name,$address,$city,$state,$zip) = 

split(/:/);
        # load up the global variables
    

write (ADDRESSLABEL); # send the output
}

Here we see our previous format definition, but now we also have some executable code. First, we open a filehandle onto an output file, which is called labels-to-print . Note that the filehandle name (ADDRESSLABEL) is the same as the name of the format. This is important. Next, we open a filehandle on an address list. The format of the address list is presumed to be something like this:

Stonehenge:4470 SW Hall Suite 107:Beaverton:OR:97005
Fred Flintstone:3737 Hard Rock Lane:Bedrock:OZ:999bc

In other words, five colon-separated fields, which our code parses as described below.

The while loop in the program reads each line of the address file, gets rid of the newline, and then splits the remainder into five variables. Note that the variable names are the same names as the ones we used when we defined the format. This, too, is important.

Once we have all of the variables loaded up (so that the values used by the format are correct), the write function invokes the format. Note that the parameter to write is the filehandle to be written to, and by default, the format of the same name is also used.

Each field in the format is replaced with the corresponding value from the next line of the format. After the two sample records given above are processed, the file labels-to-print contains:

===============================
| Stonehenge                  |
| 4470 SW Hall Suite 107      |
| Beaverton        , OR 97005 |
===============================
===============================
| Fred Flintstone             |
| 3737 Hard Rock Lane         |
| Bedrock          , OZ 999bc |
===============================

More About the Fieldholders

So far, by example, you know that the fieldholder @<<<< means a five-character left-justified field and that @<<<<<<<<<< means an 11-character left-justified field. Here's the whole scoop, as promised earlier.

Text Fields

Most fieldholders start with @. The characters following the @ indicate the type of field, while the number of characters (including the @) indicates the field width.

If the characters following the @ are left-angle brackets (<<<<), you get a left-justified field; that is, the value will be padded on the right with spaces if the value is shorter than the field width. (If a value is too long, it's truncated automatically; the layout of the format is always preserved.)

If the characters following the @ are right-angle brackets ( >>>> ), you get a right-justified field—that is, if the value is too short, it gets padded on the left with spaces.

Finally, if the characters following the @ are vertical bars (||||), you get a centered field: if the value is too short, it gets padded on both sides with spaces, enough on each side to make the value mostly centered within the field.

Numeric Fields

Another kind of fieldholder is a fixed-precision numeric field, useful for those big financial reports. This field also begins with @, and is followed by one or more #'s with an optional dot (indicating a decimal point). Once again, the @ counts as one of the characters of the field. For example:

format MONEY =
Assets: @#####.## Liabilities: @#####.## Net: @#####.##
$assets, $liabilities, $assets-$liabilities
.

The three numeric fields allow for six places to the left of the decimal place, and two to the right (useful for dollars and cents). Note the use of an expression in the format—perfectly legal and frequently used.

Perl provides nothing fancier than this; you can't get floating currency symbols or brackets around negative values or anything interesting. To do that, you have to write your own spiffy subroutine, like so:

format MONEY =
Assets: @<<<<<<<<< Liabilities @<<<<<<<< Net: @<<<<<<<<<
&pretty($assets,10), &pretty($liab,9), &pretty($assets-$liab,10)
.

sub pretty {
	my($n,$width) = @_;
    $width -= 2; # back off for negative stuff
    $n = 

	sprintf("%.2f",$n); # sprintf is in later chapter
    if ($n < 0) {
        return sprintf("[%$width.2f]", -$n);
            # negative numbers get brackets
    } else {
        return sprintf(" %$width.2f ", $n);
            # positive numbers get spaces instead
    }
}

## body of program:
$assets = 32125.12; 
$liab = 45212.15; 
write (MONEY);

Multiline Fields

As mentioned earlier, Perl normally stops at the first newline of a value when placing the result into the output. One kind of fieldholder, the multiline fieldholder, allows you to include a value that may have many lines of information. This fieldholder is denoted by @* on a line by itself: as always, the following line defines the value to be substituted into the field, which in this case may be an expression that results in a value containing many newlines.

The substituted value will look just like the original text: four lines of value become four lines of output. For example:

format STDOUT =
Text Before.
@*
$long_string
Text After.
.

$long_string = "Fred\nBarney\nBetty\nWilma\n";
write;

generates the output:

Text Before.
Fred
Barney
Betty
Wilma
Text After.

Filled Fields

Another kind of fieldholder is a filled field. This fieldholder allows you to create a filled paragraph, breaking the text into conveniently sized lines at word boundaries, wrapping the lines as needed. There are a few parts that work together here, but let's look at them separately.

First, a filled field is denoted by replacing the @ marker in a text fieldholder with a caret (so you get ^<<<, for example). The corresponding value for a filled field (on the following line of the format) must be a scalar variable containing text, rather than an expression that returns a scalar value (including a single scalar element of an array or hash, like $a[3] or $h{"fred"}). The reason for this is that Perl will alter the variable while filling the filled field, and it's pretty hard to alter an expression.

When Perl is filling the filled field, it takes the value of the variable and grabs as many words (using a reasonable definition of "word" (The word separator characters are defined by the $: variable.)) as will fit into the field. These words are actually ripped out of the variable; the value of the variable after filling this field is whatever is left over after removing the words. You'll see why in a minute.

So far, this isn't much different from how a normal text field works; we're printing only as much as will fit (except that we're respecting a word boundary rather than just cutting it off at the field width). The beauty of this filled field appears when you have multiple references to the same variable in the same format. Take a look at this:

format PEOPLE =
Name: @<<<<<<<<<<<<< Comment: ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      $name,                  $comment
                              ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<
                              $comment
                              ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<
                              $comment
                              ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<
                              $comment
.

Note that the variable $comment appears four times. The first line (the one with the name field) prints the person's name and the first few words of the value in $comment. But in the process of computing this line, $comment is altered so that the words disappear. The second line once again refers to the same variable ($comment), and so will take the next few words from the same variable. This is also true for the third and fourth lines. Effectively, what we've created is a rectangle in the output that will be filled as best it can with the words from $comment spread over four lines.

What happens if the complete text occupies less than four lines? Well, you'll get a blank line or two. This is probably OK if you are printing out labels and need exactly the same number of lines for each entry to match them up with the labels. But if you are printing out a report, many blank lines merely use up your printer paper budget.

To fix this, use the suppression indicator. Any line that contains a tilde (~) character is suppressed (not output) if the line would have otherwise printed blank (just whitespace). The tilde itself always prints as a blank and can be placed anywhere a space could have been placed in the line. Rewriting that last example:

format PEOPLE =
Name: @<<<<<<<<<<<<< Comment: ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      $name,                  $comment
~                             ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<
                              $comment
~                             ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<
                              $comment
~                             ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<
                              $comment
.

Now, if the comment covers only two lines, the third and fourth lines are automatically suppressed.

What if the comment is more than four lines? Well, we could make about 20 copies of the last two lines of that format, hoping that 20 lines will cover it. But that goes against the idea that Perl helps you to be lazy, so there's a lazy way to do it. Any line that contains two consecutive tildes will be repeated automatically until the result is a completely blank line. (The blank line is suppressed.) This changes our format to look like this:

format PEOPLE =
Name: @<<<<<<<<<<<<< Comment: ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      $name,                  $comment
~~                            ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<
                              $comment
.

This way, if the comment takes one line, two lines, or 20 lines, we are still OK.

Note that the criterion for stopping the repeated line requires the line to be blank at some point. That means you probably don't want any constant text (other than blanks or tildes) on the line, or else it will never become blank.

The Top-of-Page Format

Many reports end up on some hardcopy device, like a printer. Printer paper is generally clipped into page-size chunks, because most of us stopped reading paper in scrolls a long time ago. So the text being fed to a printer typically has to take page boundaries into consideration by putting in blank lines or formfeed characters to skip across the perforations. Now, you could take the output of a Perl program and feed it through some utility (maybe even one written in Perl) that does this pagination, but there's an easier way.

Perl allows you to define a top-of-page format that triggers a page-processing mode. Perl counts each line of output generated by any format invocation to a particular filehandle. When the next format output cannot fit on the remainder of the current page, Perl spits out a formfeed followed by an automatic invocation of the top-of-page format, and finally the text from the invoked format. That way, the result of one write invocation will never be split across page boundaries (unless it is so large that it won't even fit on a page by itself).

The top-of-page format is defined just like any other format. The default name of a top-of-page format for a particular filehandle is the name of the filehandle followed by _TOP (in uppercase only).

Perl defines the variable $% to be the number of times the top-of-page format has been called for a particular filehandle, so you can use this variable in your top-of-page format to number the pages properly. For example, adding the following format definition to the previous program fragment prevents labels from being broken across page boundaries and also numbers consecutive pages:

format ADDRESSLABEL_TOP =
My Addresses -- Page @<
                     $%
.

The default page length is 60 lines. You can change this by setting a special variable, described shortly.

Perl doesn't notice whether you also print to the same filehandle, so that might throw the number of lines on the current page off a bit. You can either rewrite your code to use formats to send everything or fudge the "number of lines on the current page" variable after you do your print . In a moment, we'll see how to change this value.

Changing Defaults for Formats

We have often referred to the "default" for this or that. Well, Perl provides a way to override the defaults for just about every step. Let's talk about these.

Using select() to Change the Filehandle

Back when we talked about print, in Chapter 6, "Basic I/O", I mentioned that print and print STDOUT were identical, because STDOUT was the default for print. Not quite. The real default for print (and write, and a few other operations that we'll get to in a moment) is an odd notion called the currently selected filehandle.

The currently selected filehandle starts out as STDOUT, which makes it easy to print things on the standard output. However, you can change the currently selected filehandle with the select function. This function takes a single filehandle (or a scalar variable containing the name of a filehandle) as an argument. Once the currently selected filehandle is changed, it affects all future operations that depend on the currently selected filehandle. For example:

print "hello world\n";       # like print STDOUT "hello world\n";
select (LOGFILE);            # select a new filehandle
print "howdy, world\n";      # like print LOGFILE "howdy, world\n";
print "more for the log\n";  # more for LOGFILE
select (STDOUT);             # re-select STDOUT
print "back to stdout\n";    # this goes to standard output

Note that the select operation is sticky; once you've selected a new handle, it stays in effect until the next select .

So, a better definition for STDOUT with respect to print and write is that STDOUT is the default currently selected handle, or the default handle.

Subroutines may find a need to change the currently selected filehandle. However, it would be shocking to call a subroutine and then find out that all of your carefully crafted text lines were going into some bit bucket because the subroutine changed the currently selected filehandle without restoring it. So what's a well-behaved subroutine to do? If the subroutine knows that the current handle is STDOUT, the subroutine can restore the selected handle with code similar to that above. However, what if the caller of the subroutine had already changed the selected filehandle?

Well it turns out that the return value from select is a string containing the name of the previously selected handle. You can capture this value to restore the previously selected filehandle later, using code like this:

$oldhandle = select LOGFILE;
print "this goes to LOGFILE\n";
select ($oldhandle); # restore the previous handle

Yes, for these examples, it's much easier simply to put LOGFILE explicitly as the filehandle for the print, but there are some operations that require the currently selected filehandle to change, as we will soon see.

Changing the Format Name

The default format name for a particular filehandle is the same as the filehandle. However, you can change this for the currently selected filehandle by setting the new format name to a special variable called $~. You can also examine the value of the variable to see what the current format is for the currently selected filehandle.

For example, to use the ADDRESSLABEL format on STDOUT, it's as easy as:

$~ = "ADDRESSLABEL";

But what if you want to set the format for the REPORT filehandle to SUMMARY? Just a few steps to do it here:

$oldhandle = select REPORT;
$~ = "SUMMARY";
select ($oldhandle);

The next time we say

write (REPORT);

we get text out on the REPORT filehandle but using the SUMMARY format. (The object-oriented FileHandle module, part of the Perl standard distribution, provides a simpler way to accomplish the same thing.)

Note that we saved the previous handle into a scalar variable and then restored it later. This is good programming practice. In fact, in production code we probably would have handled the previous one-line example similarly and not assumed that STDOUT was the default handle.

By setting the current format for a particular filehandle, you can interleave many different formats in a single report.

Changing the Top-of-Page Format Name

Just as we can change the name of the format for a particular filehandle by setting the $~ variable, we can change the top-of-page format by setting the $^ variable. This variable holds the name of the top-of-page format for the currently selected filehandle and is read/write, meaning that you can examine its value to see the current format name, and you can change it by assigning to it.

Changing the Page Length

If a top-of-page format is defined, the page length becomes important. By default, the page length is 60 lines; that is, when a write won't fit by the end of line 60, the top-of-page format is invoked automatically before printing the text.

Sometimes 60 lines isn't right. You can change this by setting the $= variable. This variable holds the current page length for the currently selected filehandle. Once again, to change it for a filehandle other than STDOUT (the default currently selected filehandle), you'll need to use the select() operator. Here's how to change the LOGFILE filehandle to have 30-line pages:

$old = select LOGFILE; # select LOGFILE and save old handle
$= = 30;
select $old;

Changing the page length won't have any effect until the next time the top-of-page format is invoked. If you set it before any text is output to a filehandle through a format, it'll work just fine because the top-of-page format is invoked immediately at the first write .

Changing the Position on the Page

If you print your own text to a filehandle, it messes up the page-position line count because Perl isn't counting lines for anything but a write . If you want to let Perl know that you've output a few extra lines, you can adjust Perl's internal line count by altering the $- variable. This variable contains the number of lines left on the current page on the currently selected filehandle. Each write decrements the lines remaining by the lines actually output. When this count reaches zero, the top-of-page format is invoked, and the value of $- is then copied from $= (the page length).

For example, to tell Perl that you've sent an extra line to STDOUT, do something like this:

write; # invoke STDOUT format on STDOUT
...;
print "An extra line... oops!\n"; # this goes to STDOUT
$- --; # decrement $- to indicate non-write line went to STDOUT
...;

write; # this will still work, taking extra line into account

At the beginning of the program, $- is set to zero for each filehandle. This ensures that the top-of-page format will be the first thing invoked for each filehandle upon the first write.

Exercises

  1. Write a program to open the /etc/passwd file by name and print out the username, user ID (number), and real name in formatted columns. Use format and write.
  2. Add a top-of-page format to the previous program. (If your password file is relatively short, you might need to set the page length to something like 10 lines so that you can get multiple instances of the top of the page.)
  3. Add a sequentially increasing page number to the top of the page, so that you get page 1, page 2, and so on, in the output.

Exercise Answers

Exercise 1

Here's one way to do it:

open(PW,"/etc/passwd") || die "How did you get logged in?";
while () {
    ($user,$uid,$gcos) = (split /:/)[0,2,4];
    ($real) = split /,/,$gcos;
    write;
}


format STDOUT =
@<<<<<<< @>>>>>> @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$user, $uid, $real
.

The first line opens the password file. The while loop processes the password file line by line. Each line is torn apart (with colon delimiters), loading up the scalar variables. The real name of the user is pulled out of the GCOS field. The final statement of the while loop invokes write to display all of the data.

The format for the STDOUT filehandle defines a simple line with three fields. The values come from the three scalar variables that are given values in the while loop.

Exercise 2

Here's one way to do it:

# append to program from the first problem...
format STDOUT_TOP =
Username User ID Real Name
======== ======= =========
.

All it takes to get page headers for the previous program is to add a top-of-page format. Here, we put column headers on the columns.

To get the columns to line up, we copied the text of format STDOUT and used overstrike mode in our text editor to replace @<<< fields with ==== bars. That's the nice thing about the one-character-to-one-character correspondence between a format and the resulting display.

Exercise 3

Here's one way to do it:

# append to program from the first problem...
format STDOUT_TOP =
Page @<<<
$%

Username User ID Real Name
======== ======= =========
.

Well, here again, to get stuff at the top of the page, I've added a top-of-page format. This format also contains a reference to $%, which gives me a page number automatically.

3 thoughts on “Formats”

  1. This page has made a major headache of mine so much less painful.
    My department is moving from a reporting tool with analytical abilities to a powerful analytical tool with limited reporting capabilities.
    I have some customers whose workflow depends heavily on a large, many-column, PRINTABLE report which product-support experts have agreed can not be reproduced in the new tool.
    My problem is to reproduce this report in a limited time frame before the existing tool is decommissioned.
    So, naturally, I turned to my reliable old standby, Perl.
    An internet search turned up this article, which provides exactly what I need. THANK YOU for remembering that, whether we like it or not, print is not yet dead.

  2. Thank you for the great article and tutorial. I have a system that has a very weak report generator supplemented with a data export facility. I’ve not used Formats before and now because of this page I stand corrected. Awesome article.

Comments are closed.