The Dictionary Page Project

Data Step Features for Creating a Web Page Sequence

Rick Aster

When I created Global Statements Dictionary, the dictionary of SAS words and computer terms that appears on the Global Statements web site, I wrote a SAS program to create the initial empty web pages. The pages contained titles, headers, and each page’s links to the previous and next pages.

Program Files This program reads a large text sample and generates a list of the first two letters of words. This list determines the pages that are included in the dictionary. This program generates the HTML files for the web pages.

Data Files

alpha.txt: This is an excerpt of a sample output file from This file would be manually edited and used as input to

a.html (code): This is an example of the generated HTML code.

a.html (web page): This shows what the generated web page could look like with an appropriate CSS file for formatting.


Global Statements Dictionary was envisioned as a computer dictionary for SAS programmers, combining computer terms of interest with the SAS words that make up a major part of SAS programs. After I had planned the scope of the dictionary, I designed its presentation, which included the decision to break it into web pages based on the first two letters of each word. So, for example, the entry for access appears on the AC page, which is the file name ac.html.

My next two questions were, what pages do I need, and what is an easy way to create the pages? The two SAS programs in this project provide answers to these questions.

The dictionary would only include pages for those two-letter sequences that form the beginnings of actual words. There would also be a page for every initial letter. Theoretically, you could create this list by first creating a list of all SAS words and all words used in discussions of SAS work. In practice, you could create a good approximation by extracting a word list from a large sample of carefully selected text. The ideal document for this purpose would be SAS-related writing that did not contain many abbreviated names. A well-chosen sample of about 1 million words could be sufficient to identify all the two-letter prefixes needed for the dictionary.

If the text sample is placed into a single text file, the program reads the text and creates a list of the prefixes used for those words. It places the list in the file alpha.txt. After any manual additions and deletions, this file can be used as input for the second program,

In actual practice, for my text data, I used the list of SAS words from Professional SAS Programmer’s Pocket Reference and a very small text sample of perhaps 20,000 words. I then attempted to fill in any missing prefixes manually. The process I followed is reconstructed in

Ordinarily, a set of web pages is created from a template, a starting document that contains the common elements of the pages. That approach was insufficient for Global Statements Dictionary; in particular, an automated method was needed to create the links from each page to the preceding and following pages and from each letter page to all the pages that begin with that letter. For example, the A page would contain links to the AB, AC, AD, etc., pages; the AB page would link back to A and forward to AC. The program writes pages that include these links.

HTML is ASCII text, so it is easy to write a data step that writes HTML code. The challenge in is in configuring the data so that all the elements that appear on one page are brought together. The PROC TRANSPOSE step brings together the two-letter prefixes associated with each initial letter. Three SET statements are arranged slightly out of sync so that they identify the current, preceding, and following pages.

The revised version of the program shown here is updated to reflect the HTML 4.0 formatting used in Global Statements Dictionary until 2005. It is also optimized somewhat, although the performance improvements are not important as the program runs in a matter of seconds.

The output of the program is a sequence of text files that contain HTML coding.