SAS in context
- Software categories
- Business intelligence
- Analytics
- Data warehousing
- ETL (extract, transform, and load)
- Software comparisons
- Database management system
- Development tools
- Spreadsheet
- Programming language categories
- Procedural
- 3GL vs. 4GL
- Interpreted and compiled
Business intelligence
The computing world places SAS in the business intelligence software category.
What is business intelligence? CSIRO offers this answer:
Geography
SAS is used worldwide, and its use is especially concentrated in commercial and research centers. You can get a general idea of where SAS is used most from the locations of SAS offices:
SAS world headquarters is in Cary, NC, in the area known as the Research Triangle.
Computing standards in SAS
- ASCII, a standard character set considered the basic character set for computing and communications
- Unicode, a standard comprehensive character set, the character set widely used on the Internet
- EBCDIC, an IBM character set
- SQL, a common database access language
- XML, a markup language for data exchange
- HTML, the markup language of web pages
- Regular expression, a text search language
Character sets: ASCII, Unicode, EBCDIC
A character set makes it possible for a computer to display text. The character set, or encoding, tells how digital values are converted to the characters that make up text data.
For example, the ASCII character set says that the digital value 103 indicates the lowercase letter g. When a program that displays ASCII text finds a 103 in a byte, it displays the letter g.
To use character data correctly on a computer, you need to know what character set is used to encode the text. Fortunately, to keep things simple, the vast majority of commercial data and computer programming is done in the ASCII character set. EBCDIC and Unicode are two other character sets that are specifically supported in SAS.
ASCII is the original character set of computer networks. Although it is a standard in its own right, it is essentially the same as the Basic Latin script of Unicode. The document below provides an encoding chart and a formal name for each character:
http://www.unicode.org/charts/PDF/U0000.pdf (PDF with restricted permissions)
The ASCII character set contains control characters and the following visible characters (mapped from the numeric ranges indicated):
33–47: !"#$%&'()*+,-./
48–57: 0123456789
58–64: :;<=>?@
65–90: ABCDEFGHIJKLMNOPQRSTUVWXYZ
91–96: [\]^_`
97–122: abcdefghijklmnopqrstuvwxyz
123–126: {|}~
Other important ASCII codes are 32, for the space character, which provides the space between words; 9, the tab (HT, or horizontal tab) character, which may separate columns or fields; and 0, the null character, which is sometimes used as a terminator for a text data value.
Unicode is a painstaking collection of all characters that are widely used. The one common Unicode encoding, especially on the Internet, is UTF-8 (corresponding to the SAS format $UTF8X). SAS currently cannot display UTF-8 text data, but it can process data files that contain UTF-8 text data.
When you use only ASCII characters, UTF-8 text data is identical to ASCII text data. To represent a non-ASCII character, UTF-8 uses a sequence of 2 to 4 bytes. The possibility of a character using more than one byte makes it more difficult to sort, measure, and modify UTF-8 character data in a SAS program.
Unicode is maintained by the Unicode Consortium:
On the Unicode web site, start with “What is Unicode?” to get a general understanding of Unicode and character encodings. Look at “Code Charts” for reference material on Unicode characters.
A formal definition of UTF-8 can be found in:
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf (PDF)
In this document, look for UTF-8 in sections 3.9 and 3.10.
The EBCDIC character set was created by IBM to make it easier to operate computers using punch cards, a common scenario prior to about 1977. IBM still uses EBCDIC in its larger computer systems. The following chart shows both the ASCII and EBCDIC character sets:
http://www.natural-innovations.com/computing/asciiebcdic.html
The ASCII character set is commonly used with added characters, and there are several variations on the EBCDIC character set. To see what ASCII and EBCDIC might look like in your SAS environment, you can run SAS programs that generate all possible characters, as shown here:
SQL
SQL is a formalized database management language originally developed at IBM. Wikipedia describes its purpose and background:
http://en.wikipedia.org/Wiki/SQL
SQL is not a strict standard, so is typically not possible to take an SQL program from one database management system to another without modifications. A database administrator has made a list of the key differences among major SQL implementations:
http://troels.arvin.dk/db/rdbms
An online course on SQL programming:
XML
XML is a markup language created by the World Wide Web Consortium (W3C). It serves as the lowest common denominator for the exchange of structured data and as a starting point for defining specific, specialized standards for particular kinds of data.
The W3C is the source for definitive information on XML:
HTML
HTML is the markup language of web pages, including the web pages that SAS generates.
The formal definitions of HTML are provided by the W3C:
The latest version of HTML are implemented as a kind of XML and are called XHTML.
Regular expression
A regular expression (regexp, regex) is a kind of coded text search string. SAS uses functions and CALL routines to implement two different kinds of regular expressions. The names of these routines begin with RX and PRX.
A web site for regular expression information: