A CGI Framework in Python

A Unified Framework Makes Site Monitoring

By A.M. Kuchling
amk@magnet.com

Web Techniques
February 1998
Volume 3, Issue 2

As Web developers create an increasing number of CGI scripts for their sites, they often find themselves repeatedly writing things like error-handling code. Fortunately, it's rather easy to construct a top-level CGI script that acts as a framework for your site and handles errors cleanly. This approach results in less code in the long run, and lets you concentrate on the interesting parts of the job.

In this article, I'll develop a simple CGI framework in Python that can easily be extended and customized to greatly simplify site development. Because of Python's powerful standard library, it doesn't require much code. Finally, I'll develop a simple user-registration scheme that puts the CGI framework into practice.

Design Considerations

In creating the framework, I had several design considerations. First, the user should never see a server error. If a programming error occurs, the framework should trap it and send a detailed error page to the browser, not leave you staring at a "Server error" page in your browser. The error page should reassure users and refer them to your tech-support line or help page.

Another requirement was that the errors be logged to help track down problems. Python provides tracebacks pinpointing the line of code where the error occurred. Tracebacks can be emailed to a maintainer, which eliminates the trial-and-error process of figuring out where the problem lies; I often just keep a mail program running and read the tracebacks as they arrive. This is particularly useful for problems found during quality-assurance testing, and immediately warns you when something breaks on the live site.

The final design consideration is that the framework should provide useful support routines and variables that simplify the code for each page.

Digging into the Code

The framework script, showpage.cgi, is shown in Listing One , and contains only about 60 lines of code, excluding comments. To start off, showpage.cgi needs to know where to find the code for the page to be executed; we'll pass that by appending it to the path portion of the URL. A sample URL will then look like http://www.yourhost.xxx/cgi-bin/showpage.cgi/pages/foo.py. The script can usually get the path information from the PATH_INFO environment variable, and the location of the server's document tree from the DOCUMENT_ROOT environment variable. Environment variables can be accessed via the environ dictionary in the os module, which will have to be imported first with "import os".

Python's dictionaries are the same as associative arrays in awk or hashes in Perl-they allow data to be retrieved quickly using a unique key. The construct os.environ['PATH_INFO'] retrieves the value of the PATH_INFO environment variable, raising a KeyError exception if it doesn't exist. You can either catch the exception, or use os.environ.has_key() to check if the key is present before trying to retrieve it.

The framework has to execute a chunk of Python code containing the logic for each script. Python's execfile() function does exactly what we want, taking a string containing a filename and two optional dictionaries used for the code's global and local namespaces. The top-level script simply sets up a dictionary containing the desired variables and calls execfile().

How do we trap errors that occur while the page's code is being run? Python signals errors by raising an exception if an illegal operation is attempted. For example, trying to add a string and a number will raise a TypeError exception. The language doesn't attempt to guess your intention and perform an automatic conversion of any of the operands. Typos commonly cause SyntaxError or NameError exceptions.

To catch exceptions, the execfile can be enclosed inside a try...except...else statement. Unless you specify an exception in the except statement, any exceptions raised inside the try: block will invoke the exception-handling code in the except: block. If the try: block runs without raising an exception, the code in the else: block will then be executed.

The traceback module contains functions to generate a stack trace, giving the line at which the exception was triggered, so the exception handler can send the traceback to sys.stderr, where it will appear in the HTTP server's error log, and also return an apologetic HTML page to the user's browser.

Capturing the Output

We want the else: block to send the output to the user only if the code was error free. Otherwise, a partially output page will interfere with the HTML for the error page. To make this work, we need to capture any output from the code. However, this presents a problem: CGI scripts are used to both generate HTTP header lines, such as redirections or cache-control directives, and to produce output which is actually displayed, such as an HTML document or a GIF file. Sometimes scripts need to do both, depending on their input. If the script can only send its results to standard output and the decision can only be made late in the script's execution, this can lead to weird contortions in the program's logic; all output must be delayed until it's certain that no more headers need to be generated.

Luckily, Python's StringIO module provides a StringIO class that mimics a file object, and saves the data written to it as a string. The string's value can be retrieved later with the StringIO object's getvalue() method. This simplifies our scripts considerably: before doing the execfile(), we save the sys.stdout file object and replace it with a StringIO object. It'll grab whatever output is produced by the page's code, and after the execfile() has finished, the output will be waiting for us.

To handle HTTP headers, we'll put a dictionary called "headers" in the namespace used by the page's code. HTTP header names must be unique, so we can simply use the header name as a key, and the corresponding value will just be the contents of the header line. Producing the headers is simply a matter of iterating over the key/value pairs and sending them to the real standard output. Next, a blank line is output, followed by the contents of the StringIO object that the script used as sys.stdout.

Useful Variables

Since most CGI scripts produce HTML, headers can be initialized to {'Content-type': 'text/html'}. Most scripts can use the default value, and those that return something different (like a GIF image) can simply do headers['Content-type'] = 'image/gif'.

Scripts will also commonly need access to the environment variables in the os.environ dictionary. Since execfile() doesn't limit what the code can do, it could just import the os module and then access os.environ, but I find it's handy to make osiron available as environ.

Most importantly, any fields passed to the script must be available. The cgi module contains classes that encapsulate fields passed to the script. The most commonly used class is cgi.FieldStorage, because it has the most features. FieldStorage objects mimic a dictionary whose keys are the field names. The framework automatically creates a FieldStorage object and places it in the namespace under the name webvars.

You can add more variables to the namespace, depending on your application. If every page requires database access, you can automatically open a database connection using a Python database extension, and pass it to the script. It's often useful to have a UserAgent class that looks at os.environ['HTTP_AGENT'] and supports various informative methods such as .isNetscape(), .isMac(), or .hasSSL(); it's a simple class to write, and it's a good learning exercise, too.

In Practice

Since many Web sites let users register to receive customized content or services, I'll implement a skeletal registration system. The User.py module actually requires twice as many lines of code as showpage.cgi, despite being simpler to describe; see Listing Two.

Registration systems require a database; users who join create individual records. In an object-oriented language like Python, it's natural to think about writing a User class, with attributes for user ID, password, street address, option settings, and so on. But how do we save Python objects?

The Python library includes the pickle module, which is similar to Java's serialization mechanism. Pickling a Python value converts it to a string representation that can be stuffed into a database or shipped over a socket, and subsequently unpickled to recreate the original object. Built-in Python types such as integers, strings, lists (even recursive ones), and dictionaries can all be pickled, as can most class instances. Only Python functions, open files, Tk widgets, and some class instances can't be pickled. This makes storing instances of the User class trivial; just pickle them and save the string representation.

Accessing databases from Python is a far-ranging issue. There are extension modules for many relational databases (such as Oracle, Sybase, and mySQL) and some even follow a standard programming interface. For this case, however, a full relational database is overkill. Since the only operation is to retrieve user records given their IDs, user registration requires only a disk-based version of a dictionary. Libraries like GDBM or the Berkeley DB package provide this functionality, and they have standard Python interfaces that imitate dictionaries, imposing the requirement that the keys and values must always be strings.

The User class has only a few methods. The save() method tells the object to pickle itself and store the resulting string in the GDBM file; it should be called whenever a page makes changes to the object representing a user. The getCookie() method returns a string to be put in a Set-Cookie header to give the browser a cookie containing the authentication token. The code to set the cookie is headers['Set-Cookie'] = userObj.getCookie().

To safely store the current user ID in a cookie, the cookie must be difficult to forge; just storing the user name is insufficient-since anyone could impersonate a user named "amk" by editing their cookies.txt file and setting the cookie's value to "amk." To avoid this, an authentication token is generated containing both the user's ID and the ID MD5 hashed together with some secret information. On reading the cookie, we can recompute the hash value and check that it matches.

The User module provides three functions: createUser(), for creating a new user; loginUser(), for logging in as an existing user; and getCurrentUser(), for getting the object representing the current user. Pages that should be customized for the user need only do import User ; userObj = User.getCurrentUser(environ). If userObj == None, the client hasn't registered with the site; otherwise, userObj is the object containing the user's vital data.

Compared with the User.py module, the pages that handle user registration are almost trivial. Listings Three, Four, and Five are the pages that handle first-time registration, logging in under an existing user ID, and displaying a customized page for a registered user, respectively. For the simple examples shown here, none of these pages require more than 20 lines of code.

Possible Enhancements

This CGI framework is very simple, but frameworks can be extended in many ways. For example, scripts often contain large chunks of HTML code, and if both Python programmers and HTML writers are developing the site, both parties may want to work on the same files simultaneously. This often leads to inadvertent errors in the code.

A common solution is to separate Python and HTML by implementing a templating scheme using Python's regular-expression module. The code executed for each page will leave its local variables in the namespace dictionary, so it's easy to read a template file, search for a placeholder string like [username] or , get a username variable set by the page's code, and substitute its value into the template. In this model, the code for each page simply sets a certain number of variables to the right values, and the framework performs the rest of the task.

If the site becomes busy enough that the cost of starting a Python interpreter for each CGI becomes significant, you can use an Apache or Netscape server module to embed a Python interpreter in the server; this avoids the startup time.

Finally, Python code can also be invoked on top of Active Server Pages under IIS, but that makes writing your own framework unnecessary in the first place.

Andrew is a Web developer at Magnet Interactive (Washington, DC). He works in several languages, but Python is his favorite. He can be reached at amk@magnet.com.