A CGI Framework in PythonA Unified Framework Makes Site MonitoringBy A.M. Kuchlingamk@magnet.com |
|
As Web developers create an increasing number of CGI scripts for their sites, they often find themselves repeatedly writing things like error-handling code. Fortunately, it's rather easy to construct a top-level CGI script that acts as a framework for your site and handles errors cleanly. This approach results in less code in the long run, and lets you concentrate on the interesting parts of the job.
In this article, I'll develop a simple CGI framework in Python that can easily be extended and customized to greatly simplify site development. Because of Python's powerful standard library, it doesn't require much code. Finally, I'll develop a simple user-registration scheme that puts the CGI framework into practice.
Another requirement was that the errors be logged to help track down problems. Python provides tracebacks pinpointing the line of code where the error occurred. Tracebacks can be emailed to a maintainer, which eliminates the trial-and-error process of figuring out where the problem lies; I often just keep a mail program running and read the tracebacks as they arrive. This is particularly useful for problems found during quality-assurance testing, and immediately warns you when something breaks on the live site.
The final design consideration is that the framework should provide useful support routines and variables that simplify the code for each page.
PATH_INFO
environment variable, and the location of the server's document tree from the DOCUMENT_ROOT
environment variable. Environment variables can be accessed via the environ
dictionary in the os
module, which will have to be imported first with "import
os"
.
Python's dictionaries are the same as associative arrays in awk or hashes in Perl-they allow data to be retrieved quickly using a unique key. The construct os.environ['PATH_INFO']
retrieves the value of the PATH_INFO
environment variable, raising a KeyError
exception if it doesn't exist. You can either catch the exception, or use os.environ.has_key()
to check if the key is present before trying to retrieve it.
The framework has to execute a chunk of Python code containing the logic for each script. Python's execfile()
function does exactly what we want, taking a string containing a filename and two optional dictionaries used for the code's global and local namespaces. The top-level script simply sets up a dictionary containing the desired variables and calls execfile()
.
How do we trap errors that occur while the page's code is being run? Python signals errors by raising an exception if an illegal operation is attempted. For example, trying to add a string and a number will raise a TypeError
exception. The language doesn't attempt to guess your intention and perform an automatic conversion of any of the operands. Typos commonly cause SyntaxError
or NameError
exceptions.
To catch exceptions, the execfile
can be enclosed inside a try...except...else
statement. Unless you specify an exception in the except
statement, any exceptions raised inside the try:
block will invoke the exception-handling code in the except:
block. If the try:
block runs without raising an exception, the code in the else:
block will then be executed.
The traceback
module contains functions to generate a stack trace, giving the line at which the exception was triggered, so the exception handler can send the traceback to sys.stderr
, where it will appear in the HTTP server's error log, and also return an apologetic HTML page to the user's browser.
else:
block to send the output to the user only if the code was error free. Otherwise, a partially output page will interfere with the HTML for the error page. To make this work, we need to capture any output from the code. However, this presents a problem: CGI scripts are used to both generate HTTP header lines, such as redirections or cache-control directives, and to produce output which is actually displayed, such as an HTML document or a GIF file. Sometimes scripts need to do both, depending on their input. If the script can only send its results to standard output and the decision can only be made late in the script's execution, this can lead to weird contortions in the program's logic; all output must be delayed until it's certain that no more headers need to be generated.
Luckily, Python's StringIO
module provides a StringIO
class that mimics a file object, and saves the data written to it as a string. The string's value can be retrieved later with the StringIO
object's getvalue()
method. This simplifies our scripts considerably: before doing the execfile()
, we save the sys.stdout
file object and replace it with a StringIO
object. It'll grab whatever output is produced by the page's code, and after the execfile()
has finished, the output will be waiting for us.
To handle HTTP headers, we'll put a dictionary called "headers" in the namespace used by the page's code. HTTP header names must be unique, so we can simply use the header name as a key, and the corresponding value will just be the contents of the header line. Producing the headers is simply a matter of iterating over the key/value pairs and sending them to the real standard output. Next, a blank line is output, followed by the contents of the StringIO
object that the script used as sys.stdout
.
{'Content-type':
'text/html'}
. Most scripts can use the default value, and those that return something different (like a GIF image) can simply do headers['Content-type']
=
'image/gif'
.
Scripts will also commonly need access to the environment variables in the os.environ
dictionary. Since execfile()
doesn't limit what the code can do, it could just import the os
module and then access os.environ
, but I find it's handy to make os iron
available as environ
.
Most importantly, any fields passed to the script must be available. The cgi module contains classes that encapsulate fields passed to the script. The most commonly used class is cgi.FieldStorage
, because it has the most features. FieldStorage
objects mimic a dictionary whose keys are the field names. The framework automatically creates a FieldStorage
object and places it in the namespace under the name webvars
.
You can add more variables to the namespace, depending on your application. If every page requires database access, you can automatically open a database connection using a Python database extension, and pass it to the script. It's often useful to have a UserAgent
class that looks at os.environ['HTTP_AGENT']
and supports various informative methods such as .isNetscape()
, .isMac()
, or .hasSSL()
; it's a simple class to write, and it's a good learning exercise, too.
User.py
module actually requires twice as many lines of code as showpage.cgi
, despite being simpler to describe; see Listing Two.
Registration systems require a database; users who join create individual records. In an object-oriented language like Python, it's natural to think about writing a User
class, with attributes for user ID, password, street address, option settings, and so on. But how do we save Python objects?
The Python library includes the pickle
module, which is similar to Java's serialization mechanism. Pickling a Python value converts it to a string representation that can be stuffed into a database or shipped over a socket, and subsequently unpickled to recreate the original object. Built-in Python types such as integers, strings, lists (even recursive ones), and dictionaries can all be pickled, as can most class instances. Only Python functions, open files, Tk widgets, and some class instances can't be pickled. This makes storing instances of the User
class trivial; just pickle them and save the string representation.
Accessing databases from Python is a far-ranging issue. There are extension modules for many relational databases (such as Oracle, Sybase, and mySQL) and some even follow a standard programming interface. For this case, however, a full relational database is overkill. Since the only operation is to retrieve user records given their IDs, user registration requires only a disk-based version of a dictionary. Libraries like GDBM or the Berkeley DB package provide this functionality, and they have standard Python interfaces that imitate dictionaries, imposing the requirement that the keys and values must always be strings.
The User
class has only a few methods. The save()
method tells the object to pickle itself and store the resulting string in the GDBM file; it should be called whenever a page makes changes to the object representing a user. The getCookie()
method returns a string to be put in a Set-Cookie
header to give the browser a cookie containing the authentication token. The code to set the cookie is headers['Set-Cookie']
=
userObj.getCookie()
.
To safely store the current user ID in a cookie, the cookie must be difficult to forge; just storing the user name is insufficient-since anyone could impersonate a user named "amk" by editing their cookies.txt file and setting the cookie's value to "amk." To avoid this, an authentication token is generated containing both the user's ID and the ID MD5 hashed together with some secret information. On reading the cookie, we can recompute the hash value and check that it matches.
The User module provides three functions: createUser()
, for creating a new user; loginUser()
, for logging in as an existing user; and getCurrentUser()
, for getting the object representing the current user. Pages that should be customized for the user need only do import
User
;
userObj
=
User.getCurrentUser(environ)
. If userObj
==
None
, the client hasn't registered with the site; otherwise, userObj
is the object containing the user's vital data.
Compared with the User.py
module, the pages that handle user registration are almost trivial. Listings Three, Four, and Five are the pages that handle first-time registration, logging in under an existing user ID, and displaying a customized page for a registered user, respectively. For the simple examples shown here, none of these pages require more than 20 lines of code.
A common solution is to separate Python and HTML by implementing a templating scheme using Python's regular-expression module. The code executed for each page will leave its local variables in the namespace dictionary, so it's easy to read a template file, search for a placeholder string like [username]
or <!--var
username-->
, get a username
variable set by the page's code, and substitute its value into the template. In this model, the code for each page simply sets a certain number of variables to the right values, and the framework performs the rest of the task.
If the site becomes busy enough that the cost of starting a Python interpreter for each CGI becomes significant, you can use an Apache or Netscape server module to embed a Python interpreter in the server; this avoids the startup time.
Finally, Python code can also be invoked on top of Active Server Pages under IIS, but that makes writing your own framework unnecessary in the first place.
Copyright © Web Techniques. All rights reserved.
Web Techniques Magazine