Tag: python

Bad data and the wonder of RegEx

Regular expressions aren’t something that I use regularly but I think that’s about the change. I came up a programming world where C, Cobol and Fortran were just the best thing considering that the alternative was writing assembler. And in that world, regular expressions don’t really exist. The basic toolkit I developed internally for solving programming problems didn’t include it. To this day, I sometimes still think about iterating over a character sequence in memory until a NULL is found. But in the modern programming world, languages like Python have tools built into the language and standard libraries that provide a lot of help to a developer. One of those tools is regular expression matching.

I’m currently working on a project to convert a FileMaker 5.5 database to SalesForce. Inside the FileMaker database, the data is in a very strange and odd states because there was no data validation done. So a date field can contain strings like ‘1/1/01’, ‘6/24/2007’ or ‘9-1-2005’ or ‘2/27’ or ‘Jan 1.’ These different formats cause a world of grief in trying to move the data to a database that expects a date to be structured. Something like 1/1/2001. It doesn’t know about the various format of our dates and it rejects these as invalid. So the dates had to be filtered into something more standard.

I started to write some code that searches the strings for ‘/’ and ‘-‘ and the text names of months. And naturally, it quickly became a rats nest of nested if and conditions that made understanding the code very cumbersome. So I went looking for another way to solve the problem. I remembered that Python had this module called re that provides regular expression processing but I hadn’t really used it and wasn’t sure about how it would work. So I start searching the web to find some help on using regular expressions and the re module. What I found was just wonderful. A.M. Kuchling wrote up the fantastic Regular Expression HOWTO. In almost no time flat I was starting to put together a regular expression that would match most the permutations of dates (those based on ‘1/1/01’) that I’ve seen in our database and a second that would match the ones that had the months written out. The quickest way that I found to test the regular expressions was using Python’s unittest module. I would edit the regular expression, run the unittests, re-editing, re-run, etc until it worked and all the tests past.

The code that I developed looks like this:

import unittest
import re
 
 
class TestDateRegEx(unittest.TestCase):
    def test_slashes(self):
        slashes = re.compile(r'(d{1,2})[/-](d{1,2})[/-](d{2,4})$')
 
        self.assertEqual(slashes.match('1/1/10').group(),     '1/1/10')
        self.assertEqual(slashes.match('1/1/10').group(1),    '1')
        self.assertEqual(slashes.match('1/1/10').group(2),    '1')
        self.assertEqual(slashes.match('1/1/10').group(3),    '10')
        self.assertEqual(slashes.match('01/1/10').group(),    '01/1/10')
        self.assertEqual(slashes.match('11/1/10').group(),    '11/1/10')
        self.assertEqual(slashes.match('11/13/10').group(),   '11/13/10')
        self.assertEqual(slashes.match('1/1/2010').group(),   '1/1/2010')
        self.assertEqual(slashes.match('12/1/2010').group(),  '12/1/2010')
        self.assertEqual(slashes.match('1/21/2010').group(),  '1/21/2010')
        self.assertEqual(slashes.match('1/21/2010').group(1), '1')
        self.assertEqual(slashes.match('1/21/2010').group(2), '21')
        self.assertEqual(slashes.match('1/21/2010').group(3), '2010')
 
        self.assertEqual(slashes.match('111/1/10'),    None)
        self.assertEqual(slashes.match('11/111/10'),   None)
        self.assertEqual(slashes.match('11/11/10100'), None)
 
        self.assertEqual(slashes.match('1-1-10').group(),     '1-1-10')
        self.assertEqual(slashes.match('01-1-10').group(),    '01-1-10')
        self.assertEqual(slashes.match('11-1-10').group(),    '11-1-10')
        self.assertEqual(slashes.match('11-13-10').group(),   '11-13-10')
        self.assertEqual(slashes.match('1-1-2010').group(),   '1-1-2010')
        self.assertEqual(slashes.match('12-1-2010').group(),  '12-1-2010')
        self.assertEqual(slashes.match('1-21-2010').group(),  '1-21-2010')
        self.assertEqual(slashes.match('1-21-2010').group(1), '1')
        self.assertEqual(slashes.match('1-21-2010').group(2), '21')
        self.assertEqual(slashes.match('1-21-2010').group(3), '2010')
 
        self.assertEqual(slashes.match('111-1-10'),    None)
        self.assertEqual(slashes.match('11-111-10'),   None)
        self.assertEqual(slashes.match('11-11-10100'), None)
 
        self.assertEqual(slashes.match('1'),           None)
        self.assertEqual(slashes.match('1/'),          None)
        self.assertEqual(slashes.match('1/1'),         None)
        self.assertEqual(slashes.match('1'),           None)
        self.assertEqual(slashes.match('1/'),          None)
        self.assertEqual(slashes.match('1/1'),         None)
        self.assertEqual(slashes.match('Jan 1'),       None)
        self.assertEqual(slashes.match('Jan 1, 2010'), None)
 
 
    def test_names(self):
        names = re.compile(r'(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)s(d{1,2})(,s(d{2,4}))?$',
                           re.IGNORECASE)
 
        self.assertEqual(names.match('Jan 1').group(),         'Jan 1')
        self.assertEqual(names.match('Jan 1').group(1),        'Jan')
        self.assertEqual(names.match('Jan 1').group(2),        '1')
        self.assertEqual(names.match('feb 21').group(),        'feb 21')
        self.assertEqual(names.match('feb 21').group(1),       'feb')
        self.assertEqual(names.match('feb 21').group(2),       '21')
        self.assertEqual(names.match('Jan 1, 2010').group(),   'Jan 1, 2010')
        self.assertEqual(names.match('Jan 1, 2010').group(1),  'Jan')
        self.assertEqual(names.match('Jan 1, 2010').group(2),  '1')
        self.assertEqual(names.match('Jan 1, 2010').group(4),  '2010')
        self.assertEqual(names.match('MAR 14, 2010').group(),  'MAR 14, 2010')
        self.assertEqual(names.match('MAR 14, 2010').group(1), 'MAR')
        self.assertEqual(names.match('MAR 14, 2010').group(2), '14')
        self.assertEqual(names.match('MAR 14, 2010').group(4), '2010')
 
        self.assertEqual(names.match('MA 20, 2010'), None)
        self.assertEqual(names.match('xyz 2,'),      None)
        self.assertEqual(names.match('jan 2,'),      None)
 
 
if __name__ == '__main__':
    unittest.main()

The use of regular expressions and the ability to match and group the matches makes the code needed to clean up the dates much simpler and a lot easier to maintain. It might take a moment next time I need to parse strings to think about using regular expressions and I’m pretty sure I’ll have to refer to the howto a couple more times but I’m really happy with how powerful and how much simpler text processing can be by using them.

Tags : ,

Web service monitoring w/ Nagios and JSON

I’m using Nagios to act as a watch dog for my network and the various services that live on it. Nagios does the job pretty well. It lets me know when there’s a problem, when things are back to normal and generally keeps on eye on things for me.

The checks that Nagios performs are done through a series of check commands. These commands are your typical Unix style program with the exceptions that they produce a single line of text that describes the state of the item being checked and the exit value let’s Nagios know what’s going on.

So for instance, to check the health of the web service on the localhost:

peter@sybil:~$ /usr/lib/nagios/plugins/check_http -H localhost
HTTP OK HTTP/1.1 200 OK - 361 bytes in 0.001 seconds |time=0.001021s;;;0.000000 size=361B;;;0
peter@sybil:~$ echo $?
2
peter@sybil:~$

The single line of text that is displayed follows a specific format. It starts with the prefix of what’s being tested, HTTP. Next is the status, OK. This can be OK, WARNING, CRITICAL or UNKNOWN. Everything after the status is eye candy that provide details that are specific to the test being done. Nagios doesn’t really care about it but it does provide important details when looking at problems that may be occurring.

Writing these check program in Python is pretty straight forward.

I recently had a situation where our ISP moved our web servers from one physical machine to another. This resulted in the credit cards processing for our online store to fail. The payment provider uses the IP address of the server as part of the authentication process when submitting credit cards for processing. Since the server changed, the IP address changed. Things went around in circles for a while until we figured out the problem and gave the new IP address to the payment
provider.

I thought is would be a good additional Nagios check for the store web site to check on the IP address of the physical server. Unfortunately, the ISP doesn’t provide access to the IP address. But they do provide access to the hostname.

To get the hostname, I added a simple CGI program that determines the hostname and then packages it up as a JSON data structure.

#!/usr/bin/env python
 
"""
Bundle the hostname up as a JSON data structure.
 
Copyright (c) 2009 Peter Kropf. All rights reserved.
"""
 
import cgi
import popen2
import sys
sys.path.insert(1, '/home/crucible/tools/lib/python2.4/site-packages')
sys.path.insert(1, '/home/crucible/tools/lib/python2.4/site-packages/simplejson-2.0.9-py2.4-linux-x86_64.egg')
 
import simplejson as json
 
field = cgi.FieldStorage()
print "Content-Type: application/json\n\n"
 
r, w, e = popen2.popen3('hostname')
host = r.readline()
r.close()
w.close()
e.close()
 
fields = {'hostname': host.split('n')[0]}
 
print json.dumps(fields)

There’s a couple of things to note. Since the ISP provides a very restrictive environment, I have to add the location of the simplejson module before it can be imported. It’s a bit annoying but it does work.

On the Nagios service side, I created a new check program called check_json. It takes the name of a field, the expected value and the URI from which to pull the JSON data.

#! /usr/bin/env python
 
"""
Nagios plugin to check a value returned from a uri in json format.
 
Copyright (c) 2009 Peter Kropf. All rights reserved.
 
Example:
 
Compare the "hostname" field in the json structure returned from
http://store.example.com/hostname.py against a known value.
 
    ./check_json hostname buenosaires http://store.example.com/hostname.py
"""
 
 
import urllib2
import simplejson
import sys
from optparse import OptionParser
 
prefix = 'JSON'
 
class nagios:
    ok       = (0, 'OK')
    warning  = (1, 'WARNING')
    critical = (2, 'CRITICAL')
    unknown  = (3, 'UNKNOWN')
 
 
def exit(status, message):
    print prefix + ' ' + status[1] + ' - ' + message
    sys.exit(status[0])
 
 
parser = OptionParser(usage='usage: %prog field_name expected_value uri')
options, args = parser.parse_args()
 
 
if len(sys.argv) < 3:
    exit(nagios.unknown, 'missing command line arguments')
 
 
field = args[0]
value = args[1]
uri = args[2]
 
try:
    j = simplejson.load(urllib2.urlopen(uri))
except urllib2.HTTPError, ex:
    exit(nagios.unknown, 'invalid uri')
 
if field not in j:
    exit(nagios.unknown, 'field: ' + field + ' not present')
 
if j[field] != value:
    exit(nagios.critical, j[field] + ' != ' + value)
 
exit(nagios.ok, j[field] + ' == ' + value)

Some checking is done to ensure that the JSON data can be retrieved, that the needed field is in the data and then that the field’s value matches what’s expected.

These examples show the basic testing that’s done and the return values:

peter@sybil:~$ /usr/lib/nagios/plugins/check_json hostname buenosaires http://store.thecrucible.org/hostname.py
JSON OK - buenosaires == buenosaires
peter@sybil:~$ echo $?
0
peter@sybil:~$ /usr/lib/nagios/plugins/check_json hostname buenosaires http://store.thecrucible.org/hostname.p
JSON UNKNOWN - invalid uri
peter@sybil:~$ echo $?
3
peter@sybil:~$ /usr/lib/nagios/plugins/check_json hostname buenosairs http://store.thecrucible.org/hostname.py
JSON CRITICAL - buenosaires != buenosairs
peter@sybil:~$ echo $?
2
peter@sybil:~$ /usr/lib/nagios/plugins/check_json ostname buenosaires http://store.thecrucible.org/hostname.py
JSON UNKNOWN - field: ostname not present
peter@sybil:~$ echo $?
3
peter@sybil:~$

Once the Nagios server is configured with the new command, the hostname on the server can be monitored and hopefully ease any problems that may occur then next time things change at the ISP.

More details on Nagios can be found at http://nagios.org and on developing check program at http://nagiosplug.sourceforge.net/developer-guidelines.html.

Tags : ,

Running External Django Scripts

Django is pretty good at creating a database driven website. The documentation is clear and the tutorials show how to use the framework to create web based applications. But one part that I wish was a bit more straight forward is running scripts outside the web server. The issue is that Django code expects to have a certain environment configured and setup for the framework. With this in place, you can preform tasks like polling an IMAP server for incoming email messages or monitoring a directory for new files or whatever else needs to be done. There are several posts online to help you get the environment setup here, here and here. But some of them seem not to work correctly because of the changes to Django for the 1.0 release or other reasons.

I have a fairly straight forward example of how to setup the Django environment and allow the rest of your code to access the Django framework for your web application. Its remarkably simple and straight forward.

Suppose that I’ve created a Django project in my tmp directory called demo_scripts and within that project, I create an app called someapp.

peter@fog:~/tmp> django-admin-2.5.py startproject demo_scripts
peter@fog:~/tmp> cd demo_scripts/
peter@fog:~/tmp/demo_scripts> django-admin-2.5.py startapp someapp
peter@fog:~/tmp/demo_scripts>

I create a model in someapp that looks like:

from django.db import models
 
class Foo(models.Model):
    name = models.CharField(max_length=21,
                            unique=True,
                            help_text="Name of the foo.")
 
    def __unicode__(self):
        return self.name
 
    class Meta:
        ordering = ('name',)

Next step is to sync the database:

peter@fog:~/tmp/demo_scripts> ./manage.py syncdb
Creating table auth_permission
Creating table auth_group
Creating table auth_user
Creating table auth_message
Creating table django_content_type
Creating table django_session
Creating table django_site
Creating table someapp_foo
 
You just installed Django's auth system, which means you don't have any superusers defined.
Would you like to create one now? (yes/no): yes
Username (Leave blank to use 'peter'):
E-mail address: pkropf@gmail.com
Password:
Password (again):
Superuser created successfully.
Installing index for auth.Permission model
Installing index for auth.Message model
peter@fog:~/tmp/demo_scripts>

And add some initial data to the database:

peter@fog:~/tmp/demo_scripts> ./manage.py shell
Python 2.5.4 (r254:67916, May  1 2009, 17:14:50)
[GCC 4.0.1 (Apple Inc. build 5490)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from someapp.models import Foo
>>> Foo(name='A Foo').save()
>>> Foo(name='Another Foo').save()
>>>
peter@fog:~/tmp/demo_scripts>

Now we can write a standalone script to do something with the data model. For simplicity’s sake, I’ll just print out all the Foo objects. The script is going to live in a new directory called scripts. Here’s the source:

#! /usr/bin/env python
#coding:utf-8
 
import sys
import os
import datetime
 
sys.path.insert(0, os.path.expanduser('~/tmp/demo_scripts'))
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
 
from someapp.models import *
print Foo.objects.all()

When I run the script, it prints the array of the two Foo objects that I previously created:

peter@fog:~/tmp/demo_scripts> ./scripts/show_foo.py
[<Foo: A Foo>, <Foo: Another Foo>]
peter@fog:~/tmp/demo_scripts>

Lines 8 and 9 are the critical lines in the script code. The first adds the project directory to the Python system path so that the settings module can be found. The second tells the Django code which module to import to determine the project settings.

Tags : ,

Django Google Apps Authentication

Django has an excellent user management and authentication system built into the framework. With it you can easily create users that can be authenticated against the website. But there are times when you just need to authenticate against a different system. In the case of an app I recently developed, I originally wanted to authenticate against an OS X Server. The OpenDirectory service on OS X Server is an LDAP server, under the hood you’ll find slapd from OpenLDAP running. So should be pretty straight forward to create an authentication module that uses Python’s LDAP module. And this article from the Carthage WebDev site shows you how to do it.

After I got the ldap_auth.py module working on my site, I realized the site would be better served if the authentication happened against Google Apps. Since Google Apps is currently being used by the organization for email, calendaring and sharing documents, everyone already has an account there. And with the ldap_auth.py module from Carthage Webdev, I thought it would be pretty simple to provide a google_auth.py module.

To get started, I had to install gdata. The installation instructions found on the Google Apps APIs page were pretty easy to follow. Specifically, I had to install the Provisioning API.

On a side note, I’m using Python 2.5 as installed via MacPorts. Before I could use the gdata APIs, I had to install py25-socket-ssl.

The APIs are pretty well documented via the examples from the Python Developer’s Guide. Here’s how I’m authenticating a Django project with users on Google Apps.

To start, there are three configuration variables that I added to the Django project’s settings.py module:

# Google Apps Settings
GAPPS_DOMAIN = 'your_domain.com'
GAPPS_USERNAME = 'name_of_an_admin_user'
GAPPS_PASSWORD = 'admin_users_password'

These will allow the module to authenticate against Google Apps and ask for specific details about the user.

Here’s the code for google_auth.py:

import logging
from django.contrib.auth.models import User
from django.conf import settings
from gdata.apps.service import AppsService, AppsForYourDomainException
from gdata.docs.service import DocsService
from gdata.service import BadAuthentication
 
 
logging.debug('GoogleAppsBackend')
 
 
class GoogleAppsBackend:
    """ Authenticate against Google Apps """
 
     def authenticate(self, username=None, password=None):
         logging.debug('GoogleAppsBackend.authenticate: %s - %s' % (username, '*' * len(password)))
         admin_email = '%s@%s' % (settings.GAPPS_USERNAME, settings.GAPPS_DOMAIN)
         email = '%s@%s' % (username, settings.GAPPS_DOMAIN)
 
         try:
             # Check user's password
             logging.debug('GoogleAppsBackend.authenticate: gdocs')
             gdocs = DocsService()
             gdocs.email = email
             gdocs.password = password
             gdocs.ProgrammaticLogin()
             # Get the user object
 
             logging.debug('GoogleAppsBackend.authenticate: gapps')
             gapps = AppsService(email=admin_email,
                                 password=settings.GAPPS_PASSWORD,
                                 domain=settings.GAPPS_DOMAIN)
             gapps.ProgrammaticLogin()
             guser = gapps.RetrieveUser(username)
 
             logging.debug('GoogleAppsBackend.authenticate: user - %s' % username)
             user, created = User.objects.get_or_create(username=username)
 
             if created:
                 logging.debug('GoogleAppsBackend.authenticate: created')
                 user.email = email
                 user.last_name = guser.name.family_name
                 user.first_name = guser.name.given_name
                 user.is_active = not guser.login.suspended == 'true'
                 user.is_superuser = guser.login.admin == 'true'
                 user.is_staff = True
                 user.save()
 
         except BadAuthentication:
             logging.debug('GoogleAppsBackend.authenticate: BadAuthentication')
             return None
 
         except AppsForYourDomainException:
             logging.debug('GoogleAppsBackend.authenticate: AppsForYourDomainException')
             return None
 
         return user
 
 
     def get_user(self, user_id):
 
         user = None
         try:
             logging.debug('GoogleAppsBackend.get_user')
             user = User.objects.get(pk=user_id)
 
         except User.DoesNotExist:
             logging.debug('GoogleAppsBackend.get_user - DoesNotExist')
             return None
 
         return user

It was pretty easy to write and debug this code using the ldap_auth.py module as a working example.

One downside to this code is that any newly created users in the Django auth database don’t have any rights. So if the Django project is expecting to be able to dynamically change the contents based on the rights that the user has, the account will have to manually modified via the Django admin interface. Not too bad, but annoying.

Tags : , ,