Regular Expressions in JavaScript

page 3 of 5 Regular Expressions

Download Scripts: Scripts to Modify Data; Scripts to Parse Data; Scripts to Validate Data; Scripts to Validate Numbers

Parsing Data with Regular Expressions

The previous page showed examples making modifications to string data. This page deals with braking pieces out of strings so they can be used to form new strings or be processed in someway outside of the original string context.

Parsing data usually implies that there is some knowledge about the form of the source data. For the most part, these examples are dependent on having good source data. In a system, however, the code should also handle bad data in an appropriate manner.

Mapping columns from a delimited string

Here is an easy start that is similar to the example of inverting names on the previous page. In this example the first, third, and sixth columns are extracted from a space delimited record and returned in sixth, first and third order in a new string. Spaces are not allowed within the column data. Select the method you want to try.

The first method parses strings of nonwhite space characters, words, between word boundaries into an array. The use of the meta-character \b for word boundary let's the first and last word of the string match. A word boundary is a noncharacter point between a word and a space or the beginning and end of the line.

The second method, which creates an array whose 0 element contains all six words and the spaces, uses the meta-character ^ to force starting at the beginning of the line. Instead of declaring an array, the components are retrieve with the static RegExp object.

function xtractReportType1(data) {
    var array = data.match(/\b[\S]+\b/g);
    return array[5] + " " + array[0] +
            " paint retail price: $" + array[2] + " ea.";
}

function xtractReportType2(data){
    data.match(/^(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\b/);
    return RegExp.$6 + " " + RegExp.$1 +
            " paint retails for $" + RegExp.$3 + " ea.";
}

The static RegExp object is limited to 9 subsets of data; i.e. $1 through $9.

Extracting numeric fields from text

Still an easy start. This example picks out numeric fields embedded in text. Perhaps this is in a loop reading a report and building a table.

A simply one line expression was used to return an array of all numbers in the string. Actually, the code behind the test button branches depending on whether or not a number in the string has formatting characters. The first expression processes strings where all numbers are nonformatted numbers. This is a less complicated expression. The second handles properly formatted strings.

nonformatted:
        function xtractNums(str){
            return str.match(/\d+/g);
        }

Formatted:
        function xtractFormattedNums(str) {
            return str.match(/\d+(,\d{3})*(\.\d{1,2})?/g);
        }

When the regular expression used with the string method match has the global (g) flag an array of matches is returned. If the global flag is not present, the return is the same as RegExp method exec: array element 0 contains the complete matched string and subsequent elements contain subsets ($1, $2, etc.). In either case, no match returns null.

Parsing URLs

Because Web development frequently deals with urls, there are times when a portion of a url needs to be evaluated. A branch may occur based on the filename or the path needs to be appended to links in a dynamic menu before further processing. The source of the Url can be user input, the document.referrer, or window.location.href; albeit, the location object has already parsed the Url. In the case of the location object, the only issue would be to separate the file name from the path. For that see the next example.

This example offers two options both are restricted to http or ftp protocols. That can be easily changed. The first method requires the Url be the only text in the string as it might be in a object property or a field exclusively for the entry of a url. The second option extracts the url from a string of text.

Both regular expression look very much the same. But, notice the the first begins with ^ and ends with $. These are meta-characters indicating that the pattern must be at the beginning and end of the line. Also note, that the return from match() is tested before generating the return object. RegExp contains the last successful match and does not reflect a failed match attempt. Elements of the array "m" could have been used as well as properties of RegExp to populate the return object.

function parseUrl1(data) {
    var e=/^((http|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+\.[^#?\s]+)(#[\w\-]+)?$/;

    if (data.match(e)) {
        return  {url: RegExp['$&'],
                protocol: RegExp.$2,
                host:RegExp.$3,
                path:RegExp.$4,
                file:RegExp.$6,
                hash:RegExp.$7};
    }
    else {
        return  {url:"", protocol:"",host:"",path:"",file:"",hash:""};
    }
}

function parseUrl2(data) {
    var e=/((http|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+\.[^#?\s]+)(#[\w\-]+)?/;

    if (data.match(e)) {
        return  {url: RegExp['$&'],
                protocol: RegExp.$2,
                host:RegExp.$3,
                path:RegExp.$4,
                file:RegExp.$6,
                hash:RegExp.$7};
    }
    else {
        return  {url:"", protocol:"",host:"",path:"",file:"",hash:""};
    }
}

Parsing path and filename from the location object.

This example is similar to the previous example; although, it is much simpler. Our goal is to separate the path and filename from the data in the location object. You can use the window.location object or the document.URL. They should have the same data. Most of the parsing of the previous example is done by the location object. It has properties for protocol, host (includes port), hostname (without port), pathname, hash and search query. For this example we will use the pathname, which includes the filename, but not protocol or host.

Getting the path without the filename or the filename without the path is a common enough programming need that one can think of a few scenarios at least. But, the need for getting the filename without the extension might be a little obscure. I have used the filename as an ID, which doesn't allow periods, for elements that are common across several pages and use onload DHTML to make page specific changes.

Without extension
        function xtractFile_sans(data){
            var m = data.match(/(.*)[\/\\]([^\/\\]+)\.\w+$/);
            return {path: m[1], file: m[2]}
        }

Full Name
        function xtractFile(data){
            var m = data.match(/(.*)[\/\\]([^\/\\]+\.\w+)$/);
            return {path: m[1], file: m[2]}
        }

This code is simple. Notice the use of the meta-character $ to match at the end of the string and the use of an excluded set of characters [^\/\\], which allows any character except the / (or \ for MS) that separates the filename from path.

Extracting File Name and Extension.

This example is similar to the previous example; although, it is much simpler. Our goal is to separate the file name and extension from a string. This is useful to verify a file name or type

        function xtractFile(data){
            data = data.replace(/^\s|\s$/g, "");

            if (/\.\w+$/.test(data)) {
                var m = data.match(/([^\/\\]+)\.(\w+)$/);
                if (m)
                    return {filename: m[1], ext: m[2]};
                else
                    return {filename: "no file name", ext:null};
            } else {
                var m = data.match(/([^\/\\]+)$/);
                if (m)
                    return {filename: m[1], ext: null};
                else
                    return {filename: "no file name", ext:null};
            }
        }
or
        function xtractFile(data){
            data = data.replace(/^\s|\s$/g, ""); //trims string

            if (/\.\w+$/.test(data)) }
                if (data.match(/([^\/\\]+)\.(\w+)$/) )
                    return {filename: RegExp.$1, ext: RegExp.$2};
                else
                    return {filename: "no file name", ext:null};
            }
            else {
                if (data.match(/([^\/\\]+)$/) )
                    return {filename: RegExp.$1, ext: null};
                else
                    return {filename: "no file name", ext:null};
            }
        }

This code should have been simple. But, modern operating systems allow dots in filenames and paths so we must also allow them. This forces us to test if there is something we can call and extension to decide on how to handle the code.

An option, if you know what file extensions you're allowing and you're not allowing files without an extension, is to test for the file extensions.

        function xtractFile(data){
            data = data.replace(/^\s|\s$/g, ""); //trims string

            if (data.match(/([^\/\\]+)\.(asp|html|htm|shtml|php)$/i) )
                return {filename: RegExp.$1, ext: RegExp.$2};
            else
                return {filename: "invalid file type", ext: null};
        }