Regular Expressions in JavaScript

Data validation is where regular expressions become a Web application developer's friend. A log-in ID or password has to have an exact match; however, the expected value of most data is unknown. Instead, checking the validity of data entered usually means testing if the data fits a specific pattern. Regular expressions are designed to find patterns in strings. One line can do a complete pattern test that would required more extensive loops and branches using string methods.

Verify Field Contains Data  

The most fundamental test is whether or not the field has data. Below are three different test to determine if the string contains data.

Enter Data or Leave blank
Select an Option and Click Validate



The tests all ask the same question: "Does a minimal amount of data exist anywhere in the string?"

        function anyChar(str) {
                return /\S+/.test(str);
        }

        function anyWord(str) {
                return /\w+/.test(str);
        }

        function any3letters(str) {
                return /[a-z]{3}/i.test(str);
        }

The first test use the metacharacter \S and the + to ask if one or more non space characters are in the string. This includes anything other than a space, tab, CR, or LF. Of course, the enter key only works in text boxes. We worked with the results of pressing the enter key in Convert Returns to <br>.

The second test is a little more restrictive. It use the metacharacter \w and the + to test for one or more word character. This includes letters, digits, and the underscore.

The third test is even more restrictive. It uses the character set [a-z] in combination with the case insensitive flag i and the {3} to require three letters together. These can be anywhere in the string and in combined with any other characters just as long as there are three letters together. The set could have been written [a-zA-Z] to include both upper and lower case letters and the i flag would have been unnecessary.

Verify Data is Required Length  

After ensuring there is data, the next basic test is if there is the right amount of data. Are there enough characters to make a valid phone number, employee ID or password? Using a length test eliminates the need for the empty test.

Enter Data or Leave blank
Select an Option and Click Validate



The three functions used in this example go beyond the basic test to test for and report common errors. The first condition is for correct data and returns the message Valid. Then common errors are checked starting with the most specific. Each test returns with an appropriate error message.

Note: all three functions use the beginning of string character, ^, and the end of string character, $. This forces the complete string to match the pattern and not just a part of the string as was the case above; thus, restricting the string to the exact length.

        function any10Char(str) {
            if (/^\S{10}$/.test(str))
                return "Valid";
            else if (/\s/.test(str))
                return "Invalid space";
            else if (/^\S{11,}$/.test(str))
                return "Too long!";
            else
                return "Too short!";
        }

        function anyVar10(str) {
            if (/^[a-z$_]\w{9}$/i.test(str))
                return "Valid";
            else if (/^[^a-z$_]/i.test(str))
                return "First Char Invalid!";
            else if (/^[a-z$_]\W/i.test(str))
                return "Invalid Character!"
            else if (/^\w{11,}$/.test(str))
                return "Too long!";
            else
                return "Too short!";
        }

        function anyClass5(str) {
            if (/^[a-z][a-z0-9\-]{4,}$/i.test(str))
                return "Valid";
            else if (/^[^a-z]/i.test(str))
                return "First Char Not Alpha!";
            else if (/[^a-z0-9\-]/i.test(str))
                return "Invalid Character!";
            else
                return "Too short!";
            }

The first function tests for exactly 10 nonspace characters with \S{10}. There cannot be additonal characters before or after the matched 10 because the expression pattern requires the 10 characters to start at the beginning of the string and go to the end by using ^ and $: /^\S{10}$/. If that fails, then the string is tested for a space character with \s (the difference is whether the s is upper or lower case). If that wasn't the cause of failure, the string is tested for too many characters—11 or more with {11, }. Otherwise, it must be too short.

In the second pattern we are looking for a valid JavaScript variable name of ten characters. JavaScript restricts the first character to letters, $, or an underscore. The pattern, /^[a-z$_] specifies the set of characters allowed for the first character. It is followed by \w{9}$/ meaning nine more valid characters reaching the end of string. The \w metacharacter includes characters programming languages customarily allow in identifiers (letters, digits and underscore). The i flag at the end of the expression makes the pattern case insensitive. The expression could have been written /^[a-zA-Z$_]\w{9}$/ specifying both upper and lower case letters in the character set eliminating the need for the flag. And as before, the string is tested for possible errors. The only new pattern is [^...]. The caret as the first character inside square brackets is the not metacharacter meaning anything but the characters in this set.

The last function tests for a valid CSS class name. The pattern rules are different for a CSS class name than a variable name. Underscores are not permitted (although many browsers will allow them) so the metacharacter \w cannot be used; it includes the underscore. Instead, the pattern specifies the first character is a letter followed by nine characters that can be letters, digits or a dash. The dash is escaped, i.e. preceded with a back virgule or solidus (\), because it has special meaning inside square brackets as a range separator. Although, since it is the last character that is unnecessary in many implementation.

Verify Data has Minimum Word Count  

This is a less commonly needed test; however, checking word count can be useful for fields where a set number of words are required or a minimum number can indicate valid data. Perhaps as a test that a comment or help request has meaningful information. The basic word unit in regular expressions is a letter, digit, or underscore (the \w metacharacter) surrounded by space characters, or the beginning or end of the string. However, regular expression patterns can be defined in many ways. Each word is bordered by a (\b) word boundary. This is a noncharacter representing the point between the word and its neighboring space or end of string.

Enter Data or Leave blank
Select an Option and Click Validate



In this example, there are two approaches. The first counts words including a. The next is more rigorous because it counts multi-lettered words only. In this case words with three or more letters The words do not have to be consecutive; there can be groups of single or double characters between the words.

        function any4Words(str) {
            return /(\b[a-z0-9]+\b.*){4,}/i.test(str);
        }

        function Three3lettered(str) {
            return /(\b[a-z0-9]{3,}\b.*){3,}/i.test(str);
        }

Both examples make use of the metacharacter \b for word boundary. That lets the first and last word, which do not have spaces on both side, get counted. The patterns don't work, however, without including the metacharacter . for any character followed by the * symbol for zero or more times. That metacharacter combination allows matching the required number of words scattered anywhere in the string.

Validating an SSN  

Although SSN tests maybe rare, when generalized this test is common. Here we have chosen to verify that a string matches the pattern of an American Social Security Number. This can be rewritten to accommodate a part number, Library of Congress number, or any other character label that has a pattern with formatting characters including some positions restricted to letters and others to digits.

Enter Data or Leave blank
Select an Expression and Click Validate



        function ssnExp1(data) {
            return /^\d{3}-\d{2}-\d{4}$/.test(data);
        }

        function ssnExp2(data) {
            return /^\d{3}-?\d{2}-?\d{4}$/.test(data);
        }

The first option requires formatting characters: the two dashes. The pattern in the expression is easy to read. It looks for 3 digits a dash 2 digits a dash and 4 digits. These are common formatting characters in American style Social Security Numbers. The dash doesn't have to be escaped (preceded by \) since it is not inside square brackets.

The second expression makes use of the ? metacharacter meaning zero or one of the preceding character to make the formatting characters optional.

Validating a Phone Number  

This is similar to the preceding example with a slightly more complex set of formatting characters. There are parenthesis, a space, and a dash. Or, none of those, using a period to separate the parts of a phone number instead. This shows how the pattern used in the SSN example can be modified to fit other needs.

Enter Data or Leave blank
Select an Expression and Click Validate



Assuming the area code, exchange, and phone are not separate input fields, a valid American or Canadian phone number would consist of the area code within parentheses, a space, the 3 digit exchange, a dash, and the 4 digit phone number. There should be no leading or trailing characters. The first number of the area code may not be a zero. A regular expression matching that pattern would look like this:

        function phoneExp1(data) {
            return /^\(\d{3}\) \d{3}-\d{4}$/.test(data);
        }

        function phoneExp2(data) {
            return /^\([1-9]\d{2}\)\s?\d{3}\-\d{4}$/.test(data);
        }

        function phoneExp3(data) {
            return /^\(?([1-9]\d{2})(\) ?|[.-])?(\d{3})[.-]?(\d{4})$/.test(data);
        }

The first expression just checks the number of digits and the formatting characters. The parenthesis need to be escaped with a \ because they have special meaning in expression sets.

The second pattern restricts the first digit of the area code to 1 through 9 and not 0, which was the full requirement, and makes the space optional.

The third pattern makes all formatting optional. It also allows the use of periods to separate the parts of the phone number, a style gaining popularity. This pattern, [\).] matches a right parenthesis or a period. It is followed by a ? meaning that the parenthesis or period are optional (i.e. zero or one).

Verify Data is Valid Identifiers  

Identifiers are the names given to functions, variables and other parts of code. Each programming language has specific rules that can be convert to regular expressions. This is not as likely to be need in a Web form as some of our other examples, but it does show how to design regular expressions to match different patterns and could be useful in a script to validate code.

Enter Data or Leave blank  
Select an Option and Click Validate



        function varTest(data) {
            if (/^[a-z$_][\w$]*$/i.test(data))
                return "Valid JavaScript Name";
            else if (/^[^a-z$_]/i.test(data))
                return "Invalid First Character!";
            else
                return "Invalid Character!";
        }

        function classTest(data) {
            if (/^[a-z][a-z0-9\-]*$/i.test(data))
                return "Valid Class Name";
            else if (/^[^a-z]/i.test(data))
                return "Invalid First Character!";
            else
                return "Invalid Character!";
        }

        function phpTest(data) {
            if (/^\$[\w]*$/i.test(data))
                return "Valid PHP/Perl Identifier";
            else if (/^[^$]/.test(data))
                return "Invalid First Character!";
            else
                return "Invalid Character!";
        }

The first two regular expressions are being repeated from the earlier discussion on verifying length. The only difference here is that the length is not restricted.

The third pattern fits PHP variable names, which must have a $ as the first character.

Validating File Name and Extension  

This recipe validates a file name. The file name is restricted to alphanumeric characters (first character is alpha) and one of the following extensions: asp, html, htm, shtml, or php. You might combine this with an expression to parse the file name and extension like Parsing path and filename ... or Extracting File Name and Extension.

Enter File Name & Extension
Select an Option and Click Validate

    function validate_file(data){
        data = data.replace(/^\s|\s$/g, ""); //trims string
            return /^[a-z][\w]?\.(asp|html|htm|shtml|php)$/i.test(data)
    }//eof - validate_file

Here we rely on the built-in meta character \w, which allows alphanumeric charactes and the underscore character. The square brackets around the meta character—this part: [\w]?—are not need, but they let you add acceptable characters; for example to allow a $ you would write [\w$]?.

Validating an Email Address  

Every feed back form, guest book, or forum needs to verify that the email addresses are valid. While regular expression cannot actually verify the address, it can make sure they fit the pattern of a valid email address. This is the most complex pattern to validate. There are several examples floating around the internet. The three examples shown typify the patterns and cover some of the complexities.

Enter Data or Leave blank
Select an Expression and Click Validate



function emailExp1(data) {
    return /^[a-zA-Z0-9_\-.]+@[a-zA-Z0-9\-]+\.[a-zA-Z0-9]+$/.test(data);
}

function emailExp2(data) {
    return /^[\w.\-]+@[\w\-]+\.[a-zA-Z0-9]+$/.test(data);
}

function emailExp3(data) {
    return /^([\w]+)(\.[\w]+)*@([\w\-]+)(\.[\w]{2,7})(\.[a-z]{2})?$/i.test(data);
}

The first and second expressions simply says there will be some characters from a set of valid characters before an @, some more of the same after it, then a period followed by characters from a reduced set.. You can't get any more general than that. This pattern can be tighten up with more careful restrictions on the first character and character counts in each part of the email address.

The second expression does the same thing as the first, but its size is reduced by use the metacharacter \w instead of the larger set of characters.

The third expression is a little more complete. It does not allow a period as the first character, doesn't allow dashes in the first part of the address, restricts the length of the domain from 2 to 4 characters, and it allows an optional 2 letter country code. The i flag is because upper case characters were not specified in the country code character set.

Validating a Date  

A common field on forms is a date field.

Enter Data or Leave blank
Select a Date Format






The code is more complex even though the regular expressions are simple. It appears this way because we are matching both the date string and a format string, and because we need to do some testing that cannot easily be done with regular expressions. (For our example years prior to 1000 or after 9999 are invalid. Most applications will require years nearer the current year. Therefore only 2 and 4 digit years are valid.)

function isValidDate(date_string, format) {
    var days = [0,31,28,31,30,31,30,31,31,30,31,30,31];
    var year, month, day, date_parts = null;
    var rtrn = false;
    var decisionTree = {
        'm/d/y':{
            're':/^(\d{1,2})[./-](\d{1,2})[./-](\d{2}|\d{4})$/,
            'month': 1,'day': 2, year: 3
        },
        'mm/dd/yy':{
            're':/^(\d{1,2})[./-](\d{1,2})[./-](\d{2})$/,
            'month': 1,'day': 2, year: 3
        },
        'mm/dd/yyyy':{
            're':/^(\d{1,2})[./-](\d{1,2})[./-](\d{4})$/,
            'month': 1,'day': 2, year: 3
        },
        'y/m/d':{
            're':/^(\d{2}|\d{4})[./-](\d{1,2})[./-](\d{1,2})$/,
            'month': 2,'day': 3, year: 1
        },
        'yy/mm/dd':{
            're':/^(\d{1,2})[./-](\d{1,2})[./-](\d{1,2})$/,
            'month': 2,'day': 3, year: 1
        },
        'yyyy/mm/dd':{
            're':/^(\d{4})[./-](\d{1,2})[./-](\d{1,2})$/,
            'month': 2,'day': 3, year: 1
        }
    };
    var test = decisionTree[format];
    if (test) {
        date_parts = date_string.match(test.re);
        if (date_parts) {
            year = date_parts[test.year] * 1;
            month = date_parts[test.month] * 1;
            day = date_parts[test.day] * 1;

            test = (month == 2 && 
                    isLeapYear() && 
                    29 || 
                    days[month] || 0);

            rtrn = 1 <= day && day <= test;
        }
    }

    function isLeapYear() {
        return (year % 4 != 0 ? false : 
            ( year % 100 != 0? true: 
            ( year % 1000 != 0? false : true)));
    }
    return rtrn;
}//eof isValidDate

The old if-else if ladder used to determine the expected date format has been replace with an object, decisionTree. It's not much of a decision tree, but&elps;. It does allow us to determine the regular expression that the date is expected to match and the order year, month, day will be in the array returned by match.

String match returns an array with each date part, those parts of the regular expression within (), in different elements. The first—zero—element is the complete string. Using (\d{2}|\d{4}) in the m/d/y or y/m/d format allows the year to be either 2 or 4 digits but not 1 or 3. If the regular expression doesn't find a match, then date_parts is null.

A function has been add to test for leap years. This function is nested within isValidDate and so it has access to the variable year.

test is re-assigned a value. Recycling variables is not necessarily a good practice, but since the code is short and clear let's let it slide.

The second assignment to test maybe a little obscure. It is assigned the last day of the month if the month is valid or zero if the month is not valid. Because of order of presidence, the three expressions separated by the and operator, &&, are treated as one compound expression. So we have three expressions separated by || the or operator: a compound express and two simple expressions.

If the month is 2 and it is a leap year and 29, 29 is return. In this case, all three sub-expressions evaluate as true so the expressions after the or operator are not evaluated. The last expression (or first false expression) in a series of and joined expressions is returned.

However, if the month is not 2 or it's not a leap year—29 is always true—, then the and joined expression is false (stopping at the first false) and evaluation passes to the expression following the or operator. The days array is accessed and will return 0, a non zero number or null. The day array has thirteen entries with the first with index zero having a value of zero. The month is entered as a number between 1 and 12 which matches the remaining array elements. Those elements hold the number of days in each month. If the month is invalid, the array returns a day of zero or null.

If the array evaluated to zero or null then the expression following the next or operator is evaluated. It is simply 0. The days in an invalid month are zero.

Finally, the code tests whether the day is between 1 and the last day of the month. This makes sure the month is between 1 and 12 since the day cannot be greater than or equal to 1 and less then or equal to zero.