Regular Expression Performance Comparison

The following document provides performance comparison for the following regular expression libraries:

What is tested

Most of the tests performed are taken from a de facto standard regexp engine test set available from the Web: computing times are measured when matching small texts or searching medium/big size texts for expression occurrences. The last test measures the performance decrease with the increase of the expression complexity, namely the number of alternatives in the expression.

Architecture

Linux localhost.localdomain 2.6.30.9-99.fc11.x86_64 #1 SMP Tue Nov 17 21:30:38 EST 2009 x86_64 x86_64 x86_64 GNU/Linux


processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 30
model name	: Intel(R) Core(TM) i5 CPU         750  @ 2.67GHz
stepping	: 5
cpu MHz		: 1197.000
cache size	: 8192 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5467.63
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

SHORT MATCHES

The following tests evaluate the time taken to match the expression to the input string.
For each result, the top number has been normalized relative to the fastest time, so 1 is as good as it gets.

astl boost::regex boost::xpressive greta pcre Text Expression
2.61111 2.66667 1.41667 1 1.13889 100- this is a line of ftp response which contains a message string ^([0-9]+)(\-| |$)(.*)$
1 3.19231 2.15385 2 1.57692 1234-5678-1234-456 ([[:digit:]]{4}[- ]){3}[[:digit:]]{3,4}
1 3.575 3.125 2.425 2.675 john_maddock@compuserve.com ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
1 6.6 5.8 4.2 4.9 foo12@foo.edu ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
1 5.48 4.64 3.36 3.68 bob.smith@foo.tv ^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
1 4.73333 1.86667 1.53333 2.2 EH10 2QQ ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$
1 5.08333 2.33333 1.66667 2.75 G1 1AA ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$
1 4.61538 2.15385 1.84615 2.53846 SW1 1ZZ ^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$
1 3.66667 1.8 1.53333 1.8 4/1/2001 ^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$
1 3.29412 1.64706 1.35294 1.64706 12/12/2001 ^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$
1 7.5 3.25 2.125 3.5 123 ^[-+]?[[:digit:]]*\.?[[:digit:]]*$
1 5.06667 1.8 1.33333 2 +3.14159 ^[-+]?[[:digit:]]*\.?[[:digit:]]*$
1 5.06667 1.8 1.33333 1.93333 -3.14159 ^[-+]?[[:digit:]]*\.?[[:digit:]]*$

SUMMARY FOR SHORT MATCHES

These are average values: the perfect library would get 1, which means best performance for all tests.

astl 1.12393
greta 1.97762
pcre 2.48767
boost::xpressive 2.59896
boost::regex 4.65693

LONG MATCHES

The next test measures the time to find all matches in a long English text. The text is the complete works of Mark Twain, from Project Gutenberg. The text is 19Mb long.
For each result, the top number has been normalized relative to the fastest time, so 1 is as good as it gets.

astl boost::regex boost::xpressive greta pcre Text Expression
1 1.21429 1 1 1.21429 complete works of Mark Twain Twain
1 1.30769 1.07692 1.07692 1.23077 complete works of Mark Twain Huck[[:alpha:]]+
1 2.40566 2.0566 7.46226 11.1887 complete works of Mark Twain [[:alpha:]]+ing
2.76316 1.36842 1 4.44737 1.97368 complete works of Mark Twain ^[^ ]*?Twain
2.21739 1 1.13043 4.73913 1.3913 complete works of Mark Twain Tom|Sawyer|Huckleberry|Finn
1.2 1.10909 1 2.4 2.36364 complete works of Mark Twain (Tom|Sawyer|Huckleberry|Finn).{0,30}river|river.{0,30}(Tom|Sawyer|Huckleberry|Finn)

SUMMARY FOR LONG MATCHES

These are average values: the perfect library would get 1, which means best performance for all tests.

boost::xpressive 1.21066
boost::regex 1.40086
astl 1.53009
pcre 3.22706
greta 3.52095

CODE SEARCH

For each of the following regular expressions the time taken to find all occurrences of the expression within the boost C++ source file crc.hpp was measured.
For each result, the top number has been normalized relative to the fastest time, so 1 is as good as it gets.

astl boost::regex boost::xpressive greta pcre Text Expression
3.38462 1 1.23077 5.07692 2.46154 /home/vintz/src/sourceforge/svn/astl/trunk/bench/crc.hpp ^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?(class|struct)[[:space:]]*(\<\w+\>([ ]*$[^)]*$)?[[:space:]]*)*(\<\w*\>)[[:space:]]*(<[^;:{]+>[[:space:]]*)?(\{|:[^;\{()]*\{)
1 11.9535 10.7907 5.76744 24.3953 /home/vintz/src/sourceforge/svn/astl/trunk/bench/crc.hpp (^[ ]+#(?:[^\\\n]|\\[^\n_[:punct:][:alnum:]]*[\n[:punct:][:alnum:]_])*)|(//[^\n]*|/\*.*?\*/)|\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\.)?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\>|('(?:[^\\']|\\.)*'|"(?:[^\\"]|\\.)*")|\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned|using|virtual|void|volatile|wchar_t|while)\>
2.26316 1.05263 1 3.68421 1.42105 /home/vintz/src/sourceforge/svn/astl/trunk/bench/crc.hpp ^[ ]*#[ ]*include[ ]+("[^"]+"|<[^>]+>)
2.26316 1.05263 1 3.68421 1.42105 /home/vintz/src/sourceforge/svn/astl/trunk/bench/crc.hpp ^[ ]*#[ ]*include[ ]+("boost/[^"]+"|]+>)

SUMMARY FOR CODE SEARCH

These are average values: the perfect library would get 1, which means best performance for all tests.

astl 2.22773
boost::xpressive 3.50537
boost::regex 3.76469
greta 4.5532
pcre 7.42475

HTML SEARCH

For each of the following regular expressions the time taken to find all occurrences of the expression within the boost html documentation file libs/libraries.htm was measured.
For each result, the top number has been normalized relative to the fastest time, so 1 is as good as it gets.

astl boost::regex boost::xpressive greta pcre Text Expression
1.04348 1.17391 1 2.6087 1.6087 /home/vintz/src/sourceforge/svn/astl/trunk/bench/libraries.htm beman|john|dave
1 1.9 1.2 1 2.1 /home/vintz/src/sourceforge/svn/astl/trunk/bench/libraries.htm <p>.*?</p>
1 1.5 1.5 1.82143 1.92857 /home/vintz/src/sourceforge/svn/astl/trunk/bench/libraries.htm <a[^>]+href=("[^"]*"|[^[:space:]]+)[^>]*>
1 1.11765 1.05882 1 1.35294 /home/vintz/src/sourceforge/svn/astl/trunk/bench/libraries.htm <h[12345678][^>]*>.*?</h[12345678]>
1 2.85714 1.14286 1.28571 2.85714 /home/vintz/src/sourceforge/svn/astl/trunk/bench/libraries.htm <img[^>]+src=("[^"]*"|[^[:space:]]+)[^>]*>
1 2.71429 1.14286 1 2.85714 /home/vintz/src/sourceforge/svn/astl/trunk/bench/libraries.htm <font[^>]+face=("[^"]*"|[^[:space:]]+)[^>]*>.*?</font>

SUMMARY FOR HTML SEARCH

These are average values: the perfect library would get 1, which means best performance for all tests.

astl 1.00725
boost::xpressive 1.17409
greta 1.45264
boost::regex 1.87716
pcre 2.11742

OVERALL SUMMARY

These are average values: the perfect library would get 1, which means best performance for all tests.

astl 1.33607
boost::xpressive 2.14194
greta 2.54357
pcre 3.24502
boost::regex 3.28507

SCALING

The next test evaluates the behavior of the libraries when the complexity of the expression increases: starting from a simple expression, at each step a new expression is concatenated as an alternative and the time to find all matches 10 times is measured.

Step 1: [[:upper:]]\. +[[:upper:]]\. +
Step 2: [[:upper:]]\. +[[:upper:]]\. +|Mr\.? [[:upper:]][[:lower:]]+ ([[:upper:]][[:lower:]] )*
Step 3: [[:upper:]]\. +[[:upper:]]\. +|Mr\.? [[:upper:]][[:lower:]]+ ([[:upper:]][[:lower:]] )*|days?
...
Step 20: [[:upper:]]\. +[[:upper:]]\. +|Mr\.? [[:upper:]][[:lower:]]+ ([[:upper:]][[:lower:]] )*|days?|P\. *S\.--|[[:upper:]][[:alpha:]]+ +[Ii]sles?|[$£]\d+([.,]\d+)*|"[[:alnum:]]+"|Mrs\.? [[:upper:]][[:lower:]]+ ([[:upper:]][[:lower:]] )*|(\d|1[0-2]) ?[aApP]\.?[Mm]\.?|([aA]\.?[Dd]|[bB]\.?[Cc])\.? \d{3,4}|[[:upper:]][[:lower:]]+[!?]|[[:upper:]]+( [[:upper:]]+)*|[hH]orses?|[oO]nce upon a time|Twain|(?i:CHAPTER [IVXLDC]+)|good terms?|Section \d+|New York|[Dd][Ee][Aa][Rr]

The text is the complete works of Mark Twain, from Project Gutenberg. The text is 19Mb long.

astl	boost::regex	boost::xpressive	greta	pcre	Text	Expression
2.61111	2.66667	1.41667	1	1.13889	100- this is a line of ftp response which contains a message string	`^([0-9]+)(\-\| \|$)(.*)$`
1	3.19231	2.15385	2	1.57692	1234-5678-1234-456	`([[:digit:]]{4}[- ]){3}[[:digit:]]{3,4}`
1	3.575	3.125	2.425	2.675	john_maddock@compuserve.com	`^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)\|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}\|[0-9]{1,3})(\]?)$`
1	6.6	5.8	4.2	4.9	foo12@foo.edu	`^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)\|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}\|[0-9]{1,3})(\]?)$`
1	5.48	4.64	3.36	3.68	bob.smith@foo.tv	`^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)\|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}\|[0-9]{1,3})(\]?)$`
1	4.73333	1.86667	1.53333	2.2	EH10 2QQ	`^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$`
1	5.08333	2.33333	1.66667	2.75	G1 1AA	`^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$`
1	4.61538	2.15385	1.84615	2.53846	SW1 1ZZ	`^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$`
1	3.66667	1.8	1.53333	1.8	4/1/2001	`^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$`
1	3.29412	1.64706	1.35294	1.64706	12/12/2001	`^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$`
1	7.5	3.25	2.125	3.5	123	`^[-+]?[[:digit:]]\.?[[:digit:]]$`
1	5.06667	1.8	1.33333	2	+3.14159	`^[-+]?[[:digit:]]\.?[[:digit:]]$`
1	5.06667	1.8	1.33333	1.93333	-3.14159	`^[-+]?[[:digit:]]\.?[[:digit:]]$`

astl	1.12393
greta	1.97762
pcre	2.48767
boost::xpressive	2.59896
boost::regex	4.65693

boost::xpressive	1.21066
boost::regex	1.40086
astl	1.53009
pcre	3.22706
greta	3.52095

astl	2.22773
boost::xpressive	3.50537
boost::regex	3.76469
greta	4.5532
pcre	7.42475

astl	1.00725
boost::xpressive	1.17409
greta	1.45264
boost::regex	1.87716
pcre	2.11742

astl	1.33607
boost::xpressive	2.14194
greta	2.54357
pcre	3.24502
boost::regex	3.28507