$Revision$
$Date$
The Header Field Name Parser
The purpose of the header field type parser is to recognize type of a
header field. The following types of header field will be recognized:
Via, To, From, CSeq, Call-ID, Contact, Max-Forwards, Route,
Record-Route, Content-Type, Content-Length, Authorization, Expires,
Proxy-Authorization, WWW-Authorization, supported, Require,
Proxy-Require, Unsupported, Allow, Event.
All other header field types will be marked as HDR_OTHER.
Main function of header name parser is
parse_hname2. The function can be found in file
parse_hname.c. The function accepts pointers to
begin and end of a header field and fills in
hdf_field
structure. name field will point to the
header field name, body field will point to
the header field body and type field will
contain type of the header field if known and HDR_OTHER if unknown.
The parser is 32-bit, it means, that it processes 4 characters of
header field name at time. 4 characters of a header field name are
converted to an integer and the integer is then compared. This is much
faster than comparing byte by byte. Because the server is compiled on
at least 32-bit architectures, such comparison will be compiled into
one instruction instead of 4 instructions.
We did some performance measurement and 32-bit parsing is about 3 times
faster for a typical SIP message than corresponding
automation comparing byte by byte. Performance may vary depending on the
message size, parsed header fields and header fields type. Test showed
that it was always as fast as corresponding 1-byte comparing automation.
Since comparison must be case insensitive in case of header field
names, it is necessary to convert it to lower case first and then
compare. Since converting byte by byte would slow down the parser a
lot, we have implemented a hash table, that can again convert 4 bytes
at once. Since set of keys that need to be converted to lowercase is
known (the set consists of all possible 4-byte parts of all recognized
header field names) we can pre-calculate size of the hash table to be
synonym-less. That will simplify (and speed up) the lookup a lot. The
hash table must be initialized upon the server startup (function
init_hfname_parser).
The header name parser consists of several files, all of them are under
parser subdirectory. Main file is
parse_hname2.c - this files contains the parser
itself and functions used to initialize and lookup the hash table. File
keys.h contains automatically generated set of
macros. Each macro is a group of 4 bytes converted to integer. The
macros are used for comparison and the hash table initialization. For
example, for Max-Forwards header field name, the following macros are
defined in the file:
#define _max__ 0x2d78616d /* "max-" */
#define _maX__ 0x2d58616d /* "maX-" */
#define _mAx__ 0x2d78416d /* "mAx-" */
#define _mAX__ 0x2d58416d /* "mAX-" */
#define _Max__ 0x2d78614d /* "Max-" */
#define _MaX__ 0x2d58614d /* "MaX-" */
#define _MAx__ 0x2d78414d /* "MAx-" */
#define _MAX__ 0x2d58414d /* "MAX-" */
#define _forw_ 0x77726f66 /* "forw" */
#define _forW_ 0x57726f66 /* "forW" */
#define _foRw_ 0x77526f66 /* "foRw" */
#define _foRW_ 0x57526f66 /* "foRW" */
#define _fOrw_ 0x77724f66 /* "fOrw" */
#define _fOrW_ 0x57724f66 /* "fOrW" */
#define _fORw_ 0x77524f66 /* "fORw" */
#define _fORW_ 0x57524f66 /* "fORW" */
#define _Forw_ 0x77726f46 /* "Forw" */
#define _ForW_ 0x57726f46 /* "ForW" */
#define _FoRw_ 0x77526f46 /* "FoRw" */
#define _FoRW_ 0x57526f46 /* "FoRW" */
#define _FOrw_ 0x77724f46 /* "FOrw" */
#define _FOrW_ 0x57724f46 /* "FOrW" */
#define _FORw_ 0x77524f46 /* "FORw" */
#define _FORW_ 0x57524f46 /* "FORW" */
#define _ards_ 0x73647261 /* "ards" */
#define _ardS_ 0x53647261 /* "ardS" */
#define _arDs_ 0x73447261 /* "arDs" */
#define _arDS_ 0x53447261 /* "arDS" */
#define _aRds_ 0x73645261 /* "aRds" */
#define _aRdS_ 0x53645261 /* "aRdS" */
#define _aRDs_ 0x73445261 /* "aRDs" */
#define _aRDS_ 0x53445261 /* "aRDS" */
#define _Ards_ 0x73647241 /* "Ards" */
#define _ArdS_ 0x53647241 /* "ArdS" */
#define _ArDs_ 0x73447241 /* "ArDs" */
#define _ArDS_ 0x53447241 /* "ArDS" */
#define _ARds_ 0x73645241 /* "ARds" */
#define _ARdS_ 0x53645241 /* "ARdS" */
#define _ARDs_ 0x73445241 /* "ARDs" */
#define _ARDS_ 0x53445241 /* "ARDS" */
As you can see, Max-Forwards name was divided into three 4-byte chunks:
Max-, Forw, ards. The file contains macros for every possible lower and
upper case character combination of the chunks. Because the name (and
therefore chunks) can contain colon (":"), minus or space and these
characters are not allowed in macro name, they must be
substituted. Colon is substituted by "1", minus is substituted by
underscore ("_") and space is substituted by "2".
When initializing the hash table, all these macros will be used as keys
to the hash table. One of each upper and lower case combinations will
be used as value. Which one ?
There is a convention that each word of a header field name starts with
a upper case character. For example, most of user agents will send
"Max-Forwards", messages containing some other combination of upper and
lower case characters (for example: "max-forwards", "MAX-FORWARDS",
"mAX-fORWARDS") are very rare (but it is possible).
Considering the previous paragraph, we optimized the parser for the
most common case. When all header fields have upper and lower case
characters according to the convention, there is no need to do hash
table lookups, which is another speed up.
For example suppose we are trying to figure out if the header field
name is Max-Forwards and the header field name is formed according to
the convention (i.e. "Max-Forwards"):
Get the first 4 bytes of the header field name ("Max-"),
convert it to an integer and compare to "_Max__"
macro. Comparison succeeded, continue with the next step.
Get next 4 bytes of the header field name ("Forw"), convert
it to an integer and compare to "_Forw_" macro. Comparison
succeeded, continue with the next step.
Get next 4 bytes of the header field name ("ards"), convert
it to an integer and compare to "_ards_" macro. Comparison
succeeded, continue with the next step.
If the following characters are spaces and tabs followed by
a colon (or colon directly without spaces and tabs), we
found Max-Forwards header field name and can set
type field to
HDR_MAXFORWARDS. Otherwise (other characters than colon,
spaces and tabs) it is some other header field and set
type field to HDR_OTHER.
As you can see, there is no need to do hash table lookups if the header
field was formed according to the convention and the comparison was
very fast (only 3 comparisons needed !).
Now lets consider another example, the header field was not formed
according to the convention, for example "MAX-forwards":
Get the first 4 bytes of the header field name ("MAX-"),
convert it to an integer and compare to "_Max__" macro.
Comparison failed, try to lookup "MAX-" converted to
integer in the hash table. It was found, result is "Max-"
converted to integer.
Try to compare the result from the hash table to "_Max__"
macro. Comparison succeeded, continue with the next step.
Compare next 4 bytes of the header field name ("forw"),
convert it to an integer and compare to "_Max__" macro.
Comparison failed, try to lookup "forw" converted to
integer in the hash table. It was found, result is "Forw"
converted to integer.
Try to compare the result from the hash table to "Forw"
macro. Comparison succeeded, continue with the next step.
Compare next 4 bytes of the header field name ("ards"),
convert it to integer and compare to "ards"
macro. Comparison succeeded, continue with the next step.
If the following characters are spaces and tabs followed by
a colon (or colon directly without spaces and tabs), we
found Max-Forwards header field name and can set
type field to
HDR_MAXFORWARDS. Otherwise (other characters than colon,
spaces and tabs) it is some other header field and set
type field to HDR_OTHER.
In this example, we had to do 2 hash table lookups and 2 more
comparisons. Even this variant is still very fast, because the hash
table lookup is synonym-less, lookups are very fast.