The Header Field Name Parser

$Revision$ $Date$ The Header Field Name Parser The purpose of the header field type parser is to recognize type of a header field. The following types of header field will be recognized: Via, To, From, CSeq, Call-ID, Contact, Max-Forwards, Route, Record-Route, Content-Type, Content-Length, Authorization, Expires, Proxy-Authorization, WWW-Authorization, supported, Require, Proxy-Require, Unsupported, Allow, Event. All other header field types will be marked as HDR_OTHER. Main function of header name parser is parse_hname2. The function can be found in file parse_hname.c. The function accepts pointers to begin and end of a header field and fills in hdf_field structure. name field will point to the header field name, body field will point to the header field body and type field will contain type of the header field if known and HDR_OTHER if unknown. The parser is 32-bit, it means, that it processes 4 characters of header field name at time. 4 characters of a header field name are converted to an integer and the integer is then compared. This is much faster than comparing byte by byte. Because the server is compiled on at least 32-bit architectures, such comparison will be compiled into one instruction instead of 4 instructions. We did some performance measurement and 32-bit parsing is about 3 times faster for a typical SIP message than corresponding automation comparing byte by byte. Performance may vary depending on the message size, parsed header fields and header fields type. Test showed that it was always as fast as corresponding 1-byte comparing automation. Since comparison must be case insensitive in case of header field names, it is necessary to convert it to lower case first and then compare. Since converting byte by byte would slow down the parser a lot, we have implemented a hash table, that can again convert 4 bytes at once. Since set of keys that need to be converted to lowercase is known (the set consists of all possible 4-byte parts of all recognized header field names) we can pre-calculate size of the hash table to be synonym-less. That will simplify (and speed up) the lookup a lot. The hash table must be initialized upon the server startup (function init_hfname_parser). The header name parser consists of several files, all of them are under parser subdirectory. Main file is parse_hname2.c - this files contains the parser itself and functions used to initialize and lookup the hash table. File keys.h contains automatically generated set of macros. Each macro is a group of 4 bytes converted to integer. The macros are used for comparison and the hash table initialization. For example, for Max-Forwards header field name, the following macros are defined in the file: #define _max__ 0x2d78616d /* "max-" */ #define _maX__ 0x2d58616d /* "maX-" */ #define _mAx__ 0x2d78416d /* "mAx-" */ #define _mAX__ 0x2d58416d /* "mAX-" */ #define _Max__ 0x2d78614d /* "Max-" */ #define _MaX__ 0x2d58614d /* "MaX-" */ #define _MAx__ 0x2d78414d /* "MAx-" */ #define _MAX__ 0x2d58414d /* "MAX-" */ #define _forw_ 0x77726f66 /* "forw" */ #define _forW_ 0x57726f66 /* "forW" */ #define _foRw_ 0x77526f66 /* "foRw" */ #define _foRW_ 0x57526f66 /* "foRW" */ #define _fOrw_ 0x77724f66 /* "fOrw" */ #define _fOrW_ 0x57724f66 /* "fOrW" */ #define _fORw_ 0x77524f66 /* "fORw" */ #define _fORW_ 0x57524f66 /* "fORW" */ #define _Forw_ 0x77726f46 /* "Forw" */ #define _ForW_ 0x57726f46 /* "ForW" */ #define _FoRw_ 0x77526f46 /* "FoRw" */ #define _FoRW_ 0x57526f46 /* "FoRW" */ #define _FOrw_ 0x77724f46 /* "FOrw" */ #define _FOrW_ 0x57724f46 /* "FOrW" */ #define _FORw_ 0x77524f46 /* "FORw" */ #define _FORW_ 0x57524f46 /* "FORW" */ #define _ards_ 0x73647261 /* "ards" */ #define _ardS_ 0x53647261 /* "ardS" */ #define _arDs_ 0x73447261 /* "arDs" */ #define _arDS_ 0x53447261 /* "arDS" */ #define _aRds_ 0x73645261 /* "aRds" */ #define _aRdS_ 0x53645261 /* "aRdS" */ #define _aRDs_ 0x73445261 /* "aRDs" */ #define _aRDS_ 0x53445261 /* "aRDS" */ #define _Ards_ 0x73647241 /* "Ards" */ #define _ArdS_ 0x53647241 /* "ArdS" */ #define _ArDs_ 0x73447241 /* "ArDs" */ #define _ArDS_ 0x53447241 /* "ArDS" */ #define _ARds_ 0x73645241 /* "ARds" */ #define _ARdS_ 0x53645241 /* "ARdS" */ #define _ARDs_ 0x73445241 /* "ARDs" */ #define _ARDS_ 0x53445241 /* "ARDS" */ As you can see, Max-Forwards name was divided into three 4-byte chunks: Max-, Forw, ards. The file contains macros for every possible lower and upper case character combination of the chunks. Because the name (and therefore chunks) can contain colon (":"), minus or space and these characters are not allowed in macro name, they must be substituted. Colon is substituted by "1", minus is substituted by underscore ("_") and space is substituted by "2". When initializing the hash table, all these macros will be used as keys to the hash table. One of each upper and lower case combinations will be used as value. Which one ? There is a convention that each word of a header field name starts with a upper case character. For example, most of user agents will send "Max-Forwards", messages containing some other combination of upper and lower case characters (for example: "max-forwards", "MAX-FORWARDS", "mAX-fORWARDS") are very rare (but it is possible). Considering the previous paragraph, we optimized the parser for the most common case. When all header fields have upper and lower case characters according to the convention, there is no need to do hash table lookups, which is another speed up. For example suppose we are trying to figure out if the header field name is Max-Forwards and the header field name is formed according to the convention (i.e. "Max-Forwards"): Get the first 4 bytes of the header field name ("Max-"), convert it to an integer and compare to "_Max__" macro. Comparison succeeded, continue with the next step. Get next 4 bytes of the header field name ("Forw"), convert it to an integer and compare to "_Forw_" macro. Comparison succeeded, continue with the next step. Get next 4 bytes of the header field name ("ards"), convert it to an integer and compare to "_ards_" macro. Comparison succeeded, continue with the next step. If the following characters are spaces and tabs followed by a colon (or colon directly without spaces and tabs), we found Max-Forwards header field name and can set type field to HDR_MAXFORWARDS. Otherwise (other characters than colon, spaces and tabs) it is some other header field and set type field to HDR_OTHER. As you can see, there is no need to do hash table lookups if the header field was formed according to the convention and the comparison was very fast (only 3 comparisons needed !). Now lets consider another example, the header field was not formed according to the convention, for example "MAX-forwards": Get the first 4 bytes of the header field name ("MAX-"), convert it to an integer and compare to "_Max__" macro. Comparison failed, try to lookup "MAX-" converted to integer in the hash table. It was found, result is "Max-" converted to integer. Try to compare the result from the hash table to "_Max__" macro. Comparison succeeded, continue with the next step. Compare next 4 bytes of the header field name ("forw"), convert it to an integer and compare to "_Max__" macro. Comparison failed, try to lookup "forw" converted to integer in the hash table. It was found, result is "Forw" converted to integer. Try to compare the result from the hash table to "Forw" macro. Comparison succeeded, continue with the next step. Compare next 4 bytes of the header field name ("ards"), convert it to integer and compare to "ards" macro. Comparison succeeded, continue with the next step. If the following characters are spaces and tabs followed by a colon (or colon directly without spaces and tabs), we found Max-Forwards header field name and can set type field to HDR_MAXFORWARDS. Otherwise (other characters than colon, spaces and tabs) it is some other header field and set type field to HDR_OTHER. In this example, we had to do 2 hash table lookups and 2 more comparisons. Even this variant is still very fast, because the hash table lookup is synonym-less, lookups are very fast.