lcgid implementation

(reload) (page class:public)
Looking into implementing a cut-down HTTP server for local CGI. Felt like trying to code it myself as HTTP headers are so basic and I'd like to put in the features I'd want.

First order of the day of course is parsing those HTTP headers.

Most of the scanner/lexer options seem to really suck unfortunately. A long sequence of scanf/strcmp type statements or whathaveyou is a pretty inefficient way of doing it, likewise a long sequential list of regexp tests. I really want a scanner system similar to lex, but where for each pattern, the action code can extract sections of text matched by subpatterns, just like Perl's (or SED's) backreferencey stuff. Unfortunately you can't do that in lex, your only options are: (a)make the patterns extremely basic, just individual tokens, and wrap the lexer in a Yacc parser, even if the grammar is really basic; this is almost certainly what is expected. (b)put in countless start states to represent steps from one token, through delimiters, to the next token, but at least you've got it all done in lex; (c)for each pattern matched, feed yytext to scanf or PCRE or similar, to extract the variables- in other words, parsing it a second time. The difference between this and the Yacc option, is that in Yacc you're merely writing lots of redundant code, whereas here you're also executing it.

I think past experience with Yacc left me wary, it seemed too flaky to me at the time, and not being something you use every day, there's not much opportunity to get used to it.

So I looked into Ragel at last, I'd seen it some years ago and thought "Hmm, that sounds rather good", but then "ehh, sounds a bit complicated and hard to understand. Maybe some other time". Well this time I downloaded and installed it, and found to my delight a large PDF guide to it! Not that much free software has docs like that. Unfortunately, after wading through all 49 pages, I was feeling much less delighted. A whole load of operators that I had quite some trouble distinguishing, and, as it slowly dawned on me, some very commonly needed functionality left for me to implement myself, and quite some uncertainty over whether side-effects would come about from actions applied to branches that turned out to be dead ends. It simply seemed far too low-level in many ways.
I could still try it, but too many times I've seen something, thought it'd be cool and make life easier, and it just made things complicated for me. This sounds that way right off the bat. All the same, here is the current homepage, also someone's blog entries about Ragel. It's strictly possible I suppose that there've been features added or documentation improved since the version I downloaded (which is a Debian package, from 2y ago), but I don't see anything very compelling in the site's Changelog. (Update: Next day, some insight into a more efficient way of extracting data in Ragel prompted me to have another look at it and make that page about it)

A further option occurred to me after seeing how one was expected to use Ragel: For the purposes of parsing HTTP headers, Flex could simply use patterns for the header name, EG:
"Content-type: " { sscanf(inputtoeol(),"%s",ctype); }
"Host: " { sscanf(inputtoeol(),"%s,hostname); }
"Something-complicated: " { scan_result=pcre_exec(complex_regexp,NULL, inputtoeol(),inputlength(),stuff,stuff,stuff); }
etc. That's only pseudocode of course. Such a thing would need to run input() (supplied by lex) up till it reached \r\n, feeding it into a buffer, and then returning that. And of course of course, stopping before the buffer overflowed.

However, I don't really want to use lex in this fashion all that much, if I'm also planning on adding an embedded language to the server. It's kinda awkward to use with multiple lexers, and besides, Flex has been made incompatible in weird ways with its old behaviour, so who knows what it'll do. This leads me to want to make a very primitive DFA compiler thingy, for this sort of task. I made one a few years ago, crappy though it was, for a simple task. Here all I want is something that'll recognise a set of string literals, each terminated in the same way, and return an identifier for the one matched. The sort of thing lex generally does quite well. I could make an implementation with relatively space-efficient sparse arrays, not sure how fast that'd be though.
(update: couple months later, I don't actually remember what I was talking about re sparse arrays in Lex, meh... Serves me right for not filling the page in. Maybe I'll remember later)

Thoughts on the reparsing issue re flex: don't sweat the small stuff, it's only meant for local CGI use, although it might be interesting to make as a sort of prototype for a outward-facing embedded server-side language. However it still remains the case that I would prefer not to use flex due to that multiple-lexers problem.



Page source

Warning:Only I can edit Mwuki!