alternate DBs

(reload) (page class:public)
SQL databases are best for data with complex structure and varied sorts of queries. And in the case of MySQL, networked systems too.

They aren't the only choice though, embedded non-relational databases such as the old DBM, Berkeley DB, and flatfile type databases have their uses too, especially when very low latency/high throughput is needed, or only very simple, consistent queries are necessary.

Berkeley DB (aka libdb2, libdb3, libdb4)

Berkeley DB, formerly by Sleepycat Software and now owned by Oracle, is one of the higher profile representatives of this class, although maligned by some stupid people who seem to think SQLite is inherently better (SQLite does not have any better locking over NFS than Berkeley DB, both are perfectly capable of locking on local filesystems; SQLite uses SQL for queries, but gives no other options and its SQL support is quite incomplete, whereas Berkeley DB gives only lower level interfaces such as key+value, but these are fast and complete. MySQL used to support Berkeley DB for one of its storage mechanisms for instance).

Funnily enough, it appears that Berkeley DB, despite not being an SQL API, has low level support to aid with doing table joins; I think this doesn't apply to all types of join, but still. (NB, Berkeley DB does not exactly have the concept of "tables" IIRC, each database file is the equivalent of a table really. IIRC.)

Having said that, it seems that Berkeley DB's issue with NFS isn't just about locking, but also its access method involved memory mapped files which most NFS systems won't do. Or something like that. So by the sound of things, there's some NFSes in which it won't even work single-user???

Others

PHP interfaces worth looking at for this sort of thing mostly would be DBA, which is an abstraction layer rather than an interface to a specific database system, but it seems this is something that you compile into the PHP binary (rather than being able to link it as a module), and that Dreamhost haven't. AFAICT. You can compile your own PHP interpreter apparently, but I don't want to do that as then I'd be responsible for keeping it up to date, and I REALLY don't want that...

Other DB systems of interest (apart from the other "dbm" like libraries, ndbm and gdbm), maybe include qdbm, and especially, CDB.

CDB and Write-rarely databases

CDB is designed for the not-that-unusual situation of write-rarely data. It has a command-line utility that compiles data in key-value pairs, assembles them in an efficiently structured format in a temporary file, and then moves said file over the old location of the database (thereby atomically replacing the existing db). As long as clients (which are using it in read-only fashion) don't close the file handle between reads, there is no race condition there. There could be a bit of a race condition if multiple processes try to remake the database file though, so that has to be managed fairly well. The keys in the database are variable-length strings, so hashing is performed on them to do lookups. All the same, it's designed to be very efficient to read from. Apparently.

Systems like that could conceivably be used to run a site like a blog with a small number of authors (especially if only 1!) but a lot of readers. Updates would go to a standard database, and periodically (every 15 minutse or something??) a cron job would check the database to see if it'd been updated since the last time the CDB style database was written. If so, the cron job would rebuild the CDB style database from the data in the standard database, so it'd be up to date. Readers meanwhile, would normally get the front page stuff from the CDB style database, likewise the search results. Things like comments however would have to go into the main database for sanity's sake, assuming they were enabled (some blogs do not have them).

A related idea to this, is a very simple system I produced for Shingo: it has various datafiles that are variable width, 1 record per line plaintext files. As they are, these are not very efficient for random access, which is what Shingo happens to want. With very small files you can simply load the whole file into an array and get the one you want, but that's not scalable. Sequential searching is little better.

So instead, I made a separate datafile that I called LIDX for LineInDeX, with 4-byte, fixed width records for each line wanted. Each record was simply the file offset in the corresponding plaintext datafile. As the LIDX file is fixed-width and sequential, it has random-access, and the offsets it stores confer access to the main datafile. So finding the right entry becomes very quick. The default approach is for the lookup to be in terms of the line number of the main file (hence the 57th LIDX record points to the 57th line), but it's also possible for the lookup to be based on a numeric key given on each line of the datafile (eg, one of the fields if the lines have field delimiters, or else some other piece of unique data that can be extracted automatically), and then it can be used with relatively sparse datasets. If it's not too sparse, the LIDX file needn't get too huge (only 4 bytes/record), whereas if it's sufficiently sparse, the LIDX could conceivably be implemented as a sparse file (also see here, or Sparse_file at wikipedia), in which case it uses actually less disk space.

However, using the sparse-file type of approach, we do have the problem that the LIDX records for unusued keys, all point to the 0000, the beginning of the main datafile, and we'd probably like to be able to distinguish keys actually pointing to the start from such unused keys. One approach might be to make the first entry in the main datafile be dummy data that's never referenced, another might be to complement (bitwise-invert) the records in the LIDX, so that pointing to the start of the file would be done with 0xFFFFFFFF instead, and 0x00000000 as returned from sparse areas would refer to position 4GB-1. Alternatively, as such sparse files would still be (mostly??) for datasets with keys stored with each record, the reader code could check the key matches the lookup.

But I digress a bit with these variations on the idea. Again, like CDB, the files are not written often, so that can be handled by a separate utility outside the system, and the data can remain nicely packed. Unlike CDB though, access was based on integer recordnumbers, rather than key strings. I'm someone who likes his integers and pointers, and tend to think in those terms, but sometimes that's not really applicable.

There was also a tool (and IIRC matching library?) called "strfile" distributed with the Fortunes package, which enabled you to make indices like these, but you had to put a delimiter like % between the entries, and I wanted to just have single-line entries instead. There may've also been an issue with the output files too, I forget. IIRC it also took surprisingly long to find another copy for some reason. Either way, I decided to roll my own, as described. It'd be a bizarre world in which only the Fortunes author and myself had produced systems such as these, I know, but I rarely seem to see them referred to.

Ideas

Also question of making keyword/tag/set based databases that are capable of quick efficient lookups for arbitrary queries- like a specialised, cut-down SQL type dealie.
Query optimisation could be done using the logic identities which in the bit-representation context should be equivalent to the set operation identityish thingies which I've not actually seen. Perhaps they are absolutely the same, I don't know. The logic identities are listed in Table 8.3 of my ancient copy of Art Of Electronics (page 332). Probably different pages on other editions. The optimisation might also be manageable via Karnaugh maps but I suspect those are too limited to be usable for complex queries?? Also I'm not sure how either of these methods intersect with knowing how many items are in each set (we ideally want to work from a small number (esp 1) of already short lists, rather than the whole table)
TODO: write up the specialist keyword db system idea... eventually

See also SQLite, or MySQL, or return to the Main Page.



Page source

Warning:Only I can edit Mwuki!