Unpicking XCom .cat files

(reload) (page class:public)
Everybody loves the X-Com games. Well no, ok, everybody with taste loves the X-Com games. Ehm well fair enough, everybody with taste loves the early X-Com games, UFO:Enemy Unknown (aka 'X-Com UFO Defense' in Narnia), X-Com: Terror From The Deep, and X-Com Apocalypse. Their chief weapons are fear, surprise, ruthless efficiency, and an almost fanatical devotion to the Pope.

Sorry, where was I? Right yes, X-Com, the truly fantabular mid-90s isometric turn-based strategy. Has a cult following, amongst whom the moody in-game music is also quite satisfying (although not so much to anybody else if my old flatmates are anything to go by). Unfortunately the OPL FM-synth data for that music is not too accessible in the early DOS versions of X-Com 1 and 2, and though it was in commodity file formats for the Windows rereleases of them, that format was General Midi files, sacrificing the flexibility of instrument choice that direct synth control gives a composer, so it wasn't quite the same. Like a cover of Baker Street with kazoo instead of saxophone. Well ok, not that heinous, the problem with GM is more that it's often rather middle-of-the-road (a charge some might level at the aforementioned song for that matter), but you get the idea.

This page is about the actual process of getting stuff out of those old obscure files without the benefit of someone having worked it all out for you already (even if someone somewhere may have). It's written not so much for the benefit of other X-Com fans (though many might appreciate the end result), as for those interested in how you actually do that sort of thing. It's impossible to give people general, universal instructions for reverse-engineering file formats, because it just doesn't work like that- especially when some are made deliberately difficult; instead it's more to give a picture of what sort of thing is often involved in practice. Hope it's some help.

Starting steps


I've often wanted to get at those mysterious .cat files with the music data from UFO and Terror From The Deep, but never really got anywhere. I didn't know what they were and neither did any of my software. Then recently, during a big Debian upgrading-spree, I remembered that "AdPlug" package that deals with various music files for OPL/2 and OPL/3 FM synth chips (like those in the famous old AdLib soundcards, and various commodity devices). I didn't really need its OPL emulation ability (which sadly is the only playback method available in AdPlug's main audio player!!), as my soundcard has OPL synth support, but I figured it might be able to access all those .cat files at long last.

...Nope. File format not recognised. But unfortunately it was too late, I was interested again. Then in having a rummage around the net for info on the files, I spotted this forum posting about them. Some of the .cat files in there are collections of plain old PCM audio bundled together with headers. If you feed those to a program that can treat it as such, and tell it the right parameters, it'll play back all the game's sound effects in sequence, with little clicks between from the embedded headers. This rung a loud bell, as I've done this sort of thing with the audio data files in Dungeon Keeper 1/Gold (another lovely game), and I'm fairly sure, with the ones in X-Com Apocalypse too (and certainly its music file, which is presynthesized PCM). If I've ever done it with some of the .cat files from X-Com 1 and 2 though, I'd forgotten by now.

Of course, this is only partly helpful- not only do we not have those sound effects separated, we still don't have access to the actual music or instruments, because those are not PCM audio but FM synth data (and presumably sequencing data too). However, the reasonable assumption at this point, must be that the .cat files are generic container files that the X-Com folks used for a variety of content, similar to the .pak files used by id software, and that if we can properly figure out the headers in those sound effects files, we can open up all the different containers and get at the juicy data within, conceivably in a commodity format that AdPlug or similar can deal with.

Getting My Hands Dirty


Judging by all the tiny clicks between the sound effects, I figured that the general format of the .cat files was likely to be a set of individual file contents, each prefixed with a tiny header describing the length of that data, and then when the game starts up it would read through the files to work out where each subfile started and ended, and then either keep the pieces loaded in memory, or just cache those file offsets for reasonably fast access. Conceivably there could still be a main header in front of the whole thing (eg to say how many pieces there were), but I'd have to find out, ultimately.

If I could see what those headers looked like I could probably figure them out, it's something I've done before with some undocumented primitive file formats, where the creators are generally not interested in making things terribly obscure, but are probably concerned about efficiency. If you have a vague idea what you're looking for (eg you have a reasonable sense of how you'd write it), it's not as much of a stab-in-the-dark as it might sound.

First I had to find the buggers. looking at the file through a hexdump, I saw that it started off with what looked conspicuously like header data rather than audio data (considering the audio is 8-bit mono). It reads like a bunch of small numbers stored in 32 bit integers, due to all the alternating neatly aligned columns of 0000.

(For the benefit of those vaguely aware of hexadecimal but who've never seen a hex dump, I've also annotated parts of it other than the area of interest, but maybe it's still clear as mud). But then I went to find the other pieces of header data in the rest of the file (between the individual sounds), and that sort of pattern didn't noticably crop up again. I took a different tack, and tried feeding the data through 'dd' to an audio player, and having dd then only transfer a few K of it. Then by varying the amount of the file played and listening carefully, I could find the approximate amount of data before a click (header), and then look that figure up in the hex-dump as the file offset for it. Easy! Ish. After a few goes I found the first click at somewhere around 2000 bytes in, but it was a bit vague, and I didn't feel entirely certain that the likely-looking patch in the hex-dump really was the header, nor how much of it was if so.

Then I had the bright idea of trying to load the .cat file into Audacity, which I only installed a few days ago after a long time of avoiding it (it turned out to be pretty damn cool mostly, although some parts are a bit lacking). Couldn't unfortunately load it straight in from the command line, but was able to import the file as raw audio once the program was started. The results were rather good.

The headers stuck out like little pointy sore thumbs in the waveform display, especially one at around 11 or 12 seconds, preceeding the final sample. I set about zooming in on it to get a closer look, as it was so much more clearly delimited from the surrounding audio than the other headers that I figured I could isolate it quite effectively. I was right, Audacity made it very quick and easy.


You can't get much clearer than that. Then the cherry on top, I found that yes ,I could tell Audacity to show the selection times in terms of samples (in the samples-per-second sense) rather than seconds or minutes.

Seeing as it's 8-bit audio, the sample number is the same as the byte offset, so I could cross reference it with the hex dump! Although I'd have to convert the address to hex: 130091 becomes 0x1fc2b, and 130140 becomes 0x1fc5c (an aside: I could really do with a hex-converter app rather than having to use the bc calculator, presumably so called because it feels 2000 years old. It's not very convenient for that purpose). And here it is, again rather conspicuous once you've got a good idea what you're looking for. In case you don't, I highlighted the "obvious" (see later) header bytes in yellow on the screenshot; the highlighted "001fc" part however is from the program itself where I was searching for that range of addresses.

I didn't try to select the header area perfectly in Audacity, just reasonably close. The idea was more that I could look it up in the hex dump and see the actual addresses in there. I figured from this that header started at 1fc2c, ended at 1fc5b, and the sample started at 1fc5c. The bytes of the audio surrounding it are noticably values around 0x7e-0x81, which, when you're reasonably awake (I hadn't been straight away) represents a ~50% (or midpoint) level, that you generally get when an audio signal goes silent. If you prefer to think of audio signals as signed data, as most people do, it basically represents a level of about 0, but we're currently looking at unsigned 8-bit audio. Also the duration of the header seems to be 48 bytes, which is a relatively round number to a computer (0x30).

Surprise surprise, chuck


I then went to dig out some more header locations from Audacity, pinpointed one and... then got a bit distracted as I am wont to do, and for whatever reason wound up looking at one of the other .cat files for a bit- the one for the actual AdLib/OPL synth data.
I guess I was wondering if I'd see the same sorts of headers maybe. What I did notice was the tiny fragments of game-related text such as "Geoscape", suggesting maybe a filename or such, who knows. I had actually seen them in the past, but forgot about them.

Note that this hex dump looks quite different to the previous ones, because it's done through the "hd" program instead of "hexdump", and they both present the data in different ways (Strictly speaking, hd and hexdump are both the same program, but it acts differently depending what name you call it, so it might as well be 2 different programs). The main differences are that "hd" shows each row as a set of 16 separate bytes, vs 8 pairs of bytes in "hexdump", and rather usefully here, "hd" also shows in the rightmost column the ASCII equivalents of the bytes on each row, where applicable (unprintable values just display as "."). I forget what lead me to use hd instead that time, it's quite lucky regardless.

TODO: :redo screenshot with annotation
So anyway. I wondered if maybe the headers for the PCM audio .cat files had names like that in their headers too, so I tried the "hd" version on that too. (Or maybe I read it via the "strings" command, I forget; but let's say I used hd anyway, as that's what the screenshot here is.)

If you're not a programmer well you'll probably see a bunch of gibberish and some semi-random words. If you are a programmer, you will probably be aware of an elephant in the room as it were. Yes, RIFF WAVE. The guy in those forum posts was wrong (actually it later transpired he was still half right), the .cat files don't contain chunks of raw PCM data, they contain plain old common or garden .WAV files (which are also PCM data too of course, but have headers of their own). So I had to reexamine the headers-between-files idea completely.

I then had another look at the 0x1fc2c area, still with the hd program, and made some further discoveries.

A few things leapt out here; firstly, the "RIFF" token is supposed to be the very first thing in any .WAV file; I suppose I could be wrong but I've never seen an actual .WAV file with it elsewhere (or missing). Yet here we clearly have 3 other bytes between the end of the previous file's audio data, and the RIFF token, so there's 4 possible explanations:


I'm assuming option #4, if it ain't abundantly clear. The second thing visible here: This hexdump shows that the header-like data (which we've decided is actually part of the subfile, not the container's subheaders) ends at 0x1fc5a, whereas previously I'd claimed it was 0x1fc5b. Is one of these hex dumps broken somehow? Are the tools screwing up? Nope, I was. I'd forgotten about the property known as "endianness" (or else had been oblivious to its application in the 2-byte version of hexdump), which is to do with the order that bytes occur in multi-byte words in a CPU, and varies between computer architectures. ISTR PCs are "Little-Endian" and some other systems are "Big-Endian". Either way, in that original set of hex dumps, the pairs of bytes were shown with every 2nd byte coming before the 1st, and I was thinking the bytes were in order, as they are in the later set of hex dumps. What's my excuse? Eh, the dog ate it.

Having seen this, I did a quick check with grep using binary mode and various other flags. After a couple false starts and another look in the manual,
grep -abo -e ...RIFF /tmp/sample.cat |less
turned out rather handy, it shows each instance of the RIFF header, along with their 3 preceding bytes! It showed that most of these little headers were practically identical, there were only 2 versions in there! 0x02, 0x31, 0x00 (shows as "^B1^@" in the grep output, or ".1." in the hd output), and 0x02, 0x30, 0x00 (shows as "^B0^@" in grep output, or ".0." in hd). It's rather hard to suppose this could really be length data.

Unfortunately in this grep invocation, when it shows file offsets (the -b option there) is showing the offsets for the start of the line if treating the file as text, which is not helpful here. IE, it doesn't show the offsets of the individual results. For instance for the first result, it shows the offset as 0, and we know that one doesn't appear until several lines down.
grep -aboz -e RIFF /tmp/sample.cat |less
however brings home the goods for this- the -z option tells it to work in terms of NUL-terminated strings instead of newline-terminated lines, and because of the NULs in those header bytes, we see each RIFF instance start on a new "line" in those terms. Why not grep -aboz -e ...RIFF in there, to get both things at once? Because it's thinking of the lines as being terminated at the end of the header bytes, so that combination actually returns no results whatsoever!

So we now have the offsets of each of the RIFF tokens in the file, albeit expressed in decimal rather than hex, and we know each of the mini headers seems to be the same size. We can also be fairly sure that data about the length and/or location of each subfile is probably stored together in some other part of the file, because it's fairly unlikely it could be encoded in those 3-byte headers, considering how identical they all are - unless most of the samples are exactly the same length. So maybe at this point we can make a bit of a leap ahead, try something out.

Mapping out the main header


Looking back at the hd-based hexdump, we can see the first entry very clearly, the 3-byte header starts at 0x0070, with the RIFF token starting on the same line at 0x0073. That's 115 decimal, exactly what the grep output said for it, as we should expect. Obviously the big chunk of header data at the start of the file, with those alternating columns of 0000 that we spotted at the beginning, has to be the place where the location or length data is, and sure enough the first 2 bytes happen to be 70 00. Remembering that PCs often use data in a byte-swapped fashion (little-endian, lowest significant byte first), that would be 0x0070.

Does this hold up? The next instance of RIFF token is, according to grep, at 2193 decimal, so the header would be at 2190 decimal, or 0x088e. Scan along a couple of columns, to the 9th byte (address 0x08 because we're using 0-indexed addresses!) and we see 8e 08, which is the little-endian form of that. So it's probably fair to say yes it does. It could technically have been done such that the 2nd subfile onwards were all located just by describing the length of each of the files (knowing where the first one started), but there wouldn't have been much point, and resulting in an unnecessarily messy function to read it.

Meanwhile we do have other data between these two values in the top row- we can suppose the 00 00 after the 70 00 is just because it's using 32-bit addressing, so that's 0x00000070 in little-endian, but there's also a "1b 08 00 00" after that. A little maths shows us though:
0x0070 + 0x081b = 0x088b
The 2nd address given was 0x088e, and each sub-header is 3 bytes, so the 0x081b figure works out as the length of the subfile, not counting the 3-byte sub-header. We've cracked it! Well apart from what on earth those 3 bytes mean.

Another look


Now with our newfound knowledge of what this is all about, we have a look back at the OPL data file, adlib.cat, as shown earlier. We know the first 4 bytes (probably the first 1 or 2) are the address of the first subheader, so we should see something along the lines of 0x02 0x30 0x00 or one of those. Maybe finding a few more datapoints we'll be able to find out what on earth that means. Perhaps it's some sort of media-identifier, eg 0x02 had meant "PCM audio" and the 0x30 and 0x31 were to indicate context in which it was used or something.

Well the first 4 bytes are b0 00 00 00, so our subheader starts at 0x00b0. Where we find...
0a 47 65 6f 73 etc...
and in the handily aligned ASCII display on the right, we can see that the 2nd byte onwards, is in fact the "Geoscape1" text we saw before, ending with a NUL byte, as one does with strings in languages such as C.

Well that wasn't what I'd expected. I had by this point been assuming that the "Geoscape1" type text must've been something embedded in the OPL data files, as we'd found the subheaders were only 3 bytes long. But if it were the case that all the subheaders were only 3 bytes long, then that'd imply that this text was split between the end of the subheader and the start of the subfile, which doesn't make any sense. The reasonable interpretation, is that these subheaders are longer. But what're these other bytes here, and in the 3-byte versions?

Well each version ends with 0x00, the NUL byte, usually used to terminate strings as pointed out. The starting byte in the "Geoscape1" version, is 0x0a, which is 10 decimal. The 3-byte versions all have 0x02 as their starting byte, which is 2 decimal. Well "Geoscape1" is 9 bytes long, and if you also count the terminating NUL byte, that's 10 bytes. If we assume that the starting byte is the total length of the following string, then for the 3-byte versions that works out as 2 in each case.

If we look at the next entry in the adlib.cat file, at 0x37a5, we see um...
05 6e 6f 77 74 00
or (5-bytes long) n o w t (NUL) *ahem*
God bless ya, Mr Gollop. So uh yes, this does seem to work out too. Each of the headers seems to be a name for some purpose or other, presumably not meant as a unique filename, as most of the names in the sample.cat file are identical. Perhaps in the other files they are meant as unique, who knows.

Well we seem to have figured out the file format anyway, even if we don't know what all the names mean, or what is inside half of the subfiles. We can make an extractor!

Finally


An XCom catfile extractor app, based on all this we've found here. Work's started on it, it can already read the TOC and list the names used.

Even more finally


Report on WTF is inside the subfiles, after that.

other linkies


XCom 1.4 GUS patch (appears to be something quite old, I think that stuff is already included in the version I have at least).
List_of_X-COM:_Terror_from_the_Deep_races


Page source

Warning:Only I can edit Mwuki!