Sorry, where was I? Right yes, X-Com, the truly fantabular mid-90s isometric turn-based strategy. Has a cult following, amongst whom the moody in-game music is also quite satisfying (although not so much to anybody else if my old flatmates are anything to go by). Unfortunately the OPL FM-synth data for that music is not too accessible in the early DOS versions of X-Com 1 and 2, and though it was in commodity file formats for the Windows rereleases of them, that format was General Midi files, sacrificing the flexibility of instrument choice that direct synth control gives a composer, so it wasn't quite the same. Like a cover of Baker Street with kazoo instead of saxophone. Well ok, not that heinous, the problem with GM is more that it's often rather middle-of-the-road (a charge some might level at the aforementioned song for that matter), but you get the idea.
This page is about the actual process of getting stuff out of those old obscure files without the benefit of someone having worked it all out for you already (even if someone somewhere may have). It's written not so much for the benefit of other X-Com fans (though many might appreciate the end result), as for those interested in how you actually do that sort of thing. It's impossible to give people general, universal instructions for reverse-engineering file formats, because it just doesn't work like that- especially when some are made deliberately difficult; instead it's more to give a picture of what sort of thing is often involved in practice. Hope it's some help.
...Nope. File format not recognised. But unfortunately it was too late, I was interested again. Then in having a rummage around the net for info on the files, I spotted this forum posting about them. Some of the .cat files in there are collections of plain old PCM audio bundled together with headers. If you feed those to a program that can treat it as such, and tell it the right parameters, it'll play back all the game's sound effects in sequence, with little clicks between from the embedded headers. This rung a loud bell, as I've done this sort of thing with the audio data files in Dungeon Keeper 1/Gold (another lovely game), and I'm fairly sure, with the ones in X-Com Apocalypse too (and certainly its music file, which is presynthesized PCM). If I've ever done it with some of the .cat files from X-Com 1 and 2 though, I'd forgotten by now.
Of course, this is only partly helpful- not only do we not have those sound effects separated, we still don't have access to the actual music or instruments, because those are not PCM audio but FM synth data (and presumably sequencing data too). However, the reasonable assumption at this point, must be that the .cat files are generic container files that the X-Com folks used for a variety of content, similar to the .pak files used by id software, and that if we can properly figure out the headers in those sound effects files, we can open up all the different containers and get at the juicy data within, conceivably in a commodity format that AdPlug or similar can deal with.
If I could see what those headers looked like I could probably figure them out, it's something I've done before with some undocumented primitive file formats, where the creators are generally not interested in making things terribly obscure, but are probably concerned about efficiency. If you have a vague idea what you're looking for (eg you have a reasonable sense of how you'd write it), it's not as much of a stab-in-the-dark as it might sound.
First I had to find the buggers. looking at the file through a hexdump, I saw that it started off with what looked conspicuously like header data rather than audio data (considering the audio is 8-bit mono). It reads like a bunch of small numbers stored in 32 bit integers, due to all the alternating neatly aligned columns of 0000.
(For the benefit of those vaguely aware of hexadecimal but who've never seen a hex dump, I've also annotated parts of it other than the area of interest, but maybe it's still clear as mud). But then I went to find the other pieces of header data in the rest of the file (between the individual sounds), and that sort of pattern didn't noticably crop up again. I took a different tack, and tried feeding the data through 'dd' to an audio player, and having dd then only transfer a few K of it. Then by varying the amount of the file played and listening carefully, I could find the approximate amount of data before a click (header), and then look that figure up in the hex-dump as the file offset for it. Easy! Ish. After a few goes I found the first click at somewhere around 2000 bytes in, but it was a bit vague, and I didn't feel entirely certain that the likely-looking patch in the hex-dump really was the header, nor how much of it was if so.
Then I had the bright idea of trying to load the .cat file into Audacity, which I only installed a few days ago after a long time of avoiding it (it turned out to be pretty damn cool mostly, although some parts are a bit lacking). Couldn't unfortunately load it straight in from the command line, but was able to import the file as raw audio once the program was started. The results were rather good.
The headers stuck out like little pointy sore thumbs in the waveform display, especially one at around 11 or 12 seconds, preceeding the final sample. I set about zooming in on it to get a closer look, as it was so much more clearly delimited from the surrounding audio than the other headers that I figured I could isolate it quite effectively. I was right, Audacity made it very quick and easy.


You can't get much clearer than that. Then the cherry on top, I found that yes ,I could tell Audacity to show the selection times in terms of samples (in the samples-per-second sense) rather than seconds or minutes.
Seeing as it's 8-bit audio, the sample number is the same as the byte offset, so I could cross reference it with the hex dump! Although I'd have to convert the address to hex: 130091 becomes 0x1fc2b, and 130140 becomes 0x1fc5c (an aside: I could really do with a hex-converter app rather than having to use the bc calculator, presumably so called because it feels 2000 years old. It's not very convenient for that purpose). And here it is, again rather conspicuous once you've got a good idea what you're looking for. In case you don't, I highlighted the "obvious" (see later) header bytes in yellow on the screenshot; the highlighted "001fc" part however is from the program itself where I was searching for that range of addresses.
I didn't try to select the header area perfectly in Audacity, just reasonably close. The idea was more that I could look it up in the hex dump and see the actual addresses in there. I figured from this that header started at 1fc2c, ended at 1fc5b, and the sample started at 1fc5c. The bytes of the audio surrounding it are noticably values around 0x7e-0x81, which, when you're reasonably awake (I hadn't been straight away) represents a ~50% (or midpoint) level, that you generally get when an audio signal goes silent. If you prefer to think of audio signals as signed data, as most people do, it basically represents a level of about 0, but we're currently looking at unsigned 8-bit audio. Also the duration of the header seems to be 48 bytes, which is a relatively round number to a computer (0x30).

TODO: :redo screenshot with annotation
So anyway. I wondered if maybe the headers for the PCM audio .cat files had names like that in their headers too, so I tried the "hd" version on that too. (Or maybe I read it via the "strings" command, I forget; but let's say I used hd anyway, as that's what the screenshot here is.)
If you're not a programmer well you'll probably see a bunch of gibberish and some semi-random words. If you are a programmer, you will probably be aware of an elephant in the room as it were. Yes, RIFF WAVE. The guy in those forum posts was wrong (actually it later transpired he was still half right), the .cat files don't contain chunks of raw PCM data, they contain plain old common or garden .WAV files (which are also PCM data too of course, but have headers of their own). So I had to reexamine the headers-between-files idea completely.
I then had another look at the 0x1fc2c area, still with the hd program, and made some further discoveries.
A few things leapt out here; firstly, the "RIFF" token is supposed to be the very first thing in any .WAV file; I suppose I could be wrong but I've never seen an actual .WAV file with it elsewhere (or missing). Yet here we clearly have 3 other bytes between the end of the previous file's audio data, and the RIFF token, so there's 4 possible explanations:
Having seen this, I did a quick check with grep using binary mode and various other flags. After a couple false starts and another look in the manual,
grep -abo -e ...RIFF /tmp/sample.cat |less
turned out rather handy, it shows each instance of the RIFF header, along with their 3 preceding bytes! It showed that most of these little headers were practically identical, there were only 2 versions in there! 0x02, 0x31, 0x00 (shows as "^B1^@" in the grep output, or ".1." in the hd output), and 0x02, 0x30, 0x00 (shows as "^B0^@" in grep output, or ".0." in hd). It's rather hard to suppose this could really be length data.
Unfortunately in this grep invocation, when it shows file offsets (the -b option there) is showing the offsets for the start of the line if treating the file as text, which is not helpful here. IE, it doesn't show the offsets of the individual results. For instance for the first result, it shows the offset as 0, and we know that one doesn't appear until several lines down.
grep -aboz -e RIFF /tmp/sample.cat |less
however brings home the goods for this- the -z option tells it to work in terms of NUL-terminated strings instead of newline-terminated lines, and because of the NULs in those header bytes, we see each RIFF instance start on a new "line" in those terms. Why not grep -aboz -e ...RIFF in there, to get both things at once? Because it's thinking of the lines as being terminated at the end of the header bytes, so that combination actually returns no results whatsoever!
So we now have the offsets of each of the RIFF tokens in the file, albeit expressed in decimal rather than hex, and we know each of the mini headers seems to be the same size. We can also be fairly sure that data about the length and/or location of each subfile is probably stored together in some other part of the file, because it's fairly unlikely it could be encoded in those 3-byte headers, considering how identical they all are - unless most of the samples are exactly the same length. So maybe at this point we can make a bit of a leap ahead, try something out.
Does this hold up? The next instance of RIFF token is, according to grep, at 2193 decimal, so the header would be at 2190 decimal, or 0x088e. Scan along a couple of columns, to the 9th byte (address 0x08 because we're using 0-indexed addresses!) and we see 8e 08, which is the little-endian form of that. So it's probably fair to say yes it does. It could technically have been done such that the 2nd subfile onwards were all located just by describing the length of each of the files (knowing where the first one started), but there wouldn't have been much point, and resulting in an unnecessarily messy function to read it.
Meanwhile we do have other data between these two values in the top row- we can suppose the 00 00 after the 70 00 is just because it's using 32-bit addressing, so that's 0x00000070 in little-endian, but there's also a "1b 08 00 00" after that. A little maths shows us though:
0x0070 + 0x081b = 0x088b
The 2nd address given was 0x088e, and each sub-header is 3 bytes, so the 0x081b figure works out as the length of the subfile, not counting the 3-byte sub-header. We've cracked it! Well apart from what on earth those 3 bytes mean.
Well the first 4 bytes are b0 00 00 00, so our subheader starts at 0x00b0. Where we find...
0a 47 65 6f 73 etc...
and in the handily aligned ASCII display on the right, we can see that the 2nd byte onwards, is in fact the "Geoscape1" text we saw before, ending with a NUL byte, as one does with strings in languages such as C.
Well that wasn't what I'd expected. I had by this point been assuming that the "Geoscape1" type text must've been something embedded in the OPL data files, as we'd found the subheaders were only 3 bytes long. But if it were the case that all the subheaders were only 3 bytes long, then that'd imply that this text was split between the end of the subheader and the start of the subfile, which doesn't make any sense. The reasonable interpretation, is that these subheaders are longer. But what're these other bytes here, and in the 3-byte versions?
Well each version ends with 0x00, the NUL byte, usually used to terminate strings as pointed out. The starting byte in the "Geoscape1" version, is 0x0a, which is 10 decimal. The 3-byte versions all have 0x02 as their starting byte, which is 2 decimal. Well "Geoscape1" is 9 bytes long, and if you also count the terminating NUL byte, that's 10 bytes. If we assume that the starting byte is the total length of the following string, then for the 3-byte versions that works out as 2 in each case.
If we look at the next entry in the adlib.cat file, at 0x37a5, we see um...
05 6e 6f 77 74 00
or (5-bytes long) n o w t (NUL) *ahem*
God bless ya, Mr Gollop. So uh yes, this does seem to work out too. Each of the headers seems to be a name for some purpose or other, presumably not meant as a unique filename, as most of the names in the sample.cat file are identical. Perhaps in the other files they are meant as unique, who knows.
Well we seem to have figured out the file format anyway, even if we don't know what all the names mean, or what is inside half of the subfiles. We can make an extractor!