Revision as of 21:33, 16 December 2009

Based on the write-up of prior DBPF/Compression formats.

Overview

The idea behind the compression is to reuse previously decoded strings. For example, if the word "heureka" occurs twice in a file, the second occurrence would be encoded by pointing to the first, thus lowering the size of the file.

The compression is done by defining control characters that tell three things:

How many characters of plain text that follow should be appended to the output.
How many characters should be read from the already decoded text (and appended to the output)
Where to read the characters from in the already decoded text.

Thus, the algorithm to decompress these files goes like this:

Read the header, which is formatted like so:

Offset 00 - 0xFB
Offset 01 - Compression type (0x10 | [0x40] | [0x80])
Offset 02 - Uncompressed Size of file (4 bytes if type contains 0x80, otherwise 3 bytes)

After the header (Offset 5 or 6, depending on the compression type flags) is the start of the actual compressed file data, which is handled like so:

{ 
	- Read the next control character. 
	- Depending on the control character, read 0-3 more bytes that are a part of the control character.
	- Inspect the control character.  From this, find out how many characters should be read and where from.
	- Read 0-n characters from source and append them to the output. (n being the "how many" data from above)
	- Copy 0-n characters from somewhere in the output to the end of the output. (n in this case is the 
}

Control Characters

There are 4 types of control characters. These are used with different restrictions on how many characters that can be read and from how far behind these can be read. The following conventions are used to describe them:

CC length

Length of control character.

Num plain text

Number of characters immediately after the control character that should be read and appended to output.

Num to copy

Number of chars that should be copied from somewhere in the already decoded output and added to the end of the output.

Copy offset

Where to start reading characters when copying from somewhere in the already decoded output.

This is given as an offset from the current end of the output buffer, i.e. an offset of 0 means that you should copy the last character in the output and append it to the output. And offset of 1 means that you should copy the second-to-last character.

byte0

first byte of control character.

Bits

Bits of the control character.

p - num plain text
c - num to copy
o - copy offset
i - identifier.

Note: It can sometimes be confusing when a control character states that you should copy for example 10 characters 5 steps from the end of the output. Clearly, you cannot read more than 5 characters before you reach the end of the buffer. The solution is to read and write one character at the time. Each time you read a character you copy it to the end thereby increasing the size of the output. By doing this, even offset 0 is possible and would result in duplicating the last character a number of times. This is utilized by the compression to recreate repeating text, for example bars of repeating dashes.

0x00 - 0x7F

CC length: 2 bytes
Num plain text: byte0 & 0x03
Num to copy: ( (byte0 & 0x1C) > > 2) + 3
Copy offset: ( (byte0 & 0x60) < < 3) + byte1 + 1

Bits: 0oocccpp oooooooo
Num plain text limit: 0-3
Num to copy limit: 3-11
Maximum Offset: 1023

0x80 - 0xBF

CC length: 3 bytes
Num plain text: ((byte1 & 0xC0) > > 6 ) & 0x03
Num to copy: (byte0 & 0x3F) + 4
Copy offset: ( (byte1 & 0x3F) < < 8 ) + byte2 + 1

Bits: 10cccccc ppoooooo oooooooo
Num plain text limit: 0-3
Num to copy limit: 4-67
Maximum Offset: 16383

0xC0 - 0xDF

CC length: 4 bytes
Num plain text: byte0 & 0x03
Num to copy: ( (byte0 & 0x0C) < < 6 )  + byte3 + 5
Copy offset: ((byte0 & 0x10) < < 12 ) + (byte1 < < 8 ) + byte2 + 1

Bits: 110occpp oooooooo oooooooo cccccccc
Num plain text limit: 0-3
Num to copy limit: 5-1028
Maximum Offset: 131071

0xE0 - 0xFB

This is the simplest form of control character. The only thing it does is tell how many plain text characters follow. The formula for this is: (C - 0xDF) * 4. Thus a value of 0xE0 means that you should read 4 characters of plain text and append to the output.

CC length: 1 byte 
Num plain text: ((byte0 & 0x1F) < < 2 ) + 4
Num to copy: 0 
Copy offset: -

Bits: 111ppppp 
Num plain text limit: 4-112 (Multiples of 4)
Num to copy limit: 0 
Maximum Offset: -

0xFC - 0xFF

CC length: 1 byte 
Num plain text: (byte0 & 0x03)
Num to copy: 0 
Copy offset: -

Bits: 111ppppp 
Num plain text limit: 3
Num to copy limit: 0 
Maximum Offset: -

Compressed data MUST end with a code in the range 0xFC to 0xFF. If the data is an exact fit to the size, 0xFC can be used as a null code. While community tools properly handle data without the ending byte, Sims 3 will happily keep reading until it encounters it, usually resulting in a crash.

Compression Types

In addition to compression tagged as 0x10FB, there are other values that can be orred into the 0x10 byte.

0x40 : Sims 3 seems to have two compression and decompression routines. The coding is identical between them, however data tagged with 0x40 only uses a subset of the available codes, and limits the window to a much smaller size than would otherwise be possible. If data is written that goes beyond these limits, it will crash Sims 3. (Q: What codes are restricted? What window size is allowed?)
0x80 : If the uncompressed data is longer than 16mb, the size won't fit in the normal 3 bytes in the header. Adding 0x80 in to the compression type increases the uncompressed size field to 4 bytes.

Example Code

Example code for the Sims 2 variety of DBPF compression can be found at DBPF/Compression.

@@ Line 120: / Line 120: @@
 In addition to compression tagged as 0x10FB, there are other values that can be orred into the 0x10 byte.
-* 0x40 : Sims 3 seems to have two compression and decompression routines.  The coding is identical between them, however data tagged with 0x40 only uses a subset of the available codes, and limits the window to a much smaller size than would otherwise be possible.  If data is written that goes beyond these limits, it will crash Sims 3.
+* 0x40 : Sims 3 seems to have two compression and decompression routines.  The coding is identical between them, however data tagged with 0x40 only uses a subset of the available codes, and limits the window to a much smaller size than would otherwise be possible.  If data is written that goes beyond these limits, it will crash Sims 3. (Q: What codes are restricted?  What window size is allowed?)
 * 0x80 : If the uncompressed data is longer than 16mb, the size won't fit in the normal 3 bytes in the header.  Adding 0x80 in to the compression type increases the uncompressed size field to 4 bytes.

Difference between revisions of "Sims 3:DBPF/Compression"

Revision as of 21:33, 16 December 2009

Contents

Overview

Control Characters

0x00 - 0x7F

0x80 - 0xBF

0xC0 - 0xDF

0xE0 - 0xFB

0xFC - 0xFF

Compression Types

Example Code

See Also

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

game select

Toolbox