Sapper's Blog: Introduction to File Formats

A computer file can be considered as a sequence of zeroes and ones. Each one or zero is known as a bit. Representing a file in ones and zeroes is known as binary format.

1000000100000000000000000000000000010111000001100000000000000000

It is convenient to organise the bits into groups of eight. Each group of eight bits is known as a byte. This file is eight bytes long.

10000001 00000000 00000000 00000000 00010111 00000110 00000000 00000000

Representing a file in binary format takes up a lot of space so files are usually displayed in hexadecimal (hex) format.
To change a file from binary format to hexadecimal format, replace each group of four bits with the number or letter in the following table which corresponds to the pattern of four bits. Because we are replacing four bits at a time, one byte is now represented by two alphanumeric (letter or number) characters.

0000	0001	0010	0011	0100	0101	0110	0111	1000	1001	1010	1011	1100	1101	1110	1111
0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F

Below is our file with each byte shown in hexadecimal format.

81 00 00 00 17 06 00 00

This is the way hex editors usually display the file. A hex editor is an editor that can display and edit files in the form that they are stored on disk and so can open any file.
Some freeware hex editors:

WARNING: When investigating files in a hex editor make a copy of the file you want to study and open the copy in the hex editor because you may save a modification that stops the file working (corrupts the file).

If you have access to the source code of the program that created the file you can study the file saving routine to discover what each byte represents.

The challenge with deciphering an unknown file format is knowing what each byte or group of bytes represents. Is it a positive or negative whole number? Is it letter and if so which encoding was used? Is it a date? Is it a decimal number?

We could guess that the file is a sequence of characters. Characters are what make up text in a text editor program like Notepad. So characters are letters or numeric digits or punctuation symbols or special symbols (#) or arithmetic signs or whitespace (non printable characters like spaces and tabs).

Different numbers of bytes can be used to store characters in file. We guess that one byte represents one character.

We need to find a table (known as a character set) that links a one byte hex value to a character symbol. Getting a byte value for a character is called encoding and getting a character from a byte value is decoding.

One such table is the ASCII table but there are others for non English languages and others for English as well. We will try the ASCII table. Note that any entry displayed in the table with more than one symbol is known as a control character and is not printable and does not print the symbols. NUL does not print N, U, L for example.

The eight bytes of our file decoded as ASCII are shown below.

81	00	00	00	17	06	00	00
invalid	NUL	NUL	NUL	ETB	ACK	NUL	NUL

Our first byte has no entry in the table and the rest of the bytes are all non printable control characters. The only control characters found in an ASCII text file are SPACE, TAB and the line ending characters, CR (carriage return) and/or LF (linefeed). Our file is not an ASCII text file then.

Hex editors usually display the file in hex format side by side with a text format so you can see if there is any readable text in the file. If your file is a text file you can use a text editor like Notepad to examine the format. Non text files look very odd when opened in a text editor.

Maybe the file is a sequence of numbers. Let’s assume it is a sequence of whole numbers (integers). Different numbers of bytes can be used to store integers in a file. Integers can be stored in a single byte or across two bytes or across four bytes or across eight bytes or other numbers of bytes.

Let’s now assume that each integer is stored in one byte.

The next thing we need to decide is whether each integer is a positive integer only (unsigned) or whether each integer may be positive or negative (signed).

Let’s decide that each integer is unsigned. We need a table that links a one byte hex value to a positive integer.

The eight bytes of our file decoded as byte sized unsigned integers are shown below.

81	00	00	00	17	06	00	00
129	0	0	0	23	6	0	0

These are the values you also get from Windows Calculator when switching between Hex mode and Dec mode.

If we decided instead that each byte was a signed integer we would have to use a table that links a one byte hex value to a signed integer. You will notice that the negative integers correspond to byte values that start with 8 or 9 or A or B or C or D or E or F, that is, those bytes that start with a one in binary format.

The eight bytes of our file decoded as byte sized signed integers are shown below.

81	00	00	00	17	06	00	00
-127	0	0	0	17	6	0	0

Some freeware calculators to convert between hex and unsigned and signed integers and between hex and binary and decimal are listed below.

BinCalc

Longsoft Calc++

So far we have only be using one byte to determine an integer or a character. However, a character or integer may be stored in more than one byte. Whenever you use more than one byte to determine some value, you must take into account another file property – byte order.

The byte order of a file simply tells you what order the bytes are written for any multi-byte data stored in the file.

The two common byte orders are known as big-endian and little-endian. Most files you examine will be in little-endian form since this is the form used by Intel CPUs.

Big-endian is the way we write numbers in everyday life. For example when we write the number one hundred and twenty three we write the number from left to right with the number of hundreds on the left, the number of ones on the right and the number of tens in the middle. That is, 123. This is called big-endian because the multipliers (100, 10, 1) get smaller as you read from left to right so the biggest multiplier is first.

One hundred and twenty three written in little-endian order would look like this, 321. Here the multipliers (1, 10, 100) get bigger from left to right so the smallest (littlest) multiplier is first. It is important to note that we are representing the same numerical value in two different ways but if you assume the bytes are in one form when in fact they are in the other you will not get the correct values.

So to change the endianness of a number all you do is reverse the digits.

To change the endianness of multi-byte data you reverse the bytes of the data. You don’t reverse individual bits or hex characters but the whole bytes.

If we consider the first two bytes of our file to be one integer stored little end first, then to change it to big end first all we do is reverse the bytes.

little end first	81 00
big end first	00 81

Here is an example for four bytes.

little end first	81 12 37 A0
big end first	A0 37 12 81

In a Hex Editor you need to nominate an endianness in the settings. Most Hex Editors default to little endian.

Let’s now consider our file as a sequence of unsigned integers and each integer is stored in two bytes.

We need a table that links a two byte hex value to an unsigned integer. This table would be very large since using two bytes gives many more choices for integers so I will only show some entries.

The eight bytes of our file decoded as double-byte sized unsigned integers for each endian-ness are shown below.

hex	81 00	00 00	17 06	00 00
little endian value	129	0	1559	0
big endian value	33024	0	5894	0

The tables we have been using are built-in to Hex Editors so when you place the cursor in the hex display, the Hex Editor can decode the data for you. Hex Editors usually decode the data for many different data types like ASCII, byte-sized unsigned integers, byte-sized signed integers, double-byte sized unsigned integers, double-byte sized signed integers, floating point single numbers, floating point double numbers and others at the same time.

For the multi-byte data the Hex Editor looks at bytes after the one next to the cursor and uses the endianness setting.

Sometimes the integers are named for the number of bits and not bytes and there are also some names in common use for various byte sizes and types.

byte sized unsigned integer	char, byte, uint8
byte sized signed integer	signed char, shortint, int8, sint8
double-byte sized unsigned integer	word, uint16
double-byte sized signed integer	short, smallint, int16, sint16
quad-byte sized unsigned integer	double word, dword, unsigned int, cardinal, uint32
quad-byte sized signed integer	int, integer, int32, sint32

In the Hex Editor screenshot below see how the integer values for unsigned and signed 8 bit to 64 bit integers are displayed. The endianness is little endian and can be changed in the Format menu.

For further information see this website.

http://www.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html

Sapper's Blog

Pages

Wednesday, 5 March 2014

Introduction to File Formats

No comments:

Post a Comment