Wednesday 5 March 2014

Introduction to File Formats

A computer file can be considered as a sequence of zeroes and ones. Each one or zero is known as a bit. Representing a file in ones and zeroes is known as binary format.

1000000100000000000000000000000000010111000001100000000000000000

It is convenient to organise the bits into groups of eight. Each group of eight bits is known as a byte. This file is eight bytes long.

10000001 00000000 00000000 00000000 00010111 00000110 00000000 00000000



Representing a file in binary format takes up a lot of space so files are usually displayed in hexadecimal (hex) format.
To change a file from binary format to hexadecimal format, replace each group of four bits with the number or letter in the following table which corresponds to the pattern of four bits. Because we are replacing four bits at a time, one byte is now represented by two alphanumeric (letter or number) characters. 

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
0 1 2 3 4 5 6 7 8 9 A B C D E F

Below is our file with each byte shown in hexadecimal format.

81 00 00 00 17 06 00 00

This is the way hex editors usually display the file. A hex editor is an editor that can display and edit files in the form that they are stored on disk and so can open any file.
Some freeware hex editors:
WARNING: When investigating files in a hex editor make a copy of the file you want to study and open the copy in the hex editor because you may save a modification that stops the file working (corrupts the file).

If you have access to the source code of the program that created the file you can study the file saving routine to discover what each byte represents.

The challenge with deciphering an unknown file format is knowing what each byte or group of bytes represents. Is it a positive or negative whole number? Is it letter and if so which encoding was used? Is it a date? Is it a decimal number?

We could guess that the file is a sequence of characters. Characters are what make up text in a text editor program like Notepad. So characters are letters or numeric digits or punctuation symbols or special symbols (#) or arithmetic signs or whitespace (non printable characters like spaces and tabs).

Different numbers of bytes can be used to store characters in file. We guess that one byte represents one character.

We need to find a table (known as a character set) that links a one byte hex value to a character symbol. Getting a byte value for a character is called encoding and getting a character from a byte value is decoding.

One such table is the ASCII table but there are others for non English languages and others for English as well. We will try the ASCII table. Note that any entry displayed in the table with more than one symbol is known as a control character and is not printable and does not print the symbols. NUL does not print N, U, L for example.

ASCII table 
 The eight bytes of our file decoded as ASCII are shown below.

8100000017060000
invalidNULNULNULETBACKNULNUL

Our first byte has no entry in the table and the rest of the bytes are all non printable control characters. The only control characters found in an ASCII text file are SPACE, TAB and the line ending characters, CR (carriage return) and/or LF (linefeed). Our file is not an ASCII text file then.

Hex editors usually display the file in hex format side by side with a text format so you can see if there is any readable text in the file. If your file is a text file you can use a text editor like Notepad to examine the format. Non text files look very odd when opened in a text editor.

Maybe the file is a sequence of numbers. Let’s assume it is a sequence of whole numbers (integers). Different numbers of bytes can be used to store integers in a file. Integers can be stored in a single byte or across two bytes or across four bytes or across eight bytes or other numbers of bytes.

Let’s now assume that each integer is stored in one byte.

The next thing we need to decide is whether each integer is a positive integer only (unsigned) or whether each integer may be positive or negative (signed).

Let’s decide that each integer is unsigned. We need a table that links a one byte hex value to a positive integer.

  One byte to Unsigned Integer Table1

One byte to Unsigned Integer Table2

The eight bytes of our file decoded as byte sized unsigned integers are shown below.

8100000017060000
12900023600

These are the values you also get from Windows Calculator when switching between Hex mode and Dec mode.

If we decided instead that each byte was a signed integer we would have to use a table that links a one byte hex value to a signed integer. You will notice that the negative integers correspond to byte values that start with 8 or 9 or A or B or C or D or E or F, that is, those bytes that start with a one in binary format.
 One byte to Signed Integer Table1

One byte to Signed Integer Table2

The eight bytes of our file decoded as byte sized signed integers are shown below.

8100000017060000
-12700017600

Some freeware calculators to convert between hex and unsigned and signed integers and between hex and binary and decimal are listed below.

Screenshot Megatops BinCalc

So far we have only be using one byte to determine an integer or a character. However, a character or integer may be stored in more than one byte. Whenever you use more than one byte to determine some value, you must take into account another file property – byte order.

The byte order of a file simply tells you what order the bytes are written for any multi-byte data stored in the file.

The two common byte orders are known as big-endian and little-endian. Most files you examine will be in little-endian form since this is the form used by Intel CPUs.

Big-endian is the way we write numbers in everyday life. For example when we write the number one hundred and twenty three we write the number from left to right with the number of hundreds on the left, the number of ones on the right and the number of tens in the middle. That is, 123. This is called big-endian because the multipliers (100, 10, 1) get smaller as you read from left to right so the biggest multiplier is first.

One hundred and twenty three written in little-endian order would look like this, 321. Here the multipliers (1, 10, 100) get bigger from left to right so the smallest (littlest) multiplier is first. It is important to note that we are representing the same numerical value in two different ways but if you assume the bytes are in one form when in fact they are in the other you will not get the correct values.

So to change the endianness of a number all you do is reverse the digits.

To change the endianness of multi-byte data you reverse the bytes of the data. You don’t reverse individual bits or hex characters but the whole bytes.

If we consider the first two bytes of our file to be one integer stored little end first,  then to change it to big end first all we do is reverse the bytes.

little end first81 00
big end first00 81

Here is an example for four bytes.

little end first81 12 37 A0
big end firstA0 37 12 81

In a Hex Editor you need to nominate an endianness in the settings. Most Hex Editors default to little endian.

Let’s now consider our file as a sequence of unsigned integers and each integer is stored in two bytes.

We need a table that links a two byte hex value to an unsigned integer. This table would be very large since using two bytes gives many more choices for integers so I will only show some entries.

Two bytes to Unsigned Integer Table

The eight bytes of our file decoded as double-byte sized unsigned integers for each endian-ness are shown below.

hex81 0000 0017 0600 00
little endian value129015590
big endian
value
33024058940

The tables we have been using are built-in to Hex Editors so when you place the cursor in the hex display, the Hex Editor can decode the data for you. Hex Editors usually decode the data for many different data types like ASCII, byte-sized unsigned integers, byte-sized signed integers, double-byte sized unsigned integers, double-byte sized signed integers, floating point single numbers, floating point double numbers and others at the same time.

For the multi-byte data the Hex Editor looks at bytes after the one next to the cursor and uses the endianness setting.

Sometimes the integers are named for the number of bits and not bytes and there are also some names in common use for various byte sizes and types.

byte sized unsigned integerchar, byte, uint8
byte sized signed integersigned char, shortint, int8, sint8
double-byte sized unsigned integerword, uint16
double-byte sized signed integershort, smallint, int16, sint16
quad-byte sized unsigned integerdouble word, dword, unsigned int, cardinal, uint32
quad-byte sized signed integerint, integer, int32, sint32

In the Hex Editor screenshot below see how the integer values for unsigned and signed 8 bit to 64 bit integers are displayed. The endianness is little endian and can be changed in the Format menu.

Mitec Hex Editor ScreenShot


For further information see this website.

http://www.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html

No comments:

Post a Comment