What is a DOC file?
Files with .doc extension represent documents generated by Microsoft Word or other word processing documents in binary file format. The extension was initially used for plain text documentation on several different operating systems. It can contain several different types of data such as images, formatted as well as plain text, graphs, charts, embedded objects, links, pages, page formatting, print settings and a lot others. The format was popular for all sorts of documentation due to the variety of options it offers to users for writing manuals, proposals, specifications, resumes, articles or any similar documents. The updated version of DOC is DOCX which is based on Office OpenXML whose specifications are openly available.
Brief History
WordPerfect, a product of Corel, used DOC as the extension of their proprietary format. In 1980s, WordPerfect remained the choice of usage on most of the computers due to its easy availability, conformance with most computer machines and Operating systems. However, WordPerfect saw its downfall on Windows OS when Microsoft introduced Microsoft Word as its product for documents file format and chose DOC extension for their proprietary format. As Microsoft Word became more and more popular, the DOC file format underwent several revisions from Microsoft Word 97 - 2003. It was 2007 when the default DOC file format was replaced by the Office Open XML format (known as DOCX) and the new versions of Microsoft Word now use this new extension as default file format.
DOC File Format Specifications - More Information
Microsoft didn’t release the DOC file format specifications for a long time until 2008. In Feb 2008, format specifications were released for .doc file format under the Microsoft Open Specification Promise. Though the specification does not describe all of the features used by the DOC format, it gives ample information about the knowledge required to work with this file format. Still, reverse engineering is required to make use of the available information. The specifications have been updated several times and the latest revision is 8.0 which was updated as of August 2018.
Some Fundamental Concepts
Before we go into any details about the file format specifications for DOC, some fundamental concepts are necessary to understand in order to work with this file format.
File Information Base (Fib): The Fib structure contains information about the document and specifies the file pointers to various portions that make up the document. The Fib is a variable length structure. With the exception of the base portion which is fixed in size, every section is preceded with a count field that specifies the size of the next section.
Character Position: CP or Character Position represents an unsigned 32-bit integer that serves as the zero-based index of a character in the document text. The location and size of each character in the file can’t be retrieved directly and needs to be computed using pre-specified algorithm. Characters include:
- Text of the document
- Anchors of objects such as footnotes or textboxes
- Control characters such as paragraph marks and table cell marks
PLC: The PLC structure is an array of CPs followed by an array of data elemetns. The data elements for any PLC must be the same size of zero or more bytes, and for this reason, the number of CPs must be one more than the number of data elements. PLC structures are of different types where each type specifies whether duplicate CPs are allowed for that type or not. A PLC structure consists of:
- aCP (variable length): An array of CP elements. Each type of PLC structure specifies the meaning of the CP elements and the allowed range.
- aData (variable length): Each type of PLC structure specifies the structure and meaning of the data elements, any restrictions on the number of data elements, and any restrictions on the data contained therein. It also specifies the relationship between the data elements and the corresponding CPs.
Valid Selection: The .DOC file constructs are mainly described by a range of CPs. There are a number of rules specified by Microsoft to be followed in such case.
STTB: The STTB is a string table that is made up of a header that is followed by an array of elements. The cData value specifies the number of elements that are contained in the array.
Property Storage: A word file may have different elements such as text, paragraphcs, tables, pictures and sections where each one can have its own properties. Properties of these are stored in the Word file as differences from the default. Such differences are specified by PRl that consists of a Single Property Modifier (Sprm) and its operand. An application can determine the final set of properties by application of lists of Prls.
Password Protection: Word files can be password protected as well, for which one of the following mechanisms can be used.
- XOR Obfuscation
- Office binary document RC4 encryption
- Office binary document RC4 CryptoAPI encryption
If FibBase.fEncrypted and FibBase.fObfuscation are both 1, the file is obfuscated by using XOR obfuscation.
If FibBase.fEncrypted is 1 and FibBase.fObfuscation is 0, the file is encrypted by using either Office Binary Document RC4 Encryption or Office Binary Document RC4 CryptoAPI Encryption, with the EncryptionHeader stored in the first FibBase.lKey bytes of the Table stream. The EncryptionHeader.EncryptionVersionInfo specifies which encryption mechanism was used to encrypt the file.
File Structure
A binary Word file in its originality is an OLE compound file that comprises of several storages and streams. These storages and streams have their own structure and sizes, that specify the parameters for writing and reading. These are:
WordDocument Stream
This stream contains the document text and other information referenced from other parts of the file. The stream has no predefined structure other than the FIB at the beginning which is mandatory and should be at offset 0. This stream must not be larger than 2147 MB.
1TableStream or 0TableStream
A binary Word file can contain Table Streams known as 1Table stream or 0Table stream. Atleast one of these should be present in the document. However, if a document contains both 1Table and 0Table streams, only the stream referenced by base.fWhichTblStm is used. The unreferenced stream MUST be ignored. The Table Stream MUST NOT be larger than 2147 MB.
Data Stream
The Data stream has no predefined structure. It contains data that is referenced from the FIB or from other parts of the file. This stream need not be present if there are no references to it. The Data stream MUST NOT be larger than 2147 MB.
Object Pool Storage
The Object Pool storage contains storages for embedded OLE objects. This storage need not be present if there are no embedded OLE objects in the document.
Custom XML Data Storage
The Custom XML Data storage is an optional storage whose name MUST be “MsoDataStore”.
Summary Information Stream
The Summary Information stream is an optional stream whose name MUST be “\005SummaryInformation”, where \005 is the character with value 0x0005, and not the string literal “\005”.
Document Summary Information Stream
The Document Summary Information stream is an optional stream whose name MUST be “\005DocumentSummaryInformation”, where \005 is the character with value 0x0005, not the string literal “\005”.
Encryption Stream
The Encryption stream is an optional stream whose name MUST be “encryption”. This stream MUST NOT be present unless both of the following conditions are met:
- The document is encrypted with Office Binary Document RC4 CryptoAPI Encryption.
- The fDocProps value is set in the EncryptionHeader.Flags.
Macros Storage
The Macros storage is an optional storage that contains the macros for the file. If present, it MUST be a Project Root Storage.
XML Signatures Storage
The XML signatures storage is an optional storage whose name MUST be “_xmlsignatures”.
Signatures Stream
The signatures stream is an optional stream whose name MUST be “_signatures”. This stream contains digital signatures.
Information Rights Management Data Space Storage
The Information Rights Management Data Space storage is an optional storage whose name MUST be “\006DataSpaces”, where \006 is the character with value 0x0006, and not the string literal “\006”. If this storage is present, the Protected Content Stream MUST also be present. If this storage is present, all specified streams and storages other than this storage and the Protected Content Stream SHOULD be read from the Protected Content Stream as specified in [MS-OFFCRYPTO] and if any of those streams and storages exist outside of the Protected Content Stream, they SHOULD be ignored.
Protected Content Stream
The Protected Content Stream is an optional stream whose name MUST be “\009DRMContent”, where \009 is the character with value 0x0009, and not the string literal “\009”. If this stream is present, the Information Rights Management Data Space Storage MUST also be present.