DOCX is a well-known format for Microsoft Word documents. Introduced from 2007 with the release of Microsoft Office 2007, the structure of this new Document format was changed from plain binary to a combination of XML and binary files. Docx files can be opened with Word 2007 and lateral versions but not with the earlier versions of MS Word which support DOC file extensions.
After Microsoft opened the specifications for the DOC file format, it was easy for its competitors to reverse engineer the format and provide the same support in their own applications. In addition, the competition from Open Office in the form of its Open Document Format, compelled Microsoft to adopt more open and wide standards. It was in early 2000 when Microsoft decided to go for the change to accommodate the standard for Office Open XML. Documents under this new Standard were given .docx extension, the “X” being for XML. By 2007, this new file format became part of Office 2007 and is carried on in the new versions of Microsoft Office as well. The new file type has added advantages of small file sizes, fewer changes of corruption and well-formatted images representation.
File Format Specifications
A Docx file comprises of a collection of XML files that are contained inside a ZIP archive. The contents of a new Word document can be viewed by unzipping its contents. The collection contains a list of XML files that are categorized as:
MetaData Files - contains information about other files available in the archive
Document - contains the actual contents of the document
Microsoft Word uses these files to find the relationship between files and to locate the document contents. When a Word document archive is extracted, it contains a number of such files as detailed below.
Relationships - _rels/.rels
This file contains information that tells MS Word where to look for the document contents and other references. Each relationship is identified by a unique relationship id and specifies the referenced XML file as target. A sample relationship file is shown as follow:
A document can contain several media types inside like images, themes, word art, etc. The [Content_Types].xml contains information about such media types present in the document. Contents of a such an XML file are shown as follow:
Information about resources, such as images embedded in the document, are referenced in this XML file.
Main Document Contents
This refers to the main XML file of the archive that contains the document’s text content. This content is represented by variety of nodes as per the OpenOffice XML specifications. Mostly the contents of this file consist of Paragraphs and Tables, though their can be other nodes as well.
File Format Nodes
The main document.xml file is a collection of nodes for representation of the overall contents of a file. Each node has a start and end that encapsulates either further nodes or the contents. A simplified example of such an xml file is as follow:
Following is the information about some of the nodes contained in a DOCX file for representation of contents.
<w:document> - Represents the root element of the main content of the file.
<w:body> - Represents the body of the document which can comprise of many other element nodes such as paragraphs, tables and sections.
A paragraph is the main content holder within a document. It is represented by <w:p> element within a document. A paragraph further consists of one or more runs <w:r> that contains the actual text of the paragraph. In addition to runs, paragraphs may also contain other document elements such as hyperlinks, comments, etc. An example paragraph structure is as shown below:
<w:spacing w:before#"120" w:after#"120"/>
<w:t xml"space#"preserve">A paragraph is main container in a document that further consists of a one or more runs where the text of paragraph is actually contained.</w:t>