Documentation

    Table of Content
    • 3D File Formats
      • 3D File Formats
      • 3D2
      • 3DS
      • 3MF
      • AMF
      • ASE
      • DAE
      • DRC
      • FBX
      • GLB
      • GLTF
      • JRXML
      • JT
      • OBJ
      • PLY
      • RVM
      • U3D
      • VRML
      • X
      • X3D
    • Audio File Formats
      • Audio File Formats
      • AAC
      • FLAC
      • M3U
      • MID
      • MKA
      • MP3
      • OGG
      • RA
      • SDT
      • STAP
      • WAV
    • CAD File Formats
      • CAD File Formats
      • CF2
      • DC3
      • DGN
      • DWF
      • DWFX
      • DWG
      • DWT
      • DXF
      • HPGL
      • IFC
      • IGES
      • IGS
      • JVSG
      • NWC
      • NWD
      • NWF
      • PAT
      • PHJ
      • PLT
      • PSM
      • PSS
      • RVT
      • STL
    • Compression File Formats
      • Compression File Formats
      • 7Z
      • ARC
      • BZ2
      • DAA
      • GZ
      • RAR
      • TAR
      • ZIM
      • ZIP
    • Database
      • Database File Formats
      • ACCDB
      • ACCDE
      • ACCDT
      • ACCFT
      • BAK
      • BCP
      • DDL
      • DTSX
      • LDF
      • MDB
      • MDF
      • NDF
      • NSF
      • SQL
      • SQLITE
    • EBook File Formats
      • EBook File Formats
      • AZW
      • AZW3
      • EPUB
      • FB2
      • KFX
      • LRF
      • LRS
      • LRX
      • MOBI
      • TR
    • Email File Formats
      • Email File Formats
      • EDB
      • EML
      • EMLX
      • ICS
      • MBOX
      • MSG
      • OFT
      • OST
      • PST
      • TNEF
      • VCF - Virtual Contact File
    • Font
      • Font File Formats
      • EOT
      • FNT
      • FON
      • JFPROJ
      • OTF
      • PFA
      • PFB
      • PFM
      • SFD
      • TTF
      • WOFF
    • GIS File Formats
      • GIS File Formats
      • E00
      • FileGDB
      • GeoJSON
      • GML
      • GPX
      • KML
      • KMZ
      • OSM
      • SHP
    • Image File Formats
      • Image File Formats
      • APNG
      • AVIF
      • BMP
      • CDR
      • CMX
      • DCM
      • DIB
      • DICOM
      • DJVU
      • DNG
      • EMF
      • EMZ
      • EXIF
      • GIF
      • ICO
      • J2K
      • JP2
      • JPEG
      • ODG
      • OTG
      • PNG
      • PSB
      • PSD
      • SVGZ
      • TGA
      • TIFF
      • VDX
      • VSD
      • VSDM
      • VSDX
      • VSS
      • VSSM
      • VSSX
      • VST
      • VSTM
      • VSTX
      • VSX
      • VTX
      • WEBP
      • WMF
      • WMZ
    • Note Taking File Formats
      • Note Taking File Formats
      • ONE
      • ONETOC2
    • Page Description Language
      • Page Description Language
      • CGM
      • EPS
      • PCL
      • PS
      • SVG
      • SWF
      • TEX
      • XPS
      • XSLFO
    • PDF
      • PDF
      • PDF/A
      • PDF/E
      • PDF/UA
      • PDF/VT
      • PDF/X
    • Presentation File Formats
      • Presentation File Formats
      • ODP
      • OTP
      • POT
      • POTM
      • POTX
      • PPS
      • PPSM
      • PPSX
      • PPT
      • PPTM
      • PPTX
    • Programming
      • Programming
      • C
      • Class
      • CPP
      • CS
      • CSPROJ
      • Dart
      • H
      • JAR
      • Java
      • KT
      • MF
      • PDB
      • PHP
      • PY
      • SH
      • SLN
      • SWIFT
      • TOML
      • VB
      • VBPROJ
      • VCXPROJ
      • YAML
    • Project Management File Formats
      • Project Management File Formats
      • MPP
      • MPT
      • MPX
      • XER
    • Spreadsheet File Formats
      • Spreadsheet File Formats
      • CSV
      • DIF
      • ODS
      • TSV
      • XLAM
      • XLM
      • XLS
      • XLSB
      • XLSM
      • XLSX
      • XLT
      • XLTM
      • XLTX
    • Video File Formats
      • Video File Formats
      • 3G2
      • 3GP
      • AVI
      • F4V
      • FLV
      • M4V
      • MKV
      • MOV
      • MP4
      • RM
      • RMVB
      • RV
      • SRT
      • VOB
      • WMV
      • Xvid
    • Web File Formats
      • Web File Formats
      • ASP
      • ASPX
      • CHM
      • CSS
      • HTM
      • HTML
      • JS
      • JSON
      • MHTML
      • Sass
      • SCSS
      • VDW
      • XAML
      • XHTML
      • XML
      • XOML
    • Word Processing File Formats
      • Word Processing File Formats
      • DOC
      • DOCM
      • DOCX
      • DOT
      • DOTM
      • DOTX
      • MD
      • ODT
      • OTT
      • RTF
      • TXT

    What's on this Page

      • What is a DOCX file?
      • Brief History
      • File Format Specifications
        • Metadata Files
        • File Format Nodes
      • References
    1. Home
    2. Word Processing File Formats
    3. DOCX

    What is a DOCX file?

    DOCX is a well-known format for Microsoft Word documents. Introduced from 2007 with the release of Microsoft Office 2007, the structure of this new Document format was changed from plain binary to a combination of XML and binary files. Docx files can be opened with Word 2007 and lateral versions but not with the earlier versions of MS Word which support DOC file extensions.

    Brief History

    After Microsoft opened the specifications for the DOC file format, it was easy for its competitors to reverse engineer the format and provide the same support in their own applications. In addition, the competition from Open Office in the form of its Open Document Format, compelled Microsoft to adopt more open and wide standards. It was in early 2000 when Microsoft decided to go for the change to accommodate the standard for Office Open XML. Documents under this new Standard were given .docx extension, the “X” being for XML. By 2007, this new file format became part of Office 2007 and is carried on in the new versions of Microsoft Office as well. The new file type has added advantages of small file sizes, fewer changes of corruption and well-formatted images representation.

    File Format Specifications

    A Docx file comprises of a collection of XML files that are contained inside a ZIP archive. The contents of a new Word document can be viewed by unzipping its contents. The collection contains a list of XML files that are categorized as:

    • MetaData Files - contains information about other files available in the archive
    • Document - contains the actual contents of the document

    Metadata Files

    Microsoft Word uses these files to find the relationship between files and to locate the document contents. When a Word document archive is extracted, it contains a number of such files as detailed below.

    Relationships - _rels/.rels

    This file contains information that tells MS Word where to look for the document contents and other references. Each relationship is identified by a unique relationship id and specifies the referenced XML file as target. A sample relationship file is shown as follow:

    <Relationship Id#"rId1" Type#"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target#"word/document.xml"/>.
    

    Content Types

    A document can contain several media types inside like images, themes, word art, etc. The [Content_Types].xml contains information about such media types present in the document. Contents of a such an XML file are shown as follow:

    <Override PartName#"/word/document.xml" ContentType#"application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
    

    References To Resources - _rels/document.xml.rels

    Information about resources, such as images embedded in the document, are referenced in this XML file.

    Main Document Contents

    This refers to the main XML file of the archive that contains the document’s text content. This content is represented by variety of nodes as per the OpenOffice XML specifications. Mostly the contents of this file consist of Paragraphs and Tables, though their can be other nodes as well.

    File Format Nodes

    The main document.xml file is a collection of nodes for representation of the overall contents of a file. Each node has a start and end that encapsulates either further nodes or the contents. A simplified example of such an xml file is as follow:

    <w:document>
       <w:body>
           <w:p w:rsidR#"005F670F" w:rsidRDefault#"005F79F5">
               <w:r><w:t>Example Document</w:t></w:r>
           </w:p>
           <w:sectPr w:rsidR#"005F670F">
               <w:pgSz w:w#"12240" w:h#"15840"/>
               <w:pgMar w:top#"1440" w:right#"1440" w:bottom#"1440" w:left#"1440" w:header#"720" w:footer#"720"
                        w:gutter#"0"/>
               <w:cols w:space#"720"/>
               <w:docGrid w:linePitch#"360"/>
           </w:sectPr>
       </w:body>
    </w:document>
    

    Following is the information about some of the nodes contained in a DOCX file for representation of contents.

    <w:document> - Represents the root element of the main content of the file.

    <w:body> - Represents the body of the document which can comprise of many other element nodes such as paragraphs, tables and sections.

    Paragraphs

    A paragraph is the main content holder within a document. It is represented by <w:p> element within a document. A paragraph further consists of one or more runs <w:r> that contains the actual text of the paragraph. In addition to runs, paragraphs may also contain other document elements such as hyperlinks, comments, etc. An example paragraph structure is as shown below:

    <w:p>
    <w:pPr>
    <w:pStyle> w:val#"MyStyle"/>
    <w:spacing w:before#"120" w:after#"120"/>
    </w:pPr>
    <w:r>
    <w:t xml"space#"preserve">A paragraph is main container in a document that further consists of a one or more runs where the text of paragraph is actually contained.</w:t>
    </w:r>
    </w:p>
    

    References

    • [MS-DOCX] - .docx File Format
    • Office Open XML