Extract tables from Microsoft Office Word documents Leave feedback

To extract tables from Microsoft Office Word document getStructure method is used. This method returns XML representation of the document. Tables are represented by “table” tag. For more details, see Extract text structure.

Warning
getStructure method returns null value if text structure extraction isn’t supported for the document. For example, text structure extraction isn’t supported for TXT files. Therefore, for TXT file getStructure method returns null. If Microsoft Office Word document has no text, getStructure method returns an empty org.w3c.dom.Document object.

Here are the steps to extract tables from Microsoft Office Word documents:

Instantiate Parser object for the initial document;
Call getStructure method and obtain org.w3c.dom.Document object;
Iterate through the XML document.

The following example demonstrates how to extract tables from Microsoft Office Word document:

// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
    // Extract text structure to the XML reader
    Document document = parser.getStructure();
    // Read XML document
    readNode(document.getDocumentElement());
}

private static void readNode(Node node) {
    NodeList nodes = node.getChildNodes();
    // Iterate over the child nodes
    for (int i = 0; i < nodes.getLength(); i++) {
        Node n = nodes.item(i);
        // If it's a table
        if (n.getNodeName().toLowerCase() == "table") {
            System.out.println("table");
            // Process node
            processNode(n);
        }
        readNode(n);
    }
}
private static void processNode(Node node) {
    NodeList nodes = node.getChildNodes();
    // Iterate over the child nodes
    for (int i = 0; i < nodes.getLength(); i++) {
        Node n = nodes.item(i);
        switch (n.getNodeName().toLowerCase()) {
            // In the case of a row or cell
            case "tr":
            case "td": {
                // Print the name
                System.out.println(n.getNodeName());
                // Process sub-nodes
                processNode(n);
                System.out.println();
                System.out.println("/" + n.getNodeName());
                break;
            }
            default:
                // Print the node value (if it's not null)
                String value = n.getNodeValue();
                if(value != null) {
                    System.out.print(value);
                }
                processNode(n);
                break;
        }
    }
}

More resources

GitHub examples

You may easily run the code above and see the feature in action in our GitHub examples:

Free online document parser App

Along with full featured .NET library we provide simple, but powerful free Apps.

You are welcome to parse documents and extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails and more with our free online Free Online Document Parser App.

We value your opinion. Your feedback will help us improve our documentation.

Extract tables from Microsoft Office Word documents Leave feedback

More resources

GitHub examples

Free online document parser App

Was this page helpful?

Any additional feedback you'd like to share with us?

Please tell us how we can improve this page.

Thank you for your feedback!