Extract text from Microsoft Office Word documents

To extract a text from Microsoft Office Word documents getText and getText(int) methods are used. These methods allow to extract a text from the entire document or a text from the selected page. TextOptions parameter is ignored for Microsoft Office Words documents.

Here are the steps to extract a text from Microsoft Office Word document:

  • Instantiate Parser object for the initial document;
  • Call getText method and obtain TextReader object;
  • Read a text from reader.

The following example demonstrates how to extract a text from Microsoft Office Word document:

// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
    // Extract a text into the reader
    try (TextReader reader = parser.getText()) {
        // Print a text from the document
        System.out.println(reader.readToEnd());
    }
}

Here are the steps to extract a text from the page of Microsoft Office Word document:

The following example demonstrates how to extract a text from the page of Microsoft Office Word document:

// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocxWithToc)) {
    // Get the document info
    IDocumentInfo documentInfo = parser.getDocumentInfo();
    // Iterate over pages
    for (int p = 0; p < documentInfo.getPageCount(); p++) {
        // Print a page number
        System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
        // Extract a text into the reader
        try (TextReader reader = parser.getText(p)) {
            // Print a text from the document
            System.out.println(reader.readToEnd());
        }
    }
}

GroupDocs.Parser also allows to extract a text from Microsoft Office Word documents as HTML, Markdown and formatted plain text. For more details, see Extract Formatted Text.

Here are the steps to extract a text from Microsoft Office Word document as HTML:

The following example shows how to extract a text from Microsoft Office Word document as HTML:

// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
    // Extract a formatted text into the reader
    try (TextReader reader = parser.getFormattedText(new FormattedTextOptions(FormattedTextMode.Html))) {
        // Print a formatted text from the document
        System.out.println(reader.readToEnd());
    }
}

More resources

GitHub examples

You may easily run the code above and see the feature in action in our GitHub examples:

Free online document parser App

Along with full featured .NET library we provide simple, but powerful free Apps.

You are welcome to parse documents and extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails and more with our free online Free Online Document Parser App.