The following example shows how to extract HTML formatted text:
// Create an instance of Parser class
try(Parserparser=newParser(Constants.SampleDocx)){// Extract a formatted text into the reader
try(TextReaderreader=parser.getFormattedText(newFormattedTextOptions(FormattedTextMode.Html))){// Print a formatted text from the document
// If formatted text extraction isn't supported, a reader is null
System.out.println(reader==null?"Formatted text extraction isn't suppported":reader.readToEnd());}}
Tag
Description
p
Paragraph is surrounded by p tag
a
Hyperlinks
b
Text with Bold font is surrounded by b tag
i
Text with Italic font is surrounded by i tag
h1 - h6
If the heading has ‘Heading X’ style, it’s surrounded by <hX> tag
ol / ul
Numbering and bullets lists
table
Tables
The following Microsoft Word document is used as input document:
The following HTML document is extracted using the example above:
More resources
GitHub examples
You may easily run the code above and see the feature in action in our GitHub examples: