GroupDocs.Parser for Java 20.3 Release Notes
This page contains release notes for GroupDocs.Parser for Java 20.3
Major Features
There are the following improvements in this release:
- Fixed the bug: Cannot parse large sized PDF file to HTML
- Improved table of contents extraction API
Full List of Issues Covering all Changes in this Release
Key | Summary | Category |
---|---|---|
PARSERJAVA-110 | Cannot parse large sized PDF file to HTML | Bug |
PARSERNET-1432 | Improve the support of text structure extraction | Improvement |
Public API and Backward Incompatible Changes
Improve the support of text structure extraction
Description
This feature adds text extraction from shapes, word art objects and text boxes for Microsoft Office formats. Also added hyperlink extraction for spreadsheets and presentations.
The structure of XML representation of a document was changed. For details, see Extract text structure.
Public API changes
There are no changes in public API
Usage
The following example shows how to extract hyperlinks from the document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleHyperlinksDocx)) {
// Extract text structure to the XML reader
Document document = parser.getStructure();
// Check if text structure extraction is supported
if (document == null) {
System.out.println("Text structure extraction isn't supported.");
return;
}
// Read XML document
readNode(document.getDocumentElement());
}
private static void readNode(Node node) {
NodeList nodes = node.getChildNodes();
for (int i = 0; i < nodes.getLength(); i++) {
Node n = nodes.item(i);
if (n.getNodeName().toLowerCase() == "hyperlink") {
Node a = n.getAttributes().getNamedItem("link");
if (a != null) {
System.out.println(a.getNodeValue());
}
}
if(n.hasChildNodes()) {
readNode(n);
}
}
}