Extract hyperlinks from Microsoft Office Word documents
To extract hyperlinks from Microsoft Office Word document getStructure method is used. This method returns XML representation of the document. Hyperlinks are represented by “hyperlink” tag; “link” attribute contains hyperlink’s URL. For more details, see Extract text structure. Hyperlink can contain a text:
<hyperlink link="www.google.com">google.com</hyperlink>
Here are the steps to extract hyperlinks from Microsoft Office Word documents:
- Instantiate Parser object for the initial document;
- Call getStructure method and obtain org.w3c.dom.Document object;
- Iterate through the XML document.
The following example demonstrates how to extract hyperlinks from Microsoft Office Word document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleHyperlinksDocx)) {
// Extract text structure to the XML reader
Document document = parser.getStructure();
// Read XML document
readNode(document.getDocumentElement());
}
private static void readNode(Node node) {
NodeList nodes = node.getChildNodes();
for (int i = 0; i < nodes.getLength(); i++) {
Node n = nodes.item(i);
if (n.getNodeName().toLowerCase() == "hyperlink") {
Node a = n.getAttributes().getNamedItem("link");
if (a != null) {
System.out.println(a.getNodeValue());
}
}
if(n.hasChildNodes()) {
readNode(n);
}
}
}
More resources
GitHub examples
You may easily run the code above and see the feature in action in our GitHub examples:
Free online document parser App
Along with full featured .NET library we provide simple, but powerful free Apps.
You are welcome to parse documents and extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails and more with our free online Free Online Document Parser App.