GroupDocs.Parser provides the functionality to extract the text structure from documents by the GetStructure method:
XmlReaderGetStructure();
This method returns XML representation of a document. A document has the following structure:
Tag
Description
document
The root tag
section
Represents a section of the document. Depending on the document type, it can represent a worksheet, a slide and so on. Can contain the following attributes:
style - the style of the section
name - the name of the section (for example, the name of the sheet)
p
Represents a text paragraph. Can contain the following attribute:
style - the style of paragraph
ul
Represents an unordered list
ol
Represents an ordered list
li
Represents a list item
shape
Represents a shape object.
table
Represents a table
tr
Represents a table row
td
Represents a table cell. Can contain the following attributes:
rowIndex - the zero-based index of the row
columnIndex - the zero-based index of the column
rowSpan - the total number of rows that contain the table cell
columnSpan - the total number of columns that contain the table cell
Tags have the following relations:
document tag can contain any number of section tags
section tag can contain any number of p, ul, ol or table tags in any sequence
table tag can contain any number of tr tags
tr tag can contain any number of td tags
td tag can contain one p tag
The p and li tags can contain hyperlink, strong, em tags and the value that represents a text:
)
Tag
Description
hyperlink
Represents a hyperlink. Can contain the following attribute:
link - URL
strong
Represents a strong emphasis (bold text)
em
Represents a regular emphasis (italic text)
br
Represents a line break (empty tag)
Features of text extraction for different formats
Word processing documents
Word processing documents have a more complex table cell and paragraph can contain any number of shapes.
A table cell can contain any number of paragraphs, lists and tables:
A shape can contain a single hyperlink (empty tag) for the entire shape and any number of paragraphs, lists or tables:
Presentations
Presentations have a more complex table cell and section can contain any number of shapes.
A table cell can contain any number of paragraphs or lists:
A shape can contain a single hyperlink (empty tag) for the entire shape and any number of paragraphs or lists:
Spreadsheets
Spreadsheets have the following document structure:
It’s more simple than others. A section can contain any number of shapes and only one table. A shape can contain a single hyperlink (empty tag) for the entire shape and any numbers of paragraphs.
Examples
Here are the steps to extract hyperlinks from the document:
Instantiate Parser object for the initial document;
Check if document isn’t null (text structure extraction is supported for the document);
Process the XML document.
The following example shows how to extract hyperlinks from the document:
// Create an instance of Parser class
try(Parserparser=newParser(Constants.SampleHyperlinksDocx)){// Extract text structure to the XML reader
Documentdocument=parser.getStructure();// Check if text structure extraction is supported
if(document==null){System.out.println("Text structure extraction isn't supported.");return;}// Read XML document
readNode(document.getDocumentElement());}privatestaticvoidreadNode(Nodenode){NodeListnodes=node.getChildNodes();for(inti=0;i<nodes.getLength();i++){Noden=nodes.item(i);if(n.getNodeName().toLowerCase()=="hyperlink"){Nodea=n.getAttributes().getNamedItem("link");if(a!=null){System.out.println(a.getNodeValue());}}if(n.hasChildNodes()){readNode(n);}}}
The following document:
has the following text structure:
<?xml version="1.0"?><document><section><pstyle="HEADING1"><shape><pstyle="HEADING1"> Lorem ipsum dolor sit amet, <hyperlinklink="google.com">consectetuer</hyperlink> adipiscing elit. Maecenas porttitor congue massa.
</p></shape>Lorem
</p><p> Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed <hyperlinklink="bing.com">pulvinar ultricies</hyperlink>, purus lectus <strong>malesuada</strong> libero, sit amet commodo magna eros quis urna.
</p><ol><li>Nunc viverra imperdiet enim. </li><li>Fusce est.</li></ol><p>Vivamus a tellus. </p><ul><li> Pellentesque <em><strong>habitant morbi tristique senectus et netus et</strong></em> malesuada fames ac turpis egestas.
</li><li>Proin pharetra nonummy pede. </li><li>Mauris et orci. </li><li>Aenean nec lorem.</li></ul><table><tr><tdrowIndex="0"columnIndex="0"rowSpan="1"columnSpan="1"><p>In porttitor.</p></td><tdrowIndex="0"columnIndex="1"rowSpan="1"columnSpan="1"><p> Donec laoreet <strong>nonummy</strong> augue.
</p></td><tdrowIndex="0"columnIndex="2"rowSpan="1"columnSpan="1"><p> Suspendisse dui purus, <em><strong>scelerisque</strong></em> at, vulputate vitae, pretium mattis, nunc.
</p></td><tdrowIndex="0"columnIndex="3"rowSpan="1"columnSpan="1"><p>Mauris eget neque at sem venenatis eleifend. </p></td></tr><tr><tdrowIndex="1"columnIndex="0"rowSpan="1"columnSpan="1"><p>Ut nonummy.</p></td><tdrowIndex="1"columnIndex="1"rowSpan="1"columnSpan="1"><p> Fusce <hyperlinklink="google.com">aliquet pede</hyperlink> non pede.
</p></td><tdrowIndex="1"columnIndex="2"rowSpan="1"columnSpan="1"><p>Suspendisse dapibus lorem pellentesque magna.</p></td><tdrowIndex="1"columnIndex="3"rowSpan="1"columnSpan="1"><p>Integer nulla.</p></td></tr><tr><tdrowIndex="2"columnIndex="0"rowSpan="3"columnSpan="1"><p>Donec blandit feugiat ligula.</p></td><tdrowIndex="4"columnIndex="1"rowSpan="1"columnSpan="1"><p> Donec hendrerit, felis et <em>imperdiet</em> euismod, purus ipsum pretium metus, in lacinia nulla nisl eget sapien.
</p></td><tdrowIndex="2"columnIndex="2"rowSpan="3"columnSpan="1"><p><strong>Donec</strong> ut est in lectus consequat consequat.
</p></td><tdrowIndex="2"columnIndex="3"rowSpan="3"columnSpan="1"><p>Etiam eget dui. Aliquam erat volutpat.</p></td></tr></table><p>Sed at lorem in nunc porta tristique. Proin nec augue.</p></section></document>
More resources
GitHub examples
You may easily run the code above and see the feature in action in our GitHub examples: