GroupDocs.Parser provides functionality to extract the text structure from documents as an XML representation, preserving document hierarchy including sections, paragraphs, tables, and lists.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample documents for testing
Understanding of XML structure
Document structure elements
The XML structure contains the following tags:
Tag
Description
document
The root tag
section
Represents a section (worksheet, slide, etc.). Attributes: style, name
p
Represents a text paragraph. Attribute: style
ul
Represents an unordered list
ol
Represents an ordered list
li
Represents a list item
shape
Represents a shape object
table
Represents a table
tr
Represents a table row
td
Represents a table cell. Attributes: rowIndex, columnIndex, rowSpan, columnSpan
hyperlink
Represents a hyperlink. Attribute: link
strong
Represents bold text
em
Represents italic text
br
Represents a line break
Extract text structure
To extract the document structure as XML:
fromgroupdocs.parserimportParserimportxml.etree.ElementTreeasET# Create an instance of Parser classwithParser("./sample.docx")asparser:# Extract text structurexml_reader=parser.get_structure()# Check if structure extraction is supportedifxml_readerisNone:print("Text structure extraction isn't supported")else:# Read XML contentxml_content=xml_readerprint(xml_content)
The following sample file is used in this example: sample.docx
Expected behavior: Returns an XML representation of the document structure, or None if structure extraction is not supported.
Parse and analyze document structure
To parse and analyze the XML structure:
fromgroupdocs.parserimportParserimportxml.etree.ElementTreeasET# Create an instance of Parser classwithParser("sample.docx")asparser:# Extract text structurexml_reader=parser.get_structure()ifxml_readerisNone:print("Structure extraction not supported")else:# Parse XMLxml_content=xml_readerroot=ET.fromstring(xml_content)# Count elementssections=root.findall('.//section')paragraphs=root.findall('.//p')tables=root.findall('.//table')hyperlinks=root.findall('.//hyperlink')print(f"Document Structure Analysis:")print(f" Sections: {len(sections)}")print(f" Paragraphs: {len(paragraphs)}")print(f" Tables: {len(tables)}")print(f" Hyperlinks: {len(hyperlinks)}")
The following sample file is used in this example: sample.docx
Expected behavior: Parses the XML structure and provides statistics about document elements.
Extract all hyperlinks from structure
To extract hyperlinks using the structure:
fromgroupdocs.parserimportParserimportxml.etree.ElementTreeasETdefextract_hyperlinks_from_structure(file_path):"""
Extract all hyperlinks using document structure.
"""withParser(file_path)asparser:xml_reader=parser.get_structure()ifxml_readerisNone:print("Structure extraction not supported")return[]# Parse XMLxml_content=xml_readerroot=ET.fromstring(xml_content)# Find all hyperlink elementshyperlinks=[]forhyperlinkinroot.findall('.//hyperlink'):url=hyperlink.get('link','')text=hyperlink.textor''ifurl:hyperlinks.append({'text':text,'url':url})returnhyperlinks# Usagelinks=extract_hyperlinks_from_structure("sample.pdf")print(f"Found {len(links)} hyperlinks:")forlinkinlinks:print(f" {link['text']} -> {link['url']}")
The following sample file is used in this example: sample.pdf
Expected behavior: Extracts all hyperlinks with their text and URLs from the document structure.
Extract table structure
To analyze table structure:
fromgroupdocs.parserimportParserimportxml.etree.ElementTreeasETdefanalyze_tables(file_path):"""
Analyze table structure in document.
"""withParser(file_path)asparser:xml_reader=parser.get_structure()ifxml_readerisNone:print("Structure extraction not supported")return# Parse XMLxml_content=xml_readerroot=ET.fromstring(xml_content)# Find all tablestables=root.findall('.//table')print(f"Found {len(tables)} tables:")foridx,tableinenumerate(tables,1):rows=table.findall('.//tr')print(f"Table {idx}:")print(f" Rows: {len(rows)}")# Get column count from first rowifrows:cells=rows[0].findall('.//td')print(f" Columns: {len(cells)}")# Show cell spanningcells_with_span=table.findall('.//td[@rowSpan]')+table.findall('.//td[@columnSpan]')ifcells_with_span:print(f" Cells with spanning: {len(cells_with_span)}")print()# Usageanalyze_tables("sample.xlsx")
The following sample file is used in this example: sample.xlsx
Expected behavior: Provides detailed analysis of table structures including rows, columns, and cell spanning.
Extract formatted text
To extract text with formatting information:
fromgroupdocs.parserimportParserimportxml.etree.ElementTreeasETdefextract_formatted_text(file_path):"""
Extract text with formatting (bold, italic).
"""withParser(file_path)asparser:xml_reader=parser.get_structure()ifxml_readerisNone:print("Structure extraction not supported")return# Parse XMLxml_content=xml_readerroot=ET.fromstring(xml_content)# Find formatted textbold_texts=root.findall('.//strong')italic_texts=root.findall('.//em')print("Bold text:")fortextinbold_texts:iftext.text:print(f" - {text.text}")print("\nItalic text:")fortextinitalic_texts:iftext.text:print(f" - {text.text}")# Usageextract_formatted_text("sample.docx")
The following sample file is used in this example: sample.docx
Expected behavior: Extracts all bold and italic text from the document.
Save structure to file
To save the XML structure to a file:
fromgroupdocs.parserimportParserdefsave_structure_to_file(file_path,output_xml):"""
Extract and save document structure to XML file.
"""withParser(file_path)asparser:xml_reader=parser.get_structure()ifxml_readerisNone:print("Structure extraction not supported")returnFalse# Read and save XMLxml_content=xml_readerwithopen(output_xml,'w',encoding='utf-8')asf:f.write(xml_content)print(f"Structure saved to {output_xml}")returnTrue# Usagesave_structure_to_file("sample.pdf","structure.xml")
Expected behavior: Saves the document structure as an XML file for further processing.
Extract section information
To extract information about document sections:
fromgroupdocs.parserimportParserimportxml.etree.ElementTreeasETdefextract_section_info(file_path):"""
Extract section information from document.
"""withParser(file_path)asparser:xml_reader=parser.get_structure()ifxml_readerisNone:print("Structure extraction not supported")return# Parse XMLxml_content=xml_readerroot=ET.fromstring(xml_content)# Find all sectionssections=root.findall('.//section')print(f"Document has {len(sections)} sections:")foridx,sectioninenumerate(sections,1):name=section.get('name',f'Section {idx}')style=section.get('style','default')# Count elements in sectionparagraphs=section.findall('.//p')tables=section.findall('.//table')lists=section.findall('.//ul')+section.findall('.//ol')print(f"Section: {name}")print(f" Style: {style}")print(f" Paragraphs: {len(paragraphs)}")print(f" Tables: {len(tables)}")print(f" Lists: {len(lists)}")print()# Usageextract_section_info("sample.pptx")
The following sample file is used in this example: sample.pptx
Expected behavior: Provides detailed information about each section in the document.
Notes
The get_structure() method returns None if structure extraction is not supported
XML structure varies by document type (Word, Excel, PowerPoint)
Use Python’s xml.etree.ElementTree or other XML parsers to process the structure
Structure extraction preserves document hierarchy and formatting information
Useful for document analysis, content extraction, and format conversion
Different document types have specific structure features (see documentation)