GroupDocs.Parser provides functionality to extract text areas with position information (coordinates) and formatting details from documents.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample documents for testing
Understanding of coordinate systems in documents
What are text areas?
Text areas represent rectangular regions on a page containing text. Each text area includes:
Page index: The page where the text appears
Rectangle: Position and size (x, y, width, height)
Text: The actual text content
Baseline: The baseline position of the text
Text style: Formatting information (font, size, etc.)
Child areas: For composite text areas
Extract text areas from document
To extract all text areas from a document:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Extract text areastext_areas=parser.get_text_areas()# Check if text areas extraction is supportediftext_areasisNone:print("Text areas extraction isn't supported")else:# Iterate over text areasforareaintext_areas:# Print page index, rectangle, and textprint(f"Page: {area.page.index}, Rectangle: {area.rectangle}, Text: {area.text}")
The following sample file is used in this example: sample.pdf
Expected behavior: Returns a collection of PageTextArea objects with position and text information for each text area in the document.
Extract text areas from document page
To extract text areas from a specific page:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Check if text areas extraction is supportedifnotparser.features.text_areas:print("Document doesn't support text areas extraction")return# Get document infoinfo=parser.get_document_info()# Check if document has pagesifinfo.page_count==0:print("Document has no pages")return# Iterate over pagesforpage_indexinrange(info.page_count):print(f"Page{page_index+1}/{info.page_count}")# Extract text areas from the pagetext_areas=parser.get_text_areas(page_index)# Iterate over text areasiftext_areas:forareaintext_areas:# Print rectangle and textrect=area.rectangleprint(f"Position: ({rect.left}, {rect.top}), Size: ({rect.width}x{rect.height})")print(f"Text: {area.text}")
The following sample file is used in this example: sample.pdf
Expected behavior: Returns text areas only for the specified page, allowing page-by-page processing.
Extract text areas with options
To extract text areas from a specific region with filtering:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageTextAreaOptions,Rectangle,Point,Size# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Define the region (upper-left corner, 300x100 pixels)region=Rectangle(Point(0,0),Size(300,100))# Create options to extract only text areas with digits in the specified regionoptions=PageTextAreaOptions(r"\d+",region)# Extract text areastext_areas=parser.get_text_areas(options)iftext_areasisNone:print("Text areas extraction isn't supported")else:# Iterate over filtered text areasforareaintext_areas:print(f"Page: {area.page.index}, Rectangle: {area.rectangle}, Text: {area.text}")
The following sample file is used in this example: sample.pdf
Expected behavior: Returns only text areas that match the regular expression pattern and fall within the specified rectangular region.
Extract text areas with formatting
To extract text areas and access formatting information:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.docx")asparser:# Extract text areastext_areas=parser.get_text_areas()iftext_areasisNone:print("Text areas extraction isn't supported")else:# Iterate over text areasforareaintext_areas:# Get text style if availableifarea.text_style:print(f"Text: {area.text}")print(f"Font: {area.text_style.name}")print(f"Size: {area.text_style.font_size}")print(f"Bold: {area.text_style.is_bold}")print(f"Italic: {area.text_style.is_italic}")print("---")
The following sample file is used in this example: sample.docx
Expected behavior: Extracts text areas with detailed formatting information, useful for document analysis.
Extract specific text from regions
Extract text from multiple predefined regions:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageTextAreaOptions,Rectangle,Point,Sizedefextract_from_regions(file_path,regions):"""
Extract text from specific regions of a document.
Args:
file_path: Path to the document
regions: List of tuples (x, y, width, height)
"""withParser(file_path)asparser:ifnotparser.features.text_areas:print("Text areas extraction not supported")return{}results={}foridx,(x,y,w,h)inenumerate(regions):# Create rectangle for this regionrect=Rectangle(Point(x,y),Size(w,h))options=PageTextAreaOptions(None,rect)# Extract text areas from this regionareas=parser.get_text_areas(options)# Collect text from all areas in this regiontext_list=[area.textforareainareas]ifareaselse[]results[f"region_{idx}"]=" ".join(text_list)returnresults# Define regions (e.g., header, body, footer)regions=[(0,0,600,100),# Header region(0,100,600,700),# Body region(0,800,600,100)# Footer region]# Extract text from regionsextracted=extract_from_regions("sample.pdf",regions)forregion_name,textinextracted.items():print(f"{region_name}: {text[:100]}...")
The following sample file is used in this example: sample.pdf
Notes
Text area extraction is more detailed than simple text extraction
Not all document formats support text area extraction - check parser.features.text_areas first
Coordinates are in document-specific units (usually points or pixels)
Composite text areas contain child text areas in the areas property
Use regular expressions in PageTextAreaOptions to filter text areas by content
Rectangle coordinates start from the top-left corner (0, 0)