GroupDocs.Parser allows extracting images from specific rectangular areas of a document page, enabling precise image extraction from defined regions.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample documents with images
Understanding of coordinate systems
Extract images from page area
To extract images from a specific rectangular area:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageAreaOptions,Rectangle,Point,Size# Create an instance of Parser classwithParser("./sample.pdf")asparser:# Check if image extraction is supportedifnotparser.features.images:print("Image extraction not supported")return# Define the area (upper-left corner, 300x100 pixels)area=Rectangle(Point(0,0),Size(300,100))# Create options for the areaoptions=PageAreaOptions(area)# Extract images from the specified areaimages=parser.get_images(options)ifimages:print(f"Images found in area:")foridx,imageinenumerate(images):print(f" Image {idx+1}: {image.file_type}, Page {image.page.index+1}")else:print("No images found in the specified area")
The following sample file is used in this example: sample.pdf
Expected behavior: Returns only images that are located within the specified rectangular area.
Extract images from multiple regions
To extract images from several predefined areas:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageAreaOptions,Rectangle,Point,Sizeimportosdefextract_images_from_regions(file_path,regions,output_dir):"""
Extract images from multiple predefined regions.
Args:
file_path: Path to document
regions: List of tuples (x, y, width, height)
output_dir: Directory to save images
"""os.makedirs(output_dir,exist_ok=True)withParser(file_path)asparser:ifnotparser.features.images:print("Image extraction not supported")returnforregion_idx,(x,y,w,h)inenumerate(regions):print(f"Processingregion{region_idx+1}:({x},{y},{w}x{h})")# Create rectangle for this regionarea=Rectangle(Point(x,y),Size(w,h))options=PageAreaOptions(area)# Extract images from this regionimages=parser.get_images(options)ifimages:forimg_idx,imageinenumerate(images):filename=f"region{region_idx+1}_img{img_idx+1}{image.file_type.extension}"filepath=os.path.join(output_dir,filename)image.save(filepath)print(f" Saved: {filename}")# Define regions (e.g., header, body, footer)regions=[(0,0,600,100),# Header region(0,100,600,700),# Body region(0,800,600,100)# Footer region]# Extract images from regionsextract_images_from_regions("document.pdf",regions,"region_images")
The following sample file is used in this example: document.pdf
Expected behavior: Extracts and saves images organized by the region they were found in.
Extract images from page area with page index
To extract images from a specific area on a specific page:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageAreaOptions,Rectangle,Point,Size# Create an instance of Parser classwithParser("document.pptx")asparser:ifnotparser.features.images:print("Image extraction not supported")return# Get document infoinfo=parser.get_document_info()# Process specific page (e.g., first page)page_index=0ifpage_index<info.page_count:# Define area (center of page, 400x300)area=Rectangle(Point(100,100),Size(400,300))options=PageAreaOptions(area)# Extract images from the area on the specific pageimages=parser.get_images(page_index,options)ifimages:print(f"Images on page {page_index+1} in specified area:")forimageinimages:print(f" Type: {image.file_type}")print(f" Position: ({image.rectangle.left}, {image.rectangle.top})")print(f" Size: {image.rectangle.width}x{image.rectangle.height}")
The following sample file is used in this example: document.pptx
Expected behavior: Extracts images from a specific area on a specific page.
Extract logo from document header
To extract logo images from the header area:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageAreaOptions,Rectangle,Point,Sizeimportosdefextract_header_logos(file_path,output_dir):"""
Extract logo images from document header area.
"""os.makedirs(output_dir,exist_ok=True)withParser(file_path)asparser:ifnotparser.features.images:print("Image extraction not supported")return# Define header area (top 150 pixels, full width)header_area=Rectangle(Point(0,0),Size(1000,150))options=PageAreaOptions(header_area)# Extract images from headerimages=parser.get_images(options)ifimages:print(f"Found {len(list(images))} images in header")# Re-extract as iterator was consumedimages=parser.get_images(options)foridx,imageinenumerate(images):filename=f"logo_{idx+1}{image.file_type.extension}"filepath=os.path.join(output_dir,filename)image.save(filepath)print(f"Saved logo: {filename}")else:print("No images found in header area")# Usageextract_header_logos("invoice.pdf","extracted_logos")
The following sample file is used in this example: invoice.pdf
Expected behavior: Extracts logo or brand images typically found in document headers.
The following sample file is used in this example: layout.pdf
Expected behavior: Categorizes and counts images by their location in document quadrants.
Extract images larger than specific size
To filter images by size within an area:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageAreaOptions,Rectangle,Point,Sizeimportosdefextract_large_images_from_area(file_path,area_rect,min_width,min_height,output_dir):"""
Extract images larger than specified size from an area.
"""os.makedirs(output_dir,exist_ok=True)withParser(file_path)asparser:ifnotparser.features.images:print("Image extraction not supported")returnoptions=PageAreaOptions(area_rect)images=parser.get_images(options)ifnotimages:print("No images found in area")returnsaved_count=0forimageinimages:# Check image sizeifimage.rectangle.width>=min_widthandimage.rectangle.height>=min_height:filename=f"large_image_{saved_count+1}{image.file_type.extension}"filepath=os.path.join(output_dir,filename)image.save(filepath)print(f"Saved: {filename} ({image.rectangle.width}x{image.rectangle.height})")saved_count+=1print(f"Totallargeimagessaved:{saved_count}")# Usage - extract images larger than 100x100 from center areacenter_area=Rectangle(Point(100,100),Size(400,600))extract_large_images_from_area("document.pdf",center_area,100,100,"large_images")
The following sample file is used in this example: document.pdf
Expected behavior: Extracts only images that meet the minimum size criteria within the specified area.
Notes
Always check parser.features.images before extracting images
Coordinates are in document-specific units (points or pixels)
The origin (0, 0) is at the top-left corner
Images partially overlapping the area boundary are included
Use get_images(page_index, options) to extract from a specific page area
Empty collections are returned if no images are found in the area (not None)