GroupDocs.Parser allows you to extract tables from specific pages of a document, providing precise control over which tables to extract.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample documents with tables on multiple pages
Basic understanding of page indexing (zero-based)
Extract tables from a specific page
To extract tables from a particular page:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageTableAreaOptions# Create an instance of Parser classwithParser("./report.pdf")asparser:# Check if table extraction is supportedifnotparser.features.tables:print("Table extraction not supported")return# Create options for page-specific extractionpage_index=0# First pageoptions=PageTableAreaOptions(page_index)# Extract tables from the specified pagetables=parser.get_tables(options)iftables:print(f"Tables on page {page_index+1}:")fortable_idx,tableinenumerate(tables):print(f"Table{table_idx+1}:")print(f"Size: {table.row_count} rows × {table.column_count} columns")# Print table contentforrowinrange(table.row_count):row_data=[]forcolinrange(table.column_count):cell=table[row,col]cell_text=cell.textifcellelse""row_data.append(cell_text)print(" | ".join(row_data))else:print(f"No tables found on page {page_index+1}")
The following sample file is used in this example: report.pdf
Expected behavior: Extracts only tables from the specified page, returning an empty collection if no tables exist on that page.
Extract tables from all pages iteratively
To process tables page by page:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageTableAreaOptions# Create an instance of Parser classwithParser("./document.docx")asparser:# Check if table extraction is supportedifnotparser.features.tables:print("Table extraction not supported")return# Get document infoinfo=parser.get_document_info()ifinfo.page_count==0:print("Document has no pages")return# Iterate over pagestotal_tables=0forpage_indexinrange(info.page_count):print(f"{'='*50}")print(f"Page {page_index+1}/{info.page_count}")print('='*50)# Create options for this pageoptions=PageTableAreaOptions(page_index)# Extract tables from current pagetables=parser.get_tables(options)iftables:page_table_count=0fortableintables:page_table_count+=1total_tables+=1print(f"Table{page_table_count}:")print(f"Size: {table.row_count}×{table.column_count}")print(f"Position: ({table.rectangle.left}, {table.rectangle.top})")print(f"Tablesonthispage:{page_table_count}")else:print("No tables on this page")print(f"{'='*50}")print(f"Total tables in document: {total_tables}")
The following sample file is used in this example: document.docx
Expected behavior: Iterates through all pages and extracts tables from each page individually, providing page-specific statistics.
Extract tables from specific pages only
To extract tables from selected pages:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageTableAreaOptions# Create an instance of Parser classwithParser("report.xlsx")asparser:ifnotparser.features.tables:print("Table extraction not supported")return# Define pages to process (e.g., pages 1, 3, and 5)pages_to_process=[0,2,4]# Zero-based indicesforpage_indexinpages_to_process:print(f"Processingpage{page_index+1}...")# Create options for this pageoptions=PageTableAreaOptions(page_index)# Extract tables from this pagetables=parser.get_tables(options)iftables:fortable_idx,tableinenumerate(tables):print(f" Table {table_idx+1}: {table.row_count}×{table.column_count}")# Show first row as sampleiftable.row_count>0:print(" First row:",end=" ")forcolinrange(min(3,table.column_count)):# Show first 3 columnscell=table[0,col]ifcell:print(f"[{cell.text}]",end=" ")print("...")else:print(f" No tables found on page {page_index+1}")
The following sample file is used in this example: report.xlsx
Expected behavior: Extracts tables only from specified pages, skipping others for efficiency.
Compare tables across pages
To analyze table structure across different pages:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageTableAreaOptionsdefcompare_tables_across_pages(file_path,page_indices):"""
Compare table structures across multiple pages.
"""withParser(file_path)asparser:ifnotparser.features.tables:print("Table extraction not supported")returnpage_table_info={}forpage_indexinpage_indices:options=PageTableAreaOptions(page_index)tables=parser.get_tables(options)iftables:table_list=list(tables)page_table_info[page_index+1]={'count':len(table_list),'structures':[(t.row_count,t.column_count)fortintable_list]}else:page_table_info[page_index+1]={'count':0,'structures':[]}# Print comparisonprint("Table Structure Comparison:")print(f"{'Page':<10}{'Tables':<10}{'Structures':<40}")print("-"*60)forpage_num,infoinsorted(page_table_info.items()):structures_str=", ".join([f"{r}×{c}"forr,cininfo['structures']])print(f"{page_num:<10}{info['count']:<10}{structures_str:<40}")# Usagecompare_tables_across_pages("financial_report.pdf",[0,1,2,3,4])
Expected behavior: Provides a comparative view of table structures across specified pages.
Export tables from specific page to JSON
To export page-specific tables to structured JSON:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageTableAreaOptionsimportjsondefexport_page_tables_to_json(file_path,page_index,output_file):"""
Export tables from a specific page to JSON format.
"""withParser(file_path)asparser:ifnotparser.features.tables:print("Table extraction not supported")returnoptions=PageTableAreaOptions(page_index)tables=parser.get_tables(options)ifnottables:print(f"No tables found on page {page_index+1}")return# Build JSON structurepage_data={'page_number':page_index+1,'tables':[]}fortable_idx,tableinenumerate(tables):table_data={'table_index':table_idx+1,'rows':table.row_count,'columns':table.column_count,'data':[]}# Extract table dataforrowinrange(table.row_count):row_data=[]forcolinrange(table.column_count):cell=table[row,col]row_data.append(cell.textifcellelse"")table_data['data'].append(row_data)page_data['tables'].append(table_data)# Save to JSONwithopen(output_file,'w',encoding='utf-8')asf:json.dump(page_data,f,indent=2,ensure_ascii=False)print(f"Exported {len(page_data['tables'])} tables from page {page_index+1} to {output_file}")# Usageexport_page_tables_to_json("data.pdf",0,"page1_tables.json")
The following sample file is used in this example: data.pdf
The following sample file is used in this example: page1_tables.json
Expected behavior: Creates a JSON file containing all tables from the specified page with complete data.
Extract tables with custom layout from specific page
To use custom table layout for a specific page:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageTableAreaOptionsfromgroupdocs.parser.templatesimportTemplateTableLayout# Create an instance of Parser classwithParser("invoice.pdf")asparser:ifnotparser.features.tables:print("Table extraction not supported")return# Define custom table layoutlayout=TemplateTableLayout([50,150,300,450],# Column separators[200,220,240,260]# Row separators)# Create options for specific page with custom layoutpage_index=0options=PageTableAreaOptions(page_index,layout)# Extract tablestables=parser.get_tables(options)iftables:print(f"Tables extracted from page {page_index+1} with custom layout:")fortableintables:print(f"Table:{table.row_count}×{table.column_count}")forrowinrange(table.row_count):forcolinrange(table.column_count):cell=table[row,col]ifcell:print(f"{cell.text:20}",end=" | ")print()
The following sample file is used in this example: invoice.pdf
Expected behavior: Extracts tables from the specified page using the custom layout definition.
Notes
Page indices are zero-based (first page is index 0)
Use PageTableAreaOptions(page_index) to extract from a specific page
Combine with TemplateTableLayout for structured documents with known table positions
Use get_document_info() to determine the total number of pages
Page-by-page extraction is more memory-efficient for large documents
Empty collections are returned for pages without tables (not None)