GroupDocs.Parser provides functionality to extract tables from various document formats including Excel spreadsheets (XLS, XLSX), Word documents, PDFs, and PowerPoint presentations.
Prerequisites
GroupDocs.Parser for Python via .NET installed
Sample documents containing tables
Understanding of table structure (rows, columns, cells)
Extract tables from document
To extract all tables from a document:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("./sample.xlsx")asparser:# Check if table extraction is supportedifnotparser.features.tables:print("Table extraction isn't supported")return# Extract tables from the documenttables=parser.get_tables(None)iftables:table_index=0fortableintables:table_index+=1print(f"===Table{table_index}===")print(f"Rows: {table.row_count}, Columns: {table.column_count}")# Iterate over rowsforrow_indexinrange(table.row_count):# Iterate over columnsforcol_indexinrange(table.column_count):# Get table cellcell=table[row_index,col_index]ifcell:print(f"{cell.text}\t",end="")print()# New line after each row
The following sample file is used in this example: sample.xlsx
Expected behavior: Extracts all tables from the document with automatic table detection, returning structured table data with rows, columns, and cell values.
Extract tables from Excel spreadsheets
To extract tables from Excel files:
fromgroupdocs.parserimportParser# Excel file pathexcel_path="data.xlsx"# Create an instance of Parser classwithParser(excel_path)asparser:# Check if table extraction is supportedifnotparser.features.tables:print("Table extraction not supported for this file")return# Extract tables (pass None for automatic detection)tables=parser.get_tables(None)iftables:fortable_idx,tableinenumerate(tables):print(f"Table{table_idx+1}:")print(f"Size: {table.row_count} rows × {table.column_count} columns")print(f"Page: {table.page.index+1}")# Print table dataforrowintable.rows:row_data=[]forcellinrow.cells:cell_text=cell.textifcellelse""row_data.append(cell_text)print(" | ".join(row_data))
The following sample file is used in this example: data.xlsx
Expected behavior: Extracts structured table data from Excel spreadsheets, including all sheets and tables.
Extract tables with custom layout
To extract tables when you know the exact table structure:
fromgroupdocs.parserimportParserfromgroupdocs.parser.optionsimportPageTableAreaOptionsfromgroupdocs.parser.templatesimportTemplateTableLayout# Create an instance of Parser classwithParser("invoice.pdf")asparser:# Check if table extraction is supportedifnotparser.features.tables:print("Table extraction not supported")return# Define table layout (column and row separators in points)# Columns: x-coordinates of vertical separators# Rows: y-coordinates of horizontal separatorslayout=TemplateTableLayout([50,95,275,415,485,545],# Column separators[325,340,365,395]# Row separators)# Create options for table extractionoptions=PageTableAreaOptions(layout)# Extract tables with custom layouttables=parser.get_tables(options)iftables:fortableintables:print(f"Table:{table.row_count}rows×{table.column_count}columns")# Print table contentforrowintable.rows:forcellinrow.cells:ifcell:print(f"{cell.text} | ",end="")print()
The following sample file is used in this example: invoice.pdf
Expected behavior: Extracts tables using predefined column and row positions, useful for structured documents like invoices or forms.
Convert table to CSV
To export extracted tables to CSV format:
fromgroupdocs.parserimportParserimportcsvimportosdefexport_tables_to_csv(file_path,output_dir):"""
Extract tables from document and save as CSV files.
"""os.makedirs(output_dir,exist_ok=True)withParser(file_path)asparser:ifnotparser.features.tables:print("Table extraction not supported")returntables=parser.get_tables(None)ifnottables:print("No tables found")returnfortable_idx,tableinenumerate(tables):# Create CSV filenamecsv_filename=f"table_{table_idx+1}.csv"csv_path=os.path.join(output_dir,csv_filename)# Write table to CSVwithopen(csv_path,'w',newline='',encoding='utf-8')ascsvfile:writer=csv.writer(csvfile)# Write table rowsforrowintable.rows:row_data=[]forcellinrow.cells:cell_text=cell.textifcellelse""row_data.append(cell_text)writer.writerow(row_data)print(f"Saved: {csv_path}")# Usageexport_tables_to_csv("report.xlsx","exported_tables")
The following sample file is used in this example: report.xlsx
Expected behavior: Converts each extracted table to a separate CSV file preserving the table structure.
Get table cell properties
To access detailed cell information:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("document.docx")asparser:ifnotparser.features.tables:print("Table extraction not supported")returntables=parser.get_tables(None)iftables:fortableintables:print(f"Table:{table.row_count}×{table.column_count}")print(f"Position: {table.rectangle}")# Access specific celliftable.row_count>0andtable.column_count>0:cell=table[0,0]# First cellifcell:print(f"Firstcell:")print(f" Text: {cell.text}")print(f" Row index: {cell.row_index}")print(f" Column index: {cell.column_index}")print(f" Row span: {cell.row_span}")print(f" Column span: {cell.column_span}")
The following sample file is used in this example: document.docx
Expected behavior: Provides access to individual cell properties including text, position, and span information.
Extract tables from multiple pages
To extract tables with page information:
fromgroupdocs.parserimportParser# Create an instance of Parser classwithParser("report.pdf")asparser:ifnotparser.features.tables:print("Table extraction not supported")return# Get document infoinfo=parser.get_document_info()print(f"Document has {info.page_count} pages")# Extract all tablestables=parser.get_tables(None)iftables:# Group tables by pagetables_by_page={}fortableintables:page_num=table.page.index+1ifpage_numnotintables_by_page:tables_by_page[page_num]=[]tables_by_page[page_num].append(table)# Print summaryprint(f"Found{len(list(tables))}tablesacross{len(tables_by_page)}pages")forpage_numinsorted(tables_by_page.keys()):page_tables=tables_by_page[page_num]print(f"Page{page_num}:{len(page_tables)}table(s)")foridx,tableinenumerate(page_tables):print(f" Table {idx+1}: {table.row_count}×{table.column_count}")
The following sample file is used in this example: report.pdf
Expected behavior: Extracts all tables and organizes them by page number.
Process large tables efficiently
To handle large tables with many rows:
fromgroupdocs.parserimportParserdefprocess_large_table(file_path,max_rows_to_print=10):"""
Process large tables efficiently, showing only a preview.
"""withParser(file_path)asparser:ifnotparser.features.tables:print("Table extraction not supported")returntables=parser.get_tables(None)iftables:fortable_idx,tableinenumerate(tables):print(f"\nTable {table_idx+1}:")print(f"Total size: {table.row_count} rows × {table.column_count} columns")# Print header (first row)iftable.row_count>0:print("\nHeader:")forcolinrange(table.column_count):cell=table[0,col]ifcell:print(f" Column {col+1}: {cell.text}")# Print preview of datarows_to_show=min(max_rows_to_print,table.row_count)print(f"Showingfirst{rows_to_show}rows:")forrowinrange(rows_to_show):row_data=[]forcolinrange(table.column_count):cell=table[row,col]cell_text=cell.text[:20]ifcellandcell.textelse""# Truncate long textrow_data.append(cell_text)print(" | ".join(row_data))# Usageprocess_large_table("large_spreadsheet.xlsx")