Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

 

HTML
<script src="https://gist.github.com/GroupDocsGists/ea14da20df6908943201c73d872c85c9.js?file=TextExtractors-PDFDocuments-extractTextByLinesPdfTextExtractor.java"></script>

Extract Data from PDF Forms

Info

This feature is supported by version 18.9 or greater.

There might be the case when its required to extract data from the forms in a PDF document. GroupDocs.Parser also allows extracting data from PDF forms. getFormData method of PdfTextExtractor class is used for this purpose (as shown in the following code sample).

HTML
<script src="https://gist.github.com/GroupDocsGists/c7ee0f8eba5e5fb668b15dfdb6e87512.js?file=PDFDocuments-extractDataFromPDFForms_18.9.java"></script>

Extract Tables from PDF Documents

Info

This feature is supported by version 18.12 or greater.

This feature allows extracting tables from PDF documents. For extracting tables from PDF document, TableAreaParser class is used. The instance of TableAreaParser class is available via property with the same name in PdfTextExtractor class. parseTableArea method is used to extract a table from the document page.

Code Block
titleC#
languagecsharp
TableArea parseTableArea(int pageIndex, TableAreaLayout tableAreaLayout)


This method accepts the zero-based page index and layout of the table. The layout is represented by 
TableAreaLayout class with the following members:

Member
Description
VerticalSeparatorsA collection of vertical separators
HorizontalSeparatorsA collection of horizontal separators


These collections represent bounds of columns and rows. For example, for 2x2 table there are 3 vertical and 3 horizontal separators:

 

Panel
---------
|   |   |
---------
|   |   |
---------

TableArea class has the following members:

Member
Description
double GetColumnWidth(int column)Width of the row
double GetRowHeight(int row)Height of the row
int ColumnCountNumber of table columns
int RowCountNumber of table rows
TableAreaCell this[int row, int column]Cell of the table

Whereas, TableCellArea class has the following members:

Member
Description
TextAreaContent of the cell.
RowZero-based index of the row.
ColumnZero-based index of the column.
RowSpanNumber of rows which the cell spans across.
ColumnSpanNumber of columns which the cell spans across.

There are following two ways to extract table from PDF document:

Creating TableAreaLayout Object Manually

The following code sample shows how to create TableAreaLayout Manually to extract tables:

HTML
<script src="https://gist.github.com/GroupDocsGists/ea14da20df6908943201c73d872c85c9.js?file=extractTablesManually_PDF_18.12.java"></script>

Using TableAreaDetector Class

TableAreaDetector class is created to find table bounds in automatic mode. The instance of TableAreaDetector class is available via property with the same name in PdfTextExtractor class. detectLayouts method searches tables on the page of the document and returns a collection of table layouts:

Code Block
titleC#
languagecsharp
IList<TableAreaLayout> detectLayouts(int pageIndex, params TableAreaDetectorParameters[] parameters)

This method accepts the zero-based page index and optional parameters. These parameters help to detect tables. If set, the detector tries to search only those tables which meet this criterion; the total number of detected tables, in this case, must be equal to the number of passed parameters.

TableAreaDetectorParameters class has the following members:

Member
Description
MinRowCountMinimum number of table rows.
MinColumnCountMinimum number of table columns.
HasMergedCellsValue indicating whether the table has merged cells.
MinVerticalSpaceMinimum width of vertical separators.
RectangleRectangle which bounds a table detection region.

By setting parameters a user can tune detector's behavior. For example, limit the page area to search a table and disable searching complex tables (with merged cells). The following code sample shows how to use TableAreaDetector.

HTML
<script src="https://gist.github.com/GroupDocsGists/ea14da20df6908943201c73d872c85c9.js?file=extractTablesUsingTableAreaDetector_PDF_18.12.java"></script>