Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For extracting tables from PDF document, TextAreaParserTableAreaParser class is used. The instance of TextAreaParserTableAreaParser class is available via property with the same name in PdfTextExtractor class:

Code Block
titleJava
languagejava
PdfTextExtractor extractor = new PdfTextExtractor("document.pdf"); 
TextAreaParserTableAreaParser parser = extractor.getTextAreaParsergetTableAreaParser();

ParseTableArea method is used to extract a table from the document page:

...

 

Panel
---------
|   |   |
---------
|   |   |
---------

TableArea class has the following members:

Member
Description
int RowCountNumber of table rows
int ColumnCountNumber of table columns
TableCellArea TableAreaCell this[int row, int column]Cell of the table
double GetRowHeight(int row)Height of the row
double GetColumnWidth(int column)Width of the row

TableCellArea class has the following members:

Member
Description
TextAreaContent of the cell.
RowZero-based index of the row.
ColumnZero-based index of the column.
RowSpanNumber of rows which the cell spans across.
ColumnSpanNumber of columns which the cell spans across.

Usage:

Code Block
titleJava
languagejava
void parse(String fileName) throws java.lang.Exception {
    // Create a text extractor
    try (PdfTextExtractor extractor = new PdfTextExtractor(fileName)) {
        // Get a table parser
        TableAreaParser parser = extractor.getTableAreaParser();
 
        // Create a table layout
        TableAreaLayout layout = new TableAreaLayout();
 
        // Add vertical separators (columns)
        layout.getVerticalSeparators().add(72.0);
        layout.getVerticalSeparators().add(125.0);
        layout.getVerticalSeparators().add(333.0);
        layout.getVerticalSeparators().add(454.0);
        layout.getVerticalSeparators().add(485.0);
 
        // Add horizontal separators (rows)
        layout.getHorizontalSeparators().add(390.0);
        layout.getHorizontalSeparators().add(417.0);
        layout.getHorizontalSeparators().add(440.0);
        layout.getHorizontalSeparators().add(500.0);
        layout.getHorizontalSeparators().add(521.0);
 
        // Extract a table area
        TableArea tableArea = parser.parseTableArea(0, layout);
 
        // Iterate over rows
        for (int row = 0; row < tableArea.getRowCount(); row++) {
            System.out.print("| ");
            // Iterate over columns
            for (int column = 0; column < tableArea.getColumnCount(); column++) {
                // Get a table cell
                TableAreaCell cell = tableArea.get_Item(row, column);
 
                // If a cell is empty or it continues another cell
                if (cell == null || cell.getColumn() != column || cell.getRow() != row) {
                    // Skip this cell
                    continue;
                }
 
                // Write content of the cell
                System.out.print(cell == null ? " " : cell.getTextArea().getText());
                System.out.print(" | ");
            }
 
            System.out.println();
        }
    }
}
A user can create TableAreaLayout object manually or by using TextAreaDetectorTableAreaDetector class. The instance of TextAreaParserTableAreaDetector class is available via property with the same name in PdfTextExtractor class:
Code Block
titleJava
languagejava
PdfTextExtractor extractor = new PdfTextExtractor("document.pdf");
TextAreaDetectorTableAreaDetector detector = extractor.getTextAreaDetectorgetTableAreaDetector();
TextAreaDetectorTableAreaDetector class is created to find table bounds in automatic mode. detectLayouts method searches tables on the page of the document and returns a collection of table layouts:

 

Code Block
titleJava
languagejava
IList<TableAreaLayout> detectLayouts(int pageIndex, params TableAreaDetectorParameters[] parameters)

This method accepts the zero-based page index and optional parameters. These parameters help to detect tables. If set, the detector tries to search only those tables which meet this criterion; the total number of detected tables, in this case, must be equal to the number of passed parameters.

TableAreaDetectorParameters class has the following members:

Member
Description
MinRowCountMinimum number of table rows.
MinColumnCountMinimum number of table columns.
HasMergedCellsValue indicating whether the table has merged cells.
MinVerticalSpaceMinimum width of vertical separators.
RectangleRectangle which bounds a table detection region.

By setting parameters a user can tune detector's behavior. For example, limit the page area to search a table and disable searching complex tables (with merged cells):

Code Block
titleJava
languagejava
void detectAndParse(String fileName) throws java.lang.Exception {
    // Create a text extractor
    try (PdfTextExtractor extractor = new PdfTextExtractor(fileName)) {
        // Get a table detector
        TableAreaDetector detector = extractor.getTableAreaDetector();
 
        int pageIndex = 0;
 
        // Get a page object
        DocumentPage page = extractor.getDocumentContent().getPage(pageIndex);
        // Create a parameter to help the detector to search a table
        TableAreaDetectorParameters parameter = new TableAreaDetectorParameters();
        // We assume that the table is placed in a middle of the page and has a half page height
        parameter.setRectangle(new Rectangle(0, page.getHeight() / 3, page.getWidth(), page.getHeight() / 2));
        // Table hasn't merged cells
        parameter.setMergedCells(false);
        // Table contains 3 or more rows
        parameter.setMinRowCount(3);
        // Table contains 4 or more columns
        parameter.setMinColumnCount(4);
 
        // Detect layouts
        java.util.List<TableAreaLayout> layouts = detector.detectLayouts(pageIndex, parameter);
 
        // If layouts collection is empty - exit
        if (layouts.size() == 0) {
            System.out.println("No tables found");
            return;
        }
 
        // Get a table parser
        TableAreaParser parser = extractor.getTableAreaParser();
        // Extract a table area. As we pass only one parameter, there is only one layout
        TableArea tableArea = parser.parseTableArea(pageIndex, layouts.get(0));
 
        // Iterate over rows
        for (int row = 0; row < tableArea.getRowCount(); row++) {
            System.out.print("| ");
            // Iterate over columns
            for (int column = 0; column < tableArea.getColumnCount(); column++) {
                // Get a table cell
                TableAreaCell cell = tableArea.get_Item(row, column);
 
                // If a cell is empty or it continues another cell
                if (cell == null || cell.getColumn() != column || cell.getRow() != row) {
                    // Skip this cell
                    continue;
                }
 
                // Write content of the cell
                System.out.print(cell == null ? " " : cell.getTextArea().getText());
                System.out.print(" | ");
            }
 
            System.out.println();
        }
    }
}

 

...

 

Code Block
titleJava
languagejava
void detect(String fileName, String password) throws java.lang.Exception {
    // Create load options
    LoadOptions loadOptions = new LoadOptions();
    // Set a password
    loadOptions.setPassword(password);
 
    // Get a default composite media type detector
    MediaTypeDetector detector = CompositeMediaTypeDetector.DEFAULT;
 
    // Create a stream to detect media type by content (not file extension)
    try (java.io.InputStream stream = new java.io.FileInputStream(fileName)) {
        // Detect a media type
        String mediaType = detector.detect(stream, loadOptions);
        // Print a detected media type
        System.out.println(mediaType);
    }
}

 

For batch document processing PasswordProvider is used:

...