Methods return an instance of TextReader class with an extracted text. The first method extracts a text from the whole document. The second method extracts a text from the document page. To retrieve the total number of document pages getDocumentInfo method is used (see below).
Warning
Instead of the accurate mode, getRawPageCount property is used to avoid extra calculations.
Check if reader isn’t null (text extraction is supported for the document);
Read a text from reader.
The following example shows how to extract a raw text from a document:
// Create an instance of Parser class
try(Parserparser=newParser(Constants.SamplePdf)){// Extract a raw text into the reader
try(TextReaderreader=parser.getText(newTextOptions(true))){// Print a text from the document
// If text extraction isn't supported, a reader is null
System.out.println(reader==null?"Text extraction isn't supported":reader.readToEnd());}}
Extract text from page
Here are the steps to extract a raw text from the document page:
Instantiate Parser object for the initial document;
Instantiate TextOptions object with true parameter;
Call isText property to check if text extraction is supported for the document;
The following example shows how to extract a raw text from a document page:
// Create an instance of Parser class
try(Parserparser=newParser(Constants.SamplePdf)){// Check if the document supports text extraction
if(!parser.getFeatures().isText()){System.out.println("Document isn't supports text extraction.");return;}// Get the document info
IDocumentInfodocumentInfo=parser.getDocumentInfo();// Check if the document has pages
if(documentInfo==null||documentInfo.getRawPageCount()==0){System.out.println("Document hasn't pages.");return;}// Iterate over pages
for(intp=0;p<documentInfo.getRawPageCount();p++){// Print a page number
System.out.println(String.format("Page %d/%d",p+1,documentInfo.getPageCount()));// Extract a text into the reader
try(TextReaderreader=parser.getText(p,newTextOptions(true))){// Print a text from the document
// We ignore null-checking as we have checked text extraction feature support earlier
System.out.println(reader.readToEnd());}}}
More resources
GitHub examples
You may easily run the code above and see the feature in action in our GitHub examples: