To extract a text from PDF documents getText and getText(int) methods are used. These methods allow to extract a text from the entire document or a text from the selected page.
Here are the steps to extract a text from PDF document:
Instantiate Parser object for the initial document;
getText method returns null value if text extraction isn’t supported for the document. For example, text extraction isn’t supported for Zip archive. Therefore, for Zip archive getText method returns null. For empty PDF document getText method returns an empty TextReader object (readToEnd method returns an empty string).
The following example demonstrates how to extract a text from PDF document:
// Create an instance of Parser class
try(Parserparser=newParser(Constants.SamplePdf)){// Extract a text into the reader
try(TextReaderreader=parser.getText()){// Print a text from the document
System.out.println(reader.readToEnd());}}
Here are the steps to extract a text from the page of PDF document:
Instantiate Parser object for the initial document;
The following example demonstrates how to extract a text from the page of PDF document:
// Create an instance of Parser class
try(Parserparser=newParser(Constants.SamplePdf)){// Get the document info
IDocumentInfodocumentInfo=parser.getDocumentInfo();// Iterate over pages
for(intp=0;p<documentInfo.getPageCount();p++){// Print a page number
System.out.println(String.format("Page %d/%d",p+1,documentInfo.getPageCount()));// Extract a text into the reader
try(TextReaderreader=parser.getText(p)){// Print a text from the document page
System.out.println(reader.readToEnd());}}}
Raw mode allows to increase the speed of text extraction due to poor formatting accuracy. getText(TextOptions) and getText(int, TextOptions) methods are used to extract a text in raw mode.
Warning
Some documents may have different page numbers in raw and accurate modes. Use getRawPageCount instead of getPageCount in raw mode.
Here are the steps to extract a raw text from the page of PDF document:
Instantiate Parser object for the initial document;
Instantiate TextOptions object with true parameter;
The following example demonstrates how to extract a raw text from the page of PDF document:
// Create an instance of Parser class
try(Parserparser=newParser(Constants.SamplePdf)){// Check if the document supports text extraction
if(!parser.getFeatures().isText()){System.out.println("Document isn't supports text extraction.");return;}// Get the document info
DocumentInfodocumentInfo=parser.getDocumentInfo()instanceofDocumentInfo?(DocumentInfo)parser.getDocumentInfo():null;// Check if the document has pages
if(documentInfo==null||documentInfo.getRawPageCount()==0){System.out.println("Document hasn't pages.");return;}// Iterate over pages
for(intp=0;p<documentInfo.getRawPageCount();p++){// Print a page number
System.out.println(String.format("Page %d/%d",p+1,documentInfo.getPageCount()));// Extract a text into the reader
try(TextReaderreader=parser.getText(p,newTextOptions(true))){// Print a text from the document
// We ignore null-checking as we have checked text extraction feature support earlier
System.out.println(reader.readToEnd());}}}
More resources
GitHub examples
You may easily run the code above and see the feature in action in our GitHub examples:
Along with full featured .NET library we provide simple, but powerful free Apps.
You are welcome to parse documents and extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails and more with our free online Free Online Document Parser App.
Was this page helpful?
Any additional feedback you'd like to share with us?
Please tell us how we can improve this page.
Thank you for your feedback!
We value your opinion. Your feedback will help us improve our documentation.