Extract text from PDF documents
To extract a text from PDF documents GetText and GetText(pageIndex) method is used. These methods allow to extract a text from the entire document or a text from the selected page.
Here are the steps to extract a text from PDF document:
- Instantiate Parser object for the initial document;
- Call GetText method and obtain TextReader object;
- Read a text from reader.
The following example demonstrates how to extract a text from PDF document:
// Create an instance of Parser class
using(Parser parser = new Parser(filePath))
{
// Extract a text into the reader
using(TextReader reader = parser.GetText())
{
// Print a text from the document
Console.WriteLine(reader.ReadToEnd());
}
}
Here are the steps to extract a text from the page of PDF document:
- Instantiate Parser object for the initial document;
- Call GetDocumentInfo method and obtain IDocumentInfo object with page count;
- Call GetText(pageIndex) method with the page index and obtain TextReader object;
- Read a text from reader.
The following example demonstrates how to extract a text from the page of PDF document:
// Create an instance of Parser class
using(Parser parser = new Parser(filePath))
{
// Get the document info
IDocumentInfo documentInfo = parser.GetDocumentInfo();
// Iterate over pages
for(int p = 0; p < documentInfo.PageCount; p++)
{
// Print a page number
Console.WriteLine(string.Format("Page {0}/{1}", p + 1, documentInfo.PageCount));
// Extract a text into the reader
using(TextReader reader = parser.GetText(p))
{
// Print a text from the document
Console.WriteLine(reader.ReadToEnd());
}
}
}
Raw mode allows to increase the speed of text extraction due to poor formatting accuracy. GetText(TextOptions) and GetText(pageIndex, TextOptions) methods are used to extract a text in raw mode.
Here are the steps to extract a raw text from the page of PDF document:
- Instantiate Parser object for the initial document;
- Instantiate TextOptions object with true parameter;
- Call GetDocumentInfo;
- Use RawPageCount instead of PageCount to avoid extra calculations;
- Call GetText(pageIndex, TextOptions) method with the sheet index and obtain TextReader object;
- Read a text from reader.
The following example demonstrates how to extract a raw text from the page of PDF document:
// Create an instance of Parser class
using (Parser parser = new Parser(filePath))
{
// Get the document info
IDocumentInfo documentInfo = parser.GetDocumentInfo();
// Check if the document has pages
if (documentInfo == null || documentInfo.RawPageCount == 0)
{
Console.WriteLine("Document hasn't pages.");
return;
}
// Iterate over pages
for (int p = 0; p < documentInfo.RawPageCount; p++)
{
// Print a pagenumber
Console.WriteLine(string.Format("Slide {0}/{1}", p + 1, documentInfo.RawPageCount));
// Extract a text into the reader
using (TextReader reader = parser.GetText(p, new TextOptions(true)))
{
// Print a text from the document page
Console.WriteLine(reader.ReadToEnd());
}
}
}
More resources
GitHub examples
You may easily run the code above and see the feature in action in our GitHub examples:
Free online document parser App
Along with full featured .NET library we provide simple, but powerful free Apps.
You are welcome to parse documents and extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails and more with our free online Free Online Document Parser App.