Extract text from PDF documents Leave feedback

Warning
GetText method returns null value if text extraction isn’t supported for the document. For example, text extraction isn’t supported for Zip archive. Therefore, for Zip archive GetText method returns null. For empty PDF document GetText method returns an empty TextReader object (reader.ReadToEnd method returns an empty string).

The following example demonstrates how to extract text from a PDF document:

// Create an instance of Parser class
using (Parser parser = new Parser(filePath))
{
    // Extract text into the reader
    using (TextReader reader = parser.GetText())
    {
        // If text extraction isn't supported, reader is null
        Console.WriteLine(reader == null ? "Text extraction isn't supported" : reader.ReadToEnd());
    }
}

Here are the steps to extract a text from the page of PDF document:

Instantiate Parser object for the initial document;
Call GetDocumentInfo method and obtain IDocumentInfo object with page count;
Call GetText(pageIndex) method with the page index and obtain TextReader object;
Read a text from reader.

The following example demonstrates how to extract a text from the page of PDF document:

// Create an instance of Parser class
using(Parser parser = new Parser(filePath))
{
    // Get the document info
    IDocumentInfo documentInfo = parser.GetDocumentInfo();
   
    // Iterate over pages
    for(int p = 0; p < documentInfo.PageCount; p++)
    {
        // Print a page number 
        Console.WriteLine(string.Format("Page {0}/{1}", p + 1, documentInfo.PageCount));
   
        // Extract a text into the reader
        using(TextReader reader = parser.GetText(p))
        {
            // Print a text from the document
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

Raw mode allows to increase the speed of text extraction due to poor formatting accuracy. GetText(TextOptions) and GetText(pageIndex, TextOptions) methods are used to extract a text in raw mode.

Warning
Some documents may have different page numbers in raw and accurate modes. Use DocumentInfo.RawPageCount instead of IDocumentInfo.PageCount in raw mode.

Here are the steps to extract a raw text from the page of PDF document:

Instantiate Parser object for the initial document;
Instantiate TextOptions object with true parameter;
Call GetDocumentInfo;
Use RawPageCount instead of PageCount to avoid extra calculations;
Call GetText(pageIndex, TextOptions) method with the sheet index and obtain TextReader object;
Read a text from reader.

The following example demonstrates how to extract a raw text from the page of PDF document:

// Create an instance of Parser class
using (Parser parser = new Parser(filePath))
{
    // Get the document info
    IDocumentInfo documentInfo = parser.GetDocumentInfo();
    // Check if the document has pages
    if (documentInfo == null || documentInfo.RawPageCount == 0)
    {
        Console.WriteLine("Document hasn't pages.");
        return;
    }
    // Iterate over pages
    for (int p = 0; p < documentInfo.RawPageCount; p++)
    {
        // Print a pagenumber 
        Console.WriteLine(string.Format("Slide {0}/{1}", p + 1, documentInfo.RawPageCount));
        // Extract a text into the reader
        using (TextReader reader = parser.GetText(p, new TextOptions(true)))
        {
            // Print a text from the document page
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

More resources

GitHub examples

You may easily run the code above and see the feature in action in our GitHub examples:

Free online document parser App

Along with full featured .NET library we provide simple, but powerful free Apps.

You are welcome to parse documents and extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails and more with our free online Free Online Document Parser App.

We value your opinion. Your feedback will help us improve our documentation.

Extract text from PDF documents Leave feedback

On this page

More resources

GitHub examples

Free online document parser App

Was this page helpful?

Any additional feedback you'd like to share with us?

Please tell us how we can improve this page.

Thank you for your feedback!

On this page