Extract text from Microsoft Office Excel spreadsheets

To extract a text from Microsoft Office Excel spreadsheets GetText and GetText(pageIndex) method is used. These methods allow to extract a text from the entire document or a text from the selected page. Here are the steps to extract a text from Microsoft Office Excel spreadsheets:

  • Instantiate Parser object for the initial spreadsheet;
  • Call GetText method and obtain TextReader object;
  • Read a text from reader.
Warning
GetText method returns null value if text extraction isn’t supported for the document. For example, text extraction isn’t supported for Zip archive. Therefore, for Zip archive GetText method returns null. For empty Microsoft Office Excel spreadsheets GetText method returns an empty TextReader object (reader.ReadToEnd method returns an empty string).

The following example demonstrates how to extract a text from Microsoft Office Excel spreadsheets:

// Create an instance of Parser class
using(Parser parser = new Parser(filePath))
{
    // Extract a text into the reader
    using(TextReader reader = parser.GetText())
    {
        // Print a text from the spreadsheet
        Console.WriteLine(reader.ReadToEnd());
    }
}

Here are the steps to extract a text from the sheet of Microsoft Office Excel spreadsheet:

The following example shows how to extract a text from the sheet of Microsoft Office Excel spreadsheet:

// Create an instance of Parser class
using(Parser parser = new Parser(filePath))
{
    // Get the document info
    IDocumentInfo documentInfo = parser.GetDocumentInfo();
   
    // Iterate over sheets
    for(int p = 0; p < documentInfo.PageCount; p++)
    {
        // Print a sheet number 
        Console.WriteLine(string.Format("Page {0}/{1}", p + 1, documentInfo.PageCount));
   
        // Extract a text into the reader
        using(TextReader reader = parser.GetText(p))
        {
            // Print a text from the spreadsheet sheet
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

Raw mode allows to increase the speed of text extraction due to poor formatting accuracy. GetText(TextOptions) and GetText(pageIndex, TextOptions) methods are used to extract a text in raw mode.

Warning
Some spreadsheets may have different sheet numbers in raw and accurate modes. Use IDocumentInfo.RawPageCount instead of IDocumentInfo.PageCount in raw mode.

Here are the steps to extract a raw text from the sheet of Microsoft Office Excel spreadsheet:

The following example shows how to extract a raw text from the sheet of Microsoft Office Excel spreadsheet:

// Create an instance of Parser class
using (Parser parser = new Parser(filePath))
{
    // Get the document info
    IDocumentInfo documentInfo = parser.GetDocumentInfo();
    // Check if the document has pages
    if (documentInfo == null || documentInfo.RawPageCount == 0)
    {
        Console.WriteLine("Document hasn't pages.");
        return;
    }
    // Iterate over sheets
    for (int p = 0; p < documentInfo.RawPageCount; p++)
    {
        // Print a sheet number 
        Console.WriteLine(string.Format("Page {0}/{1}", p + 1, documentInfo.RawPageCount));
        // Extract a text into the reader
        using (TextReader reader = parser.GetText(p, new TextOptions(true)))
        {
            // Print a text from the spreadsheet sheet
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

GroupDocs.Parser also allows to extract a text from Microsoft Office Excel spreadsheets as HTML, Markdown and formatted plain text. For more details, see Extract Formatted Text.

Here are the steps to extract a text from Microsoft Office Excel spreadsheet as HTML:

The following example shows how to extract a text from Microsoft Office Excel spreadsheet as HTML:

// Create an instance of Parser class
using (Parser parser = new Parser(filePath))
{
    // Extract a formatted text into the reader
    using (TextReader reader = parser.GetFormattedText(new FormattedTextOptions(FormattedTextMode.Html)))
    {
        // Print a formatted text from the sheet
        Console.WriteLine(reader.ReadToEnd());
    }
}

More resources

GitHub examples

You may easily run the code above and see the feature in action in our GitHub examples:

Free online document parser App

Along with full featured .NET library we provide simple, but powerful free Apps.

You are welcome to parse documents and extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails and more with our free online Free Online Document Parser App.