Extract text from Microsoft Office Excel spreadsheets

Overview

This guide demonstrates how to extract text content from Microsoft Office Excel spreadsheets (.xls, .xlsx) using the GroupDocs.Parser for .NET API. You’ll learn different text extraction methods suitable for various document processing scenarios, from simple Excel text retrieval to advanced sheet-by-sheet parsing operations.

Extraction Methods Comparison

MethodUse CasePerformanceOutput Quality
Whole DocumentExtract all text at onceFastStandard
Sheet-by-SheetProcess individual worksheetsMediumStandard
Raw ModeHigh-speed bulk processingFastestLower formatting accuracy
Formatted TextPreserve formatting (HTML/Markdown)SlowerHighest

Method 1: Extract Text from Entire Spreadsheet

When to use: When you need all text content from the Excel workbook and don’t need to distinguish between different worksheets for your document parsing workflow.

To extract text from Microsoft Office Excel spreadsheets using the .NET parser library, the GetText method is used. This text extraction API method retrieves content from the entire document.

Steps:

  1. Instantiate Parser object for the initial spreadsheet
  2. Call GetText method and obtain TextReader object
  3. Read text from the reader
Warning
GetText method returns null value if text extraction isn’t supported for the document. For example, text extraction isn’t supported for Zip archive. Therefore, for Zip archive GetText method returns null. For empty Microsoft Office Excel spreadsheets GetText method returns an empty TextReader object (reader.ReadToEnd method returns an empty string).

Example:

// Create an instance of Parser class
using(Parser parser = new Parser(filePath))
{
    // Extract a text into the reader
    using(TextReader reader = parser.GetText())
    {
        // Print a text from the spreadsheet
        Console.WriteLine(reader.ReadToEnd());
    }
}

Method 2: Extract Text from Individual Sheets

When to use: When you need to process each Excel worksheet separately for your C# spreadsheet parser application, maintain sheet organization, or perform sheet-specific data extraction operations.

This Excel parsing method uses GetText(pageIndex) to extract text from specific sheets. Each worksheet is treated as a separate page in the document parsing process.

Steps:

  1. Instantiate Parser object for the initial spreadsheet
  2. Call GetDocumentInfo method and obtain IDocumentInfo object with page count
  3. Call GetText(pageIndex) method with the sheet index and obtain TextReader object
  4. Read text from the reader

Example:

// Create an instance of Parser class
using(Parser parser = new Parser(filePath))
{
    // Get the document info
    IDocumentInfo documentInfo = parser.GetDocumentInfo();
   
    // Iterate over sheets
    for(int p = 0; p < documentInfo.PageCount; p++)
    {
        // Print a sheet number 
        Console.WriteLine(string.Format("Page {0}/{1}", p + 1, documentInfo.PageCount));
   
        // Extract a text into the reader
        using(TextReader reader = parser.GetText(p))
        {
            // Print a text from the spreadsheet sheet
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

Method 3: High-Speed Raw Text Extraction

When to use: When processing large Excel files or multiple spreadsheets where parsing speed is more important than formatting accuracy. Ideal for Excel data mining, content indexing, or bulk document processing scenarios in enterprise applications.

Raw mode increases text extraction performance by sacrificing formatting accuracy in the .NET parser. Use GetText(TextOptions) and GetText(pageIndex, TextOptions) methods for high-speed raw mode extraction.

Warning
Some spreadsheets may have different sheet numbers in raw and accurate modes. Use IDocumentInfo.RawPageCount instead of IDocumentInfo.PageCount in raw mode.

Steps:

  1. Instantiate Parser object for the initial spreadsheet
  2. Instantiate TextOptions object with true parameter
  3. Call GetDocumentInfo method
  4. Use RawPageCount instead of PageCount to avoid extra calculations
  5. Call GetText(pageIndex, TextOptions) method with the sheet index and obtain TextReader object
  6. Read text from the reader

Example:

// Create an instance of Parser class
using (Parser parser = new Parser(filePath))
{
    // Get the document info
    IDocumentInfo documentInfo = parser.GetDocumentInfo();
    // Check if the document has pages
    if (documentInfo == null || documentInfo.RawPageCount == 0)
    {
        Console.WriteLine("Document hasn't pages.");
        return;
    }
    // Iterate over sheets
    for (int p = 0; p < documentInfo.RawPageCount; p++)
    {
        // Print a sheet number 
        Console.WriteLine(string.Format("Page {0}/{1}", p + 1, documentInfo.RawPageCount));
        // Extract a text into the reader
        using (TextReader reader = parser.GetText(p, new TextOptions(true)))
        {
            // Print a text from the spreadsheet sheet
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

Method 4: Extract Formatted Text (HTML)

When to use: When you need to preserve the visual structure and formatting of Excel spreadsheet data using the .NET document parser, or when integrating extracted content into web applications or formatted reports.

The GroupDocs.Parser text extraction library allows extracting text from Microsoft Office Excel spreadsheets as HTML, Markdown, and formatted plain text. For more details, see Extract Formatted Text.

Steps:

  1. Instantiate Parser object for the initial spreadsheet
  2. Call GetFormattedText method and obtain TextReader object
  3. Read text from the reader

Example:

// Create an instance of Parser class
using (Parser parser = new Parser(filePath))
{
    // Extract a formatted text into the reader
    using (TextReader reader = parser.GetFormattedText(new FormattedTextOptions(FormattedTextMode.Html)))
    {
        // Print a formatted text from the sheet
        Console.WriteLine(reader.ReadToEnd());
    }
}

Supported Excel Formats

  • .xls - Microsoft Excel 97-2003 Workbook
  • .xlsx - Microsoft Excel Open XML Workbook

Common Use Cases

Enterprise Document Processing

  • Excel Data Mining: Extract text for search indexing and content analysis using C# parser methods
  • Spreadsheet Report Generation: Convert Excel workbook data to text-based reports
  • Content Migration: Move spreadsheet data between different document management systems

Business Intelligence & Analytics

  • Text Analytics on Excel Files: Analyze comments, notes, and text data within .xls/.xlsx spreadsheets
  • Compliance Document Processing: Extract text content for regulatory reporting from Excel documents
  • Audit Trails: Document text content changes over time in spreadsheet parser workflows

Integration Scenarios

  • Enterprise Search Systems: Index Excel spreadsheet content for full-text search capabilities
  • Automated Data Pipelines: Extract text as part of automated document processing workflows
  • Content Management Systems: Process uploaded Excel files automatically using the .NET parsing API

Performance Considerations

File Size Impact

  • Small files (< 1MB): All methods perform similarly
  • Medium files (1-10MB): Raw mode provides noticeable speed improvement
  • Large files (> 10MB): Raw mode recommended for bulk processing

Memory Usage

  • Sheet-by-sheet processing uses less memory than whole document extraction
  • Raw mode is more memory-efficient for large files
  • Consider processing sheets individually for very large workbooks

Troubleshooting

Common Issues

Null TextReader Response

  • Verify the file format is supported (.xls, .xlsx)
  • Check if the file is corrupted or password-protected
  • Ensure the file path is correct and accessible

Empty Text Output

  • Confirm the spreadsheet contains text data (not just numbers/formulas)
  • Check if the sheets contain visible content
  • Verify the file isn’t completely empty

Performance Issues

  • Use raw mode for large files or bulk processing
  • Process sheets individually instead of extracting the entire document
  • Consider file size limitations in your environment

More resources

GitHub examples

You may easily run the code above and see the feature in action in our GitHub examples:

Free online document parser App

Along with full featured .NET library we provide simple, but powerful free Apps.

You are welcome to parse documents and extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails and more with our free online Free Online Document Parser App.

Close
Loading

Analyzing your prompt, please hold on...

An error occurred while retrieving the results. Please refresh the page and try again.