Extract text from Microsoft Office Excel spreadsheets
To extract a text from Microsoft Office Excel spreadsheets getText and getText(int) method is used. These methods allow to extract a text from the entire document or a text from the selected page.
Here are the steps to extract a text from Microsoft Office Excel spreadsheets:
- Instantiate Parser object for the initial spreadsheet;
- Call getText method and obtain TextReader object;
- Read a text from reader.
The following example demonstrates how to extract a text from Microsoft Office Excel spreadsheets:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleXlsx)) {
// Extract a text into the reader
try (TextReader reader = parser.getText()) {
// Print a text from the spreadsheet
System.out.println(reader.readToEnd());
}
}
Here are the steps to extract a text from the sheet of Microsoft Office Excel spreadsheet:
- Instantiate Parser object for the initial spreadsheet;
- Call getDocumentInfo method and obtain IDocumentInfo object with getPageCount property;
- Call getText(int) method with the sheet index and obtain TextReader object;
- Read a text from reader.
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleXlsx)) {
// Get the spreadsheet info
IDocumentInfo spreadsheetInfo = parser.getDocumentInfo();
// Iterate over sheets
for (int p = 0; p < spreadsheetInfo.getPageCount(); p++) {
// Print a sheet number
System.out.println(String.format("Sheet %d/%d", p + 1, spreadsheetInfo.getPageCount()));
// Extract a text into the reader
try (TextReader reader = parser.getText(p)) {
// Print a text from the spreadsheet
System.out.println(reader.readToEnd());
}
}
}
Raw mode allows to increase the speed of text extraction due to poor formatting accuracy. getText(TextOptions) and getText(int, TextOptions) methods are used to extract a text in raw mode.
Here are the steps to extract a raw text from the sheet of Microsoft Office Excel spreadsheet:
- Instantiate Parser object for the initial spreadsheet;
- Instantiate TextOptions object with true parameter;
- Call getDocumentInfo method;
- Use getRawPageCount instead of getPageCount to avoid extra calculations;
- Call getText(int, TextOptions) method with the sheet index and obtain TextReader object;
- Read a text from reader.
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleXlsx)) {
// Get the document info
DocumentInfo documentInfo = parser.getDocumentInfo() instanceof DocumentInfo
? (DocumentInfo) parser.getDocumentInfo()
: null;
// Check if the document has pages
if (documentInfo == null || documentInfo.getRawPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int p = 0; p < documentInfo.getRawPageCount(); p++) {
// Print a page number
System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
// Extract a text into the reader
try (TextReader reader = parser.getText(p, new TextOptions(true))) {
// Print a text from the document
// We ignore null-checking as we have checked text extraction feature support earlier
System.out.println(reader.readToEnd());
}
}
}
GroupDocs.Parser also allows to extract a text from Microsoft Office Excel spreadsheets as HTML, Markdown and formatted plain text. For more details, see Extract Formatted Text.
Here are the steps to extract a text from Microsoft Office Excel spreadsheet as HTML:
- Instantiate Parser object for the initial spreadsheet;
- Call getFormattedText(FormattedTextOptions) method and obtain TextReader object;
- Read a text from reader.
The following example shows how to extract a text from Microsoft Office Excel spreadsheet as HTML:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleXlsx)) {
// Extract a formatted text into the reader
try (TextReader reader = parser.getFormattedText(new FormattedTextOptions(FormattedTextMode.Html))) {
// Print a formatted text from the spreadsheet
System.out.println(reader.readToEnd());
}
}
More resources
GitHub examples
You may easily run the code above and see the feature in action in our GitHub examples:
Free online document parser App
Along with full featured .NET library we provide simple, but powerful free Apps.
You are welcome to parse documents and extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails and more with our free online Free Online Document Parser App.