This example demonstrates opening, editing, and saving XML documents, using different options and adjustments.
Introduction
GroupDocs.Editor supports importing documents in the XML (eXtensible Markup Language) format. This article describes the XML processing mechanism and the available editing options.
Loading XML documents
Loading XML documents into the Editor class is usual and the same as for other formats. There are no dedicated load options for the XML format; it is enough to specify the file itself through a file path or a byte stream.
If loading through a file path, the file extension does not matter, so you may freely load an XML file not only with the *.xml extension, but with any other extension like *.csproj, *.svg, or any other — only the valid internal structure matters.
Also, please note that you cannot treat HTML files like XML — only XHTML can be treated like valid XML.
fromgroupdocs.editorimportEditor# Load from a file patheditor_from_path=Editor("sample.xml")# Load from a binary streamwithopen("sample.xml","rb")asstream:editor_from_stream=Editor(stream)
Editing XML documents
Like for other format families in GroupDocs.Editor, there is a special XmlEditOptions class for editing XML documents. As always, it is not mandatory when editing a document, so the parameterless editor.edit() overload may be used — GroupDocs.Editor will automatically detect the format and apply the default options.
The XmlEditOptions class has different properties. The most useful and important ones are described below:
encoding — allows setting the encoding that is applied while opening the input XML file (any XML is first of all a text file). By default all XML files are UTF-8, so the default value of this option is also UTF-8.
fix_incorrect_structure — a boolean flag. GroupDocs.Editor can handle without error any XML document: corrupted, truncated, or with an invalid structure. If fix_incorrect_structure is enabled (True), GroupDocs.Editor scans the XML document and tries to fix its structure — it escapes prohibited characters, properly closes unclosed tags, opens unopened tags, fixes overlapping tags, and so on. By default it is disabled (False).
recognize_uris — a boolean flag that enables the mechanism of recognizing and preparing URIs (web addresses). By default it is disabled (False). When enabled (True), GroupDocs.Editor scans the XML document for any valid URIs and represents them as external links in the resultant HTML using the A element.
recognize_emails — a boolean flag, very similar to recognize_uris, but for email addresses. By default it is disabled. When enabled (True), all valid email addresses are represented with the mailto scheme and the A element.
trim_trailing_whitespaces — a boolean flag that enables truncation of trailing whitespaces in text nodes. By default it is disabled (False) — trailing whitespaces are preserved.
attribute_values_quote_type — allows redefining the quote type used in attribute values in the resultant HTML (single quote or double quote). By default double quotes are used.
The runnable example below loads an XML file, edits it with adjusted boolean options, and obtains the resulting HTML content.
importosfromgroupdocs.editorimportEditor,Licensefromgroupdocs.editor.optionsimportXmlEditOptionsdefedit_xml():# Optionally set a licenselicense_path=os.path.abspath("./GroupDocs.Editor.lic")ifos.path.exists(license_path):License().set_license(license_path)# Load an input XML file into the EditorwithEditor("./sample.xml")aseditor:# Create and adjust the XML edit optionsedit_options=XmlEditOptions()edit_options.fix_incorrect_structure=Trueedit_options.recognize_uris=Trueedit_options.recognize_emails=Trueedit_options.trim_trailing_whitespaces=True# Edit the XML document and obtain an EditableDocumenteditable=editor.edit(edit_options)# Obtain the HTML content (in practice it is sent to the WYSIWYG-editor)content=editable.get_content()print("Generated HTML content length:",len(content))editable.dispose()if__name__=="__main__":edit_xml()
sample.xml is the sample file used in this example. Click here to download it.
The XmlEditOptions class also has two compound properties, highlight_options and format_options, which are wrappers around the XmlHighlightOptions and XmlFormatOptions types respectively. An already created instance is set in each of these properties, and only the members of those instances are meant to be changed — a new instance cannot be assigned.
highlight_options controls the fonts (name, size, color, weight, style, and decoration) used to represent XML tags, attribute names, attribute values, inner text, HTML comments, and CDATA sections in the resultant HTML. format_options controls how the XML hierarchy is laid out — whether each attribute goes on a new line, whether leaf text nodes go on a new line, and the size of the left indent per nesting level. These properties operate on complex CSS-related value types, so adjust their members carefully. The snippet below is a schematic, non-runnable illustration of accessing these compound properties:
fromgroupdocs.editor.optionsimportXmlEditOptionsedit_options=XmlEditOptions()# Access the already-created compound sub-options (do not assign a new instance)highlight_options=edit_options.highlight_optionsformat_options=edit_options.format_options# Adjust members of the compound options using CSS-related value types# (see the API reference for the exact value types and their constructors)
Getting document metainfo
The article Extracting document metainfo describes the get_document_info() method, which allows you to detect the document format and extract its metadata without editing it. The XML format is supported as well.
When get_document_info() is called for an Editor instance loaded with an XML document, the method returns a metadata view corresponding to the TextualDocumentInfo type — a common type for all document formats of a textual nature, like HTML, XML, and TXT. It exposes the format, page_count, size, is_encrypted, and encoding properties. For XML documents page_count always returns 1 and is_encrypted always returns False.