This demonstration shows how to open an input document, convert it to an intermediate EditableDocument, and get HTML markup in different forms depending on client requirements.
Preparations
When an input document is loaded into the Editor class and opened for editing by transforming it to the intermediate EditableDocument class, it is possible to generate and get HTML markup in different forms.
First of all the user needs to load the document into the Editor class and open it for editing, which is demonstrated in the code below.
fromgroupdocs.editorimportEditorfromgroupdocs.editor.optionsimportWordProcessingLoadOptionsload_options=WordProcessingLoadOptions()editor=Editor("document.docx",load_options)# passing path and load options to the constructordocument=editor.edit()# opening the document for editing
The piece of code above prepares a ready-to-use instance of the EditableDocument class, that contains the original document in its own intermediate format and is able to generate HTML markup in different forms.
Getting the whole HTML content
The most default and standard method for generating HTML markup is the parameterless get_content() method:
html_content=document.get_content()
If the document has external resources (stylesheets, fonts, images), they are referenced via different HTML elements: stylesheets are specified through LINK elements, while images — through IMG. When using the get_content() method, such external resources will be referenced by external links. For example:
A lot of HTML WYSIWYG editors are not able to process the whole HTML document, with a HEAD section and so on. They are only able to process the inner content of the HTML->BODY element. In order to obtain such part of the HTML markup, the EditableDocument class contains the get_body_content() method:
body_content=document.get_body_content()
It is also possible to pass an external images template, which is added to every URL in the src attribute of every IMG tag found inside the HTML->BODY markup:
The get_css_content() method returns the CSS stylesheet(s) of the document. It can be called without arguments, or with prefixes for external images and fonts referenced from the stylesheets:
css_content=document.get_css_content()
Getting base64-encoded content
Sometimes it is necessary to obtain all the content of the whole document with all used resources in a single string. GroupDocs.Editor allows to do this with the get_embedded_html() method:
In such a string all stylesheets will be placed into STYLE elements in the HTML->HEAD section, all images in IMG elements will be serialized with base64 encoding and placed directly in the src attributes. All fonts and images, which are used in the stylesheets, will also be serialized and stored in the appropriate locations. Such a string is fully autonomous and self-sufficient.
Complete code example
The example below loads a document, opens it for editing, and prints the lengths of the HTML markup obtained in different forms.
importosfromgroupdocs.editorimportEditor,Licensedefget_html_markup_in_different_forms():# Optionally set a licenselicense_path=os.path.abspath("./GroupDocs.Editor.lic")ifos.path.exists(license_path):License().set_license(license_path)withEditor("./sample-document.docx")aseditor:document=editor.edit()# Generate the HTML markup in different forms and inspect their sizesprint("Whole content length:",len(document.get_content()))print("Body content length:",len(document.get_body_content()))print("Embedded (base64) content length:",len(document.get_embedded_html()))# get_css_content() returns the stylesheet(s) of the documentcss=document.get_css_content()print("CSS stylesheets count:",len(css))document.dispose()if__name__=="__main__":get_html_markup_in_different_forms()
sample-document.docx is the sample file used in this example. Click here to download it.