Creating content from the scratch with XML

By Alexander Halser
December 4, 2006

Original article link

 

Targeted audience: Programmers and power users who want to create content for Help & Manual via XML.
Required skills: Requires a basic knowledge of XML!

If you ever looked at the XML code that Help & Manual exports you may have found it daunting. "Better not touch that code manually" may come to your mind. And that's a mistake, because the code is really extremely simple. So, if the idea of generating content for Help & Manual via XML appeals to you, read ahead.


Importing data via XML

Help & Manual can import XML data according to its own XML schema. I have explained this in other articles - it must be the H&M schema, not "any" XML. This schema is described in the help file helpman_xml_ref.chm that you find in your H&M installation folder.

The easiest way to get an idea of how the XML looks like is to create a small new project and export it to XML. Help & Manual can either export one "single" XML file, where all the data (including topics) is in one XML file or it can separate configuration overhead from topics by creating multiple XML files.

The XML import on the other hand doesn't care whether it's a single XML or multiple files - it reads both. So, when we are going to create XML from the scratch, I prefer to put everything into a single file. It just saves us from a few unnecessary headaches.

The following examples are plain text: put this into a text file with Notepad and save it with an .XML extension. Then import the examples in Help & Manual (click File > New > Import XML).



The most essential XML code

When importing XML, Help & Manual is quite tolerant if some data is missing: all configuration values that you usually set in Project > Properties have defaults which are automatically filled in. So when we create XML from the scratch, we can leave most of this stuff out and just concentrate on the content.

Here is my first and most essential XML file.
You can copy and paste this code into a text file, save it and import the XML into Help & Manual. The result is a new project with one topic:

Code:
<?xml version="1.0"?>
<helpproject>

  <map id="table-of-contents">
    <topicref type="topic" href="new_topic">New item</topicref>
  </map>

  <topics>

    <topic id="new_topic">
    <body>
      <header>
        <para styleclass="Heading1"><text styleclass="Heading1">Topic Header</text></para>
      </header>
      <para styleclass="Normal"><text styleclass="Normal">Enter topic text here.</text></para>
    </body>
    </topic>

  </topics>

</helpproject>

The file starts with the XML identifier "<?xml version="1.0"?>" which must be the first line in all XML files. This identifier usually includes an encoding type which I have intentionally left out. When H&M exports XML, it always encodes the files with UTF-8 but the import understands ANSI encoded XML files as well. UTF-8, ANSI, what the heck...? Don't worry. I presume that you are creating English documentation and the encoding requirements for English are trivial: we don't need any. To get started, just forget about the encoding - one thing less to worry about.

The XML file has a root element which is <helpproject> and a matching closing tag </helpproject>. So far, so simple. Within this root tag are two sections, a "map" section and a "topics" section. The first one defines the table of contents and the second one the actual topic record.



The <map> section

This section requires an ID: for the table of contents, the id is "table-of-contents" (see above) and for the list of invisible topics, you guessed it, it's "invisible-topics". Within this map are <topicref> tags. A topicref tag designates a TOC entry. You can think of this as a kind of pointer to the actual topic record. Because it's a pointer, it has a "href" attribute which points to the topic ID. The "type" attribute declares the entry as a topic with text. The map in our example creates exactly one topic reference in the TOC. If you add two of those lines, you get two TOC entries:

Code:
<topicref id="1" type="topic" href="new_topic1">My first topic</topicref>
<topicref id="2" type="topic" href="new_topic2">My second topic</topicref>

Look carefully: I have added just a new attribute to the tags, id="1" and id="2". Help & Manual maintains an internal ID for every TOC reference to keep track of it when you export/import XML for translation. You can omit this ID but if you want to update the TOC in a later import cycle, you must assign a number to this value. The TOC entry ID is a 64 bit unique integer value, in other words: a number between 1 and a very large number, that must not be duplicate.


Creating chapters with sub-topics

Sub-topic are <topicref> tags within other <topicref> tags! Here is an example:

Code:
  <map id="table-of-contents">
    <topicref id="1" type="topic" href="new_topic1">My first topic</topicref>
    <topicref id="2" type="chapter">Chapter 1 overview

       <topicref id="3" type="topic" href="chapter1_topic1">First topic of chapter 1</topicref>
       <topicref id="4" type="topic" href="chapter1_topic2">Second topic of chapter 2</topicref>

    </topicref>
  </map>


Note that the XML tag on the third line doesn't close here. Instead, after the text "Chapter 1 overview" is another <topicref> tag opened, which desginates a sub-entry. I have added a chapter without text here, specified by the attribute type="chapter". Chapter entries don't need the href tag.



The <topics> section

So far we have defined a table of contents only, but the topics are still missing. Keep in mind that the TOC entries are pointers only while topics are separate records. They are defined in the <topics> section.

The <topics> section contains a list of <topic> tags. Plain simple... Inside the <topic> tags is the actual content of the topic. But let's have a look at the <topic> tag first.

Every topic in Help & Manual has a unique ID - the topic ID. This is defined by the attribute <topic id="new_topic1">. The ID is the same that you enter in Help & Manual and it corresponds to the HREF-attributes in the <map> section above.


Code:
  <topics>

    <topic id="new_topic1" target="Main" helpcontext="1234">
    <body>
      <header>
        <para styleclass="Heading1"><text styleclass="Heading1">Topic Header</text></para>
      </header>
      <para styleclass="Normal"><text styleclass="Normal">Enter topic text here.</text></para>
    </body>
    </topic>

    <topic id="new_topic2" target="Main" helpcontext="5678">

   //... next topic goes here...

    </topic>

</topics>

The "ID" attribute is in fact the only required attribute of the <topic> tag. If the ID matches an existing topic during the XML import, the topic will be overwritten with the new content.
But it makes sense to explicitely specify the window type as well, which is defined by the target="Main" attribute. If you omint it, H&M assumes the "Main" window type. Last not least, any help context numbers are defined here as well. The attribute helpcontext="1234" sets the help context number of "new_topic1" to "1234". If you omit that attribute, the topic has no help context.

Multiple topics are declared in a series of <topic> ... </topic> tags, before the final </topics> closing tag ends the section.



Inside a <topic> tag

Now, let's look what's inside a <topic> tag. The opening <topic> tag is in followed by a <body> tag which designates the topic header + body text (keywords are outside the <body> tag, see below).

The basic construction of a topic is like this:

Code:
<topic id="new_topic1">
  <keywords>
    <keyword>This is a keyword</keyword>
    <keyword>Another keyword</keyword>
  </keywords>
  <body>
      <header>
        // ... paragraphs that go into the topic header
      </header>
      // ... paragraphs that go into the topic text
  </body>
</topic>



Topic keywords

The XML snipped above a topic with two keywords, a topic header and some topic body text. Keywords are designated by the <keywords> section (note the plural) and a keyword is the text between <keyword>...</keyword>. Sub-keywords are just sub-entries inside a <keyword> tag. Note that the first <keyword> tag is not closed immediately - there is another <keyword> tag inbetween:

Code:
  <keywords>
    <keyword>This is a keyword
        <keyword>with sub-keyword</keyword>
    </keyword>
    <keyword>Another keyword</keyword>
  </keywords>



Adding text

Every text in a topic is tied to a paragraph. It is therefore necessary to start with a paragraph tag - a <para> tag. A paragraph continues until its matching closing tag </para>. Note that hard returns in the text will not have any effect: it's like in HTML, white space is completely ignored.

A paragraph has either a styleclass (style name reference) or hard formatting. In Help & Manual you can do both, but when creating XML from the scratch, I strongly recommend to go with styleclasses and nothing else! So, a complete para tag will look like this:

Code:
<para styleclass="Heading1"> ... </para>
<para styleclass="Normal"> ... </para>


You could write some text simple between within <para> tag and the text would receive the same style name as the paragraph. However, when we are at it, let's do it right: lets define text with a dedicated <text> tag!

Code:
<para styleclass="Heading1">
    <text styleclass="Heading1">This is a heading</text>
</para>
<para styleclass="Normal">
    <text styleclass="Normal">This is regular topic text</text>
</para>
<para styleclass="Normal">
    <text styleclass="Normal">This text contains </text>
    <text styleclass="Code Example">a special term</text>
    <text styleclass="Normal"> and the paragraph continues until here.</text>
</para>

The example above contains 3 lines: the first is a heading ("Heading1"), the second line is regular ("Normal") and the third line contains mixed text - partly formatted normal and with a style name called "Code Example". Note that this style must exist in your project (the style names used in this example are all predefined in a H&M project, so we don't need to worry about their existence here). I will discuss style definitions in the XML file below.


Adding links

One of the most essential things when creating content from the scratch is to include links in the topic text. A link (in this case a text link) is very similar to a <text> tag, it's just called <link>. That took you by surprise, didn't it? Cool


Code:
<para styleclass="Normal">
    <text styleclass="Normal">This text contains </text>
    <link type="topiclink" displaytype="text" styleclass="Code Example" href="chapter1_topic1">a special term</link>
    <text styleclass="Normal"> and the paragraph continues until here.</text>
</para>

Look at the <link> tag: it has three attributes: the type="topiclink" defines the type of the link - a link to another topic. This attribute is required. The second attribute displaytype="text" is the default: it defines it as a text link (as opposed to button or picture links). The attribute href="chapter1_topic1" is the target address - the ID of the target topic in this case. Other link types could be Internet links:


Code:
<link displaytype="text" type="topiclink" href="new_topic1"                 styleclass="Normal">a topic link</link>
<link displaytype="text" type="weblink"   href="http://www.ec-software.com" styleclass="Normal">Internet link</link>

Got it? Ok.


Inserting pictures

Pictures are simple to insert as well, they are defined by a <picture> tag. Keep in mind that pictures are inline tags: like <text> tags, they are always inside a paragraph. The following picture gets its own line - it's within a <para> tag:

Code:
<para styleclass="Normal">
    <image src="screenshot1.bmp" scale="100.00%" styleclass="Image Caption"></image>
</para>

The essential attribute here is src="screenshot1.bmp" which is the picture file name. Note that Help & Manual will ignore any path here, even a relative path! Pictures in H&M are internally saved with their file name only, while the search path from the configuration (Project > Properties) is used to locate it. The second attribute scale="100.00%" defines the zoom factor - for a half-size image you would write scale="50%". This attribute is optional, since the default is 100% anyway. The last attribute in this example is the text style of the picture. This style has no meaning unless you define an image caption.

Don't forget to include an image called "screenshot1.bmp" in the same folder as your XML file!


Images with captions and tooltips

Wow! We are really getting along well... images with captions are non-trivial! Here is an XML snipped for an image with a caption (I'll explain below):

Code:
<image src="screenshot1.bmp" scale="100.00%" styleclass="Image Caption">
   <title>Image tooltip</title>
   <caption>Visible text below image (image caption)</caption>
</image>

As you can see, the <image> tag includes two subsequent tags: a <title> tag for the tooltip and a <caption> tag for the image caption. Here the attribute styleclass="Image Caption" matters: the text below the image will be formatted with this style. There is of course more to image tags, the XML syntax stores image alignment, margins, a picture ID and last, not least: hotspots which may have their own link types and tooltips. But I don't want to over-complicate this tutorial, so for the images, that's it.


Empty lines, multiple blanks and special characters

Empty lines are simple: you create a <para> tag and close it right after the opening tag.

Code:
<para styleclass="Normal"></para>

But if you want to insert multiple blanks, it gets more difficult: multiple blanks are - like in HTML - simply ignored, because they are so-called "white-space". The multiple spaces in this example will have vanished after the import:

Code:
<para styleclass="Normal">
   <text styleclass="Normal" translate="true">Text with multiple            blanks.</text>
</para>

In order to retain multiple blanks, you need to encode them as "fixed spaces" or as an alternating series of spaces and fixed spaces (remove the blank between & and #):

Code:
<para styleclass="Normal">
   <text styleclass="Normal" translate="true">Text with multiple & #160 & #160 & #160 & #160 blanks.</text>
</para>


The fixed spaces are encoded as "&#160". Like in HTML, the "&#" designates the numeric character definition that follows and "160" is the decimal code of the character.

This directly leads us to special characters: While I assume for this article that you have a basic knowledge of XML, let me remind you that there are 5 special characters in XML that need to be "escaped". Those are the characters used by the XML syntax itself. You cannot simply put a "< >" in the text, because this would appear as a new XML tag that has to be interpreted. Instead, you must escape these 5 characters when using them in the text.

Code:
< ... is escaped as:  &lt;
> ... is escaped as:  &gt;
' ... is escaped as:  &apos;   (single quotation mark)
" ... is escaped as:  &quot;   

And because the ampersand symbol is used for escaping, you must escape the ampersand as well:

& ... is escaped as:  &amp;


Example XML code with escaped characters:

Code:
<para styleclass="Normal">
   <text styleclass="Normal">This is an &lt;XML&gt; example for Help &amp; Manual with escaped characters.</text>
</para>




Exploring the configuration overhead

When I introduced my very simple XML example at the beginning of this article, I intentionally left out the configuration section that Help & Manual puts in, when creating XML. The configuration section are those values which you usually edit in Project > Properties. And because all these settings have default values, you can omit the section in the XML file.

However, if really want to create a project from scratch, it may be necessary to include some configuration values as well.


The <config> section

This section is usually placed on top of the XML file and starts with <config>.

Code:
<?xml version="1.0"?>
<helpproject>

  <config>
    <config-group name="project">
      <config-value name="title">Title of your project</config-value>
    </config-group>
  </config>

  <map id="table-of-contents">
    <topicref id="1" type="topic"  ....

The secion is devided into groups. Basically the structure is designated by a <config> tag, followed by a <config-group> tag. The <config-group> has one required attribute: name="xyz". Within a group are the configuration values, each in a <config-value> tag, which also have a required name="..." attribute. For instance, the project title is in the group <config-group name="project"> and the value for the project title is defined by the <config-value name="title"> tag.

It would go beyond this article to describe them all in details. I recommend to export XML from a simple project to find out the group and config-value names. The most important values and groups are also explained in the XML reference manual helpman_XML_ref.chm in your H&M installation folder.


Declaring styles in the <config> section

Above we have discussed how to create topics with paragraphs and text. Both paragraphs and text use styles for formatting. The styles are referred by the styleclass="stylename" attribute of the <para> and <text> and other tags. How do you declare those styles when creating a project from scratch?

First of all, there are a few predefined styles. At the time of writing this article, those styles are 'Normal', 'Heading1', 'Image Caption', 'Code Example', 'Comment' and 'Notes'. If you declare the styles in the XML file, the project styles get overwritten with your declaration. If you do not declare them, they are created with defaults.

Style classes are defined in the XML file within the <config> section. The structure is as follows:

Code:

  <config>
    <styleclasses>

      <styleclass name="stylename">
        <style-set>... CSS style definitions ...</style-set>
      </styleclass>

      <styleclass>
        ...
      </styleclass>

    <styleclasses>
  </config>

Styles are designated by a <styleclasses> section (note the plural). There is only one <styleclasses> section in the XML structure. Within this tag comes at least one <styleclass> tag (singular!). The <styleclass> tag has a required attribute name="stylename" which defines the style name as you see it in Help & Manual. Optional attributes of <styleclass> are the parentclass="styleclass" attribute. This attribute makes the style inherit properties from a parent class - similar to the definition of styles in H&M, when you open the "Edit Style" dialog box.

Within the <styleclass> tag finally comes the actual style definition, designated by a <style-set> tag. A <style-set> defines a set of formatting attributes - identical to CSS media types. That's why the <style-set> tag has a required attribute called media="mediatype". Unless you want to define different style sets for online help and print, you can always use one <style-set media="all"> tag.

The text between <style-set media="all"> and </style-set> eventually the style attributes. And this is rather simple: it's HTML CSS 2.0. With a few custom extensions (for tab stops that are still supported in H&M but don't exist in HTML) and with a few things left out like complicated border definitions.

If a style defines paragraph attributes (e.g. "text-align:left;" or "margin-left:20px"), it's assumed to be a paragraph style. If a style defines font properties only (e.g. "font-family:Arial; font-size:10pt; color:#000000;") it's assumed to be a font style only.

Code:

  <config>
    <styleclasses>
      <styleclass name="Normal" parentclass="">
        <style-set media="all">font-family:Arial; font-size:10pt; color:#000000; text-align:left; text-indent:0px; margin-right:0px; margin-left:0px; </style-set>
      </styleclass>
      <styleclass name="Heading1" parentclass="Normal">
        <style-set media="all">font-size:14pt; font-weight:bold; color:#ffffff; </style-set>
      </styleclass>
      <styleclass name="Image Caption" parentclass="Normal">
        <style-set media="all">font-size:8pt; font-weight:bold; </style-set>
      </styleclass>
    </styleclasses>

    ...
</config>



The XML file with all the examples discussed in this article can be downloaded from the original article link (requires login to the user forum).
Good luck with your XML experiments! Very Happy

Alexander Halser is CEO of EC Software and chief developer of Help & Manual. He has been developing software for more than 20 years and has a special talent for user interface design.