Convert XML tags into Orange Textable annotations

Goal

Convert XML markup into Orange Textable data structures such as segments and their annotations.

Prerequisites

Some text containing XML markup has been imported in Orange Textable (see Cookbook: Text input) and possibly further processed (see Cookbook: Segmentation manipulation).

Ingredients

Widget Extract XML
Icon extract_xml_icon
Quantity 1

Procedure

Convert XML tags into Orange Textable annotations with an instance of Extract XML

Figure 1: Convert XML tags into Orange Textable annotations with an instance of Extract XML

  1. Create an instance of Extract XML on the canvas.
  2. Drag and drop from the output connection (righthand side) of the widget instance that emits the data containing XML markup (e.g. Text Field) to the Extract XML widget instance’s input connection (lefthand side).
  3. Open the Extract XML instance’s interface by double-clicking on its icon on the canvas.
  4. In the XML Extraction section, insert the desired XML element (here w).
  5. Click the Send button (or make sure the Send automatically checkbox is selected).
  6. A segmentation containing a segment for each occurrence of the specified tag is then available on the Segment instance’s output connections; to display or export it, see Cookbook: Text output.

Comment

  • The XML tags that have been retrieved are actually discarded from the resulting segmentation: only their content is included in the output.
  • The attributes of the XML tags are automatically converted to annotations associated with the created segments.
  • Note that it is only possible to extract instances of a single XML element type at a time (here w).
  • However, it is possible to chain several Extract XML instances in order to successively extract instances of different XML elements. For example, a first instance to extract div type elements, a second to extract w type elements, and so on. In this case, it is important to make sure that the Remove markup option is not selected.