Category

_images/Category_54.png

Build a table with categories defined by segments’ content or annotations.

Signals

Inputs:

  • Segmentation (multiple)

    Segmentation whose segments constitute the basis for category extraction.

Outputs:

  • Textable table

    Table displaying the extracted categories

Description

This widget inputs one or several segmentations and outputs a tabulated representation of categories associated to the segments of one of them; categories are typically defined on the basis of their annotation values of segments for a given annotation key, but may also be defined on the basis of the content of segments.

Typically, tables produced by the Category widget are destined to be merged (by means of the built-in Merge Data widget of Orange Canvas) with quantitative tables produced by widgets Count, Length, or Variety, in order to associate with each row the piece of categorical information required to train a text classifier (i.e. a system able to automatically predict the membership of a text to a category based on the quantitative profile associated with it). Here is an example of a table with this structure, where the second column would have been constructed by an instance of Category, and the columns to its right by an instance of Count:

__context__ __category__ noun verb ...
text1 news 35 12 ...
text2 news 20 8 ...
text3 poetry 27 18 ...
... ... ... ... ...

The tables produced by this widget only contain two columns. The first (header __context__) contains the headers corresponding to the contexts – which are essentially defined in the same way as with the Containing segmentation mode of widgets Count, Length, and Variety: by the segment types appearing in a segmentation. The second column (header __category__) contains the annotation(s) associated with each segment type.

To take a simple example, consider two segmentations of the string a simple example [1]:

  1. label = words
content start end part of speech word category
a 1 1 article grammatical
simple 3 8 adjective lexical
example 10 16 noun lexical
  1. label = letters (extract)
content start end letter category
a 1 1 vowel
s 3 3 consonant
i 4 4 vowel
... ... ... ...
e 16 16 vowel

Based on the latter segmentation, we can produce the following table, giving the annotation value associated with the key letter category for each distinct letter:

__context__ __category__
a vowel
s consonant
i vowel
m consonant
p consonant
l consonant
e vowel
x consonant

In this illustration, each letter is only associated to a single category. In a more general case, the contexts can be associated to several categories; for example, if the contexts are defined based on the word category annotation of the words segmentation and the extracted categories are defined as the segment contents of the letters segmentation:

__context__ __category__
grammatical a
lexical e-m-l-p-a-i-s-x

In this case, the user will have to choose (a) the order (frequential or ASCII-betical) in which the multiple values will be sorted and (b) whether they should all be shown or only the first (in the selected order).

The widget interface (see figure 1) has three separate sections, for unit specification (Units), for multiple values processing specification (Multiple Values), and for context specification (Contexts).

In the Units section, the Segmentation drop-down menu allows the user to select among the input segmentations the one whose segments will be examined to determine the categories. The Annotation key menu shows the possible annotation keys associated to the chosen segmentation; if one of these keys is selected, the corresponding annotation values will be used; if on the other hand the value (none) is selected, the content of the segments will be used. The Sequence length drop-down menu allows the user to indicate if the widget should consider the isolated segments or the n–grams of segments. In this latter case, the (optional) string specified in the Intra-sequence delimiter text field will be used to separate the content or the annotation value corresponding to each individual segment.

interface of the Category widget

Figure 1: Interface of the Category widget.

In the Multiple Values section, the Sort by drop-down menu allows the user to select the sorting criteria of multiple values, namely either the frequency (Frequency) or the ASCII order (ASCII). The Sort in reverse order checkbox reverses the sorting order, and the Keep only first value checkbox allows the program to retain only the first value (in the selected order). The Value delimiter field is used to indicate the character string to insert in-between multiple values.

Unlike other table contruction widgets , here the context specification can only be done in relation to a segmentation containing the unit segmentation (thus the equivalent of the Containing segmentation mode of widgets Count, Length, and Variety:). This segmentation is selected among the input segmentation by means of the Segmentation drop-down menu. The Annotation key menu shows the possible annotation keys associated to the selected segmentation; if one of these keys is selected, the corresponding annotation values will will constitute the row headers; if on the other hand the value (none) is selected, the content of the segments will be used.

The Info section indicates if a table has been correctly emitted, or the reasons why no table is emitted (no input data, typically).

The Compute button triggers the emission of a table in the internal format of Orange Textable, to the output connection(s). When it is selected, the Compute automatically checkbox disables the button and the widget attempts to automatically emit a segmentation at every modification of its interface or when its input data are modified (by deletion or addition of a connection, or because modified data is received through an existing connection).

Messages

Information

Data correctly sent to output.
This confirms that the widget has operated properly.
Settings were (or Input has) changed, please click ‘Compute’ when ready.
Settings and/or input have changed but the Compute automatically checkbox has not been selected, so the user is prompted to click the Compute button (or equivalently check the box) in order for computation and data emission to proceed.
No data sent to output yet: no input segmentation.
The widget instance is not able to emit data to output because it receives none on its input channel(s).
No data sent to output yet, see ‘Widget state’ below.
A problem with the instance’s parameters and/or input data prevents it from operating properly, and additional diagnostic information can be found in the Widget state box at the bottom of the instance’s interface (see Warnings below).

Warnings

Resulting table is empty.
No table has been emitted because the widget instance couldn’t find a single element in its input segmentation(s). A likely cause for this problem (when using the Containing segmentation mode) is that the unit and context segmentations do not refer to the same strings, so that the units are in effect not contained in the contexts. This is typically a consequence of the improper use of widgets Preprocess and/or Recode (see Caveat).

Footnotes

[1]By convention, we do not indicate here the string index associated with each segment but only its start and end positions, along with the various annotation values associated with it; moreover, for the sake of readability, we do indicate the content of each segment, though it is not formally part of the segmentation (but rather of the string to which the segmentation refers).