.. meta:: :description: Orange Textable documentation, Category widget :keywords: Orange, Textable, documentation, Category, widget .. _Category: Category ======== .. image:: figures/Category_54.png Build a table with categories defined by segments' content or annotations. Signals ------- Inputs: * ``Segmentation`` (multiple) Segmentation whose segments constitute the basis for category extraction. Outputs: * ``Textable table`` Table displaying the extracted categories Description ----------- This widget inputs one or several segmentations and outputs a tabulated representation of *categories* associated to the segments of one of them; categories are typically defined on the basis of their annotation values of segments for a given annotation key, but may also be defined on the basis of the content of segments. Typically, tables produced by the **Category** widget are destined to be merged (by means of the built-in **Merge Data** widget of Orange Canvas) with quantitative tables produced by widgets :ref:`Count`, :ref:`Length`, or :ref:`Variety`, in order to associate with each row the piece of categorical information required to train a text classifier (i.e. a system able to automatically predict the membership of a text to a category based on the quantitative profile associated with it). Here is an example of a table with this structure, where the second column would have been constructed by an instance of **Category**, and the columns to its right by an instance of :ref:`Count`: .. csv-table:: :header: *__context__*, *__category__*, *noun*, *verb*, ... :stub-columns: 1 :widths: 4 4 2 2 1 *text1*, *news*, 35, 12, ... *text2*, *news*, 20, 8, ... *text3*, *poetry*, 27, 18, ... ..., ..., ..., ..., ... The tables produced by this widget only contain two columns. The first (header *__context__*) contains the headers corresponding to the contexts -- which are essentially defined in the same way as with the **Containing segmentation** mode of widgets :ref:`Count`, :ref:`Length`, and :ref:`Variety`: by the segment types appearing in a segmentation. The second column (header *__category__*) contains the annotation(s) associated with each segment type. To take a simple example, consider two segmentations of the string *a simple example* [#]_: A) label = *words* =========== ======= ===== ================== ================= content start end *part of speech* *word category* =========== ======= ===== ================== ================= *a* 1 1 *article* *grammatical* *simple* 3 8 *adjective* *lexical* *example* 10 16 *noun* *lexical* =========== ======= ===== ================== ================= B) label = *letters* (extract) ========= ======= ===== =================== content start end *letter category* ========= ======= ===== =================== *a* 1 1 *vowel* *s* 3 3 *consonant* *i* 4 4 *vowel* ... ... ... ... *e* 16 16 *vowel* ========= ======= ===== =================== Based on the latter segmentation, we can produce the following table, giving the annotation value associated with the key *letter category* for each distinct letter: .. csv-table:: :header: *__context__*, *__category__* :stub-columns: 1 :widths: 8 9 *a*, *vowel* *s*, *consonant* *i*, *vowel* *m*, *consonant* *p*, *consonant* *l*, *consonant* *e*, *vowel* *x*, *consonant* In this illustration, each letter is only associated to a single category. In a more general case, the contexts can be associated to several categories; for example, if the contexts are defined based on the *word category* annotation of the *words* segmentation and the extracted categories are defined as the segment contents of the *letters* segmentation: .. csv-table:: :header: *__context__*, *__category__* :stub-columns: 1 :widths: 8 9 *grammatical*, *a* *lexical*, *e-m-l-p-a-i-s-x* In this case, the user will have to choose (a) the order (frequential or ASCII-betical) in which the multiple values will be sorted and (b) whether they should all be shown or only the first (in the selected order). The widget interface (see :ref:`figure 1 `) has three separate sections, for unit specification (**Units**), for multiple values processing specification (**Multiple Values**), and for context specification (**Contexts**). In the **Units** section, the **Segmentation** drop-down menu allows the user to select among the input segmentations the one whose segments will be examined to determine the categories. The **Annotation key** menu shows the possible annotation keys associated to the chosen segmentation; if one of these keys is selected, the corresponding annotation values will be used; if on the other hand the value *(none)* is selected, the *content* of the segments will be used. The **Sequence length** drop-down menu allows the user to indicate if the widget should consider the isolated segments or the *n--grams* of segments. In this latter case, the (optional) string specified in the **Intra-sequence delimiter** text field will be used to separate the content or the annotation value corresponding to each individual segment. .. _category_fig1: .. figure:: figures/category_widget.png :align: center :alt: interface of the Category widget Figure 1: Interface of the **Category** widget. In the **Multiple Values** section, the **Sort by** drop-down menu allows the user to select the sorting criteria of multiple values, namely either the frequency (**Frequency**) or the ASCII order (**ASCII**). The **Sort in reverse order** checkbox reverses the sorting order, and the **Keep only first value** checkbox allows the program to retain only the first value (in the selected order). The **Value delimiter** field is used to indicate the character string to insert in-between multiple values. Unlike other table contruction widgets , here the context specification can only be done in relation to a segmentation containing the unit segmentation (thus the equivalent of the **Containing segmentation** mode of widgets :ref:`Count`, :ref:`Length`, and :ref:`Variety`:). This segmentation is selected among the input segmentation by means of the **Segmentation** drop-down menu. The **Annotation key** menu shows the possible annotation keys associated to the selected segmentation; if one of these keys is selected, the corresponding annotation values will will constitute the row headers; if on the other hand the value *(none)* is selected, the *content* of the segments will be used. The **Info** section indicates if a table has been correctly emitted, or the reasons why no table is emitted (no input data, typically). The **Compute** button triggers the emission of a table in the internal format of Orange Textable, to the output connection(s). When it is selected, the **Compute automatically** checkbox disables the button and the widget attempts to automatically emit a segmentation at every modification of its interface or when its input data are modified (by deletion or addition of a connection, or because modified data is received through an existing connection). Messages -------- Information ~~~~~~~~~~~ *Data correctly sent to output.* This confirms that the widget has operated properly. *Settings were* (or *Input has*) *changed, please click 'Compute' when ready.* Settings and/or input have changed but the **Compute automatically** checkbox has not been selected, so the user is prompted to click the **Compute** button (or equivalently check the box) in order for computation and data emission to proceed. *No data sent to output yet: no input segmentation.* The widget instance is not able to emit data to output because it receives none on its input channel(s). *No data sent to output yet, see 'Widget state' below.* A problem with the instance's parameters and/or input data prevents it from operating properly, and additional diagnostic information can be found in the **Widget state** box at the bottom of the instance's interface (see `Warnings`_ below). Warnings ~~~~~~~~ *Resulting table is empty.* No table has been emitted because the widget instance couldn't find a single element in its input segmentation(s). A likely cause for this problem (when using the **Containing segmentation** mode) is that the unit and context segmentations do not refer to the same strings, so that the units are in effect *not* contained in the contexts. This is typically a consequence of the improper use of widgets :ref:`Preprocess` and/or :ref:`Recode` (see :ref:`anchor_to_caveat`). Footnotes --------- .. [#] By convention, we do not indicate here the string index associated with each segment but only its start and end positions, along with the various annotation values associated with it; moreover, for the sake of readability, we do indicate the content of each segment, though it is not formally part of the segmentation (but rather of the string to which the segmentation refers).