.. meta:: :description: Orange Textable documentation, Cooccurrence widget :keywords: Orange, Textable, documentation, Cooccurrence, widget .. _Cooccurrence: Cooccurrence ============ .. image:: figures/Cooccurrence_54.png Measure the cooccurrence of segments in documents. Signals ------- Inputs: * ``Segmentation`` (multiple) Segmentation whose segments constitute the units subject to measurement of their cooccurrence or the contexts in which unit cooccurrence will be measured Outputs: * ``Pivot Crosstab`` Table displaying the cooccurrence of units in the defined context Description ----------- This widget inputs one or several segmentations, measures the number of documents in which the input segments occur simultaneously, and sends the result in the form of a *cooccurrence matrix* [1]_. The cooccurrence matrix produced by this widget is of *IntPivotCrosstab* type, a subtype of the generic *Table* format (see :ref:`Convert` widget, section :ref:`Table formats `). Since this table is a cooccurrence matrix, both rows and columns correspond to *unit* types. The cell at the intersection of a given column and row represents the number of documents (*context* types) in which these two *unit* types occur simultaneously. As the measure of cooccurrence represents absolute frequency, the resulting table contains integer numbers, and as such it is of *IntPivotCrosstab* type, a subclass of *PivotCrosstab*. To take a simple example, consider two segmentations of the string *a simple example* [2]_: A) label = *words* =========== ======= ===== ================== ================= content start end *part of speech* *word category* =========== ======= ===== ================== ================= *a* 1 1 *article* *grammatical* *simple* 3 8 *adjective* *lexical* *example* 10 16 *noun* *lexical* =========== ======= ===== ================== ================= B) label = *letters* (extract) ========= ======= ===== =================== content start end *letter category* ========= ======= ===== =================== *a* 1 1 *vowel* *s* 3 3 *consonant* *i* 4 4 *vowel* ... ... ... ... *e* 16 16 *vowel* ========= ======= ===== =================== Typically, we could define unit types based on the content of the segments of the *letters* segmentation. As for the context types, there are two distinct forms of contexts for measuring the cooccurrence of the units: * **Sliding window*** **Containing segmentation** **Sliding window** relies on the notion of a "window" of *n* segments that we progressively "slide" from the beginning to the end of the segmentation. In our example, by applying this principle to the *letters* segmentation and by setting the window size to 3 segments, we thus define the following contexts: 1. *a si* 2. *sim* 3. *imp* 4. *mpl* 5. *ple* 6. *le e* 7. *e ex* 8. *exa* 9. *xam* 10. *amp* 11. *mpl* 12. *ple* Considering the letter segmentation as that of the unit types, we would obtain the following cooccurrence matrix[3]_: .. csv-table:: :header: *__unit__*, *a*, *s*, *i*, *m*, *p*, *l*, *e*, *x* :stub-columns: 1 :widths: 1 1 1 1 1 1 1 1 1 *a*, 4, 1, 1, 2, 1, 0, 1, 2 *s*, 1, 2, 2, 1, 0, 0, 0, 0 *i*, 1, 2, 3, 2, 1, 0, 0, 0 *m*, 2, 1, 2, 6, 4, 2, 0, 1 *p*, 1, 0, 1, 4, 6, 4, 2, 0 *l*, 0, 0, 0, 2, 4, 5, 3, 0 *e*, 1, 0, 0, 0, 2, 3, 5, 2 *x*, 2, 0, 0, 1, 0, 0, 2, 3 Alternatively, we could consider the *annotation values* of the units instead of their content. For example, by defining units based on the annotations associated to the key *letter category* in the *letters* segmentation, and choosing the mode **Sliding window** for the context with the window size of 3 (see :ref:`figure 1 `), we would obtain the following cooccurrence matrix: .. csv-table:: :header: *__unit__*, *vowel*, *consonant* :stub-columns: 1 :widths: 3 2 3 *vowel*, 10, 10 *consonant*, 10, 12 The mode **Containing segmentation** consists in measuring the cooccurrence of units in context defined by another segmentation. In the above example we consider *letter* as the segmentation for unit types and *word* as the segmentation for context types, and thus the following cooccurrence matrix will be obtained and is symmetric by definition: .. csv-table:: :header: *__unit__*, *a*, *s*, *i*, *m*, *p*, *l*, *e*, *x* :stub-columns: 1 :widths: 1 1 1 1 1 1 1 1 1 *a*, 2, 0, 0, 1, 1, 1, 1, 1 *s*, 0, 1, 1, 1, 1, 1, 1, 1 *i*, 0, 1, 1, 1, 1, 1, 1, 1 *m*, 1, 1, 1, 2, 2, 2, 2, 1 *p*, 1, 1, 1, 2, 2, 2, 2, 1 *l*, 1, 1, 1, 2, 2, 2, 2, 1 *e*, 1, 1, 1, 2, 2, 2, 2, 1 *x*, 1, 0, 0, 1, 1, 1, 1, 1 Each cell at the above table represents the number of words (segments of the context types) in which the unit in the column and the unit in the row are used simultaneously. For example, "2" in the fifth column and forth row, shows that there are two words in which *p* and *m* occur together. In the **Containing segmentation** mode, it is also possible to measure the cooccurrence of units belonging to distinct segmentations. For instance this would enable us to know how many times a given vowel and a given consonant occur simultaneously in each word. By ticking the **Secondary units** checkbox in the interface of the widget, we will be able to define a segmentation for secondary unit types. In this case, the resulting cooccurrence matrix will no longer be symmetric. Therefore, in the above example, vowels as the primary units segmentation constitute the rows, and consonants as the secondary units segmentation constitute the columns of the resulting cooccurrence matrix (see :ref:`figure 2 `): .. csv-table:: :header: *__unit__*, *s*, *m*, *p*, *l*, *x* :stub-columns: 1 :widths: 1 1 1 1 1 1 *a*, 0, 1, 1, 1, 1 *i*, 1, 1, 1, 1, 0 *e*, 1, 2, 2, 2, 1 As mentioned in the **Sliding window** mode, it is always possible to measure the cooccurrence of the annotation values of the units (primary and secondary) and those of the contexts instead of the content of segments. In the case of the above example with the secondary units, the resulting crosstab consists of only one cell indicating the number of words in which every letter with *vowel* and every letter with *consonant* annotation value have occurred at the same time: .. csv-table:: :header: *__unit__*, *consonant* :stub-columns: 1 :widths: 2 3 *vowel*, 2 Note that it is up to the user to provide a coherent definition of the units and contexts. In general, there are three conditions to be met in this respect: (a) the segment corresponding to the unit and the context are both associated to the same string, (b) the initial position of the unit segment in the string is higher or equal to that of the context segment, and (c) conversely the final position of the unit is lower or equal to that of the context. In short, the unit must be *contained* within the context. It is also noteworthy that in order to measure the cooccurrence, it is by definition necessary to specify a context. The context is set to the **Sliding window** mode by default. Finally, in every scenario considered here, we could also take an interest for the cooccurrence of sequences of 2, 3, ..., *n* segments (or *n--grams*) rather than for the frequency of isolated segments. The cooccurrence matrix of bigrams in **Sliding window** (size 3) is illustrated below: .. csv-table:: :header: *__unit__*, *as*, *si*, *im*, *mp*, *pl*, *le*, *ee*, *ex*, *xa*, *am* :stub-columns: 1 :widths: 1 1 1 1 1 1 1 1 1 1 1 *as*, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 *si*, 1, 2, 1, 0, 0, 0, 0, 0, 0, 0 *im*, 0, 1, 2, 1, 0, 0, 0, 0, 0, 0 *mp*, 0, 0, 1, 4, 2, 0, 0, 0, 0, 0 *pl*, 0, 0, 0, 2, 4, 2, 0, 0, 0, 0 *le*, 0, 0, 0, 0, 2, 3, 1, 0, 0, 0 *ee*, 0, 0, 0, 0, 0, 1, 2, 1, 0, 0 *ex*, 0, 0, 0, 0, 0, 0, 1, 2, 1, 0 *xa*, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1 *am*, 0, 0, 0, 1, 0, 0, 0, 0, 1, 2 Hereafter the interface of the widget will be introduced (see figures :ref:`1 ` to :ref:`4 `). It contains three separate sections for unit definition (**Units** and **Secondary units**) and context definition (**Contexts**). .. _cooccurrence_fig1: .. figure:: figures/cooc_example.png :align: center :alt: Cooccurrence widget in the default mode("Sliding window") Figure 1: **Cooccurrence** widget (**Sliding window** mode as the default mode). In the **Units** section, the **Segmentation** drop-down menu allows the user to select among the input segmentations, the one whose segment types will be subject to the cooccurrence measurement. The **Annotation key** menu displays the annotation keys associated to the chosen segmentation, if any; if one of the keys is selected, the corresponding annotation values will be considered; if on the other hand the value *(none)* is selected, the *content* of the segments will be taken into consideration. The **Sequence length** drop-down menu allows the user to indicate if isolated segments or segment *n--grams* should be considered; in the latter case, the (optional) string specified in the **Intra sequence delimiter** text field will be used to separate the content or the annotation value corresponding to each segment in the table headers. The **Secondary units** section has almost the same characteristics as the **Units** section, except for the fact that there is no **Sequence length** menu. This section is by default disabled due to the default mode of the **Contexts** section being **Sliding window**, in which only one unit segmentation can be considered for the measure of cooccurrence (see :ref:`figure 1 `). When changing the mode to *Containing segmentation*, the box becomes automatically enabled (see :ref:`figure 2 `). .. _cooccurrence_fig2: .. figure:: figures/cooc_secondary_units_example.png :align: center :alt: Secondary units box of Cooccurrence widget in mode "Sliding window" Figure 2: **Secondary units** box of **Cooccurrence** widget. The **Contexts** section is available in two forms, depending on the selected value in the **Mode** drop-down menu. This allows the user to choose between the two possible ways of defining contexts described earlier. The **Sliding window** mode (see :ref:`figure 3 `) implements the notion of a "sliding window" introduced earlier. Typically, it allows the user to observe the cooccurrence of the unit types with one another throughout the unit segmentation. The only parameter is the window size (in number of segments), defined by the **Window size** cursor, set to 2 by default. .. _cooccurrence_fig3: .. figure:: figures/cooc_mode_sliding_window_example.png :align: center :alt: Cooccurrence widget in mode "Sliding window" Figure 3: **Cooccurrence** widget (**Sliding window** mode). .. _cooccurrence_fig4: .. figure:: figures/cooc_mode_containing_segmentation_example.png :align: center :alt: Cooccurrence widget in mode "Containing segmentation" Figure 4: **Cooccurrence** widget (**Containing segmentation** mode). Finally, the **Containing segmentation** mode (see :ref:`figure 4 `) corresponds to the case where contexts are defined by the segment types that appear in another segmentation. This segmentation is selected among the input segmentations by means of the **Segmentation** drop-down menu. The **Annotation key** menu displays the annotation keys associated with the context segmentation, if any; if one of the keys is selected, the corresponding annotation value types will constitute the row headers; otherwise the value *(none)* is selected so that *content* of the segments will be exploited. The **Info** section indicates whether or not the data is correctly sent to the output table. If not, the respective error message will be given. The **Compute** button triggers the emission of a table in the internal format of Orange Textable, to the output connection(s). When it is selected, the **Compute automatically** checkbox disables the button and the widget attempts to automatically emit a segmentation at every modification of its interface or when its input data are modified (by deletion or addition of a connection, or because modified data is received through an existing connection). Messages -------- Information ~~~~~~~~~~~ *Data correctly sent to output.* This confirms that the widget has operated properly. *Settings were* (or *Input has*) *changed, please click 'Compute' when ready.* Settings and/or input have changed but the **Compute automatically** checkbox has not been selected, so the user is prompted to click the **Compute** button (or equivalently check the box) in order for computation and data emission to proceed. *No data sent to output yet: no input segmentation.* The widget instance is not able to emit data to output because it receives none on its input channel(s). *No data sent to output yet, see 'Widget state' below.* A problem with the instance's parameters and/or input data prevents it from operating properly, and additional diagnostic information can be found in the **Widget state** box at the bottom of the instance's interface (see `Warnings`_ below). Warnings ~~~~~~~~ *Resulting table is empty.* No table has been emitted because the widget instance couldn't find a single element in its input segmentation(s). A likely cause for this problem (when using the **Containing segmentation** mode) is that the unit and context segmentations do not refer to the same strings, so that the units are in effect *not* contained in the contexts. This is typically a consequence of the improper use of widgets :ref:`Preprocess` and/or :ref:`Recode` (see :ref:`anchor_to_caveat`). See also -------- * :ref:`Reference: Convert widget (section "Table formats") ` Footnotes --------- .. [1] The definition of cooccurrence may vary depending on the discipline in which this notion is used. In text analytics, the cooccurrence is the number of the documents in which two textual units simultaneously occur. Here by convention, cooccurrence is the dot product of the transposed term-document matrix with itself, which is symmetric when considering only one unit type. As a result, and contrary to other definitions, the diagonal members of the matrix are not zero; rather, they indicate the document frequency of the corresponding textual unit (i.e. the number of context types in which it occurs). .. [2] By convention, we do not indicate here the string index associated with each segment but only its start and end positions, along with the various annotation values associated with it; moreover, for the sake of readability, we do indicate the content of each segment, though it is not formally part of the segmentation (but rather of the string to which the segmentation refers). .. [3] The first column header, *__unit__*, is a name predefined by Orange Textable.