.. meta:: :description: Orange Textable documentation, Convert widget :keywords: Orange, Textable, documentation, Convert, widget .. _Convert: Convert ======= .. image:: figures/Convert_54.png Convert, transform, or export Orange Textable tables Signals ------- Inputs: * ``Textable Table`` Table in the internal format of Orange Textable. Outputs: * ``Orange Table`` (default) Data in the standard *Table* format of Orange Canvas (possibly transformed). * ``Textable Table`` Table in the internal format of Orange Textable (possibly transformed). * ``Segmentation`` Segmentation containing the output table in tab-delimited format. Description ----------- **Convert**, inputs data in the internal format of Orange Textable and enables the user to modify them (sorting, normalization, etc.), to convert them to other formats, in particular the standard *Table* format of Orange Canvas (suitable for further processing within Orange Canvas), or to export them in tab-delimited text format (either to a file or to the clipboard). .. _anchor_to_table_formats: Table formats ~~~~~~~~~~~~~ The table representation format of Orange Canvas (*Table* type) presents compatibility issues with Unicode encoded data. Since this encoding is emerging as the most widely used standard for languages of the world, Orange Textable provides its own Unicode-friendly table representation format. Widgets :ref:`Count`, :ref:`Length`, :ref:`Variety`, :ref:`Category`, and :ref:`Context`) thus produce tables in Orange Textable format. In order to be manipulated by the numerous tabulated data processing widgets offered by Orange Canvas, these data must be converted to the standard *Table* format of Orange Canvas (and to an encoding supported by this latter format). Note that the internal Orange Textable *Table* type subdivides in several subtypes. In particular, the contingency tables (see :ref:`Count` widget) belong to the *Crosstab* subtype which itself subdivides in *PivotCrosstab*, *FlatCrosstab*, and *WeightedFlatCrosstab*. These three subtypes are equivalent with regard to the information they allow the user to store, and the easiest way to understand what differentiates them is to see an example. Consider the following contingency table, of *IntPivotCrosstab* [#]_ type (such as produced by the :ref:`Count` widget): .. csv-table:: :header: *__context__*, *unit1*, *unit2* :stub-columns: 1 :widths: 3 2 2 *context1*, 1, 3 *context2*, 2, 1 Here is the same information converted in *FlatCrosstab* format: .. csv-table:: :header: *__id__*, *__unit__*, *__context__* :stub-columns: 1 :widths: 2 3 3 1, *unit1*, *context1* 2, *unit2*, *context1* 3, *unit2*, *context1* 4, *unit2*, *context1* 5, *unit1*, *context2* 6, *unit1*, *context2* 7, *unit2*, *context2* This representation contains three columns carrying the headers *__id__*, *__unit__* and *__context__*, and a number of rows equal to the total count of the contingency table. It is the standard way of encoding a contingency table in Orange Canvas, and it is required by widgets such as *Correspondence Analysis* (after conversion to the *Table* type defined by Orange Canvas). The *WeightedFlatCrosstab* format produces a more compact representation by keeping only one copy of each distinct unit--context pair and by adding a column *__count__* to save information on the number of repetition of each pair: .. csv-table:: :header: *__id__*, *__unit__*, *__context__*, *__weight__* :stub-columns: 1 :widths: 2 3 3 3 1, *unit1*, *context1*, 1 2, *unit2*, *context1*, 3 3, *unit1*, *context2*, 2 4, *unit2*, *context2*, 1 This format is sometimes used to represent contingency tables in third-party data analysis software. It is often called "sparse" matrix format. format. Output channels ~~~~~~~~~~~~~~~ Regardless of the selected output table format (or the transforms that have been applied to the data, see `Advanced interface`_ below), the **Convert** widget emits data on three distinct output channels: * The default output channel (*Orange Table*) emits data converted to standard *Table* format of Orange Canvas; it will typically be used for passing them to built-in Orange Canvas table processing widgets. * The *Textable Table* channel outputs a table in the internal format of Orange Textable (usually after applying some set of transforms); it can then be sent to another instance of **Convert** (in cases where it is useful to apply transforms in distinct steps) or to an instance of the built-in **Python script** widget of Orange Canvas, for accessing the content of the table in a programmatic fashion. * The *Segmentation* channel emits a segmentation with a single segment enclosing a version of the (possibly transformed) table in tab-delimited text format (in utf-8 encoding), which is suitable for further textual processing using Orange Textable widgets such as :ref:`Recode` or :ref:`Segment` for instance. Basic interface ~~~~~~~~~~~~~~~ The basic version of the widget (see :ref:`figure 1 ` below) is essentially limited to the **Encoding** section, which allows the user to select an encoding for the output data. This can be done separately for the data emitted in Orange Canvas *Table* format on the default output channel (**Orange table** drop-down menu) and for the data possibly exported to a text file in tab-delimited format (**Output File**). If certain characters cannot be converted to the specified encoding (for example accentuated characters in the ASCII encoding), they are automatically replaced by corresponding HTML entities (for example *é*; for *é*). .. _convert_fig1: .. figure:: figures/convert_basic_example.png :align: center :alt: Basic interface of the Convert widget Figure 1: **Convert** widget (basic interface). The **Export** section allows the user to export a version of the (possibly transformed) table in tab-delimited text format, either to a text file (**Export to file**) or to the clipboard (**Copy to clipboard**), in order to paste it to a spreadsheet opened in a third-party program for instance. In the former case, the **Output file** drop-down menu (section **Encoding**) is used to indicate which encoding the data should be converted to before being saved; typically, except for a limit imposed by the further processing planned for the saved data (for example by a specific data analysis program), we will seek to keep here the maximum amount of information by specifying either the original encoding of the data, or a more general encoding (a variant of Unicode for example). Note that when the data are copied to the clipboard, the utf-8 encoding is used by default (regardless of what has been selected in the **Encoding** section). Advanced interface ~~~~~~~~~~~~~~~~~~ The advanced version of the **Convert** widget (see :ref:`figure 2 ` below) contains an additional section (**Transform**) allowing the user to apply a number of standard modifications to the incoming table. The different operations defined in this section are applied to input data in the order in which they appear in the interface, top to bottom. The modified data can then be emitted on output connections or exported (either to a file or to the clipboard). .. _convert_fig2: .. figure:: figures/convert_advanced_example.png :align: center :alt: Advanced interface of the Convert widget Figure 2: **Convert** widget (advanced interface). The **Sort rows by column** checkbox triggers row sorting. If it is selected, the column headers of the table appear in the drop-down menu directly on the right and the user can thus select the column on the basis of which the rows will be sorted. If the **Reverse** box on the right of the drop-down menu is checked, rows will be sorted by *decreasing* value. **Sort columns by row** controls in a similar way column sorting. It should be noted in this case that the first column (containing row headers) will always stay in the same position; the sorting only affects the following columns. To sort the columns based on the header row, you must select the first option in the **Sort columns by row** drop-down menu in the right. It will typically contain a name predefined by Orange Textable but which does not appear in the table (*__unit__* if it is a contingency table of *PivotCrosstab* type such as produced by the :ref:`Count` widget, and the generic header *__col__* in every other case). The **Transpose** checkbox allows the user to transpose the table, which means invert its rows and columns. This option is only available for *PivotCrosstab* type contingency tables. The **Normalize** checkbox triggers the normalization of the table (in a rather loose sense of the term); it is only applicable for *PivotCrosstab* type contingency tables. If it is selected, the user can choose in the drop-down menu directly on the right whether the normalization should be applied by rows (**rows**) or by columns (**columns**); the **Norm** drop-down menu allows the user to select the type of normalization, either **L1** (division by the sum of the row/column) or **L2** (division by the root of the sum of the squares of the row/column). Three more operations (which are not usually classified as normalizations in the strict sense of the term) can be selected in the drop-down menu, each of which deactivates the **Norm** drop-down menu on the right: * In **quotients** mode, the count stored in each cell of a contingency table (of *PivotCrosstab* type) is divided by the corresponding "theoretical" count under the hypothesis of independence between table rows and columns. This quotient is superior to 1 if the row and the column in question are in a mutual attraction relation, inferior to 1 in case of repulsion between the row and the column, finally equal to 1 if the row and column do not repulse nor attract each other particularly. * In **TF--IDF** mode, the count stored in each cell of a contingency table (of *PivotCrosstab* type) is multiplied by the natural log of the ratio of the number of rows (i.e. contexts) having nonzero frequency for this column (i.e. unit) to the total number of rows. * In **presence/absence** mode, counts greater than 1 are replaced by the value 1, so that the resulting table can contain only 0's and 1's. The common property of all operations available in the **Normalize** drop-down menu is that they preserve the original dimensions of the input contingency table. On the contrary, the **Convert to** checkbox (only applicable for *PivotCrosstab* type tables) allows the user to trigger the application of transforms which actually modify the dimensionality of the table : * In **document frequency** mode, a new contingency table is created, which giver, for each column (i.e. unit) the number of distinct rows (i.e. contexts) that have nonzero frequency (hence the resulting table contains a single row). * In **association matrix** mode, a new symmetric table is constructed, where each cell gives a measure of the (Markov) associativity between a pair of columns (i.e. units) in the original contigency table: two columns are thus strongly associated if they have similar profiles of attraction/repulsion with rows (i.e. contexts). Selecting this mode activates the **Bias** drop-down menu on the right, which allows the user to select between three predefined ways of weighing the contributions of high versus low frequencies in this computation: **frequent** emphasizes strong associations between frequent units; **none** provides a balanced compromise between frequent and rare units; **rare** emphasizes strong associations between rare units (note that in this particular case, values greater than 1 express an attraction and values lesser than 1 a repulsion) [#]_. It is worth mentioning that the **Normalize** and **Convert to** checkboxes are mutually exclusive and deactivate one another. Finally, the **Reformat to sparse crosstab** checkbox allows the user to convert a contingency table from the *PivotCrosstab* format to the *WeightedFlatCrosstab* or from *IntPivotCrosstab* to *IntWeightedFlatCrosstab* (see the `Table formats`_ section above). In turn, data in *IntWeightedFlatCrosstab* format can be converted to *FlatCrosstab* by further selecting option **Encode counts by repeating rows**; the latter option is only available when dealing with tables containing integer values. Compared to its basic version (see `Basic interface`_ above), the advanced version of the **Export** section offers two extra controls. The **Column delimiter** drop-down menu allows the user to select the column separator that will be inserted between cell values when exporting a table in text format; possible choices are *tabulation (\t)*, *comma (,)*, and *semi-colon (;)*. The **Output Orange headers** checkbox allows the user to indicate if the output should include every header line of the format *.tab* specific to Orange Canvas (**Output Orange headers**)--which is useful only for re-importing the exported table using the built-in **File** widget of Orange Canvas (and in fact often necessary in that case). Both parameters (**Column delimiter** and **Output Orange headers** also apply to the data sent on the *Segmentation* output channel) The **Info** section indicates the number of lines and columns in the output table, or the reasons why no table is emitted (no input data). The **Send** button triggers data emission to the output connection(s) (see `Output channels`_ above). When it is selected, the **Send automatically** checkbox disables the button and the widget attempts to automatically send data at every modification of its interface or when its input data are modified (by deletion or addition of a connection, or because modified data is received through an existing connection). Messages -------- Information ~~~~~~~~~~~ *Data correctly sent to output: table has and columns.* This confirms that the widget has operated properly. *Settings were* (or *Input has*) *changed, please click 'Send' when ready.* Settings and/or input have changed but the **Send automatically** checkbox has not been selected, so the user is prompted to click the **Send** button (or equivalently check the box) in order for computation and data emission to proceed. *No data sent to output yet: no input table.* The widget instance is not able to emit data to output because it receives none on its input channel(s). See also -------- * :doc:`Getting started: Converting between table formats ` * :doc:`Cookbook: Display table ` * :doc:`Cookbook: Export table ` Footnotes --------- .. [#] *IntPivotCrosstab* is in turn a subtype of *PivotCrosstab* (and similarly *IntWeightedFlatCrosstab* is a subtype of *WeightedFlatCrosstab*), whose specificity is to be limited to integer values. .. [#] For more details on the calculation of Markov associativities, see Bavaud F. and Xanthos A. (2005). Markov associativities. *Journal of Quantitative Linguistics, 12*:123--137. Details on the effect of the **bias** parameter can be found in Deneulin, P., Gautier, L., Le Fur, Y., and Bavaud, F. (2014). Corrélats textuels autour du concept de minéralité dans les vins. In Actes des 12èmes Journées internationales d'analyse statistique des données textuelles (JADT 2014), pp. 209--223; the predefined values of this parameter (**frequent**, **none**, and **rare**) correspond respectively to values 1, 0.5 and 0 of parameter *alpha* in the above cited reference.