Rule types¶
This section describes each rule in terms of the following.
Name of rule
Required arguments
Optional arguments
Description
Notes
The types of rules supported in data preparation are as follows:
In addition to these rules, data preparation provides various expressions, thereby supporting almost every function required for general data preprocessing.
header¶
Required arguments: Row number that contains the column name (1-base)
Description
This rule sets the content in the designated row as the column name.
This is useful for reading a CSV file with column names in the first row.
Unless otherwise specified, data preparation automatically performs header. This rule may be deleted if header results are not desired, but such cases are not common.
settype¶
Required arguments
Column: A list of target columns
New type: Select one out of Long, Double, String, Boolean, and Timestamp
Optional arguments
Set format: A format string (Joda Time) in the case of timestamp
Description
This rule changes the type of the selected columns.
The rule is considered successful even if the result is a type mismatch, which should be separately addressed.
setformat¶
Required arguments
Column: A list of target columns
Set format: A Joda-Time format string
Description
This rule changes the display format of a Timestamp column.
The target column must be of the Timestamp type.
Notes
As shown below, the format input field lists different entries depending on the input. The candidate list is narrowed as more values are entered.
rename¶
Required arguments
Column: A single target column
New column name: New name
Description
This rule changes the name of the selected column.
To rename two or more columns at once, click the Rename multiple columns button at the bottom of the command input panel to display the following popup.
keep¶
Required arguments
Condition: A conditional expression returning a Boolean value
Description
All rows are deleted except the rows that return true for the conditional expression.
delete¶
Required arguments
Condition: A conditional expression returning a Boolean value
Description
All rows that return true for the conditional expression are deleted. This is the opposite of keep.
replace¶
Required arguments
Column: A list of target columns
Pattern: A string pattern to be replaced
In the case of a constant string: Characters enclosed inside
'
('Houston'
,'Naperville'
,'Philadelphia'
etc.)In the case of a regular expression: Characters enclosed inside
/
(/[ ,_]+/
,/\s+$/
, etc.)New value: A new string expression to replace the specified pattern
Constant string
Regular expression
$1_$2_$3
, etc.
Optional arguments
Ignore between characters: Does not make any replacement for content between the characters entered here
Match all occurrences: Whether all characters of a word must match
Ignore case: Whether to make the strings case-insensitive
Description
String replacement is performed for the selected columns.
Notes
Do not use
'
or/
in a new value.Values from other columns are not available as new values. replace performs string replacement for content in the selected columns only. (cf. set rule)
set¶
Required arguments
Column: A list of target columns
Expression: An expression to be applied to the values of the target column. Values from other columns may be referenced. (cf. replace rule)
When multiple columns are involved, use a
$col
variable, which will be substituted by the respective target column during each conversion.That is, when applying the set command on
column1
andcolumn2
,$col
becomescolumn1
during conversion ofcolumn1
, and$col
becomescolumn2
during conversion ofcolumn2
.
Optional arguments
Use only under the following conditions
The set rule is applied only to rows satisfying this condition.
This rule may be regarded the same as the
WHERE
statement in SQL.
Description
This rule replaces the values in the selected column with results returned by the expression.
When using a complex expression, click the Advanced editor to display the popup shown below:
In the Advanced editor, you can edit the expression in a larger window while viewing the column list and a list of functions and their descriptions, and also run a validity check before implementing the expression.
derive¶
Required arguments
Expression: An expression whose resulting values are to form a new column. Similar to the set rule, values from other columns may be referenced.
New column name
Description
While similar to the set rule, this rule creates a new column instead of replacing an existing one.
Notes
The new column is inserted after the last existing column in the expression.
split¶
Required arguments
Column: A list of target columns
Pattern: A string expression that serves as a separator that splits the target strings. Allows a regular expression as is the case for the replace rule.
Number: Number of columns to be divided into.
Description
Each row is split by the given Number - 1.
When the pattern is no longer matched, the rest columns contain a null.
Notes
Note that columns are created as many as the Number input.
merge¶
Required arguments
Column: A list of target columns
Delimiter: A constant string with which values of different columns are concatenated.
New column name
Description
The target columns are merged with the Delimiter into a new column.
Notes
Similar to the replace rule, enclosing with a
'
may be skipped. That is, strings not enclosed by/
or'
are automatically enclosed by'
.
extract¶
Required arguments
Column: A list of target columns
Pattern: A string pattern to be extracted. Allows a regular expression as is the case for the replace rule.
Number: Number of instances to be extracted
Optional arguments
Ignore between characters: Does not make any replacement for content between the characters entered here
Ignore case: Whether to make the strings case-insensitive
Description
A new column(s) with content matching the given pattern is created.
Notes
When there are multiple target columns, the resulting columns are inserted after each target column.
countpattern¶
Required arguments
Column: A list of target columns
Pattern: A string pattern to be detected. Allows a regular expression as is the case for the replace rule.
Optional arguments
Ignore between characters: Does not make any replacement for content between the characters entered here
Ignore case: Whether to make the strings case-insensitive
Description
New columns are created based on the number of matches with the pattern.
This is highly similar to extract. The only difference is that it counts the number of matches, rather than extracting the matched content.
Notes
When there are multiple target columns, the resulting columns are inserted after each target column.
nest¶
Required arguments
Column: A list of target columns
Type: Map or Array
New column name
Description
The target columns are grouped into a new column of the given type.
Below are examples of grouping columns into an array and map, respectively.
unnest¶
Required arguments
Column: A single target column
Select elements: 0-base index for an array, or key value for a map
Description
A new column is created by extracting the selected elements from an array or a map.
Notes
The target column must be of the array or map type.
flatten¶
Required arguments
Column: A single target column
Description
Rows are created from elements of an array.
Notes
The target column must be of the array type.
If the target array column has four elements as shown in the above example, each original row of the array results in four rows. Non-array columns result in the same columns.
aggregate¶
Required arguments
Expression: A list of aggregate functions
Group by: A list of columns that group values by.
Description
A new column is added from the results of grouping by each combination of the elements from the GroupBy columns.
A column is created for each expression. For example, two columns are created if average and count are designated as expressions.
The available aggregate functions are as follows:
count()
sum(colname)
avg(colname)
min(colname)
max(colname)
Notes
Calculations are performed only for sampling results. Therefore, the snapshot?the results for the entire data?may be different.
Note that
()
must be inserted when using the count function.count(colname) is currently not available.
pivot¶
Required arguments
Column: A list of columns subject to pivoting
Expression: A list of expressions whose resulting values form new columns (only aggregate functions are available)
Group by: A list of columns that group values by.
Description
Group By is performed for each combination of target columns and GroupBy columns. A dataset having the results as column values is created.
A set of columns is created for each expression. For example, if average and count are designated as expressions and the values in the pivoted columns are divided into ten groups, a total of 20 columns will be created.
Notes
This is used when performing GroupBy on at least two columns. (1 pivoted column, 1 GroupBy column)
Here, Rename multiple columns is useful as column names tend to get longer.
unpivot¶
Required arguments
Column: A list of target columns to be converted into values in new columns
GroupEvery: Number of columns (defaults to 1)
Description
Two columns are created?one contains the selected column names and the other contains their values. (If GroupEvery is set to 1)
If GroupEvery is the same as the number of selected columns, each resulting pair of columns contains the name and values of its respective original column. Therefore, If 10 columns are unpivoted with the GroupEvery argument set to 10, for example, a total of 20 columns are created.
Notes
Using the GroupEvery argument set to a factor of the number of columns will soon be supported.
<Where GroupEvery is set to 1>
<Where GroupEvery is set to the same as the number of columns>
join¶
Unlike other rules, join has a separate popup.
Required arguments (select in a popup or enter a value)
Dataset to join: A wrangled dataset in the same dataflow
Columns to join (toggle)
Join keys: Multiple values may be entered
Join type: Only inner join supported now
Description
Joins to the target dataset to create new columns.
This rule is the same as
join
used by a relational database.The results can be previewed by clicking the Show result button.
Notes
The join keys must be included in the columns to join.
union¶
Similar to join, union has a separate popup.
Required arguments (select in a popup)
Datasets to union: Multiple selections allowed.
Description
The content of the selected datasets is also processed.
This rule is the same as
union all
used by a relational database.
Notes
The target datasets must coincide with the dataset that unions them in terms of column name, type, and number of columns.
window¶
Required arguments
Expression: A list of window functions
Group by: A list of columns that group values by. Row order created within each group. If not specified, the whole data is sorted based on the Sort by setting.
Sort by: Specifies columns by which the order of rows is determined. If not specified, data is sorted in the order of being inputted.
Description
Column values are created by calculating with the values of the preceding and following rows.
The rows are grouped first and then sorted within each group in the specified column order.
In the above example, each row value is averaged with the three preceding and following rows within the same State group.
If an immediately preceding row does not have the same state, earlier rows are searched.
The currently available window functions are as follows:
row_number()
lead(colname, int)
lag(colname, int)
rolling_sum(colname, int, int)
rolling_avg(colname, int, int)
In addition to window functions, aggregate functions may be used.
Notes
When using window functions, error messages may not be properly displayed in the event of insufficient arguments.