Basic examples（基本示例）¶

This page explores examples of basic transforms to clean and prepare your data in the Preparations interface.

Remove leading and trailing whitespace¶

If some values have leading/trailing whitespace, they will be counted under Needs trim in the stats area.

Click the pink lightbulb next to Needs trim, then Trim whitespace to remove the leading and trailing whitespace from the values in that column.

Set empty strings to `null`¶

If some values are empty strings, they will be counted under Empty in the stats area.

Click the pink lightbulb next to Empty, then Set to null to set any empty string values in that column to null.

Normalize multiple string columns at once¶

Select the string columns you wish to normalize (or default to all string columns with no selected columns).
Choose the Normalize strings action and select the actions you wish to apply.

Normalize values to uppercase¶

The stats panel on the left will show how many values in the column are uppercase, lowercase or mixed case.

To normalize the values to uppercase, choose the Change Case action and click Uppercase (you can also use Lowercase or Title case as appropriate).

Parse currency strings into a numeric column¶

If numeric values have extraneous non-numeric characters (for example, $1,234.56) the column type will usually be detected as string. However, the column should be numeric to analyze it numerically.

Check the stats area to ensure the values are showing as Numeric.

:::callout{theme="neutral"} If some values are showing as non-alpha, uppercase, etc., you must first clean them to allow them to be parsed as numbers. Click the relevant category (e.g. non-alpha) to explore those values and begin cleaning. :::

Click the Change type button, choose either Integer (for whole numbers) or Double (for numbers with decimals) from the dropdown.

Nullify values that imply no data, e.g. N/A¶

Often, columns will end up with values that imply that there is no data available (e.g. N/A, Other, None, Unknown, etc). Typically, those values should be null to properly indicate that no data is available in that cell.

Select the value or values in the histogram.
If you don't see the values, try searching for them using the Filter... box.
Click the New value action, enter /NULL, and click Apply.

:::callout{theme="neutral"} Make sure you select one or more values before applying the change, or the entire column will be set to null. :::

Normalize ZIP codes to five digits¶

ZIP code columns can sometimes be a mix of five-digit (12345) and nine-digit (12345-6789) values. Typically, ZIP codes should be normalized to a single format to allow for grouping and other preparation workflows.

Normalize ZIP code columns to five digits using the Extract action. Click the Extract action, choose Indexed Substring from the Type dropdown, then use a start index of 1 and an end index of 5.

Rename all columns to snake case¶

Snake case (lowercase_with_underscores) is a common column naming standard across many deployments.

Click the Normalize column names action. If the action isn't visible, make sure no columns are selected. Choose Standard, and click Apply.

All columns will be instantly renamed to snake case, meaning the column values with have a format of lowercase letters with underscores instead of special characters.

Rename many columns¶

The Columns view is the easiest way to rename many columns. Switch to Columns view by clicking Columns at the top of the screen.

:::callout{theme="neutral"} Bulk column changes are not saved until you click Apply. :::

For any column that needs renaming, click into the column name and edit as necessary. The column name will turn green to indicate that the change is staged but not yet applied.

Once all the column names are fixed, click the Apply button at the top of the changelog on the right to save all the column name changes.

Delete null columns¶

Sometimes, columns only contain null values. You can clean up the dataset by removing the column entirely.

Check the stats panel to verify that the column contains entirely null values; the stats should say 100% for Null.

Then, click the Delete column button to remove the column.

Remove rows with a null value¶

Sometimes, rows with a null value in a particular column are irrelevant and can be removed.

Select the column, and check the Null section of the stats panel to see how many rows have a null value for that column.

Next, click on Null in the stats panel, then the Focus in button, to focus only on the rows with null values. Check to be sure that these rows are irrelevant and can be deleted.

To remove these rows, click the Remove rows button at the top of the screen.

Capture the row order of the original data¶

From the initial view, use the Add new column action with the following expression:

monotonically_increasing_id()

This will add a column of numbers that are guaranteed to be increasing based on the row order, but may be neither consecutive nor deterministic.

:::callout{theme="neutral"} The column values can change for each computation. This can lead to unexpected behavior, such as a dataset saved from the preparation having different numbers associated with the rows when built multiple times, or the histogram of the row number column not being consistent with the table data. :::

Add row numbers (for small datasets only)¶

From the initial view, use the Add new column action with the following expression:

row_number() over (order by monotonically_increasing_id())

:::callout{theme="warning"} This row number expression can be very expensive to compute, and it may significantly hinder the performance of the preparation when working with large datasets. :::

Insert changes into the middle of the changelog¶

From the initial view, choose the change above which you would like to insert changes. Then, enter preview mode by clicking on the change or selecting Preview data in the change dropdown menu.

Make changes to the preparation. The changes will appear above the change being previewed but below future changes.

Exit preview mode by clicking Cancel in the preview warning bar.

中文翻译¶

基本示例¶

本页介绍在 Preparations 界面中使用基本转换来清理和准备数据的示例。

删除前导和尾随空格¶

如果某些值包含前导或尾随空格，它们将在统计区域中被归类为 Needs trim（需要修剪）。

点击 Needs trim 旁边的粉色灯泡，然后选择 Trim whitespace（修剪空格），即可删除该列值中的前导和尾随空格。

将空字符串设置为 `null`¶

如果某些值为空字符串，它们将在统计区域中被归类为 Empty（空值）。

点击 Empty 旁边的粉色灯泡，然后选择 Set to null（设为空值），即可将该列中的所有空字符串值设置为 null。

同时标准化多个字符串列¶

选择需要标准化的字符串列（若未选择任何列，则默认对所有字符串列操作）。
选择 Normalize strings（标准化字符串）操作，并勾选需要应用的选项。

将值标准化为大写¶

左侧的统计面板会显示该列中大写、小写和混合大小写的值数量。

要将值标准化为大写，请选择 Change Case（更改大小写）操作，然后点击 Uppercase（大写）（您也可以根据需要选择 Lowercase（小写）或 Title case（首字母大写））。

将货币字符串解析为数值列¶

如果数值包含无关的非数字字符（例如 $1,234.56），列类型通常会被检测为 string（字符串）。但为了进行数值分析，该列应为数值类型。

检查统计区域，确保值显示为 Numeric（数值）。

:::callout{theme="neutral"} 如果某些值显示为非字母、大写等，您必须先清理它们，才能将其解析为数字。点击相关类别（例如 non-alpha）查看这些值并开始清理。 :::

点击 Change type（更改类型）按钮，从下拉菜单中选择 Integer（整数）或 Double（双精度浮点数，适用于带小数的数字）。

将表示无数据的值设为空值，例如 N/A¶

通常，列中会出现表示无数据的值（例如 N/A、Other、None、Unknown 等）。这些值应设为 null，以正确指示该单元格中无数据。

在直方图中选择一个或多个值。
如果看不到这些值，请尝试使用 Filter...（筛选）框进行搜索。
点击 New value（新值）操作，输入 /NULL，然后点击 Apply（应用）。

:::callout{theme="neutral"} 请确保在应用更改前已选择一个或多个值，否则整个列将被设置为 null。 :::

将邮政编码标准化为五位数字¶

邮政编码列有时会混合五位数字（12345）和九位数字（12345-6789）的值。通常，邮政编码应标准化为统一格式，以便进行分组和其他准备工作流。

使用 Extract（提取）操作将邮政编码列标准化为五位数字。点击 Extract 操作，从 Type（类型）下拉菜单中选择 Indexed Substring（索引子字符串），然后将起始索引设为 1，结束索引设为 5。

将所有列重命名为蛇形命名法¶

蛇形命名法（lowercase_with_underscores）是许多部署中常见的列命名标准。

点击 Normalize column names（标准化列名）操作。如果该操作不可见，请确保未选择任何列。选择 Standard（标准），然后点击 Apply（应用）。

所有列将立即重命名为蛇形命名法，即列名格式为小写字母加下划线，不含特殊字符。

批量重命名列¶

Columns（列）视图是批量重命名列的最简便方式。点击屏幕顶部的 Columns 切换到该视图。

:::callout{theme="neutral"} 批量列更改在点击 Apply（应用）之前不会保存。 :::

对于需要重命名的列，点击列名并进行编辑。列名将变为绿色，表示更改已暂存但尚未应用。

所有列名修改完成后，点击右侧 changelog（更改日志）顶部的 Apply（应用）按钮，保存所有列名更改。

删除空值列¶

有时，列中仅包含 null 值。您可以通过删除整列来清理数据集。

检查统计面板，确认该列完全由空值组成；统计信息中 Null（空值）应显示为 100%。

然后，点击 Delete column（删除列）按钮移除该列。

删除包含空值的行¶

有时，某列中包含 null 值的行无关紧要，可以删除。

选择该列，查看统计面板中的 Null（空值）部分，了解该列有多少行包含 null 值。

接下来，点击统计面板中的 Null，然后点击 Focus in（聚焦）按钮，仅查看包含 null 值的行。确认这些行无关紧要后即可删除。

要删除这些行，请点击屏幕顶部的 Remove rows（删除行）按钮。

捕获原始数据的行顺序¶

在初始视图中，使用 Add new column（添加新列）操作并输入以下表达式：

monotonically_increasing_id()

这将添加一列数字，这些数字保证按行顺序递增，但可能既不连续也不确定。

:::callout{theme="neutral"} 列值可能因每次计算而改变。这可能导致意外行为，例如从 Preparations 保存的数据集在多次构建时，行对应的数字不同，或者行号列的直方图与表格数据不一致。 :::

添加行号（仅适用于小数据集）¶