Create a simple preparation(创建简单数据准备(Preparation))¶
:::callout{theme="warning"} Preparation has been superseded by Pipeline Builder and is therefore no longer the recommended approach for cleaning and preparing data. Pipeline Builder makes it easy to clean and prepare your data for pipelines, while also offering Marketplace support. :::
The following tutorial will walk you through how to use Preparation to transform a spreadsheet of raw data to a cleaned and prepared dataset ready for analysis.
This tutorial uses data from The Meteoritical Society via the NASA Data Portal ↗. You can follow along on your own Preparation instance with this sample dataset:
Download meteorite_landings_raw
This dataset contains raw data about meteorites that have been found on Earth.
The dataset includes name, mass, classification, and other identifying information for each meteorite, along with the year it was discovered and coordinates of where it was found.
We recommend opening the CSV to review the data before uploading into Foundry.
1. Create a preparation¶
We will get started by creating a new preparation.
-
First, upload the
meteorite_landings_raw.csvfile into Foundry. -
Then, navigate to the
meteorite_landings_rawdataset, right-click, and choose Clean in Preparation.
This creates a new preparation. You should save your preparation with a meaningful name to make it easier to find again in your files.
- Finally, click Save and choose a name and save location for the preparation.
:::callout{theme="neutral"} Preparations that you create and do not explicitly save are stored by default in Files > .auto-save. :::
2. Clean data¶
Now, review the dataset and identify and fix any data quality issues you find.
Trim whitespace¶
- First, click on the name column in the table:

The panels below will show some information about the data in the column: statistics, charts, etc:

You can see from the statistics panel that some of the values have been flagged as Needs trim, which means that there is extraneous whitespace at the beginning or end of the value.
- Hover over the pink lightbulb, and click the Trim whitespace button to fix this issue.

After the column statistics refresh, you should now see that the Needs trim count is now zero, and the column has been cleaned successfully. You will also see the Trim whitespace change added to the Dataset Changes list on the right side of the screen:

Transform year column to a date¶
Now, let's move on to the year column. You can see in the table that the data type of the column is Timestamp. However, we want it to just be a Date.
-
First, click the Change type button and choose Date (whole days) from the dropdown list.
-
Click the Change type button.

Set geolocation values to null¶
Finally, let's look at the GeoLocation column. You will see in the histogram that a large number of rows have a value of (0.000000,0.000000), which is not a valid geolocation.

Let's fix these values by setting them to null.
- First, select the (0.000000, 0.000000) value in the histogram.
- Next, click the New value action under Change data (for selected rows).
- Finally, enter
/NULLin the text box, and click Apply to set these values tonull.

3. Save a cleaned version of a dataset¶
Now that we have cleaned up data quality issues, we can save a new, cleaned version of this dataset.
- First, click the Save as dataset button at the top of the screen.
- Then, choose a name and location for the new cleaned dataset. A pop-up will appear indicating that the new dataset is being built.

There will be a link to the new dataset indicated by Output:. As you make changes to the preparation, you can update the output dataset using the Update button.
:::callout To try out the results of your cleaning in Contour without having to save a new dataset, click the Analyze button at the top of the screen. :::
中文翻译¶
创建简单数据准备(Preparation)¶
:::callout{theme="warning"} 数据准备(Preparation)已被 Pipeline Builder 取代,因此不再推荐用于数据清洗和准备。Pipeline Builder 可轻松清洗和准备管道数据,同时还提供 Marketplace 支持。 :::
以下教程将指导您如何使用数据准备(Preparation)将原始电子表格数据转换为可供分析的清洗后数据集。
本教程使用来自陨石学会(The Meteoritical Society)的数据,数据来源为 NASA 数据门户 ↗。您可以使用此示例数据集在自己的数据准备实例中跟随操作:
该数据集包含在地球上发现的陨石的原始数据。
数据集包括每颗陨石的名称、质量、分类及其他识别信息,以及发现年份和发现地点的坐标。
我们建议在上传至 Foundry 之前先打开 CSV 文件查看数据。
1. 创建数据准备¶
我们将从创建新的数据准备开始。
-
首先,将
meteorite_landings_raw.csv文件上传至 Foundry。 -
然后,导航至
meteorite_landings_raw数据集,右键单击并选择 在 Preparation 中清洗(Clean in Preparation)。
这将创建一个新的数据准备。建议使用有意义的名称保存数据准备,以便日后在文件中更容易找到。
- 最后,点击 保存(Save),为数据准备选择名称和保存位置。
:::callout{theme="neutral"} 您创建但未明确保存的数据准备默认存储在 文件(Files) > .auto-save 中。 :::
2. 清洗数据¶
现在,查看数据集,识别并修复发现的数据质量问题。
修剪空白(Trim whitespace)¶
- 首先,点击表格中的 name 列:

下方的面板将显示该列数据的一些信息:统计信息、图表等:

从统计面板中可以看到,部分值被标记为 需要修剪(Needs trim),这意味着值的开头或结尾存在多余空白。
- 将鼠标悬停在粉色灯泡上,点击 修剪空白(Trim whitespace) 按钮修复此问题。

列统计信息刷新后,您会看到 需要修剪 计数变为零,表示该列已成功清洗。同时,修剪空白 的更改会显示在屏幕右侧的 数据集更改(Dataset Changes) 列表中:

将 year 列转换为日期(Date)¶
接下来,处理 year 列。您可以在表格中看到该列的数据类型为 时间戳(Timestamp),但我们希望它仅为 日期(Date)。
-
首先,点击 更改类型(Change type) 按钮,从下拉列表中选择 日期(整天)[Date (whole days)]。
-
点击 更改类型 按钮。

将地理位置值设置为 null¶
最后,查看 GeoLocation 列。在直方图中可以看到,大量行的值为 (0.000000,0.000000),这不是有效的地理位置。

让我们将这些值设置为 null 来修复。
- 首先,在直方图中选择 (0.000000, 0.000000) 值。
- 接着,在 更改数据(针对所选行)[Change data (for selected rows)] 下点击 新值(New value) 操作。
- 最后,在文本框中输入
/NULL,点击 应用(Apply) 将这些值设置为null。

3. 保存清洗后的数据集版本¶
现在我们已经清理了数据质量问题,可以保存该数据集的新清洗版本。
- 首先,点击屏幕顶部的 另存为数据集(Save as dataset) 按钮。
- 然后,为新的清洗后数据集选择名称和位置。将出现一个弹出窗口,提示正在构建新数据集。

将有一个指向新数据集的链接,显示为 输出(Output):。当您对数据准备进行更改时,可以使用 更新(Update) 按钮更新输出数据集。
:::callout 若要在不保存新数据集的情况下在 Contour 中测试清洗结果,请点击屏幕顶部的 分析(Analyze) 按钮。 :::