Datasets（数据集(Datasets)）¶

To create a pipeline, select a data input as a starting point. After adding a data input to a pipeline, the data can be cleaned, transformed, and combined with other data to be deployed for further use across the platform (for example, as part of the Ontology).

Pipeline Builder supports data inputs in the form of structured, semi-structured, and unstructured data.

Structured data typically comes in the form of datasets that consist of files containing tabular data and metadata about the columns in the dataset. The column metadata is stored alongside the corresponding dataset as a schema. Pipeline Builder also supports the input of structured data in the form of manually entered data, uploaded data, data syncs, and virtual tables.

Semi-structured data refers to a dataset that consists of files without a schema, making the data non-tabular. Thus, semi-structured datasets are sometimes called "schema-less" datasets. Pipeline Builder supports semi-structured data in the form of XML, JSON, and CSV files. You can use parsing transform functions to convert semi-structured files into tabular form and benefit from schema safety. Learn how to transform data in your pipeline.

Unstructured data refers to other non-tabular forms of data, including visual media, PDF documents, audio, and more. Pipeline Builder supports unstructured data inputs in the form of media sets.

The first step towards defining a workflow in Pipeline Builder is to add one or more data inputs to your workspace. Learn how to add data or change input computation modes in the following documentation, and learn more about data in Foundry by visiting data integration.

中文翻译¶

数据集(Datasets)¶

要创建管道(Pipeline)，首先需要选择一个数据输入作为起点。将数据输入添加到管道后，可以对数据进行清洗、转换，并与其他数据合并，以便在平台中进一步部署使用（例如，作为本体论(Ontology)的一部分）。

管道构建器(Pipeline Builder)支持以结构化(Structured)、半结构化(Semi-structured)和非结构化(Unstructured)数据形式输入数据。

结构化数据(Structured data)通常以数据集(Datasets)的形式呈现，这些数据集由包含表格数据和列元数据的文件组成。列元数据作为模式(Schema)与相应的数据集一起存储。管道构建器还支持以手动输入数据、上传数据、数据同步和数据同步和虚拟表(Virtual tables)的形式输入结构化数据。

半结构化数据(Semi-structured data)指的是由没有模式的文件组成的数据集(Dataset)，因此数据是非表格形式的。因此，半结构化数据集有时被称为"无模式(Schema-less)"数据集。管道构建器支持XML、JSON和CSV文件形式的半结构化数据。您可以使用解析转换函数将半结构化文件转换为表格形式，并受益于模式安全性。了解如何在管道中转换数据。

非结构化数据(Unstructured data)指其他非表格形式的数据，包括视觉媒体、PDF文档、音频等。管道构建器支持以媒体集(Media sets)形式输入非结构化数据。

在管道构建器中定义工作流的第一步是将一个或多个数据输入添加到您的工作区。了解如何在以下文档中添加数据或更改输入计算模式，并通过访问数据集成(Data integration)了解更多关于Foundry中数据的信息。