跳转至

Pipelines on unstructured data(非结构化数据管道)

As discussed in the overview for datasets, unstructured data in Foundry is stored as a collection of files in a dataset just like tabular data.

These are some features that work identically between pipelines on structured and unstructured data:

  • Pipelines can be made incremental to optimize compute performance.
  • You can write unit tests against your pipelines.
  • Computing output datasets is done using builds and schedules.
  • Foundry's pipeline security features enable robust, end-to-end security guarantees.

Some differences from pipelines on tabular data include:

  • Most guidance and example code in documentation focuses on processing dataframes, which are not the input types used for unstructured data.
  • You must use the lower-level file system APIs to read and write files in unstructured datasets.
  • Because unstructured datasets have no schema, some features focused on validating rows and columns of tabular datasets are unavailable.
  • It is possible to use Spark to process unstructured files in parallel, but the APIs are lower-level and more complex than for dataframe processing.

To get started with pipelines on unstructured data, refer to the relevant parts of documentation for Python and Java transforms:

Once unstructured data has been cleaned and normalized, you can use Code Workbook to analyze unstructured datasets and train machine learning models in Python and R. Learn more about unstructured data access in Code Workbook.


中文翻译

非结构化数据管道

正如数据集概述中所述,Foundry 中的非结构化数据与表格数据一样,以数据集中的文件集合形式存储。

以下功能在结构化数据管道和非结构化数据管道中的工作方式完全相同:

与表格数据管道相比,存在以下差异:

  • 文档中的大多数指南和示例代码侧重于处理数据框(DataFrame),而数据框并非非结构化数据所使用的输入类型。
  • 您必须使用较低级别的文件系统 API 来读取和写入非结构化数据集中的文件。
  • 由于非结构化数据集没有模式(Schema),因此某些专注于验证表格数据集行和列的功能不可用。
  • 可以使用 Spark 并行处理非结构化文件,但其 API 比数据框处理的 API 更底层且更复杂。

要开始使用非结构化数据管道,请参阅 Python 和 Java 转换(Transforms)文档的相关部分:

一旦非结构化数据经过清洗和规范化,您可以使用代码工作簿(Code Workbook)分析非结构化数据集,并在 Python 和 R 中训练机器学习模型。了解有关在代码工作簿中访问非结构化数据的更多信息