Pipeline Builder¶

How can I drop malformed CSV lines in a data pipeline build to avoid errors?¶

From the dataset preview section, use the Parsing options -> Drop jagged rows feature to drop malformed rows.

Timestamp: March 1, 2024

How can I union multiple datasets in a single transform without unioning different pairs of datasets?¶

You can connect all the datasets into the union board which takes an unlimited number of inputs, or you can drag-select all of the inputs and then click the union transform option.

Timestamp: April 11, 2024

Is there a batch operation for calling the LLM on the Use LLM board, or is it called per row?¶

The LLM is called per row, but the operations are parallelized across executors for speed.

Timestamp: March 28, 2024

Is there a way to extract text from an image in a PDF using OCR in Pipeline Builder?¶

Yes, in Pipeline Builder, you can extract text from images in a PDF by using the OCR (Optical Character Recognition) extraction method in the PDF text extraction transform.

Timestamp: April 10, 2024

What does the "Time bounded drop duplicates" function do when a row arrives later than the configured event time window?¶

The Time bounded drop duplicates function will drop any row that arrives later than the configured event time window, regardless of whether it is a duplicate or not.

Timestamp: March 20, 2024

Can I replace the output for pipeline A with new datasets and then have the previous output datasets from pipeline A be the output for a different pipeline (pipeline B), ensuring all pipeline output schemas are the same?¶

Yes, you can overwrite a dataset with a new output in Pipeline Builder, which is a one-time action that changes the ownership of an existing dataset to a new output. You can configure the desired datasets as outputs for pipeline B, provided you have the necessary permissions and follow the required steps. It's crucial that all pipeline output schemas match the input transform node schema to avoid errors and successfully deploy the pipeline.

Timestamp: April 13, 2024

How can I implement a custom User-Defined Function (UDF) to use in Pipeline Builder?¶

To implement a custom User-Defined Function (UDF) in Pipeline Builder, refer to documentation on creating and using UDFs as well as how to run arbitrary Java code in Pipeline Builder.

Timestamp: April 19, 2024

How can I add row numbers to a dataset that is built by uploading CSV files?¶

You can enable the Row number via the Edit schema option in the dataset preview.

Timestamp: April 18, 2024

How can I convert struct columns to JSON strings in Pipeline Builder?¶

The JSON to string expression can be used to convert struct columns to JSON strings.

Timestamp: June 14, 2024

Why is there a discrepancy in row counts between the preview of a deployed dataset in Pipeline Builder and the actual dataset view?¶

The discrepancy could be caused if input sampling strategies are applied in the preview. Also, consider that non-deterministic transformations may vary row counts.

Timestamp: June 28, 2024

How do you clean up checkpoint datasets created by a pipeline?¶

Move the pipeline that created the checkpoint dataset to the trash, and it should also move the checkpoint dataset to the trash.

Timestamp: April 24, 2024

How can `null` string values be mapped to a specific string (for example, "no data") in a Pipeline Builder pipeline?¶

There are two methods to achieve this in a Pipeline Builder pipeline:

Use the Coalesce function. For instance, A = coalesce(A, "no data"). If A is null, it will return "no data".
Use the Case board.

Both methods allow for the mapping of null values to a specified string.

Timestamp: July 11, 2024

Is there a method to impute `null` values in a group of columns?¶

Yes, you can use the Apply To Multiple Columns transform to impute null values across different columns.

Timestamp: April 24, 2024

中文翻译¶

Pipeline Builder¶

如何在数据管道构建中丢弃格式错误的CSV行以避免错误？¶

在数据集预览部分，使用解析选项 -> 丢弃参差不齐的行(Parsing options -> Drop jagged rows)功能来丢弃格式错误的行。

时间戳： 2024年3月1日

如何在不合并不同数据集对的情况下，在单个转换中合并多个数据集？¶

您可以将所有数据集连接到联合(union)面板中，该面板接受无限数量的输入；或者您可以拖选所有输入，然后点击联合转换选项。

时间戳： 2024年4月11日

在"使用LLM(Use LLM)"面板上调用LLM是批量操作还是逐行调用？¶

LLM是按行调用的，但操作会在执行器(executors)之间并行化以提高速度。

时间戳： 2024年3月28日

在Pipeline Builder中是否有办法使用OCR从PDF中的图像提取文本？¶

是的，在Pipeline Builder中，您可以通过PDF文本提取(PDF text extraction)转换中的OCR（光学字符识别，Optical Character Recognition）提取方法从PDF中的图像提取文本。

时间戳： 2024年4月10日

当某行在配置的事件时间窗口之后到达时，"时间限定去重(Time bounded drop duplicates)"函数会做什么？¶

时间限定去重(Time bounded drop duplicates)函数会丢弃任何在配置的事件时间窗口之后到达的行，无论它是否为重复行。

时间戳： 2024年3月20日

我能否用新数据集替换管道A的输出，然后将管道A之前的输出数据集作为不同管道（管道B）的输出，并确保所有管道输出模式相同？¶

是的，您可以在Pipeline Builder中用新输出覆盖数据集，这是一次性操作，会将现有数据集的所有权更改为新输出。只要您拥有必要的权限并遵循所需步骤，就可以将所需数据集配置为管道B的输出。至关重要的是，所有管道输出模式必须与输入转换节点模式匹配，以避免错误并成功部署管道。

时间戳： 2024年4月13日

如何在Pipeline Builder中实现自定义用户定义函数(UDF)？¶

要在Pipeline Builder中实现自定义用户定义函数(UDF)，请参考关于创建和使用UDF的文档以及如何在Pipeline Builder中运行任意Java代码。

时间戳： 2024年4月19日

如何为通过上传CSV文件构建的数据集添加行号？¶

您可以通过数据集预览中的编辑模式(Edit schema)选项启用行号(Row number)。

时间戳： 2024年4月18日

如何在Pipeline Builder中将结构体列转换为JSON字符串？¶

可以使用JSON to string表达式将结构体列转换为JSON字符串。

时间戳： 2024年6月14日

为什么Pipeline Builder中已部署数据集的预览与实际数据集视图之间的行数存在差异？¶

如果预览中应用了输入采样策略(input sampling strategies)，则可能导致差异。此外，非确定性转换也可能导致行数变化。

时间戳： 2024年6月28日

如何清理管道创建的检查点数据集？¶

将创建检查点数据集的管道移至回收站，检查点数据集也应随之移至回收站。

时间戳： 2024年4月24日

如何在Pipeline Builder管道中将`null`字符串值映射到特定字符串（例如"no data"）？¶

在Pipeline Builder管道中有两种方法可以实现：

使用Coalesce函数。例如：A = coalesce(A, "no data")。如果A为null，将返回"no data"。
使用Case面板。

这两种方法都可以将null值映射到指定字符串。

时间戳： 2024年7月11日

是否有方法对一组列中的`null`值进行插补？¶

是的，您可以使用应用到多列(Apply To Multiple Columns)转换来对不同列中的null值进行插补。

时间戳： 2024年4月24日

Pipeline Builder¶

How can I drop malformed CSV lines in a data pipeline build to avoid errors?¶

How can I union multiple datasets in a single transform without unioning different pairs of datasets?¶

Is there a batch operation for calling the LLM on the Use LLM board, or is it called per row?¶

Is there a way to extract text from an image in a PDF using OCR in Pipeline Builder?¶

What does the "Time bounded drop duplicates" function do when a row arrives later than the configured event time window?¶

Can I replace the output for pipeline A with new datasets and then have the previous output datasets from pipeline A be the output for a different pipeline (pipeline B), ensuring all pipeline output schemas are the same?¶

How can I implement a custom User-Defined Function (UDF) to use in Pipeline Builder?¶

How can I add row numbers to a dataset that is built by uploading CSV files?¶

How can I convert struct columns to JSON strings in Pipeline Builder?¶

Why is there a discrepancy in row counts between the preview of a deployed dataset in Pipeline Builder and the actual dataset view?¶

How do you clean up checkpoint datasets created by a pipeline?¶

How can null string values be mapped to a specific string (for example, "no data") in a Pipeline Builder pipeline?¶

Is there a method to impute null values in a group of columns?¶

中文翻译¶

Pipeline Builder¶

如何在数据管道构建中丢弃格式错误的CSV行以避免错误？¶

如何在不合并不同数据集对的情况下，在单个转换中合并多个数据集？¶

在"使用LLM(Use LLM)"面板上调用LLM是批量操作还是逐行调用？¶

在Pipeline Builder中是否有办法使用OCR从PDF中的图像提取文本？¶

当某行在配置的事件时间窗口之后到达时，"时间限定去重(Time bounded drop duplicates)"函数会做什么？¶

我能否用新数据集替换管道A的输出，然后将管道A之前的输出数据集作为不同管道（管道B）的输出，并确保所有管道输出模式相同？¶

如何在Pipeline Builder中实现自定义用户定义函数(UDF)？¶

如何为通过上传CSV文件构建的数据集添加行号？¶

如何在Pipeline Builder中将结构体列转换为JSON字符串？¶

为什么Pipeline Builder中已部署数据集的预览与实际数据集视图之间的行数存在差异？¶

如何清理管道创建的检查点数据集？¶

如何在Pipeline Builder管道中将null字符串值映射到特定字符串（例如"no data"）？¶

是否有方法对一组列中的null值进行插补？¶

How can `null` string values be mapped to a specific string (for example, "no data") in a Pipeline Builder pipeline?¶

Is there a method to impute `null` values in a group of columns?¶

如何在Pipeline Builder管道中将`null`字符串值映射到特定字符串（例如"no data"）？¶

是否有方法对一组列中的`null`值进行插补？¶