跳转至

Configuration reference(配置参考)

:::callout{theme="warning" title="Sunset"} HyperAuto V1 is in the sunset phase of development and will be deprecated at a future date. Full support remains available. The creation of new V1 pipelines is discouraged, and users should migrate from HyperAuto V1 to V2 as detailed in the migration documentation. :::

:::callout{theme="warning" title="Warning"} This section describes advanced manual settings that can bring your SDDI pipeline into a broken state if not applied correctly. Always verify changes on a branch before deploying to production. :::

SDDI's pipeline is generated by a fully automated code repository. Cockpit is the default place to interact with those configurations, but you may have to manually amend the configuration files for advanced parameters or to configure non-standard source types.

:::callout{theme="neutral"} To review the steps involved, read about pipeline generation. :::

Configurations are performed within two main files located in the transforms-bellhop/src/config/ folder:

SourceConfig.yaml

The following is a notional example of a fully-defined SourceConfig file.

sourceName: MY_SOURCE
sourceRid: ri.magritte..source.abcdefgh-1234-5678-910a-zyxwvut
sapContext:
  type: direct
rawFolderStructure:
  raw: /HyperAuto/source/raw
  dataDictionary: /HyperAuto/source/metadata
cleaningLibraries:
  - convert_all_columns_to_clean_types
deploymentSemanticVersion: 2
metadataSparkProfiles:
  - DRIVER_MEMORY_MEDIUM
languageKey: 'E'
tables:
  - tableName: ABCD
    datasetTransformsConfig:
      datasetName: ABCD
      deduplicationComparisonColumns: []
      batchUnionComponents: []
      tableCleaningLibraries: []
  - tableName: WXYZ
    datasetTransformsConfig:
      datasetName: WXYZ
      deduplicationComparisonColumns:
        - /PALANTIR/TIMESTAMP
        - /PALANTIR/ROWNO
      batchUnionComponents:
        - WXYZ_historical
        - WXYZ_incremental
      tableCleaningLibraries:
        - parse_timestamp_column
      sparkProfiles:
        profiles:
          - EXECUTOR_MEMORY_MEDIUM
          - NUM_EXECUTORS_4

Parameters description

Parameter Description
sourceName The name to identify a source system. Used to prefix primary and foreign keys.
sourceRid The RID of the source attached to this SDDI instance.
sapContext (Optional) Details of the SAP context.
rawFolderStructure Defines the folders in which the raw data and metadata reside.
cleaningLibraries List of cleaning libraries to apply to all tables.
deduplicationConfig (Optional, default: None) Config used to specify which columns to use for the deduplication logic.
metadataSparkProfiles (Optional, default: None) List of Spark profiles to apply to metadata generation.
languageKey (Optional, default: 'E') Language to use in enrichments.
deploymentSemanticVersion (Optional, default: 0) Semantic version of the pipeline; incrementing it will force a snapshot.
tables List of tables from that source to be processed by SDDI.

sapContext

(Optional) Details of the SAP context. SAP Explorer will use this to pre-select the context. Each context will need to have its own SourceConfig file.

rawFolderStructure

Defines the folders in which the raw data and metadata reside.

Fields:

  • raw: Path of folder where raw tables are ingested.
  • dataDictionary: (Optional, default:raw) Path of folder where metadata tables are ingested.

cleaningLibraries

List of cleaning libraries to apply to all tables. Cleaning functions are defined in transforms-bellhop/src/software_defined_integrations/transforms/cleaned/function_libraries.

Adding or removing a function requires incrementing the deploymentSemanticVersion.

deduplicationConfig

(Optional, default: None) Config used to specify which columns to use for the deduplication logic. Configuration defined here is applied across all tables.

Fields:

  • comparisonColumns: Columns for which the max value will be taken to determine the uniqueness of primary keys.
  • changeModeColumn: (Optional) If specified, rows having value D in this column will be deleted.

deploymentSemanticVersion

(Optional, default: 0) Semantic version of the pipeline; incrementing it will force a snapshot.

See Incremental Transforms for the effects of deploymentSemanticVersion on incremental and snapshot transforms.

metadataSparkProfiles

(Optional, default: None) List of Spark profiles to apply to metadata dataset generation (objects, fields, links and diffs).

Be sure the profiles are added to the repository before referencing them here.

tables

List of tables from defined source to be processed by SDDI.

Fields:

  • tableName: Name of the table in metadata.
  • datasetTransformsConfig
  • datasetName: Foundry dataset name of the raw data.
  • deduplicationComparisonColumns: Table-specific config used to deduplicate data and specify which columns to use for the deduplication logic. Applied after the global deduplication fields.
  • changeModeColumn: (Optional) If specified, rows having value D in this column will be deleted. Applied over the global change mode column.
  • batchUnionComponents: List of input dataset names that should be unioned before the cleaning step.
  • sparkProfiles: (Optional) Spark profiles to apply at different stages of the transforms.
    • profiles: Spark profiles; see details for adding them to the repository.
    • stages: (Optional, default: None) Transform stages the profiles should be applied to. Value should be in [CLEANED, DERIVED, ENRICHED, FINAL, RENAMED, RENAMED_CHANGELOG]. If None, profiles are applied at all stages.
  • tableCleaningLibraries: List of cleaning libraries to apply to this table. Cleaning functions are defined in transforms-bellhop/src/software_defined_integrations/transforms/cleaned/function_libraries. Adding or removing a function will require you to increment the deploymentSemanticVersion.
  • enforceUniquePrimaryKeys: (Optional, default: False). If True and deduplicationComparisonColumns are defined, guarantees that only one record per primary key will be kept at the deduplication stage. This may result in non-deterministic behavior.

PipelineConfig.yaml

Example of a notional fully-defined PipelineConfig file.

sourceName: HyperAuto
sourceType: SAP_ERP
sourceConfigFileNames:
  - SourceConfig.yaml
outputFolder: /HyperAuto/source/output
workflows:
  my_workflow:
    variables:
      - name: my_variable_name
        value: my_variable_value
    enrichments:
      - my_enrichment_name
tables:
  ABCD:
    displayName: Header Table
    types:
      - OBJECT
  WXYZ:
    displayName: Item Table
    types:
      - OBJECT
      - METADATA
disableForeignKeyGeneration: False
disableEnrichedStage: False
disableRenamedStage: False

Parameters description

Parameter Description
projectName Project name. Serves as a prefix to Ontology objects.
sourceType Type of sources supported by SDDI. Should be one of [SAP_ERP, SALESFORCE, ORACLE_NETSUITE].
sourceConfigFileNames List of SourceConfig filenames to include in the pipeline.
outputFolder Defines the folder in which output datasets will be written.
workflows List of workflows to deploy, with configurations.
tables List of tables processed in this SDDI pipeline.
disableEnrichedStage (Optional, default: False) If enabled, no enriched datasets will be produced. Use with caution, as enabling will break workflows.
disableRenamedStage (Optional, default: False) If enabled, no renamed_changelog datasets will be produced. Use with caution, as enabling will break workflows.
disableForeignKeyGeneration If enabled, no foreign key columns will be produced. Use with caution, as enabling will break workflows.

tables

List of tables processed in this SDDI pipeline:

  • displayName: Human-readable name of the table. Output dataset name will be constructed in the form displayName (technicalName)
  • types: List of data types this table represents (can be many).
  • OBJECT: Master data table that constitutes an object in the ontology.
  • METADATA: Metadata table that contains information on objects and constructing primary keys.
  • CUSTOMIZATION: Enrichment table that is joined to master data tables at enriched step of SDDI pipeline.

中文翻译

配置参考

:::callout{theme="warning" title="即将停用"} HyperAuto V1 已进入开发阶段的停用期,将在未来某个日期正式弃用。目前仍提供全面支持。不建议创建新的 V1 管道,用户应按照迁移文档中的说明从 HyperAuto V1 迁移至 V2。 :::

:::callout{theme="warning" title="警告"} 本节介绍的高级手动设置若应用不当,可能导致 SDDI 管道进入故障状态。请务必先在分支上验证更改,再部署到生产环境。 :::

SDDI 的管道由全自动代码仓库生成。Cockpit 是与这些配置进行交互的默认位置,但对于高级参数或配置非标准源类型,您可能需要手动修改配置文件。

:::callout{theme="neutral"} 如需了解相关步骤,请阅读管道生成。 :::

配置在 transforms-bellhop/src/config/ 文件夹中的两个主要文件中完成:

SourceConfig.yaml

以下是一个完整定义的 SourceConfig 文件的示例。

sourceName: MY_SOURCE
sourceRid: ri.magritte..source.abcdefgh-1234-5678-910a-zyxwvut
sapContext:
  type: direct
rawFolderStructure:
  raw: /HyperAuto/source/raw
  dataDictionary: /HyperAuto/source/metadata
cleaningLibraries:
  - convert_all_columns_to_clean_types
deploymentSemanticVersion: 2
metadataSparkProfiles:
  - DRIVER_MEMORY_MEDIUM
languageKey: 'E'
tables:
  - tableName: ABCD
    datasetTransformsConfig:
      datasetName: ABCD
      deduplicationComparisonColumns: []
      batchUnionComponents: []
      tableCleaningLibraries: []
  - tableName: WXYZ
    datasetTransformsConfig:
      datasetName: WXYZ
      deduplicationComparisonColumns:
        - /PALANTIR/TIMESTAMP
        - /PALANTIR/ROWNO
      batchUnionComponents:
        - WXYZ_historical
        - WXYZ_incremental
      tableCleaningLibraries:
        - parse_timestamp_column
      sparkProfiles:
        profiles:
          - EXECUTOR_MEMORY_MEDIUM
          - NUM_EXECUTORS_4

参数说明

参数 描述
sourceName 用于标识源系统的名称。用作主键和外键的前缀。
sourceRid 附加到此 SDDI 实例的源的 RID。
sapContext (可选)SAP 上下文的详细信息。
rawFolderStructure 定义原始数据和元数据所在的文件夹。
cleaningLibraries 应用于所有表的清洗库列表。
deduplicationConfig (可选,默认值:None)用于指定去重逻辑所用列的配置。
metadataSparkProfiles (可选,默认值:None)应用于元数据生成的 Spark 配置文件列表。
languageKey (可选,默认值:'E')在丰富化中使用的语言。
deploymentSemanticVersion (可选,默认值:0)管道的语义版本;递增此版本将强制进行快照。
tables 该源中由 SDDI 处理的表列表。

sapContext

(可选)SAP 上下文的详细信息。SAP Explorer 将使用此信息预选上下文。每个上下文都需要有自己的 SourceConfig 文件。

rawFolderStructure

定义原始数据和元数据所在的文件夹。

字段:

  • raw 原始表被摄取到的文件夹路径。
  • dataDictionary (可选,默认值:raw)元数据表被摄取到的文件夹路径。

cleaningLibraries

应用于所有表的清洗库列表。清洗函数定义在 transforms-bellhop/src/software_defined_integrations/transforms/cleaned/function_libraries 中。

添加或删除函数需要递增 deploymentSemanticVersion

deduplicationConfig

(可选,默认值:None)用于指定去重逻辑所用列的配置。此处定义的配置将应用于所有表。

字段:

  • comparisonColumns 将取其最大值以确定主键唯一性的列。
  • changeModeColumn (可选)如果指定,此列中值为 D 的行将被删除。

deploymentSemanticVersion

(可选,默认值:0)管道的语义版本;递增此版本将强制进行快照。

有关 deploymentSemanticVersion 对增量转换和快照转换的影响,请参阅增量转换

metadataSparkProfiles

(可选,默认值:None)应用于元数据集生成(objectsfieldslinksdiffs)的 Spark 配置文件列表。

请确保在引用这些配置文件之前,已将其添加到仓库

tables

从已定义源中由 SDDI 处理的表列表。

字段:

  • tableName 元数据中的表名。
  • datasetTransformsConfig
  • datasetName 原始数据的 Foundry 数据集名称。
  • deduplicationComparisonColumns 表特定的配置,用于对数据进行去重并指定去重逻辑所用的列。在全局去重字段之后应用。
  • changeModeColumn (可选)如果指定,此列中值为 D 的行将被删除。覆盖全局更改模式列。
  • batchUnionComponents 在清洗步骤之前应进行联合的输入数据集名称列表。
  • sparkProfiles (可选)在转换的不同阶段应用的 Spark 配置文件。
    • profiles Spark 配置文件;有关将其添加到仓库的详细信息
    • stages (可选,默认值:None)应应用配置文件的转换阶段。值应为 [CLEANED, DERIVED, ENRICHED, FINAL, RENAMED, RENAMED_CHANGELOG] 之一。如果为 None,则在所有阶段应用配置文件。
  • tableCleaningLibraries 应用于此表的清洗库列表。清洗函数定义在 transforms-bellhop/src/software_defined_integrations/transforms/cleaned/function_libraries 中。添加或删除函数需要递增 deploymentSemanticVersion
  • enforceUniquePrimaryKeys (可选,默认值:False)。如果为 True 且定义了 deduplicationComparisonColumns,则保证在去重阶段每个主键只保留一条记录。这可能导致非确定性行为。

PipelineConfig.yaml

一个完整定义的 PipelineConfig 文件的示例。

sourceName: HyperAuto
sourceType: SAP_ERP
sourceConfigFileNames:
  - SourceConfig.yaml
outputFolder: /HyperAuto/source/output
workflows:
  my_workflow:
    variables:
      - name: my_variable_name
        value: my_variable_value
    enrichments:
      - my_enrichment_name
tables:
  ABCD:
    displayName: Header Table
    types:
      - OBJECT
  WXYZ:
    displayName: Item Table
    types:
      - OBJECT
      - METADATA
disableForeignKeyGeneration: False
disableEnrichedStage: False
disableRenamedStage: False

参数说明

参数 描述
projectName 项目名称。用作本体对象的前缀。
sourceType SDDI 支持的源类型。应为 [SAP_ERP, SALESFORCE, ORACLE_NETSUITE] 之一。
sourceConfigFileNames 要包含在管道中的 SourceConfig 文件名列表。
outputFolder 定义输出数据集将被写入的文件夹。
workflows 要部署的工作流列表,包含配置。
tables 在此 SDDI 管道中处理的表列表。
disableEnrichedStage (可选,默认值:False)如果启用,将不会生成丰富化数据集。请谨慎使用,因为启用此选项会破坏工作流。
disableRenamedStage (可选,默认值:False)如果启用,将不会生成 renamed_changelog 数据集。请谨慎使用,因为启用此选项会破坏工作流。
disableForeignKeyGeneration 如果启用,将不会生成外键列。请谨慎使用,因为启用此选项会破坏工作流。

tables

在此 SDDI 管道中处理的表列表:

  • displayName 表的人类可读名称。输出数据集名称将以 displayName (technicalName) 的形式构建。
  • types 此表表示的数据类型列表(可以有多个)。
  • OBJECT: 构成本体中对象的主数据表。
  • METADATA: 包含对象信息和构建主键的元数据表。
  • CUSTOMIZATION: 在 SDDI 管道的 enriched 步骤中与主数据表进行连接(join)的丰富化表。