File-based syncs(基于文件的同步(File-based syncs))¶
After creating a file-based sync using exploration, you can update the configuration in the Configurations tab of the sync page.
Conceptual file-based ingestion modes¶
While Foundry file-based syncs offer low-level settings for greater flexibility and configuration, most use cases will follow a known mode. The following table documents known modes and the low-level settings required to achieve the desired behavior, as well as settings that could be contradictory with those modes.
Batch mirror with SNAPSHOT (default)¶
- Transaction type:
SNAPSHOT - Filters:
None
Each run will ingest all files nested in the external system's subdirectory, including files ingested in previous runs, and commit a SNAPSHOT transaction to the output dataset containing exactly those files. The output Foundry dataset view will contain a single SNAPSHOT transaction containing all files.
Contradictory settings¶
- Filters:
Exclude files already synced - Results in the trailing window ingestion mode.
- Filters:
Limit number of files - Results in the output Foundry dataset view containing only a non-deterministic subset of the desired files if the limit is lower than the total number of available files.
- Filters:
At least N files - If there are not
Nnested files in the specified subfolder of the external system, this setting will yield an empty transaction and result in 0 files being ingested. Otherwise, this setting has no effect.
Incremental mirror with APPEND¶
- Transaction type:
APPEND - Filters:
Exclude files already synced
The output dataset view will contain a collection of APPEND transactions, which in aggregate contain all nested files ever present during any job run from the specified subfolder. Each run will ingest all files that have not yet been ingested, keyed by file path name, and commit an APPEND transaction to the output dataset.
Contradictory settings¶
- Filters:
Exclude files already syncedwith theLast modified dateorFile sizeoption - These options would attempt to incorrectly re-ingest existing files, keyed by file path name, in an
APPENDtransaction when theirLast modified dateorFile sizechange, respectively. To allow updates to existing files, review the incremental withUPDATEingestion mode.
Incremental mirror with UPDATE¶
- Transaction type:
UPDATE - Filters:
Exclude files already synced - One or both of:
- Filters:
Exclude files already syncedwith theLast modified dateoption - Filters:
Exclude files already syncedwith theFile sizeoption
The output dataset view will contain a collection of UPDATE transactions, which in aggregate contain the latest version of all nested files ever present during any job run from the specified subfolder. Each run will ingest all files that have not yet been ingested or have since changed, keyed by file path name, and commit an UPDATE transaction to the output dataset.
Caveat¶
Only use this mode if modifications to existing files are a non-negotiable behavior of the external system. While ingestion is incremental in the sense that only files that are new or changed are ingested in a given run, downstream pipelines cannot run incrementally, as the output dataset (input to the downstream pipelines) is not append-only.
Trailing window with SNAPSHOT¶
- Transaction type:
SNAPSHOT - Filters:
Exclude files already synced
The output dataset view will contain a single SNAPSHOT transaction containing only files that were never present in any previous job run. Each run will ingest all files that have not yet been ingested, keyed by file path name, and commit a SNAPSHOT transaction to the output dataset, containing exactly those files.
This mode is useful when only "recent" files (files that were created in the external system between the second-to-last and last run) are relevant to downstream pipelines and operations. Files ingested in previous runs will not be visible in the output dataset view.
Contradictory settings¶
- Filters:
Limit number of files - When the number of files created in the external system during a given window exceeds the specified limit, a non-deterministic subset of those files will be ingested, and the remainder will be deferred to a subsequent window. This number may grow rapidly over time, destroying the "recency" intended in the output dataset.
:::callout{theme="neutral"}
It is always safe to specify the subfolder and optional regex, in addition to filters that limit the file types desired in the output. Such filters include Last modified after to exclude outdated files or Path does not match to exclude files with a certain file extension, such as .sh executable files.
Only the Exclude files already synced, At least N files, and Limit number of files filters are tightly coupled to the desired sync mode and might interfere with it.
:::
Sync failure behavior¶
As with all batch syncs, file-based syncs are transactional. If a sync fails at any point, the transaction is aborted and none of the files from that run are committed to the dataset.
For example, if a sync is configured to upload one million files and a network error occurs after uploading all but one file, the entire transaction is aborted. None of the uploaded files are committed and the dataset view is not modified.
Because of this behavior, syncs that process a large number of files in a single run are more vulnerable to transient failures. To reduce the impact of sync failures, consider using incremental APPEND syncs with a file limit filter so that each run processes a smaller batch of files. This way, a failure only requires re-syncing the current batch rather than all files.
Configure file-based syncs¶
Configuration options for file-based syncs include the following:
| Parameter | Required? | Default | Description |
|---|---|---|---|
Subfolder |
Yes | Specify the location of files within the connector that will be synced into Foundry. | |
Filters |
No | Apply filters to limit the files synced into Foundry. | |
Transformers |
No | Apply transformers to data before it is synced into Foundry. | |
🟡 Completion strategies [Legacy] |
No | Enable to delete files and/or empty parent directories after a successful sync. Requires write permission on the source filesystem. Completion strategies are read-only and cannot be configured on new syncs. |
:::callout{theme="warning"} Syncs will include all nested files and folders from the specified subfolder. :::
Filters¶
Filters allow you to filter source files before they are imported into Foundry. The supported filter types are:
- Exclude files already synced: Only sync files that were added or modified in size or date since the last sync.
:::callout{theme="warning"}
Exclude files already synced has known scale limitations, as it requires scanning all files on every sync run. For syncs ingesting a large number of files, we recommend splitting the sync into a SNAPSHOT sync for historical data and an INCREMENTAL sync with Exclude files already synced, so that the filter is only applied to a smaller subset of recent files.
:::
- Path matches: Only sync files with a path (relative to the root of the connector) that matches the regular expression.
- Path does not match: Only sync files with a path (relative to the root of the connector) that does not match the regular expression.
- Last modified after: Only sync files that have been modified after a specified date and time.
- File size is between: Only sync files with a size between the specified minimum and maximum byte value.
- Any file has path matching: If any file has a relative path matching the regular expression, sync all files in the subfolder that are not otherwise filtered.
- At least N files: Sync all filtered files only if there are at least N files remaining.
- Limit number of files: Limit the number of files to keep per transaction. This option can increase the reliability of incremental syncs.
Transformers¶
Transformers allow you to perform basic file transformations (compression or decryption, for example) before uploading to Foundry. During a sync, the files chosen for ingest will be modified per the chosen transformer.
:::callout{theme="warning"} Rather than using Data Connection transformers, we recommend performing data transformations in Foundry with Pipeline Builder and Code Repositories to benefit from provenance and branching. :::
The following transformers are supported in Data Connection:
- Compress with Gzip
- Concatenate multiple files
- Join multiple files into a single file.
- Rename files
- Replace all occurrences of a given filename substring with a new substring.
- Drop the directory path from the filename by replacing
^(.*/)with/. - Decrypt with PGP
- Decrypt files that have been encrypted with PGP encryption.
- Requires that the agent system has PGP keys configured.
- Unavailable for syncs running on a Foundry worker.
- Append timestamp to filenames
- Add a timestamp in a custom format to the filename of each file ingested.
Completion strategies¶
:::callout{theme="warning"} Completion strategies are in the legacy phase of development, and no additional development is expected. Completion strategies are read-only on existing syncs and cannot be configured on new syncs. We recommend using a downstream external transform as an alternative. :::
Completion strategies are a mechanism to attempt deleting files and empty parent directories after a successful batch sync of those files into a Foundry dataset. Completion strategies are read-only and cannot be configured on new syncs. Existing syncs with completion strategies will continue to function, but the configuration cannot be modified.
This workflow may be useful when data is synced from an intermediate staging area, such as a file storage system, rather than directly from the external system of record. This workflow is not recommended. Foundry should be connected directly to the external system of record. In cases where Foundry is the system of record, data should be pushed directly to Foundry rather than an intermediate staging area.
Limitations of completion strategies and alternatives¶
Completion strategies are subject to several important limitations and caveats. These limitations and potential mitigations or alternatives are described below.
Completion strategy support¶
Completion strategies are in the legacy phase of development, and no additional development is expected. Completion strategies are read-only on existing syncs and cannot be configured on new syncs. Completion strategies were only available for sources using an agent worker. We recommend implementing the functionality provided by completion strategies as a downstream transform instead.
As an example, assume you have a direct connection to an S3 bucket containing the files foo.txt and bar.txt. You want to use a file batch sync to copy them to a dataset, and then delete the files from S3. Instead of using completion strategies, you should do the following:
- Configure a batch sync without any completion strategies and schedule it to run.
- Write a downstream external transform job which is scheduled to run when the sync output dataset is updated, taking the synced data as an input.
- In that external transform, write Python transforms code to iterate through the files that have appeared in the synced dataset, and make calls to S3 to delete those files from the bucket.
Note that this approach is retryable if any deletion calls fail, and guarantees that data is successfully committed to Foundry before attempting to perform any deletions. This approach is also compatible with incremental file batch syncs.
Completion strategies are best effort¶
For existing syncs using completion strategies, note that completion strategies are best effort. This means that they do not guarantee that data will be effectively removed. The following are some situations that may cause completion strategies to fail:
- Completion strategies will not be retried if the agent worker crashes or is restarted after the batch sync commits data to Foundry, but before the completion strategies run.
- If the credentials used to connect do not have write permissions, the batch sync may successfully read data and commit to Foundry, but fail to perform the deletions specified by the completion strategy.
In general, we recommend using an alternative to completion strategies wherever possible. Custom completion strategies are no longer supported.
For directory sources specifically, if you require guaranteed file deletion after ingestion, consider using an external transform to delete files via SSH instead.
Optimize file-based syncs¶
:::callout{theme="warning" title="Warning"} This guide is recommended for users setting up a new sync or troubleshooting a slow or unreliable sync. If your sync is already working reliably, you do not need to take any action. :::
Syncing a large number of files into a single dataset can be slow and vulnerable to transient failures. As described in sync failure behavior, a failure at any point aborts the entire transaction, which means larger syncs risk more lost work.
If the dataset grows over time, the time to sync the data as a SNAPSHOT increases because SNAPSHOT transactions sync all data into Foundry. Instead, use syncs configured with transaction type APPEND to import your data incrementally. Since you will be syncing smaller, discrete chunks of data, you will create an effective checkpoint; a sync failure will result in a minimal amount of duplicated work rather than requiring a complete re-run. Additionally, your dataset syncs will run faster as you no longer need to upload all of your data for every sync.
Configure incremental APPEND syncs¶
APPEND transactions require additional configuration to run successfully.
By default, files synced into Foundry are not filtered. However,APPEND syncs require filters to prevent the same files from being imported. We recommend using the Exclude files already synced and Limit number of files filters to control how many files get imported into Foundry in a single sync. Finally, schedule your sync to remain up to date with your source system.
中文翻译¶
基于文件的同步(File-based syncs)¶
使用探索功能创建基于文件的同步后,您可以在同步页面的配置(Configurations)选项卡中更新配置。
概念性文件摄取模式(Conceptual file-based ingestion modes)¶
虽然Foundry基于文件的同步提供底层设置以实现更大的灵活性和配置能力,但大多数用例都会遵循已知模式。下表记录了已知模式以及实现所需行为所需的底层设置,同时也列出了可能与这些模式冲突的设置。
使用SNAPSHOT的批量镜像(Batch mirror)(默认)¶
- 事务类型(Transaction type):
SNAPSHOT - 过滤器(Filters):
无
每次运行将摄取外部系统子目录中嵌套的所有文件(包括之前运行已摄取的文件),并向输出数据集提交一个包含这些文件的SNAPSHOT事务。输出的Foundry数据集视图将包含一个包含所有文件的SNAPSHOT事务。
冲突设置¶
- 过滤器:
排除已同步的文件(Exclude files already synced) - 会导致尾随窗口摄取模式。
- 过滤器:
限制文件数量(Limit number of files) - 如果限制低于可用文件总数,则输出的Foundry数据集视图将仅包含所需文件的非确定性子集。
- 过滤器:
至少N个文件(At least N files) - 如果外部系统指定子文件夹中没有
N个嵌套文件,此设置将产生空事务,导致0个文件被摄取。否则,此设置无影响。
使用APPEND的增量镜像(Incremental mirror)¶
- 事务类型:
APPEND - 过滤器:
排除已同步的文件
输出的数据集视图将包含一系列APPEND事务,这些事务汇总了指定子文件夹在任何作业运行期间曾经存在的所有嵌套文件。每次运行将摄取所有尚未摄取的文件(以文件路径名为键),并向输出数据集提交一个APPEND事务。
冲突设置¶
- 过滤器: 使用
最后修改日期(Last modified date)或文件大小(File size)选项的排除已同步的文件 - 这些选项会尝试在
APPEND事务中错误地重新摄取现有文件(以文件路径名为键),当它们的最后修改日期或文件大小发生变化时。要允许更新现有文件,请查看使用UPDATE的增量模式。
使用UPDATE的增量镜像(Incremental mirror)¶
- 事务类型:
UPDATE - 过滤器:
排除已同步的文件 - 以下一项或两项:
- 过滤器: 使用
最后修改日期选项的排除已同步的文件 - 过滤器: 使用
文件大小选项的排除已同步的文件
输出的数据集视图将包含一系列UPDATE事务,这些事务汇总了指定子文件夹在任何作业运行期间曾经存在的所有嵌套文件的最新版本。每次运行将摄取所有尚未摄取或自上次运行后发生更改的文件(以文件路径名为键),并向输出数据集提交一个UPDATE事务。
注意事项¶
仅当修改现有文件是外部系统不可协商的行为时才使用此模式。虽然摄取是增量的(即每次运行仅摄取新文件或已更改的文件),但下游管道无法增量运行,因为输出数据集(下游管道的输入)不是仅追加的。
使用SNAPSHOT的尾随窗口(Trailing window)¶
- 事务类型:
SNAPSHOT - 过滤器:
排除已同步的文件
输出的数据集视图将包含一个SNAPSHOT事务,其中仅包含在任何先前作业运行中从未出现过的文件。每次运行将摄取所有尚未摄取的文件(以文件路径名为键),并向输出数据集提交一个SNAPSHOT事务,其中恰好包含这些文件。
当只有"最近"的文件(即在倒数第二次运行和最后一次运行之间在外部系统中创建的文件)与下游管道和操作相关时,此模式非常有用。先前运行中摄取的文件将不会在输出数据集视图中可见。
冲突设置¶
- 过滤器:
限制文件数量 - 当在给定时间窗口内外部系统中创建的文件数量超过指定限制时,将摄取这些文件的非确定性子集,其余文件将推迟到后续窗口处理。此数量可能随时间快速增长,从而破坏输出数据集中预期的"最近性"。
:::callout{theme="neutral"}
除了限制输出中所需文件类型的过滤器外,指定子文件夹和可选正则表达式始终是安全的。此类过滤器包括最后修改时间晚于(Last modified after)(用于排除过时文件)或路径不匹配(Path does not match)(用于排除具有特定文件扩展名的文件,如.sh可执行文件)。
只有排除已同步的文件、至少N个文件和限制文件数量过滤器与所需同步模式紧密耦合,可能会干扰该模式。
:::
同步失败行为¶
与所有批量同步一样,基于文件的同步是事务性的。如果同步在任何时候失败,事务将被中止,该运行中的任何文件都不会提交到数据集。
例如,如果同步配置为上传一百万个文件,但在上传除一个文件外的所有文件后发生网络错误,则整个事务将被中止。所有已上传的文件都不会被提交,数据集视图也不会被修改。
由于此行为,在单次运行中处理大量文件的同步更容易受到瞬时故障的影响。为减少同步失败的影响,请考虑使用带有文件限制过滤器的增量APPEND同步,以便每次运行处理较小的文件批次。这样,失败只需要重新同步当前批次,而不是所有文件。
配置基于文件的同步¶
基于文件的同步的配置选项包括以下内容:
| 参数 | 是否必需 | 默认值 | 描述 |
|---|---|---|---|
子文件夹(Subfolder) |
是 | 指定连接器内将同步到Foundry的文件位置。 | |
过滤器(Filters) |
否 | 应用过滤器以限制同步到Foundry的文件。 | |
转换器(Transformers) |
否 | 应用转换器在数据同步到Foundry之前进行处理。 | |
🟡 完成策略(Completion strategies) [旧版] |
否 | 启用以在成功同步后删除文件和/或清空父目录。需要对源文件系统具有写入权限。完成策略是只读的,无法在新同步上配置。 |
:::callout{theme="warning"} 同步将包含指定子文件夹中的所有嵌套文件和文件夹。 :::
过滤器¶
过滤器允许您在源文件导入Foundry之前对其进行过滤。支持的过滤器类型包括:
- 排除已同步的文件: 仅同步自上次同步以来添加或大小/日期发生更改的文件。
:::callout{theme="warning"}
排除已同步的文件存在已知的规模限制,因为它需要在每次同步运行时扫描所有文件。对于摄取大量文件的同步,我们建议将同步拆分为用于历史数据的SNAPSHOT同步和带有排除已同步的文件的INCREMENTAL同步,以便该过滤器仅应用于较小的最近文件子集。
:::
- 路径匹配(Path matches): 仅同步路径(相对于连接器根目录)与正则表达式匹配的文件。
- 路径不匹配(Path does not match): 仅同步路径(相对于连接器根目录)与正则表达式不匹配的文件。
- 最后修改时间晚于(Last modified after): 仅同步在指定日期和时间之后修改过的文件。
- 文件大小介于(File size is between): 仅同步大小介于指定最小和最大字节值之间的文件。
- 任何文件路径匹配(Any file has path matching): 如果任何文件的相对路径与正则表达式匹配,则同步子文件夹中所有未被其他方式过滤的文件。
- 至少N个文件(At least N files): 仅当至少有N个文件剩余时才同步所有过滤后的文件。
- 限制文件数量(Limit number of files): 限制每个事务保留的文件数量。此选项可以提高增量同步的可靠性。
转换器¶
转换器允许您在上传到Foundry之前执行基本的文件转换(例如压缩或解密)。在同步期间,选定的摄取文件将根据所选转换器进行修改。
:::callout{theme="warning"} 我们建议使用管道构建器(Pipeline Builder)和代码仓库(Code Repositories)在Foundry中执行数据转换,以利用数据溯源和分支功能,而不是使用Data Connection转换器。 :::
Data Connection支持以下转换器:
- 使用Gzip压缩
- 合并多个文件(Concatenate multiple files)
- 将多个文件合并为单个文件。
- 重命名文件(Rename files)
- 将所有出现的给定文件名字符串替换为新字符串。
- 通过将
^(.*/)替换为/来从文件名中删除目录路径。 - 使用PGP解密(Decrypt with PGP)
- 解密已使用PGP加密的文件。
- 要求代理系统已配置PGP密钥。
- 在Foundry工作节点上运行的同步不可用。
- 向文件名追加时间戳(Append timestamp to filenames)
- 以自定义格式向每个摄取文件的文件名添加时间戳。
完成策略¶
:::callout{theme="warning"} 完成策略处于旧版开发阶段,预计不会进行额外开发。完成策略在现有同步上是只读的,无法在新同步上配置。我们建议使用下游外部转换(External transform)作为替代方案。 :::
完成策略是一种机制,用于在成功将文件批量同步到Foundry数据集后尝试删除文件和空父目录。完成策略是只读的,无法在新同步上配置。具有完成策略的现有同步将继续运行,但无法修改配置。
当数据从中间暂存区域(如文件存储系统)同步,而不是直接从外部记录系统同步时,此工作流可能有用。不建议使用此工作流。Foundry应直接连接到外部记录系统。在Foundry是记录系统的情况下,数据应直接推送到Foundry,而不是中间暂存区域。
完成策略的限制和替代方案¶
完成策略存在几个重要的限制和注意事项。这些限制以及潜在的缓解措施或替代方案如下所述。
完成策略支持¶
完成策略处于旧版开发阶段,预计不会进行额外开发。完成策略在现有同步上是只读的,无法在新同步上配置。完成策略仅适用于使用代理工作节点(Agent worker)的源。我们建议改为将完成策略提供的功能实现为下游转换。
例如,假设您有一个直接连接到包含文件foo.txt和bar.txt的S3存储桶。您希望使用文件批量同步将它们复制到数据集,然后从S3删除这些文件。您应该执行以下操作,而不是使用完成策略:
- 配置一个不带任何完成策略的批量同步,并安排其运行。
- 编写一个下游外部转换作业,该作业安排在同步输出数据集更新时运行,并将同步数据作为输入。
- 在该外部转换中,编写Python转换代码以遍历已同步数据集中出现的文件,并调用S3从存储桶中删除这些文件。
请注意,如果任何删除调用失败,此方法是可以重试的,并且保证在尝试执行任何删除之前数据已成功提交到Foundry。此方法也与增量文件批量同步兼容。
完成策略是尽力而为的¶
对于使用完成策略的现有同步,请注意完成策略是尽力而为的。这意味着它们不保证数据会被有效删除。以下是一些可能导致完成策略失败的情况:
- 如果代理工作节点在批量同步将数据提交到Foundry之后、完成策略运行之前崩溃或重启,完成策略将不会重试。
- 如果用于连接的凭据没有写入权限,批量同步可能成功读取数据并提交到Foundry,但无法执行完成策略指定的删除操作。
总的来说,我们建议尽可能使用完成策略的替代方案。不再支持自定义完成策略。
具体对于目录源,如果您需要在摄取后保证文件删除,请考虑使用外部转换通过SSH删除文件作为替代方案。
优化基于文件的同步¶
:::callout{theme="warning" title="警告"} 本指南建议用于设置新同步或排查缓慢或不可靠同步的用户。如果您的同步已经可靠运行,则无需采取任何操作。 :::
将大量文件同步到单个数据集可能会很慢,并且容易受到瞬时故障的影响。如同步失败行为中所述,任何点的失败都会中止整个事务,这意味着较大的同步会面临更多工作丢失的风险。
如果数据集随时间增长,将数据作为SNAPSHOT同步的时间会增加,因为SNAPSHOT事务会将所有数据同步到Foundry。相反,请使用配置为事务类型APPEND的同步来增量导入数据。由于您将同步较小、离散的数据块,您将创建一个有效的检查点;同步失败将导致最少量的重复工作,而不是需要完全重新运行。此外,您的数据集同步将运行得更快,因为您不再需要为每次同步上传所有数据。
配置增量APPEND同步¶
APPEND事务需要额外配置才能成功运行。
默认情况下,同步到Foundry的文件不会被过滤。然而,APPEND同步需要过滤器以防止导入相同的文件。我们建议使用排除已同步的文件和限制文件数量过滤器来控制单次同步中导入Foundry的文件数量。最后,安排您的同步以保持与源系统的同步。