跳转至

Set up a streaming sync(设置流式同步)

A sync is a task that reads specific data from a source and ingests it into Foundry. For example, if you have a relational database source that contains multiple tables, you might configure a sync to ingest a specific table into Foundry.

A streaming sync is similar to a non-streaming (i.e. batch or incremental) sync but with some differences. The primary difference is that a batch or incremental sync runs periodically while a streaming sync runs consistently to pull data into Foundry with as little latency as possible.

Below, we will discuss the steps required to create a sync :

  1. Define the data to sync from the source.
  2. Define a location in Foundry to send the data.
  3. Configure the streaming sync.
  4. Run the streaming sync.

For this tutorial, we will use a Kafka source to set up the sync.

Part 1. Define data

First, decide which data you would like to sync into Foundry. Select your streaming source in Data Connection, then select the available action in the top right corner:

  • Explore and create syncs: This option appears if your source type supports source exploration, allowing you to explore your data source while creating a sync.
  • Create sync: This option appears if your source type does not support source exploration.

Explore Kafka source

Explore and create syncs

If your source type supports source exploration, you will land on the Explore source page in Data Connection that shows data available to sync. The exploration view interface depends on the source type you are using. For example, a Kafka source exploration allows you to see the topics ↗ present on the Kafka broker and preview the data contained in those topics.

From the Kafka exploration view, you can view existing topics in the list to the left of the page.

Explore Kafka source

Selecting a topic will let you preview a sample of data from that topic.

Preview Kafka topic

Part 2. Define the sync location

Next, you need to decide where to save your synced dataset in Foundry. The location of your dataset will determine who has permission to access the resulting dataset, based on Project-level permissions.

We recommend saving a synced dataset next to its source in a Project, allowing them to have the same permissions; matching dataset and source permissions are helpful when creating data pipelines. Learn more about the recommended Project structure for data pipelines.

Once you choose your sync location, click Create streaming sync in the upper right corner.

Part 3. Configure the streaming sync

Now, you will land on the Sync creation page in Data Connection where you can define source-specific and core streaming configurations for your sync.

  • Source-specific: Located at the top of the configuration page, these options depend on your source type and configures the parameters passed to the specific source to which you are connecting.
  • Core streaming: Located below the source-specific configuration, these options are common to all streaming syncs. Core configurations include the throughput, schema, and sync destination.

Configure Kafka sync

Next, select the throughput for your stream. The throughput determines the number of partitions that will be created. Selecting a larger number of partitions allows for higher throughput. Selecting a Normal throughput will allow up to 5 MB/s for that stream.

Then specify the schema of the input data, by default this is inferred from source, but it can be overwritten if necessary.

Set stream schema

Once you configure your sync, select Create Sync on the top right.

Now that your sync is created, you will be taken to the Overview tab.

Part 4. Run the sync

Now, you are ready to run the sync. Select the Overview tab to view a summary of your new sync, including the output dataset, location, and available actions.

Click Start to begin running the sync of data from the external stream into Foundry.

Kafka sync overview

To view the stream data, navigate to the stream you configured while creating the sync to view the stream preview page. You should see records flowing from the Kafka topic in the stream.

View stream ouptut

Next steps

Now that you have successfully run a sync, learn how to debug a failing stream, push data into a stream with push-based ingestions, or integrate your stream with the Ontology.


中文翻译

设置流式同步

同步(Sync) 是一项从特定数据源读取数据并将其导入Foundry的任务。例如,如果你有一个包含多个表的关系型数据库源,你可以配置一个同步任务来将特定表导入Foundry。

流式同步与批处理或增量同步等非流式同步类似,但存在一些差异。主要区别在于,批处理或增量同步会定期运行,而流式同步则持续运行,以尽可能低的延迟将数据拉取到Foundry中。

下面,我们将讨论创建同步所需的步骤:

  1. 定义数据以从源同步。
  2. 定义位置以在Foundry中发送数据。
  3. 配置流式同步。
  4. 运行流式同步。

在本教程中,我们将使用Kafka源来设置同步。

第一部分. 定义数据

首先,决定要将哪些数据同步到Foundry。在数据连接(Data Connection)中选择你的流式源,然后选择右上角可用的操作:

  • 探索并创建同步(Explore and create syncs): 如果你的源类型支持源探索,则会显示此选项,允许你在创建同步时探索数据源。
  • 创建同步(Create sync): 如果你的源类型不支持源探索,则会显示此选项。

探索Kafka源

探索并创建同步

如果你的源类型支持源探索,你将进入数据连接中的探索源(Explore source)页面,该页面显示可供同步的数据。探索视图界面取决于你使用的源类型。例如,Kafka源探索允许你查看Kafka代理上存在的主题(Topics) ↗,并预览这些主题中包含的数据。

从Kafka探索视图中,你可以在页面左侧的列表中查看现有主题。

探索Kafka源

选择一个主题将允许你预览该主题的数据样本。

预览Kafka主题

第二部分. 定义同步位置

接下来,你需要决定在Foundry中保存同步数据集的位置。数据集的位置将根据项目级(Project-level)权限决定谁有权访问生成的数据集。

我们建议将同步的数据集保存在项目中其源数据旁边,以便它们具有相同的权限;匹配的数据集和源权限在创建数据管道时非常有用。了解更多关于数据管道的推荐项目结构。

选择同步位置后,点击右上角的创建流式同步(Create streaming sync)

第三部分. 配置流式同步

现在,你将进入数据连接中的同步创建(Sync creation)页面,在这里你可以为同步定义源特定配置和核心流式配置。

  • 源特定(Source-specific): 位于配置页面顶部,这些选项取决于你的源类型,并配置传递给所连接特定源的参数。
  • 核心流式(Core streaming): 位于源特定配置下方,这些选项是所有流式同步共有的。核心配置包括吞吐量、模式和同步目标。

配置Kafka同步

接下来,选择流的吞吐量(Throughput)。吞吐量决定了将创建的分区数量。选择更多的分区数量可以实现更高的吞吐量。选择正常(Normal)吞吐量将允许该流达到最高5 MB/s。

然后指定输入数据的模式,默认情况下会从源推断,但必要时可以覆盖。

设置流模式

配置同步后,点击右上角的创建同步(Create Sync)

同步创建完成后,你将进入概览(Overview)选项卡。

第四部分. 运行同步

现在,你已经准备好运行同步了。选择概览(Overview)选项卡查看新同步的摘要,包括输出数据集、位置和可用操作。

点击开始(Start)开始将数据从外部流同步到Foundry。

Kafka同步概览

要查看流数据,请导航到创建同步时配置的流,以查看流预览页面。你应该会看到记录从Kafka主题流入流中。

查看流输出

后续步骤

现在你已经成功运行了同步,可以学习如何调试失败的流、通过推送式摄取(Push-based ingestion)将数据推入流,或者将流与本体论(Ontology)集成