Google Pub/Sub¶
Connect Foundry to Google Pub/Sub to read data from a topic into a Foundry stream in realtime.
Supported capabilities¶
| Capability | Status |
|---|---|
| Exploration | 🟢 Generally available |
| Streaming syncs | 🟢 Generally available |
| Streaming exports | 🟢 Generally available |
Data model¶
The connector does not parse message contents, and data of any type can be synced into Foundry. All content is uploaded, unparsed, under the data column. Use a downstream streaming transform (for example, parse_json in Pipeline Builder) to parse the data. The id column will display the message ID received from Pub/Sub.
| id (string) | data (string) |
|---|---|
| 5986331692832221 | {"firstName": "John", "lastName": "Doe"} |
| 5986326266478130 | test-payload |
When setting up a sync, the schema of the dataset in Foundry must match the above schema.
Performance and limitations¶
The connector uses a single-consumer thread to sync messages from Pub/Sub. When exporting data to Pub/Sub from Foundry, one thread is used per partition of the Foundry data stream.
:::callout{theme="neutral"} Streaming syncs create a subscription based on the output dataset in Foundry. The credentials must have permissions to create subscriptions. If multiple imports are configured to read from the same topic, they will each create their own subscription and read all the messages on the topic. :::
Streaming syncs are meant to be consistent, long-running jobs. Any interruption to a streaming sync is a potential outage, depending on expected outcomes.
Consider the following before setting up streaming syncs in Foundry:
- Jobs running on a Foundry worker restart at least once every 48 hours. Expected downtime is less than ten minutes, assuming resource availability allows jobs to immediately restart.
- For legacy agent worker sources, jobs restart during agent maintenance windows (typically once a week) to pick up upgrades. Expected downtime is less than five minutes.
Setup¶
- Open the Data Connection application and select + New Source in the upper right corner of the screen.
- Select Pub/Sub from the available connector types.
- Follow the additional configuration prompts to continue the setup of your connector using the information in the sections below.
Learn more about setting up a connector in Foundry.
Authentication¶
Choose between two available authentication methods:
- GCP instance account: Refer to the Google Cloud documentation ↗ to learn how to set up instance-based authentication.
-
Note that GCP instance authentication only works for connections operating through agents that run on appropriately configured instances in GCP.
-
Service account key file: Refer to the Google Cloud documentation ↗ to learn how to set up service account key file authentication. The key file can be provided as JSON or PKCS8 credentials.
Configured credentials must have access to the following:
- For syncs:
roles/pubsub.viewerroles/pubsub.subscriberprojects.subscriptions.create- For exports:
roles/pubsub.publisher
Connection details¶
The following configuration options are available for the Pub/Sub connector:
| Option | Required? | Description |
|---|---|---|
Project ID |
Yes | The ID of the Project in GCP. |
Credentials settings |
Yes | Configure using the Authentication guidance shown above. |
Proxy settings |
No | Enable to allow a proxy connection to Pub/Sub. |
GRPC Settings |
No* | Advanced settings used to configure GRPC channels. |
Sync data from Pub/Sub¶
Learn how to set up a sync with Pub/Sub in the Set up a streaming sync tutorial.
When setting up a sync, the schema of the dataset must match the schema described in the data model section above.
Export data to Pub/Sub¶
The connector supports exporting streams to Pub/Sub through Data Connection.
To export to Pub/Sub, first enable exports for your Pub/Sub connector. Then, create a new export.
Export configuration options¶
| Option | Required? | Default | Description |
|---|---|---|---|
Topic |
Yes | N/A | The Pub/Sub topic to which you want to export. |
Value Column |
No | N/A | If no value is specified here, the entire contents of the record on the Foundry stream will be written as a string to Pub/Sub. If specified, only the contents of the Value Column will be exported to Pub/Sub. |
中文翻译¶
Google Pub/Sub¶
将 Foundry 连接到 Google Pub/Sub,以便实时从主题(Topic)读取数据到 Foundry 流(Stream)中。
支持的功能¶
| 功能 | 状态 |
|---|---|
| 探索(Exploration) | 🟢 正式发布(Generally available) |
| 流式同步(Streaming syncs) | 🟢 正式发布(Generally available) |
| 流式导出(Streaming exports) | 🟢 正式发布(Generally available) |
数据模型¶
该连接器(Connector)不会解析消息内容,任何类型的数据都可以同步到 Foundry。所有内容都会以未解析的形式上传到 data 列下。请使用下游的流式转换(Streaming Transform)(例如 Pipeline Builder 中的 parse_json)来解析数据。id 列将显示从 Pub/Sub 接收到的消息 ID。
| id (字符串) | data (字符串) |
|---|---|
| 5986331692832221 | {"firstName": "John", "lastName": "Doe"} |
| 5986326266478130 | test-payload |
设置同步时,Foundry 中数据集(Dataset)的架构(Schema)必须匹配上述架构。
性能与限制¶
该连接器使用单消费者线程从 Pub/Sub 同步消息。当从 Foundry 向 Pub/Sub 导出数据时,Foundry 数据流的每个分区(Partition)会使用一个线程。
:::callout{theme="neutral"} 流式同步会基于 Foundry 中的输出数据集创建一个订阅(Subscription)。凭据必须具有创建订阅的权限。如果配置了多个导入(Import)从同一主题读取数据,它们将各自创建自己的订阅并读取该主题上的所有消息。 :::
流式同步旨在作为持续运行的长期任务。根据预期结果,流式同步的任何中断都可能导致潜在的服务中断。
在 Foundry 中设置流式同步之前,请考虑以下事项:
- 在 Foundry 工作节点(Foundry worker) 上运行的作业至少每 48 小时重启一次。假设资源可用性允许作业立即重启,预期停机时间不超过十分钟。
- 对于传统的 代理工作节点(Agent worker) 源,作业会在代理维护窗口期间(通常每周一次)重启以获取升级。预期停机时间不超过五分钟。
设置¶
- 打开 数据连接(Data Connection) 应用程序,然后选择屏幕右上角的 + 新建源(+ New Source)。
- 从可用的连接器类型中选择 Pub/Sub。
- 按照其他配置提示,使用以下部分中的信息继续设置您的连接器。
了解更多关于在 Foundry 中设置连接器的信息。
身份验证(Authentication)¶
在两种可用的身份验证方法之间进行选择:
- GCP 实例账户(GCP instance account): 请参考 Google Cloud 文档 ↗ 了解如何设置基于实例的身份验证。
-
请注意,GCP 实例身份验证仅适用于通过运行在 GCP 中适当配置的实例上的代理进行操作的连接。
-
服务账户密钥文件(Service account key file): 请参考 Google Cloud 文档 ↗ 了解如何设置服务账户密钥文件身份验证。密钥文件可以以 JSON 或 PKCS8 凭据的形式提供。
配置的凭据必须具有以下访问权限:
- 对于同步:
roles/pubsub.viewerroles/pubsub.subscriberprojects.subscriptions.create- 对于导出:
roles/pubsub.publisher
连接详情¶
Pub/Sub 连接器提供以下配置选项:
| 选项 | 是否必需 | 描述 |
|---|---|---|
项目 ID(Project ID) |
是 | GCP 中项目的 ID。 |
凭据设置(Credentials settings) |
是 | 按照上述身份验证指南进行配置。 |
代理设置(Proxy settings) |
否 | 启用以允许通过代理连接到 Pub/Sub。 |
GRPC 设置(GRPC Settings) |
否* | 用于配置 GRPC 通道的高级设置。 |
从 Pub/Sub 同步数据¶
了解如何在设置流式同步教程中设置与 Pub/Sub 的同步。
设置同步时,数据集的架构必须匹配上述数据模型部分中描述的架构。
将数据导出到 Pub/Sub¶
该连接器支持通过数据连接(Data Connection)将流(Stream)导出到 Pub/Sub。
要导出到 Pub/Sub,首先为您的 Pub/Sub 连接器启用导出。然后,创建一个新的导出。
导出配置选项¶
| 选项 | 是否必需 | 默认值 | 描述 |
|---|---|---|---|
主题(Topic) |
是 | 无 | 您要导出到的 Pub/Sub 主题。 |
值列(Value Column) |
否 | 无 | 如果此处未指定值,Foundry 流上记录的完整内容将作为字符串写入 Pub/Sub。如果指定了值,则仅将 值列 的内容导出到 Pub/Sub。 |