Initial setup overview(初始设置概述)¶
This guide will walk you through the process of connecting your organization's data to Foundry.
Before starting, it is important to recognize that this first step towards connecting your organizational data to Foundry is fundamentally a networking concept. The initial setup is best done by someone familiar with network engineering and aware of the organization's network topology and configurations, such as firewall rules.
Conceptual overview¶
Connecting data to Foundry requires that three components are configured in the following order:
1. Configure networking connectivity¶
You must first ensure that there is a valid networking path between Foundry and the external system.
For external systems hosted on the same network from Foundry's network (for cloud-hosted Foundry instances, this is typically systems accessible over the Internet), define direct connection egress policies to route the traffic. Make sure that the external system allows inbound connections from Foundry.
For external systems hosted on a separate network from Foundry's network (for cloud-hosted Foundry instances, this is typically on-premise systems), an agent is required. An agent is Palantir software that runs within your organization’s network and functions as a secure intermediary between your organization’s data sources and your Foundry instance. Make sure the external system allows inbound connections from the agent and that the agent can establish outbound connections to the external system as well as to Foundry.
The agent can then be used to define agent proxy egress policies to route traffic through the agent when using a Foundry worker.
Agents can also be used as a legacy agent worker where capabilities execute on the agent host. New sources should use Foundry worker; see Foundry worker vs. agent worker.
Agents can be shared by agent worker sources and sources using agent proxy egress policies, though we recommend always having multiple agents assigned to a given source to maximize availability.
Learn more about various architecture patterns.
2. Source configuration¶
You must configure a source, or connection, to connect your external system to Foundry before executing any capability. An external system could be, for example, a Postgres database, an S3 bucket, a filesystem on a Linux server, an SAP instance, or a REST API accessible over the Internet.
3. Capability configuration¶
Once a source is configured to connect to the external system, you must configure the capability to execute on that source. Capabilities include batch syncs of data, streaming syncs, webhooks, exports, and more.
A batch sync, for example, reads specific data from an external system and ingests it into Foundry. If you have a PostgreSQL database that contains multiple tables, you might configure a sync to ingest one specific table into Foundry. Once a sync has successfully run, the result in Foundry will be a dataset to use across all of Foundry's data pipelining, model development, and analytical tools.
Roles and workflows¶
Most Foundry users will never need to set up a new agent themselves. Agent setup requires an IT-focused skill set, though the same agent can be reused to support multiple sources and syncs. Some organizations can operate long-term with agents set up during the first week of a Foundry deployment. New agents are only needed to access data that your existing agents cannot access (due to network segmentation or data scale, for example) or to set up an additional agent to allow for high availability.
The table below summarizes the configuration frequency and skill set required for maintaining the resources required for connecting to data:
| Resource | Frequency of configuration | Typical user role | Knowledge required |
|---|---|---|---|
| Agent | Rare | IT / Network Engineer | Network and firewall policies; Linux VMs; SSH |
| Source | Occasional | IT / Network Engineer; Data Engineer | Debugging network access; credential management |
| Sync | Frequent | Data Engineer; Data Scientist | Writing SQL queries; managing files |
High availability¶
We recommend setting up redundant hardware to establish a high availability (HA) architecture. High availability increases resiliency and allows no-downtime maintenance during operating hours.
Foundry offers HA at the source level, meaning that if a source is assigned to multiple agents, Foundry will dispatch ingestions to one of the healthy agents. We strongly recommend configuring agents in a high availability setup at the start of source creation; adding extra agents to a created source requires re-entering the credentials for that source.
The following best practices are recommended when setting up high availability:
- Always install agents by pairs, on similar hardware.
- Give each agent in a pair similar names, such as
agent-1andagent-2. - Systematically assign both agents in a pair to every source.
- Configure non-overlapping upgrade windows on both agents in a pair. Upgrade windows should be during business days and provide sufficient soaking time. Doing so ensures that any unexpected issues with an update will be contained to a single agent and can be detected by operators or administrators.
Next steps¶
To get started, move on to setting up a source.
中文翻译¶
初始设置概述¶
本指南将引导您完成将组织数据连接到 Foundry 的整个过程。
在开始之前,需要认识到将组织数据连接到 Foundry 的第一步本质上是一个网络概念(Networking Concept)。初始设置最好由熟悉网络工程并了解组织网络拓扑和配置(如防火墙规则)的人员来完成。
概念概述(Conceptual Overview)¶
将数据连接到 Foundry 需要按以下顺序配置三个组件:
1. 配置网络连接¶
您必须首先确保 Foundry 与外部系统之间存在有效的网络路径。
对于与 Foundry 网络位于同一网络上的外部系统(对于云托管的 Foundry 实例,这通常是通过互联网可访问的系统),请定义直连出站策略(Direct Connection Egress Policies)来路由流量。确保外部系统允许来自 Foundry 的入站连接。
对于与 Foundry 网络位于不同网络上的外部系统(对于云托管的 Foundry 实例,这通常是本地部署系统),则需要使用代理(Agent)。代理(Agent)是 Palantir 软件,运行在您的组织网络内,作为组织数据源与 Foundry 实例之间的安全中介。确保外部系统允许来自代理的入站连接,并且代理能够建立到外部系统以及 Foundry 的出站连接。
然后,在使用 Foundry 工作节点(Foundry Worker) 时,可以使用该代理定义代理出站策略(Agent Proxy Egress Policies),通过代理路由流量。
代理也可以作为传统的代理工作节点(Agent Worker)使用,此时功能在代理主机上执行。新的数据源应使用 Foundry 工作节点;请参阅 Foundry 工作节点与代理工作节点对比(Foundry Worker vs. Agent Worker)。
代理可以由代理工作节点数据源和使用代理出站策略的数据源共享,但我们建议始终为给定数据源分配多个代理,以最大化可用性。
了解更多关于各种架构模式(Architecture Patterns)的信息。
2. 数据源配置¶
在执行任何功能(Capability)之前,您必须配置数据源(Source)(即连接),将外部系统连接到 Foundry。外部系统可以是例如 Postgres 数据库、S3 存储桶、Linux 服务器上的文件系统、SAP 实例或可通过互联网访问的 REST API。
3. 功能配置¶
一旦配置好数据源以连接到外部系统,您必须配置在该数据源上执行的功能(Capability)。功能包括数据的批量同步、流式同步、Webhook、导出等。
例如,批量同步(Batch Sync)从外部系统读取特定数据并将其摄取到 Foundry 中。如果您有一个包含多个表的 PostgreSQL 数据库,您可以配置一个同步来将某个特定表摄取到 Foundry 中。同步成功运行后,Foundry 中的结果将是一个数据集(Dataset),可用于 Foundry 的所有数据管道、模型开发和分析工具。
角色与工作流程¶
大多数 Foundry 用户永远不需要自行设置新的代理。代理设置需要以 IT 为核心的技能组合,但同一个代理可以重复用于支持多个数据源和同步。某些组织可以在 Foundry 部署的第一周设置好代理后长期运行。只有在需要访问现有代理无法访问的数据(例如由于网络分段或数据规模)或需要设置额外代理以实现高可用性时,才需要新的代理。
下表总结了连接数据所需资源的配置频率和技能要求:
| 资源 | 配置频率 | 典型用户角色 | 所需知识 |
|---|---|---|---|
| 代理 | 极少 | IT / 网络工程师 | 网络和防火墙策略;Linux 虚拟机;SSH |
| 数据源 | 偶尔 | IT / 网络工程师;数据工程师 | 调试网络访问;凭据管理 |
| 同步 | 频繁 | 数据工程师;数据科学家 | 编写 SQL 查询;管理文件 |
高可用性(High Availability)¶
我们建议设置冗余硬件以建立高可用性(HA)架构。高可用性可提高弹性,并允许在工作时间内进行无停机维护。
Foundry 在数据源级别提供高可用性,这意味着如果某个数据源分配了多个代理,Foundry 会将数据摄取任务分派给其中一个健康的代理。我们强烈建议在创建数据源之初就配置高可用性设置的代理;向已创建的数据源添加额外代理需要重新输入该数据源的凭据。
设置高可用性时,建议遵循以下最佳实践:
- 始终成对安装代理,使用相似的硬件。
- 为成对中的每个代理赋予相似的名称,例如
agent-1和agent-2。 - 系统地将成对中的两个代理分配给每个数据源。
- 为成对中的两个代理配置不重叠的升级窗口。升级窗口应安排在工作日内,并提供足够的浸泡时间。这样做可以确保任何更新引起的意外问题仅局限于单个代理,并能被操作员或管理员检测到。
后续步骤¶
要开始操作,请继续阅读设置数据源。