跳转至

The Rubix substrate(Rubix 基础设施层)

AIP, Foundry, and Apollo all operate within a hardened, autoscaling, highly available implementation of Kubernetes known as Palantir Rubix ↗.

Building on Palantir's decades of experience building secure software, Rubix was developed with the goal of extending the core benefits of open-source containerization with the features needed to operate in the world’s most demanding environments. This includes ephemeral compute nodes with enforced cycle times, secure-by-default networking, dynamic and intelligent autoscaling, and a wide range of features that meet the most stringent accreditation standards, including FedRAMP High, DOD DISA IL-5/IL-6, and CMMC.

The Rubix architecture was initially developed to host Palantir’s own software, but is now used by software companies of all sizes to deploy their products into any environment of their choosing, including the most restrictive and complex environments in the world.

At the heart of Rubix’s design is a set of uncompromising and interlocking assumptions: mission-critical software must be secure by design, highly available, and capable of rapid evolution.

Illustration of Rubix: a hardened, autoscaling, high-available implementation of Kubernetes.

Security

Rubix is designed to mitigate against both tactical and advanced persistent threat vectors.

Every workload is securely isolated based on necessary requirements, enabling the safe execution of operational tasks that require elevated privileges, which are distinguished from application-driven executions that operate with precisely governed permissions. Encryption is rigorously enforced across every element in the environment, and every interaction between workloads must be authenticated, authorized, and logged in accordance with immutable configurations.

Palantir built the first versions of Rubix to pioneer this secure, autoscaling paradigm with the Spark compute runtime. Today, this now extends across every runtime and service structure within AIP, Foundry, and Apollo, from managing system connections, to data integration, to model management, to application development, to agent building, through to the developer toolchain.

Illustration of Rubix Security, including a list of core infrastructure security features.

High availability

High availability is interwoven with Rubix’s approach to ephemerality; nodes in Rubix environments cannot live longer than 48 hours. This ensures that every service, whether the user-facing Ontology Manager or a backend service for transforming streams, is designed for disruption and resilient failover. Rubix also reduces the need for manual interventions from infrastructure teams, since outdated instances are automatically replaced, and the logic that drives this replacement contains encoded learnings from global performance.

From a security perspective, aggressive node cycling ensures that compromising a single node is insufficient for an attacker to gain persistent access to an environment. Operationally, this ephemerality works in tandem with a multidimensional node draining and termination pipeline that utilizes policy-driven node selection, to gracefully avoid destabilization.

Rubix as a substrate for Palantir services

Rubix drives efficiency for infrastructure, platform, and customer teams alike. By providing a secure and consistent substrate for Palantir’s core services, it enables infrastructure teams to deploy AIP, Foundry, Apollo, and dependent offerings across AWS, Azure, Google Cloud, Oracle Cloud, or on-premises environments — with identical operational characteristics.

For Palantir teams shipping new features and services into the managed infrastructure, Rubix provides a reliable and uniform substrate that abstracts away the peculiarities of different environments and providers.

For customer developers looking to securely host custom applications, containerized models, and other Kubernetes-compliant workloads (e.g., through Compute Modules), all of Rubix’s core benefits can be transparently leveraged. This includes intelligent workload distribution, a range of sophisticated demand-sensing algorithms, and other features that drive continuous cost optimization.

Apollo and "Day 2" operations

Rubix works in concert with Apollo to provide a powerful mission control for “Day 2" infrastructure operations.

Among Apollo’s responsibilities is the computation and transmission of plans for installations, upgrades, and rollbacks for each of the hundreds of services in a given Palantir environment.

To ensure zero-downtime upgrades, Apollo requires that every software service is deployed in a multi-node configuration that is designed for "blue/green" rollout strategies. In contrast to all-at-once rollout strategies, the blue/green paradigm first builds a parallel green environment and monitors its performance alongside the existing blue environment. If the green environment operates successfully for a specified period of time, traffic is gradually redirected away from blue nodes to green nodes (and the blue nodes are simply destroyed, per enforced cycling).

This zero-downtime architecture would not be possible without Rubix’s opinionated API layer, which Apollo leverages to translate complex deployment intentions into resource-level instructions for service creation, status monitoring, product revision management, configuration management, and more.

Build once, run anywhere

Rubix provides the foundation for Palantir’s “write once, ship anywhere” development philosophy. Rubix takes the core benefits of Kubernetes and supercharges them with the security, high availability, and deployability features needed to rapidly release software into the most critical environments in the world.

In addition to enabling Palantir’s developer teams, Rubix now empowers software teams looking to deploy their own end-to-end solutions into regulated environments (enabling them to achieving FedRAMP compliance through Palantir FedStart).

Rubix also enables entire government agencies to securely expedite vendor onboarding and management through the Mission Manager offering. As Palantir’s customers and partners continue to pursue their most critical missions and challenge the outdated orthodoxies that have long governed software infrastructure, Rubix will continue to evolve to support their most pressing needs.


中文翻译

Rubix 基础设施层

AIP、Foundry 和 Apollo 均运行在一个经过加固、支持自动扩缩且具备高可用性的 Kubernetes 实现之上,即 Palantir Rubix ↗

基于 Palantir 数十年来构建安全软件的经验,Rubix 的开发目标是在开源容器化的核心优势基础上,扩展出在全球最严苛环境中运行所需的各项功能。这包括具有强制运行周期的临时计算节点、默认安全的网络配置、动态智能的自动扩缩,以及满足最严格认证标准(包括 FedRAMP High、DOD DISA IL-5/IL-6 和 CMMC)的广泛功能。

Rubix 架构最初是为托管 Palantir 自有软件而开发的,但现在已被各种规模的软件公司用于将其产品部署到任意环境中,包括全球最受限、最复杂的环境。

Rubix 设计的核心是一套不可妥协且环环相扣的假设:关键任务软件必须天生安全、高度可用,并且能够快速演进。

Rubix 示意图:一个经过加固、支持自动扩缩且具备高可用性的 Kubernetes 实现。

安全性

Rubix 旨在防御战术性威胁和高级持续性威胁。

每个工作负载都根据必要需求进行安全隔离,从而能够安全执行需要提升权限的操作任务,这些任务与在精确管控权限下运行的应用程序驱动执行相区分。加密机制在环境中的每个环节都得到严格执行,工作负载之间的每一次交互都必须经过身份验证、授权,并根据不可变配置进行记录。

Palantir 构建了 Rubix 的早期版本,率先将这种安全、自动扩缩的范式与 Spark 计算运行时相结合。如今,这一范式已扩展到 AIP、Foundry 和 Apollo 中的每一个运行时和服务结构,涵盖系统连接管理、数据集成、模型管理、应用开发、智能体构建,直至开发者工具链。

Rubix 安全示意图,包括核心基础设施安全功能列表。

高可用性

高可用性与 Rubix 的临时性理念紧密交织;Rubix 环境中的节点存活时间不得超过 48 小时。这确保了每一项服务——无论是面向用户的本体论管理器(Ontology Manager),还是用于转换流的后端服务——都针对中断和弹性故障转移而设计。Rubix 还减少了基础设施团队手动干预的需求,因为过时的实例会被自动替换,而驱动这种替换的逻辑中包含了从全球性能表现中编码学习到的经验。

从安全角度来看,激进的节点轮换确保攻击者仅攻破单个节点不足以获得对环境的持久访问权限。在运维层面,这种临时性与多维度的节点排空和终止管道协同工作,该管道利用基于策略的节点选择,优雅地避免系统不稳定。

Rubix 作为 Palantir 服务的基础设施层

Rubix 为基础设施团队、平台团队和客户团队共同提升了效率。通过为 Palantir 的核心服务提供安全且一致的基础设施层,它使基础设施团队能够在 AWS、Azure、Google Cloud、Oracle Cloud 或本地环境中部署 AIP、Foundry、Apollo 及相关产品,且运维特性完全相同。

对于向托管基础设施发布新功能和新服务的 Palantir 团队而言,Rubix 提供了一个可靠且统一的基础设施层,抽象掉了不同环境和供应商的差异性。

对于希望安全托管自定义应用、容器化模型及其他 Kubernetes 兼容工作负载(例如通过计算模块(Compute Modules))的客户开发者而言,Rubix 的所有核心优势都可以透明地加以利用。这包括智能工作负载分配、一系列复杂的需求感知算法,以及其他推动持续成本优化的功能。

Apollo 与"Day 2"运维

Rubix 与 Apollo 协同工作,为"Day 2"基础设施运维提供强大的任务控制能力。

Apollo 的职责之一是为给定 Palantir 环境中的数百项服务计算并传输安装、升级和回滚的计划

为确保零停机升级,Apollo 要求每项软件服务都以多节点配置部署,并采用"蓝/绿"滚动策略。与一次性全部部署的策略不同,蓝/绿范式首先构建一个并行的绿色环境,并监控其与现有蓝色环境并行的性能表现。如果绿色环境在指定时间段内成功运行,流量将逐步从蓝色节点重定向到绿色节点(而蓝色节点则根据强制轮换规则被直接销毁)。

如果没有 Rubix 高度定制化的 API 层,这种零停机架构将无法实现。Apollo 利用该 API 层将复杂的部署意图转化为资源级别的指令,用于服务创建、状态监控、产品版本管理、配置管理等。

一次构建,随处运行

Rubix 为 Palantir 的"一次编写,随处交付"开发理念奠定了基础。Rubix 汲取了 Kubernetes 的核心优势,并为其注入了安全、高可用性和可部署性方面的增强功能,从而能够快速将软件发布到全球最关键的环境中。

除了赋能 Palantir 的开发团队外,Rubix 现在还帮助希望将自有端到端解决方案部署到受监管环境中的软件团队(使他们能够通过 Palantir FedStart 实现 FedRAMP 合规)。

Rubix 还使整个政府机构能够通过 Mission Manager 产品安全地加速供应商接入和管理。随着 Palantir 的客户和合作伙伴继续追求其最关键的任务,并挑战长期主导软件基础设施的过时传统,Rubix 将持续演进以支持他们最迫切的需求。