Environment creation overview(环境创建概述)¶
:::callout{theme="neutral"} This is an advanced guide that goes into detail about the environment initialization process. It is intended for users who are interested in the technical considerations that can affect initialization performance. For general guidance regarding common environment-related issues, see the Environment troubleshooting guide. :::
Conda is an open-source language-agnostic package and environment manager. Mamba is an open-source re-implementation of the Conda package manager. Hawk is a Palantir-developed re-implementation of the Conda package manager. Code Repositories uses Hawk to resolve package dependencies when running Checks after a Commit. Hawk offers several advantages in package resolution over Mamba and Conda, most notably in increased speed and active maintenance and development.
This page introduces the most important concepts and outlines the environment creation process; for more information, consult the official Conda documentation ↗ and Mamba documentation ↗.
Important Terms¶
Package¶
A package is a collection of files that commonly contains metadata, libraries, and/or binaries. Code Repository provides access to a broad selection of packages (numpy, for example) to supplement the core language features.
A package is versioned, and nearly always has a set of dependencies—other packages that must also be installed for it to function properly. A dependency might be a specific version of a package, a range of acceptable versions, or any version at all.
Channel¶
A channel, sometimes called a repository, is any location where packages are stored. One channel might be a directory in the local file system, and another might be a directory hosted on a web server.
Regardless of type, each channel is a directory tree that separates packages by platform architecture. Each platform subdirectory contains a file called repodata.json, which is an index of all packages in that subdirectory.
Conda searches a set of pre-configured channels whenever it needs to fetch a package. For more information about channel management in Foundry, see the packages documentation.
Environment¶
A Conda environment is a directory that contains a specific collection of packages. An environment is created for each repository by passing the packages specified in the Environment Configuration panel to Conda. Conda constructs a set of packages that satisfies the configuration and all dependencies, and installs these packages onto the Spark module that backs the repository.
Performance¶
The following explanation of performance draws on this Anaconda blog post ↗, which discusses Conda performance at length, but will similarly apply to a Mamba or Hawk implementation. The next two sections summarize this material and outline the performance factors that are most relevant to Code Repositories:
Creating an environment¶
Environment creation comprises two major steps: the solve step and the install step.
Solve step¶
In the solve step, the designated package manager, either Conda, Mamba, or Hawk, attempts to find packages and versions that satisfy all transient dependencies. Transient dependencies comprise the dependencies of the packages specified in the Environment Configuration panel, those dependencies' dependencies, and so on. This step contains four stages:
- Download and process package indices. The package manager will download relevant
repodata.jsonfiles from each configured channel, and will convert the index entries into objects in memory. - Reduce the index. The package manager builds up a set of all packages that could possibly be used for the environment. To do this, the algorithm begins with the provided package specification and recurses through all dependencies. All unneeded packages—primarily those that are not in the dependency graph—are pruned away.
- Express dependency constraints as a Boolean satisfiability (SAT) problem. Package managers prefer certain types of solutions, such as the newest possible versions of packages, and these biases are baked into clause construction.
- Run the SAT solver.
Install step¶
If the solve succeeds, the install step is next. Here, each artifact is retrieved from the proper channel and the package manager uses these artifacts to construct the environment. This step contains three stages:
- Download and extract all packages in the solved environment.
- Verify package contents. Depending on configuration, Conda will either use a checksum or will verify that the file size is correct.
- Link packages into the environment.
Limitations¶
All of the steps outlined above can be susceptible to slowness in certain situations. Causes of slowness usually fall into one of three categories:
Upstream changes¶
A significant portion of slowness is caused by factors external to Foundry.
- Downloading and processing channel indices scale with total size of index files; the more channels that are needed to consider and the larger these channels are, the longer these steps will take.
- Index reduction also scales with the number of transitive dependencies; the number of transitive dependencies is determined by which dependencies packages choose to declare.
Because these factors are external and opaque, it can be difficult to perform root cause analysis on performance regressions. An environment may suddenly take longer to load because a channel recently increased in size, or because a package declared several new dependencies in its newest release.
Environment specification¶
More commonly, slow initialization is directly tied to the environment specification itself. The solve step scales superlinearly with environment size, so as a general rule of thumb, environments with more packages will take disproportionately longer to initialize.
There are two ways to remediate these situations.
- First, remove unneeded packages from the environment definition. It is much more performant to have small, specialized environments than to have large, general-purpose ones.
- Second, try adding version constraints to some of the packages in the Environment Configuration panel. It is most effective to pin versions for packages with many extant builds like
pythonand for packages with complex dependency graphs likescipy. Doing so will allow Conda to more aggressively reduce the indices, meaning that the SAT solve will not need to account for as many package versions.
Package size¶
Package size is typically less problematic than the other factors in this section, but it is still relevant in some cases.
Downloading and extracting even a single package may not be trivial. For example, the pytorch package is about 460 MB in size, and can take 35+ seconds to extract.
Download, extraction, and verification all scale linearly with the size and number of packages in the environment. Due to transitive dependencies, the solved environment typically contains many more packages than were explicitly specified in the environment definition, and the increase in packages may cause slowness.
Remediation in this case is similar to the suggestions for environment specification: try to keep environments as small as possible.
中文翻译¶
环境创建概述¶
:::callout{theme="neutral"} 这是一份高级指南,详细介绍了环境初始化过程。适用于对可能影响初始化性能的技术考量感兴趣的读者。关于常见环境相关问题的通用指导,请参阅环境故障排除指南。 :::
Conda 是一个开源的、与语言无关的包和环境管理器。Mamba 是 Conda 包管理器的开源重新实现。Hawk 是 Palantir 开发的 Conda 包管理器的重新实现。代码仓库(Code Repositories)在执行提交(Commit)后的检查(Checks)时使用 Hawk 来解析包依赖关系。与 Mamba 和 Conda 相比,Hawk 在包解析方面具有多项优势,最显著的是速度更快以及持续维护和开发。
本页面介绍最重要的概念并概述环境创建过程;更多信息请参考官方 Conda 文档 ↗和Mamba 文档 ↗。
重要术语¶
包(Package)¶
包(Package) 是文件的集合,通常包含元数据、库和/或二进制文件。代码仓库(Code Repository)提供对大量包(例如 numpy)的访问,以补充核心语言功能。
包是有版本(versioned)的,并且几乎总是有一组依赖项——即必须同时安装的其他包才能使其正常运行。依赖项可以是包的特定版本、可接受版本的范围,或者任何版本。
频道(Channel)¶
频道(Channel)(有时称为仓库)是存储包的任意位置。一个频道可能是本地文件系统中的目录,另一个可能是托管在 Web 服务器上的目录。
无论类型如何,每个频道都是一个按平台架构分隔包的目录树。每个平台子目录包含一个名为 repodata.json 的文件,该文件是该子目录中所有包的索引。
Conda 在需要获取包时会搜索一组预配置的频道。有关 Foundry 中频道管理的更多信息,请参阅包文档。
环境(Environment)¶
Conda 环境(Environment) 是包含特定包集合的目录。通过将环境配置(Environment Configuration)面板中指定的包传递给 Conda,为每个仓库创建一个环境。Conda 构建一组满足配置和所有依赖项的包,并将这些包安装到支持仓库的 Spark 模块上。
性能¶
以下关于性能的解释参考了这篇 Anaconda 博客文章 ↗,该文章详细讨论了 Conda 的性能,但同样适用于 Mamba 或 Hawk 实现。接下来的两个部分总结了这些内容,并概述了与代码仓库(Code Repositories)最相关的性能因素:
创建环境¶
求解步骤¶
在求解步骤中,指定的包管理器(Conda、Mamba 或 Hawk)尝试找到满足所有传递依赖项的包和版本。传递依赖项包括环境配置(Environment Configuration)面板中指定的包的依赖项、这些依赖项的依赖项,依此类推。此步骤包含四个阶段:
- 下载并处理包索引。包管理器将从每个配置的频道下载相关的
repodata.json文件,并将索引条目转换为内存中的对象。 - 缩减索引。包管理器构建一个可能用于环境的所有包的集合。为此,算法从提供的包规范开始,递归遍历所有依赖项。所有不需要的包(主要是那些不在依赖关系图中的包)都会被修剪掉。
- 将依赖约束表示为布尔可满足性(SAT)问题。包管理器偏好某些类型的解决方案,例如包的最新可能版本,这些偏好被嵌入到子句构建中。
- 运行 SAT 求解器。
安装步骤¶
如果求解成功,接下来是安装步骤。在此步骤中,每个工件从相应的频道中检索,包管理器使用这些工件构建环境。此步骤包含三个阶段:
- 下载并解压求解环境中的所有包。
- 验证包内容。根据配置,Conda 将使用校验和或验证文件大小是否正确。
- 将包链接到环境中。
限制因素¶
上述所有步骤在某些情况下都可能出现速度缓慢的问题。导致缓慢的原因通常属于以下三类之一:
上游变更¶
很大一部分缓慢是由 Foundry 外部的因素引起的。
- 下载和处理频道索引的耗时与索引文件的总大小成正比;需要考虑的频道越多、这些频道越大,这些步骤所需的时间就越长。
- 索引缩减的耗时也与传递依赖项的数量成正比;传递依赖项的数量由包选择声明的依赖项决定。
由于这些因素是外部的且不透明,因此对性能回归进行根本原因分析可能很困难。环境可能突然需要更长时间加载,因为某个频道最近增大了,或者因为某个包在其最新版本中声明了几个新的依赖项。
环境规范¶
更常见的是,初始化缓慢直接与环境规范(environment specification)本身相关。求解步骤的耗时与环境大小呈超线性关系,因此根据经验法则,包含更多包的环境将需要不成比例地更长时间来初始化。
有两种方法可以解决这些情况。
- 首先,从环境定义中移除不需要的包。拥有小型、专门化的环境比拥有大型、通用型环境的性能要好得多。
- 其次,尝试在环境配置(Environment Configuration)面板中为某些包添加版本约束。对于像
python这样具有许多现有构建版本的包,以及像scipy这样具有复杂依赖关系图的包,固定版本最为有效。这样做将使 Conda 能够更积极地缩减索引,这意味着 SAT 求解器不需要考虑那么多包版本。
包大小¶
包大小(Package size)通常比本节中的其他因素问题更少,但在某些情况下仍然相关。
即使下载和解压单个包也可能不是微不足道的。例如,pytorch 包大小约为 460 MB,解压可能需要 35 秒以上。
下载、解压和验证的耗时与环境中的包大小和数量呈线性关系。由于传递依赖项,求解后的环境通常包含比环境定义中明确指定的多得多的包,包数量的增加可能导致速度缓慢。
这种情况的补救措施与环境规范的建议类似:尽量保持环境尽可能小。