Development best practices(开发最佳实践)¶
This guide is meant to provide guidance for pipeline developers who are developing transformations. The General recommendations may also be of interest to project managers or platform administrators as they focus on higher-order principles of clean pipelines in general.
General best practices¶
Pipeline development is software development
Many of the best practices from general software development apply equally to defining the transformations that make up your data pipeline. Below are a few common practices that exemplify this approach:
- Reduce cognitive load: Wherever possible, reduce the amount of thinking necessary. If you need to use complex functions or API calls that are non-obvious, make sure you have clear documentation.
- Be nice to future you: Pipelines are the foundation of your data enterprise. Think long-term and high vision while planning and implementing.
- Don't Repeat Yourself (DRY): Duplicated code or concepts require more maintenance and lead to subtle errors. Instead, refactor frequently at all levels, including building and publishing your own libraries or packages for cross-repository use, to ensure minimum duplication.
- Avoid tech debt: A corollary to long-term thinking, tech debt is easy to accrue in the face of deadlines of project-specific demands, but always comes back to bite.
- Conventions matter: Set a precedent and stick with it; this reduces cognitive load and helps legibility between Code Repositories. For instance, column and dataset names are conventionally written in
snake_case- following this convention means anyone else who wants to reference your awesome dataset knows that it'sawesome_dataset, notAwesomeDatasetorAwesome_Dataset. - Less is more: Systems with many smaller units are easier to maintain and reason about than systems with few large units; therefore bias towards:
- Tightly-scoped Foundry Projects
- Smaller transforms chained together
- Short functions - and helper functions in transforms
Anti-patterns¶
- Overwrite vs. Delete: Unlike other file systems, Foundry resources and datasets are connected to other artifacts rather than just the file itself. Hence, if you are iterating on a pipeline or data, deletion should occur only if you fundamentally want that type of data gone. For example, if you've written a transaction with incorrect data to a dataset, don't delete the dataset; instead, write a new
SNAPSHOTtransaction to overwrite the previous one. More challenges are likely to arise from trying to delete a dataset than from creating a new dataset in the same location. - Don't introduce circular dependencies: If you are looking to use the dataset output of your transform as an input into other transforms, you should ensure you are not introducing any circular dependencies in your code. Foundry's build orchestration layer will attempt to prevent you from configuring any loops on the branch you are currently on and your Code Repository branch will fail checks if a loop is detected
:::callout Foundry will check for circular dependencies on the branch being developed on, but will not run the check across all branches while writing code, such as if there are ontology writeback datasets that only exist on the master branch. Foundry will still fail checks if circular dependencies are detected on other branches when attempting to merge the feature branch with the branch that contains the circular dependencies. :::
Project folder structure¶
See Recommended Project structure for a description of an overarching model to organize the entire flow of data through multiple Projects.
Best practices¶
- Manage permissions at the Project level: If you anticipate needing to further partition permissions within a project, consider if the project should be further decomposed into smaller pieces. Learn more about the concepts and practices of applying and managing permissions.
- Keep Projects tightly scoped: Avoid “feature creep” and adding tangentially-related resources or use-case specific logic within your Project
- Use meaningful folder names: Remember that you're designing a Project for consumption outside your team as well as for daily iteration and development. Consider that at minimum you should have a
/Documentationfolder and a/Outputfolder. Different use-case or workflow Projects will have needs for more specific names, but always consider that your Project structure and naming scheme are signposts for visitors.
Anti-patterns¶
- Working in your home folder: Your personal folder is not a place for development - the access controls are strict and sharing is difficult. Consider creating a
/scratchfolder in your Project for experiments or throw-away work rather than building in your home folder.
Naming¶
Best practices¶
- Follow convention: Name columns, datasets, repositories, files, and other resources following the common patterns for your organization or the convention for your Foundry instance.
- Choose descriptive names: Take the time to come up with two or three word names that orient a reader to the purpose or content of the resource being named. Consider putting the distinctive portion of the name first; this can help with long names where the full text is cut off in a dropdown and helps immediately distinguish references when scanning a code base or list of files, for instance. Avoid using abbreviations.
Anti-patterns¶
- Cryptic names: Choosing names that simply increment a number (i.e.
dataset1,dataset2, and so on) or names that are a single letter will make it more difficult to read and refactor your code and much more difficult for a new developer to approach your work. - Names distinguished only by path: In some cases this is acceptable, for instance when you have
/raw/my_important_datasetand then/clean/my_important_dataset, however in many cases this naming pattern can create confusion in views where only the dataset name itself is prominently displayed. Remember, provenance is tracked and easily visible, so you don't need to embed this kind of "state" into the names of your datasets.
Data types¶
Best practices¶
Explicitly cast column types: If you are working in a Datasource Project, explicitly cast the column types in the raw → clean transform, even if the schema inference from the data connection has chosen correct values. This will help catch breaking changes from the source system if a column type changes or an invalid value creates an incorrect inference during the sync.
Patterns¶
Use Timestamps for 'Time Only' data types: Spark doesn't have a time-only data type for fields with values like “10:59:00”. In order to leverage the time functions that come with Spark's timestamp type, cast the values to seconds and then add -2208988800 to it before casting it to a timestamp in order to put it in year 0. Alternatively, leave it as a string and let the users parse it as they need to.
Anti-patterns¶
- Casting all numeric fields to numeric datatypes: Consider a column of aircraft IDs, like
545,972,314. It can be tempting to cast these to an integer column (after all, they look like integers and may even be integers in the source system). However, this has significant drawbacks: - If leading zeros are meaningful, e.g. 123 and 0123 are both legal IDs that should be differentiated, this will result in errors.
- The IDs may be formatted in an undesirable way in the UI (e.g.
545.0,1,234, right justified). - If the IDs are too long, you can run into MAX_INT issues.
-
Numeric functions won't ever be applied to these values (e.g. you won't ever add two aircraft IDs together). The rule of thumb should be to only cast numeric fields to numeric datatypes if it would be appropriate to perform arithmetic on them, otherwise cast them as strings.
-
Storing timestamps in different timezones: — Spark timestamps are timezone agnostic. They are stored internally in UTC (aka Zulu, GMT); displaying a timezone is expected to be done by the front end.
- If you need a particular timezone for display purposes, use PySpark's
from_unixtime()function ↗ to store a string in the appropriate timezone. - If timestamps in a given datasource are in a timezone other than UTC, use PySpark's
to_utc_timestamp()function ↗ to normalize them to UTC.
Code comments¶
Anti-patterns¶
-
Commenting out code in commits: When making changes, it can be tempting to comment out old code and leave it for reference or in case you need to revert later. You can use comments while iterating, but do not commit code with statements commented out. Doing so builds cruft and reduces legibility. Old code is easy to find in previous commits.
-
Comments with authorship details and dates: These are automatically tracked in the commits/git repository. Manually commenting them is prone to not being updated and creates cruft.
-
Over-verbose Commenting: Comments should share the rationale behind decisions rather than explain the logic itself. Strive to write “self-documenting” code; if a set of statements is difficult to understand, that is a clear sign to refactor and simplify.
Repository practices¶
Best practices¶
-
Protect the
masterbranch: If you're developing with a team, or even just working on a long-lived individual project, protect the master branch and practice GitFlow ↗ or your preferred development workflow. The key concepts are simply to ensure that code moving to master is reviewed and tested. -
Write commit messages: Commit messages are the log of all activity in the repository. Take the time to write a useful description of your change:
- Code reviewers can look at your commits to get a sense of your workflow and what changes were made.
- If you want to revert a change, it's much easier if you can find the commit you were looking for at a glance.
-
Note that clicking the build button will autogenerate a commit message with a timestamp; avoid using this. Click commit first, write a commit message, and then click build.
-
Prune your branches: In long-lived repositories, branches can accumulate. If development of a branch is abandoned, especially if a branch is merged into another, keep things tidy by deleting the branch. This helps with legibility of which branches are actively developed.
-
Upgrade your Repository: When prompted, follow the steps to upgrade the language bundles in your repository. This process will open a pull request to the active branch containing the upgrades. You should feel free to run a build of your pipeline on the upgrade branch to ensure that none of the version bumps impact your code. However, staying up-to-date with these upgrades often ensures you do not encounter edge cases seen elsewhere, which are patched in the upgraded versions.
-
Practice Code Reviews: As you collaborate with teammates to develop transformations, implement some practice of code reviews during the pull request process. We've shared our thoughts on Code Review Best Practices ↗ and many of the concepts will apply equally to reviewing data transformation code.
Patterns¶
Share code between repositories: Repositories in Foundry operate on a Project level for a variety of reasons, but often there is logic that could be reused across pipelines, which has several advantages:
-
General code reuse in accordance with DRY.
-
Avoiding forked/inconsistent logic across different areas of the data foundation.
-
There may be pipelines which would ideally use foundational pipelines but which have much stricter SLAs or performance requirements. In this case, the solution is often to share the logic but not the transforms/datasets so that the critical pipeline can rely on pre-filtered datasets, fewer transforms, have different build schedules, and so on.
-
Code Repositories are an excellent way to accomplish this. For example, when working with Python transforms, libraries can publish themselves to a Conda channel and allow other repositories to consume them. See the documentation on Sharing Python Libraries.
-
An additional advantage of shared repositories is semantic versioning. The shared repository can tag its commits with versions (e.g. 1.0.0, 1.0.1, 2.0.0) and the consuming libraries can choose how they pick up new versions. For example, a repository might choose to take the latest version (2.0.0 above) or to only take a specific version (say, 1.0.1) and defer picking up new versions until they manually decide to. The latter case is particularly valuable when the pipeline is critical and the pipeline owners would like a chance to opt into/approve changes to the shared repository.
-
Releases: Along the same lines, if a team wants to have an explicit release schedule for pipelines, one option (which avoids staging instances or long-lived develop branches) is to factor the logic out into functions in a shared repository and use the semantic version to keep the consuming repository at major releases, such as 1.0.0, 2.0.0, or even .0.0. That way, the developers can continue to iterate on the logic and tag intermediate releases without them going live on master. Moreover, on a branch on the consuming repository, the developers can always pick up intermediate versions as long as they do not merge them to master before the release date.
Unit testing¶
Unit testing is a popular way of improving and maintaining code quality. In unit testing ↗, small and discrete components ("units") of software are tested in an individual, independent, and automated fashion.
-
Python unit tests: You can enable
pytestunit tests as part of the CI checks for your Python transforms repository by following the Python unit test instructions. -
Java unit tests: The steps for configuring unit tests for Java transforms can be found in the Java unit tests documentation.
Best practices¶
Unit tests should:
- Only test one piece of logic at a time;
- Be lightweight (see the language-specific instructions for tips on how to reduce memory usage);
- Not rely on extra inputs to function; and
- Not rely on or call each other.
Health checks¶
Best practices¶
Check your data health: Often, once a portion of a pipeline is completed, it is easy to set the schedule and put it out of mind. However, even if your logic is sound, the incoming data can change in ways that affect your build, leading to slower performance, increased data scale, or outright build failure. Configuring basic checks on dataset size and build time, even if you do not configure alerts, will provide a view over time of these key metrics so you can observe, for instance, the rate of increase in dataset size or the average build time for the dataset. Read more about the specific health checks available and how to configure them.
Patterns¶
Extend health checks: In most cases, the default health check configurations should be sufficient. If you need further flexibility, however, consider adding one or more derived health check datasets to your pipeline. The transform for this dataset can perform arbitrary logic to determine the validity of its input dataset (the dataset you are validating), and then output data formatted so that a simple health check, like Allowed Value, can report if the dataset is valid.
Set a schedule for this dataset to build whenever there is an update on the input dataset and you will have an extra set of comprehensive health checks.
Scheduling¶
Best practices¶
Review the scheduling best practices.
中文翻译¶
开发最佳实践¶
本指南旨在为开发转换的管道开发者提供指导。通用建议也可能对项目经理或平台管理员有参考价值,因为这些建议侧重于构建清洁管道的更高层次原则。
通用最佳实践¶
管道开发即软件开发
通用软件开发中的许多最佳实践同样适用于定义构成数据管道的转换。以下是体现这一方法的几种常见实践:
- 降低认知负荷: 尽可能减少所需的思考量。如果必须使用复杂的函数或非显而易见的API调用,请确保有清晰的文档说明。
- 善待未来的自己: 管道是数据企业的基础。在规划和实施时要着眼长远、胸怀大局。
- 不要重复自己(DRY): 重复的代码或概念需要更多维护,并可能导致细微的错误。相反,应在各个层面频繁重构,包括构建和发布自己的库或包以供跨代码仓库使用,以确保最小化重复。
- 避免技术债务: 作为长远思维的推论,面对项目特定需求的截止日期时,技术债务很容易累积,但最终总会带来麻烦。
- 约定很重要: 设定先例并坚持执行;这能降低认知负荷,并有助于代码仓库之间的可读性。例如,列名和数据集名称通常采用
snake_case格式——遵循此约定意味着任何想引用你优秀数据集的人都知道它是awesome_dataset,而不是AwesomeDataset或Awesome_Dataset。 - 少即是多: 由许多小单元组成的系统比由少数大单元组成的系统更容易维护和理解;因此应倾向于:
- 范围紧凑的Foundry项目
- 将较小的转换串联起来
- 短函数——以及转换中的辅助函数
反模式¶
- 覆盖 vs. 删除: 与其他文件系统不同,Foundry资源和数据集与其他工件相关联,而不仅仅是文件本身。因此,如果你在迭代管道或数据时,只有在从根本上希望该类型数据消失时才应执行删除操作。例如,如果你向数据集写入了包含错误数据的事务,不要删除该数据集;而是写入一个新的
SNAPSHOT事务来覆盖前一个。尝试删除数据集比在同一位置创建新数据集更可能引发问题。 - 不要引入循环依赖: 如果你希望将转换的数据集输出用作其他转换的输入,应确保代码中没有引入任何循环依赖。Foundry的构建编排层会阻止你在当前分支上配置任何循环,如果检测到循环,你的代码仓库分支将无法通过检查。
:::callout Foundry会检查正在开发的分支上的循环依赖,但在编写代码时不会跨所有分支运行检查,例如,如果存在仅存在于主分支上的本体写回数据集。当尝试将功能分支与包含循环依赖的分支合并时,如果检测到其他分支上的循环依赖,Foundry仍会阻止合并。 :::
项目文件夹结构¶
请参阅推荐的项目结构,了解组织跨多个项目的整个数据流的总体模型描述。
最佳实践¶
- 在项目级别管理权限: 如果你预计需要在项目内进一步划分权限,请考虑是否应将项目进一步拆分为更小的部分。了解应用和管理权限的概念与实践。
- 保持项目范围紧凑: 避免"功能蔓延"以及在项目中添加不相关的资源或用例特定的逻辑。
- 使用有意义的文件夹名称: 请记住,你设计的项目不仅供团队内部日常迭代和开发使用,也供外部消费。考虑至少应包含一个
/Documentation文件夹和一个/Output文件夹。不同的用例或工作流项目可能需要更具体的名称,但始终要考虑你的项目结构和命名方案是给访问者的路标。
反模式¶
- 在个人文件夹中工作: 你的个人文件夹不是开发的地方——访问控制严格且共享困难。考虑在项目中创建一个
/scratch文件夹用于实验或临时工作,而不是在个人文件夹中构建。
命名¶
最佳实践¶
- 遵循约定: 按照组织的通用模式或Foundry实例的约定来命名列、数据集、代码仓库、文件和其他资源。
- 选择描述性名称: 花时间想两到三个词的名称,让读者了解所命名资源的用途或内容。考虑将名称中独特的部分放在前面;这有助于处理长名称(当下拉菜单中完整文本被截断时),并在扫描代码库或文件列表时帮助立即区分引用。避免使用缩写。
反模式¶
- 晦涩的名称: 选择仅递增数字的名称(例如
dataset1、dataset2等)或单个字母的名称会使代码更难阅读和重构,也会让新开发者更难理解你的工作。 - 仅通过路径区分的名称: 在某些情况下这是可以接受的,例如当你有
/raw/my_important_dataset和/clean/my_important_dataset时,但在许多情况下,这种命名模式会在仅突出显示数据集名称本身的视图中造成混淆。请记住,数据谱系是可追踪且易于查看的,因此你不需要将这种"状态"嵌入到数据集名称中。
数据类型¶
最佳实践¶
显式转换列类型: 如果你在数据源项目中工作,请在raw → clean转换中显式转换列类型,即使数据连接的模式推断已选择了正确的值。这有助于在源系统的列类型发生变化或无效值在同步期间导致错误推断时捕获破坏性变更。
模式¶
对"仅时间"数据类型使用时间戳: Spark没有针对包含"10:59:00"等值的字段的仅时间数据类型。为了利用Spark时间戳类型附带的时间函数,将值转换为秒,然后在转换为时间戳之前加上-2208988800,以将其置于公元0年。或者,将其保留为字符串,让用户根据需要解析。
反模式¶
- 将所有数字字段转换为数字数据类型: 考虑一个飞机ID列,如
545、972、314。很容易将这些转换为整数列(毕竟它们看起来像整数,甚至在源系统中可能就是整数)。然而,这有显著的缺点: - 如果前导零有意义,例如123和0123都是应区分的合法ID,这将导致错误。
- ID在用户界面中的格式可能不理想(例如
545.0、1,234、右对齐)。 - 如果ID太长,可能会遇到MAX_INT问题。
-
数字函数永远不会应用于这些值(例如,你永远不会将两个飞机ID相加)。经验法则应该是:仅当对这些数字字段进行算术运算合理时,才将其转换为数字数据类型,否则应将其转换为字符串。
-
存储不同时区的时间戳: — Spark时间戳与时区无关。它们在内部以UTC(也称为祖鲁时间、GMT)存储;显示时区应由前端完成。
- 如果出于显示目的需要特定时区,请使用PySpark的
from_unixtime()函数 ↗以适当时区存储字符串。 - 如果给定数据源中的时间戳位于UTC以外的时区,请使用PySpark的
to_utc_timestamp()函数 ↗将其标准化为UTC。
代码注释¶
反模式¶
-
在提交中注释掉代码: 进行更改时,很容易注释掉旧代码并保留以供参考或以防以后需要回退。在迭代过程中可以使用注释,但不要提交包含注释掉语句的代码。这样做会增加冗余并降低可读性。旧代码很容易在之前的提交中找到。
-
包含作者信息和日期的注释: 这些信息会自动记录在提交/git仓库中。手动添加注释容易因未更新而产生冗余。
-
过度冗长的注释: 注释应分享决策背后的理由,而不是解释逻辑本身。努力编写"自文档化"的代码;如果一组语句难以理解,这显然是重构和简化的信号。
代码仓库实践¶
最佳实践¶
-
保护
master分支: 如果你与团队一起开发,或者即使是在长期个人项目上工作,请保护主分支并实践GitFlow ↗或你偏好的开发工作流。关键概念很简单:确保进入主分支的代码经过审查和测试。 -
编写提交信息: 提交信息是仓库中所有活动的日志。花时间编写有用的变更描述:
- 代码审查者可以查看你的提交以了解你的工作流以及所做的更改。
- 如果你想回退更改,如果能一目了然地找到目标提交,会容易得多。
-
请注意,点击构建按钮会自动生成带有时间戳的提交信息;避免使用此功能。先点击提交,编写提交信息,然后再点击构建。
-
清理分支: 在长期存在的仓库中,分支可能会累积。如果某个分支的开发被放弃,特别是当分支已合并到另一个分支时,通过删除分支来保持整洁。这有助于提高哪些分支正在积极开发的可读性。
-
升级你的仓库: 当提示时,按照步骤升级仓库中的语言包。此过程将打开一个包含升级内容的拉取请求到活动分支。你可以随时在升级分支上运行管道构建,以确保版本升级不会影响你的代码。然而,保持这些升级的最新状态通常可以确保你不会遇到其他地方出现的边缘情况,这些情况已在升级版本中得到修复。
-
实践代码审查: 当你与团队成员协作开发转换时,在拉取请求过程中实施一些代码审查实践。我们分享了关于代码审查最佳实践 ↗的想法,其中许多概念同样适用于审查数据转换代码。
模式¶
在仓库之间共享代码: Foundry中的仓库出于多种原因在项目级别运行,但通常存在可跨管道重用的逻辑,这有几个优点:
-
根据DRY原则实现通用代码重用。
-
避免数据基础中不同区域出现分支/不一致的逻辑。
-
可能存在一些理想情况下应使用基础管道的管道,但它们有更严格的SLA或性能要求。在这种情况下,解决方案通常是共享逻辑但不共享转换/数据集,以便关键管道可以依赖预过滤的数据集、更少的转换、不同的构建计划等。
-
代码仓库是实现这一点的绝佳方式。例如,在使用Python转换时,库可以将自身发布到Conda通道,并允许其他仓库使用它们。请参阅关于共享Python库的文档。
-
共享仓库的另一个优点是语义版本控制。共享仓库可以用版本标记其提交(例如1.0.0、1.0.1、2.0.0),消费库可以选择如何获取新版本。例如,一个仓库可能选择获取最新版本(上面的2.0.0),或者只获取特定版本(比如1.0.1),并推迟获取新版本直到手动决定。后一种情况在管道至关重要且管道所有者希望有机会选择/批准对共享仓库的更改时特别有价值。
-
发布: 同样,如果团队希望为管道制定明确的发布计划,一种选择(避免暂存实例或长期存在的开发分支)是将逻辑提取到共享仓库中的函数中,并使用语义版本将消费仓库保持在主要版本上,例如1.0.0、2.0.0,甚至.0.0。这样,开发者可以继续迭代逻辑并标记中间版本,而不会让它们上线到主分支。此外,在消费仓库的分支上,开发者始终可以获取中间版本,只要他们在发布日期之前不将其合并到主分支即可。
单元测试¶
单元测试是改进和维护代码质量的一种流行方式。在单元测试 ↗中,软件的微小且离散的组件("单元")以独立、自动化的方式进行测试。
-
Python单元测试: 你可以按照Python单元测试说明启用
pytest单元测试作为Python转换仓库CI检查的一部分。 -
Java单元测试: 配置Java转换单元测试的步骤可以在Java单元测试文档中找到。
最佳实践¶
单元测试应:
- 一次只测试一段逻辑;
- 轻量级(请参阅特定语言的说明,了解如何减少内存使用的技巧);
- 不依赖额外的输入来运行;以及
- 不依赖或相互调用。
健康检查¶
最佳实践¶
检查数据健康: 通常,一旦管道的一部分完成,很容易设置计划并将其抛之脑后。然而,即使你的逻辑是正确的,传入的数据也可能以影响构建的方式发生变化,导致性能下降、数据规模增加或构建完全失败。配置数据集大小和构建时间的基本检查,即使你不配置警报,也能提供这些关键指标随时间变化的视图,以便你观察数据集大小的增长率或数据集的平均构建时间。阅读更多关于可用的特定健康检查以及如何配置它们的信息。
模式¶
扩展健康检查: 在大多数情况下,默认的健康检查配置应该足够。如果需要更多灵活性,请考虑在管道中添加一个或多个派生的健康检查数据集。此数据集的转换可以执行任意逻辑来确定其输入数据集(你正在验证的数据集)的有效性,然后输出格式化的数据,以便像允许值这样的简单健康检查可以报告数据集是否有效。
为此数据集设置计划,使其在输入数据集有更新时构建,这样你将拥有一套额外的全面健康检查。
调度¶
最佳实践¶
请查阅调度最佳实践。