HDFS¶
Connect Foundry to the Hadoop Distributed File System (HDFS) to read and sync data from HDFS to Foundry datasets.
Supported capabilities¶
| Capability | Status |
|---|---|
| Exploration | 🟢 Generally available |
| Bulk import | 🟢 Generally available |
Data model¶
The connector can transfer files of any type into Foundry datasets. File formats are preserved, and no schemas are applied during or after the transfer. Apply any necessary schema to the output dataset, or write a downstream transformation to access the data.
Performance and limitations¶
There is no limit to the size of transferable files. However, network issues can result in failures of large-scale transfers. In particular, direct cloud syncs that take more than two days to run will be interrupted. To avoid network issues, we recommend using smaller file sizes and limiting the number of files that are ingested in every execution of the sync. Syncs can be scheduled to run frequently.
Setup¶
- Open the Data Connection application and select + New Source in the upper right corner of the screen.
- Select HDFS from the available connector types.
- Follow the additional configuration prompts to continue the setup of your connector using the information in the sections below.
Learn more about setting up a connector in Foundry.
Networking¶
We recommend using the HDFS scheme ↗ if available due to faster RPC performance. Alternatively, WebHDFS ↗ is a HTTP REST API that supports the complete FileSystem interface for HDFS. Some examples include:
- hdfs://myhost.example.com:1234/path/to/root/directory
- webhdfs://example.com/path
- swebhdfs://example.com/path
:::callout{theme="warning"} The required network ports will differ based on the selected scheme. For the HDFS scheme, these ports are typically 8020/9000 on the NameNode server and 1019, 50010, and 50020 on the DataNode. For the WebHDFS scheme, the required port is typically 9820. :::
Certificates and private keys¶
SSL connections validate server certificates. Normally, SSL validations happen through a certificate chain; by default, both agent and Foundry workers trust most industry-standard certificate chains. If the server to which you are connecting has a self-signed certificate, or if there is TLS interception during the validation, the connector must trust the certificate. Learn more about using certificates in Data Connection.
Configuration options¶
The following configuration options are available for the HDFS connector:
| Option | Required? | Description |
|---|---|---|
URL |
Yes | The HDFS URL to the root data directory |
Extra properties |
No | Add a properties map that is passed to the Hadoop Configuration ↗. Each entry is a name and value pair that corresponds to a single property, avoiding the need for specifying the config on disk via configurationResources. |
Advanced options¶
The following advanced options are available for the HDFS connector:
| Option | Required? | Description |
|---|---|---|
User |
No | HDFS user. For legacy agent worker sources, defaults to the currently logged-in user on the agent host. The user parameter overrides Data Connection's global Kerberos settings. Leave the user parameter blank if you are using Kerberos. |
File change timeout |
No | Amount of time (in ISO-8601 ↗) a file must remain constant before being considered for upload. If possible, use the more efficient lastModifiedBefore processor. |
Sync data from HDFS¶
Visit the Explore tab to interactively explore data available in the configured HDFS instance. Select New Sync to regularly pull data from HDFS to a specified dataset in Foundry.
HDFS export task (legacy)¶
:::callout{theme="warning"} Export tasks are a legacy feature that is not recommended for new implementations. This documentation is provided for users who are still using legacy export tasks. :::
Basic configuration¶
type: export-hdfs-task
directoryPath: /some/directory
Complete configuration options¶
| Option | Required | Default | Description |
|---|---|---|---|
directoryPath |
Yes | N/A | The directory where files will be written |
incrementalType |
No | snapshot | Use incremental for incremental exports |
retriesPerFile |
No | 0 | Number of retry attempts per file on failure |
rewritePaths |
No | N/A | Map of regex patterns for path rewriting (see common configuration) |
Incremental exports¶
For incremental exports, set incrementalType to incremental. The first export will export all files, then subsequent exports will only include new transactions if the previous transaction is still present in the dataset.
type: export-hdfs-task
directoryPath: /exports/incremental/
incrementalType: incremental
Connection retries¶
If you experience connection issues to the HDFS cluster, you can configure retry attempts per file. Setting retriesPerFile: 1 will attempt to export each file twice (one initial attempt plus one retry).
type: export-hdfs-task
directoryPath: /some/directory
retriesPerFile: 1
中文翻译¶
HDFS¶
将 Foundry 连接到 Hadoop 分布式文件系统(HDFS),以从 HDFS 读取数据并将其同步到 Foundry 数据集。
支持的功能¶
| 功能 | 状态 |
|---|---|
| 数据探索 | 🟢 正式可用 |
| 批量导入 | 🟢 正式可用 |
数据模型¶
该连接器可以将任意类型的文件传输到 Foundry 数据集中。文件格式保持不变,在传输过程中或传输后不会应用任何模式。请将必要的模式应用于输出数据集,或编写下游转换来访问数据。
性能与限制¶
可传输的文件大小没有限制。然而,网络问题可能导致大规模传输失败。特别是,运行时间超过两天的直接云同步将被中断。为避免网络问题,我们建议使用较小的文件大小,并限制每次同步执行时摄取的文件数量。同步可以安排频繁运行。
设置¶
- 打开数据连接(Data Connection)应用程序,并在屏幕右上角选择 + 新建源。
- 从可用的连接器类型中选择 HDFS。
- 按照额外的配置提示,使用以下部分的信息继续设置您的连接器。
了解更多关于在 Foundry 中设置连接器的信息。
网络¶
我们建议在可用时使用 HDFS 方案 ↗,因为其 RPC 性能更快。或者,WebHDFS ↗ 是一个支持 HDFS 完整文件系统接口的 HTTP REST API。一些示例如下:
- hdfs://myhost.example.com:1234/path/to/root/directory
- webhdfs://example.com/path
- swebhdfs://example.com/path
:::callout{theme="warning"} 所需的网络端口将根据所选方案而有所不同。对于 HDFS 方案,这些端口通常是 NameNode 服务器上的 8020/9000 以及 DataNode 上的 1019、50010 和 50020。对于 WebHDFS 方案,所需端口通常是 9820。 :::
证书与私钥¶
SSL 连接会验证服务器证书。通常,SSL 验证通过证书链进行;默认情况下,代理和 Foundry 工作节点都信任大多数行业标准的证书链。如果您连接的服务器具有自签名证书,或者在验证过程中存在 TLS 拦截,则连接器必须信任该证书。了解更多关于在数据连接中使用证书的信息。
配置选项¶
HDFS 连接器提供以下配置选项:
| 选项 | 是否必需 | 描述 |
|---|---|---|
URL |
是 | 指向根数据目录的 HDFS URL |
额外属性(Extra properties) |
否 | 添加一个属性映射,该映射将传递给 Hadoop 配置(Hadoop Configuration) ↗。每个条目是一个名称和值对,对应单个属性,无需通过 configurationResources 在磁盘上指定配置。 |
高级选项¶
HDFS 连接器提供以下高级选项:
| 选项 | 是否必需 | 描述 |
|---|---|---|
用户(User) |
否 | HDFS 用户。对于旧版代理工作节点源,默认为代理主机上当前登录的用户。 user 参数会覆盖数据连接(Data Connection)的全局 Kerberos 设置。如果您正在使用 Kerberos,请将 user 参数留空。 |
文件变更超时(File change timeout) |
否 | 文件在被考虑上传之前必须保持不变的时长(以 ISO-8601 ↗ 格式表示)。 如果可能,请使用更高效的 lastModifiedBefore 处理器。 |
从 HDFS 同步数据¶
访问 探索(Explore) 标签页,以交互方式探索已配置的 HDFS 实例中可用的数据。选择 新建同步(New Sync) 以定期将数据从 HDFS 拉取到 Foundry 中的指定数据集。
HDFS 导出任务(旧版)¶
:::callout{theme="warning"} 导出任务(Export tasks) 是一个旧版功能,不建议用于新的实现。本文档仅供仍在使用旧版导出任务的用户参考。 :::
基本配置¶
type: export-hdfs-task
directoryPath: /some/directory
完整配置选项¶
| 选项 | 必需 | 默认值 | 描述 |
|---|---|---|---|
directoryPath |
是 | 无 | 文件将被写入的目录 |
incrementalType |
否 | snapshot | 对于增量导出,使用 incremental |
retriesPerFile |
否 | 0 | 每个文件失败时的重试次数 |
rewritePaths |
否 | 无 | 用于路径重写的正则表达式模式映射(参见通用配置) |
增量导出¶
对于增量导出,将 incrementalType 设置为 incremental。首次导出将导出所有文件,随后的导出将仅包含新的事务,前提是之前的事务仍存在于数据集中。
type: export-hdfs-task
directoryPath: /exports/incremental/
incrementalType: incremental
连接重试¶
如果您遇到与 HDFS 集群的连接问题,可以配置每个文件的重试次数。设置 retriesPerFile: 1 将尝试导出每个文件两次(一次初始尝试加一次重试)。
type: export-hdfs-task
directoryPath: /some/directory
retriesPerFile: 1