S3-compatible API for Foundry datasets（Foundry 数据集的 S3 兼容 API）¶

The S3-compatible API for Foundry datasets allows you to interact with Foundry datasets as though they are S3 buckets. Learn how datasets behave when accessed through the API, and view the setup guide and examples.

Foundry exposes a subset of the Amazon Simple Storage Service (S3) API ↗, allowing you to interact with Foundry datasets using clients that know how to speak to S3 storage services. Examples include the AWS CLI, AWS SDKs, Hadoop S3 filesystem, and Cyberduck.

S3-compatible API for Foundry datasets

The S3 API is not fully implemented as not all S3 concepts map naturally to concepts in Foundry. For example, creation and deletion of buckets (which represent Foundry datasets) is not currently supported; datasets should be created in Foundry ahead of using the API. However, the majority of file read/write/delete workflows are supported, including multipart uploads. See Supported actions for a list of which S3 actions are supported.

Concepts¶

This section outlines how S3 concepts map to Foundry dataset concepts.

S3 buckets correspond to Foundry datasets¶

S3 buckets correspond to Foundry datasets, with the bucket name being the Foundry dataset's unique identifier (for example, ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7).

S3 object keys correspond to the logical paths of files within a Foundry dataset (for example, top-level-file.csv or subfolder/nested-file.csv).

Branches¶

The API supports accessing dataset branches with alphanumeric names (containing only a-z, A-Z, or 0-9) or the special characters - and _. To specify a branch, modify the bucket name by appending the branch name, separated by a period: <dataset-rid>.<branch-name>. If no branch is specified, the default branch is used.

For example, to access the mybranch branch of the dataset with RID ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7, use the bucket name ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7.mybranch.

S3 bucket name validation imposes the character restrictions previously described. If your branch contains characters that are not allowed such as /, you can encode the branch name using Base64-encoding. You should omit any trailing = characters. For example, feature/my-branch is Base64 encoded as ZmVhdHVyZS9teS1icmFuY2g=. ZmVhdHVyZS9teS1icmFuY2g (without the = character) can be used to reference this branch in the bucket name.

The API does not support branch creation; specified branches must already exist on the dataset.

Transactions¶

The API supports accessing dataset transactions in a similar way to branches. This allows users to access historical versions of a dataset. For example, to access the ri.foundry.main.transaction.0cdfe8c9-f595-4859-a194-7daecff9d6fe transaction of the dataset with RID ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7, use the bucket name ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7.ri.foundry.main.transaction.0cdfe8c9-f595-4859-a194-7daecff9d6fe.

Only committed transactions may be accessed in this way. As a result, the bucket will be read-only; it is not possible to put or delete objects when using a bucket name that includes a transaction identifier.

Transaction management¶

S3 does not have the notion of transactions, so Foundry dataset transactions are automatically handled with the following behavior:

Transactions are lazily opened when users write or delete files.
The API supports writing multiple files concurrently or deleting multiple files concurrently. Files will be modified in the same open transaction, with writes using UPDATE transactions and deletes using DELETE transactions.
The API does not support concurrent writes and deletes, given this would require both an UPDATE and DELETE transaction to be active at the same time.
Transactions are automatically committed after a short period of inactivity following a write or delete. This allows you to upload multiple files in the same transaction, avoiding many transactions with very few files. To avoid transactions being kept open indefinitely, for example if files are continuously uploaded such that there is no period of inactivity, a transaction will be committed after a certain amount of time as soon as there are no active uploads. If files are continuously uploaded in parallel, then the transaction will remain open until all uploads are complete.
If a read request is received within a period of inactivity while the transaction is open, the UPDATE and DELETE transactions will be immediately committed where possible. This aims to preserve read-after-write semantics that are generally expected with S3. However, if reads are attempted while there are still active write requests to the open transaction, the read will happen from the latest committed view. To guarantee a transaction commits after all writes or deletes are complete, you can issue a subsequent read request prior to any new write or delete requests.

Given the above behaviors, read-after-write semantics are not guaranteed. However, every effort is made to provide them where possible.

Authentication¶

Connections through the API are authenticated using access key ID and secret access key credentials.

Static credentials¶

Static credentials are similar to standard AWS access key ID and secret access key credentials. Static credentials are long-lived and, in the Foundry case, are associated with the service user of a third-party application registered in Foundry's Control Panel.

When using a static access key ID and secret access key to connect to datasets through the S3-compatible API, the access level is determined by the access granted to the third-party application's service user. Static credentials must be restricted to individual Projects. The project restrictions are specified when generating a new set of credentials. Only datasets within the specified projects will be accessible using the generated credentials.

See the setup guide below for guidance on using the /io/s3/api/v2/credentials API endpoint to generate these credentials.

Temporary credentials¶

We recommend using static credentials in any workflow that requires long-lived credentials or where it is beneficial to tie access to a service user. If you prefer authenticating to the API as your regular Foundry user, we support exchanging a user authentication token for temporary S3 credentials. This token could also be obtained through one of the OAuth2 grants for a third-party application.

Temporary credentials are obtained using the standard AssumeRoleWithWebIdentity ↗ Security Token Service (STS) API. We only require the WebIdentityToken request parameter be present and configured with a regular Foundry token as described above. The temporary credentials returned will have an identical scope to that of the provided token. The DurationSeconds parameter can be provided to specify the lifetime of the credentials. The credentials will have a maximum lifetime of one hour and will never exceed the lifetime of the Foundry token used to obtain temporary credentials.

The STS API can be accessed at https://<FOUNDRY_URL>/io/s3. If you wish to obtain STS credentials programmatically, this URL should be configured in the endpoint configuration of the standard STS clients or credential providers. For example:

import boto3

endpoint = "https://<FOUNDRY_URL>/io/s3"

session = boto3.session.Session()
client = session.client(service_name='sts', endpoint_url=endpoint)

# RoleArn and RoleSessionName are required parameters in boto3 despite being unused
credentials = client.assume_role_with_web_identity(
    RoleArn="xxxxxxxxxxxxxxxxxxxx",
    RoleSessionName="xxxxx",
    WebIdentityToken=token
)["Credentials"]

Alternatively, you can access the API directly using cURL or equivalent such as in the example below. <TOKEN> should be replaced with a valid Foundry token.

curl -X POST \
    "https://<FOUNDRY_URL>/io/s3?Action=AssumeRoleWithWebIdentity&WebIdentityToken=<TOKEN>"

You will receive session credentials in the XML response, as shown below. These credentials should be securely stored.

<?xml version='1.0' encoding='UTF-8'?>
<AssumeRoleWithWebIdentityResponse
    xmlns="https://sts.amazonaws.com/doc/2011-06-15/">
    <AssumeRoleWithWebIdentityResult>
        <Credentials>
            <AccessKeyId>PLTRLZZJE0...</AccessKeyId>
            <SecretAccessKey>2j3hKX4EDP...</SecretAccessKey>
            <SessionToken>eyJwbG50ci...</SessionToken>
            <Expiration>2023-08-30T10:55:08.841403951Z</Expiration>
        </Credentials>
    </AssumeRoleWithWebIdentityResult>
</AssumeRoleWithWebIdentityResponse>

Once you have temporary credentials, navigate to step four in the setup guide below for guidance on configuring S3 clients. You do not need to follow any steps regarding third-party applications. When configuring S3 clients, you must provide the session token, the access key ID, and secret access key.

:::callout{theme="info"} To read or write data via the S3-compatible API, users need s3-proxy:datasets-read and s3-proxy:datasets-write permissions which, by default, are granted to the Viewer and Editor role respectively. When using static credentials, the service user corresponding to the third-party app will need to be granted the relevant role. When using temporary credentials, the user obtaining credentials will need the relevant role. :::

Path-style URL access¶

The API supports only path-style ↗ bucket access. Your bucket URLs will take the format https://<FOUNDRY_URL>/io/s3/<bucket-name>/<key-name>.

In Foundry terms, this means https://<FOUNDRY_URL>/io/s3/<dataset-rid>/<logical-filepath>.

Virtual-hosted-style bucket access (where the bucket name is included in the subdomain of the URL) is currently not supported.

Presigned URLs¶

Presigned URLs are supported for the putObject operation.

Setup guide¶

Follow these steps to set up a connection to the API from your S3 client.

Register a third-party application in Foundry
Grant permissions to your third-party application's service user
Generate S3 access key ID and secret access key
Configure the S3 client

View example configuration settings for specific S3 clients.

Step 1: Register a third-party application in Foundry¶

To obtain credentials for the S3-compatible API you will first need to obtain the client ID and secret of a third-party application that has been created in Control Panel in Foundry:

Open Control Panel in Foundry.
Select Third party applications in the sidebar.
Select New application, then complete the setup wizard with the following parameters:
Client type: Choose Confidential client.
Authorization grant types: Enable the Client credentials grant.
Select Register application in the lower right of the Summary page.
On the completion screen, record the Client ID as this will be needed in a future step. Then select Enable application use and use the toggle switch to Enable it.

:::callout{theme="neutral"} Review Concepts: Authentication to understand the requirements and behavior for project restrictions, and the scope of generated credentials. :::

Step 2: Grant permissions to your third-party application's service user¶

:::callout{theme="warning"} Users should use Developer Console to manage their application configuration. The Control Panel view only applies if Developer Console has not been enabled for the user. :::

When you created the third-party application in the previous step, Foundry created a service user automatically. To access datasets via the S3 API, this service user must have sufficient permissions on the relevant Projects and Markings.

To set up permissions for the service user:

Find the name and ID of the service user. You can find these details on your application's "Manage application" page, under Authorization grant types > Client credentials grant > Service user.
Grant permissions to the service user, either by adding the user to Projects and Markings directly, or by adding the user to a group that has been granted those permissions.

Step 3: Generate S3 access key ID and secret access key¶

:::callout{theme="neutral"} To generate credentials you will need to have the User experience administrator role on the Organization in Control Panel. :::

Run the terminal command below (using either curl or Powershell) to receive an access key ID and secret access key. Replace <TOKEN> with an active token for your user account, and <CLIENT_ID> with the client ID of the third-party application generated in the previous step. Additionally, you must replace <PROJECT_RID> with the RID of a Project to which the credentials have access. The projectRestrictions value can take multiple Projects, allowing you to list more Project RIDs if necessary. At least one project must be specified. The Project RID should be of the form ri.compass.main.folder.{RID_VALUE}.

Option 1: curl¶

curl -X POST \
    -H "Authorization: Bearer <TOKEN>" \
    -H "Content-type: application/json" \
    --data '{"clientId":"<CLIENT_ID>","projectRestrictions":["<PROJECT_RID>"]}' \
    https://<FOUNDRY_URL>/io/s3/api/v2/credentials

Option 2: Powershell (Windows)¶

$headers = @{
    "Authorization" = "Bearer <TOKEN>"
    "Content-type" = "application/json"
}
$body = @{
    "clientId" = "<CLIENT_ID>"
    "projectRestrictions" = @("<PROJECT_RID>")
} | ConvertTo-Json

Invoke-WebRequest -Uri "https://<FOUNDRY_URL>/io/s3/api/v2/credentials" -Method POST -Headers $headers -Body $body

Securely store the access key and secret key you receive in the response of this request. You must configure clients with these credentials, not the third-party application's client ID and secret.

Specifying an organization¶

By default, the /v2/credentials endpoint assumes the authenticating user is generating credentials for a third-party application in their own Organization. If the third-party application exists in a different Organization, specify the Organization ID as a query parameter in the URL: https://<FOUNDRY_URL>/io/s3/api/v2/credentials?organizationRid=<ORGANIZATION_ID>.

Revoking credentials¶

If you need to revoke an access key and secret access key, call the following endpoint and replace <ACCESS_KEY_ID> with the access key ID that you wish to revoke:

curl -X DELETE \
    -H "Authorization: Bearer <TOKEN>" \
    https://<FOUNDRY_URL>/io/s3/api/v2/credentials/<ACCESS_KEY_ID>

Listing credentials¶

You can use the following endpoint to retrieve a list of active (non-revoked) access keys, including their client ID and project restrictions.

curl -X GET \
    -H "Authorization: Bearer <TOKEN>" \
    https://<FOUNDRY_URL>/io/s3/api/v2/credentials

Step 4: Configure the S3 client¶

To configure an S3 client you must set the following configuration parameters. See the examples below for details on how these should be configured in common S3 clients.

Name	Value	Description
Hostname / Endpoint	`https://<FOUNDRY_URL>/io/s3`	The hostname to which clients should connect (rather than `s3.amazonaws.com` for native S3 buckets hosted in AWS).
Region	`foundry`	The region must be set to `foundry` as it is used as part of the V4 signature verification ↗ process. If clients can only use standard AWS regions then use `us-east-1`.
Credentials	Access key ID and secret access key, and optionally, session token	Static or temporary credentials generated as described above.
Path-style access	`true`	The API only supports path-style ↗ bucket access rather than virtual-hosted-style bucket access.
Bucket Name (Optional)	`ri.foundry.main.dataset.<uuid>`	Each Foundry dataset is accessible as a separate S3 bucket, with the bucket name being the dataset's RID.

Supported actions¶

The following S3 actions ↗ are supported:

Client setup examples¶

As discussed above, you must ensure the client/SDK/connector is configured to use path-style bucket access. If the client does not support path-style bucket access, it is currently not compatible with this API. For example, with the S3A Hadoop client ↗ this can be configured using the fs.s3a.path.style.access flag.

If you are using temporary credentials, be sure to also configure the AWS session access token. Consult the relevant AWS client documentation for details. For example, for the AWS CLI ↗ you should set the AWS_SESSION_TOKEN environment variable.

AWS CLI¶

Once you have an access key ID and secret access key, you are ready to configure the AWS CLI ↗. Run the following command, entering the access key ID, secret access key, and region.

$ aws configure --profile foundry
AWS Access Key ID [None]: <ACCESS_KEY_ID>
AWS Secret Access Key [None]: <SECRET_ACCESS_KEY>
Default region name [None]: foundry
Default output format [None]:

You should now be able to run commands for the foundry profile. For example:

aws --profile foundry --endpoint-url https://<FOUNDRY_URL>/io/s3 s3 ls s3://<DATASET_RID>

As of a recent release ↗ of the AWS CLI, it is now possible to configure the endpoint-url as part of the profile configuration. A sample foundry profile as it would be configured in ~/.aws/config is shown below. When configuring a profile with the endpoint_url property, you no longer need to include the --endpoint-url argument when using the aws command. Instead, --profile is sufficient.

[profile foundry]
region = foundry
endpoint_url = https://<FOUNDRY_URL>/io/s3

To use temporary credentials with the AWS CLI, follow the AWS documentation ↗ that explains how to configure the CLI to make the AWS STS AssumeRoleWithWebIdentity call for you. A sample foundry profile as it would be configured in ~/.aws/config is shown below. When using this configuration you do not need to have configured credentials in environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN) or ~/.aws/credentials.

[profile foundry]
region = foundry
endpoint_url = https://<FOUNDRY_URL>/io/s3
web_identity_token_file = ~/.foundry/web-identity.token
role_arn=xxxxxxxxxxxxxxxxxxxx

The example above assumes a valid Foundry token is stored in a file at ~/.foundry/web-identity.token. We only recommend this approach if this file is properly secure and not at risk of being leaked. The role_arn property is not used but must still be provided and be at least 20 characters long due to AWS CLI validations. We use xxxxxxxxxxxxxxxxxxxx as a placeholder in the example. To use this configuration, you must configure the endpoint_url in ~/.aws/config rather than using --endpoint-url, as discussed above.

AWS SDK for Python (Boto3)¶

import boto3
import pandas as pd

s3 = boto3.client(
    's3',
    aws_access_key_id="<ACCESS_KEY_ID>",
    aws_secret_access_key="<SECRET_ACCESS_KEY>",
    # aws_session_token="<SESSION_TOKEN>", only needed when using temporary credentials
    endpoint_url="https://<FOUNDRY_URL>/io/s3",
    region_name="foundry"
)

bucket = 'ri.foundry.main.dataset.<uuid>'
key = 'iris.csv'

obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'])
print(df)

Review the S3 section of Boto3's documentation ↗ for more information on connecting to S3-compatible sources using Boto3.

Spark¶

from pyspark.sql import SparkSession

hostname = "https://<FOUNDRY_URL>/io/s3"
access_key_id = "<ACCESS_KEY_ID>"
secret_access_key = "<SECRET_ACCESS_KEY>"

# ensure dataset RID can be parsed as a valid hostname
dataset_rid = "ri.foundry.main.dataset.<uuid>".replace('.', '-')

spark_session = (
    SparkSession.builder
        .config("fs.s3a.access.key", access_key_id)
        .config("fs.s3a.secret.key", secret_access_key)
        # .config("fs.s3a.session.token", session_token) only needed when using temporary credentials
        .config("fs.s3a.endpoint", hostname)
        .config("fs.s3a.endpoint.region", "foundry")
        .config("fs.s3a.path.style.access", "true")
        .getOrCreate()
)

df = spark_session.read.parquet(f"s3a://{dataset_rid}/*")
df.show()

Review the Spark documentation ↗ for more information on using Spark with S3.

:::callout{theme="warning"} There is a known issue using the Hadoop AWS client with bucket names that contain '.'. You may encounter an error message such as "bucket is null/empty". If this occurs, the dataset RID cannot be parsed as a valid hostname. As a workaround, you can substitute '.' in the dataset RID with '-'. :::

Cyberduck¶

Download the following profile: Foundry S3.cyberduckprofile ↗
Double-click the profile file to open and register the profile in Cyberduck.
Set the following connection properties on the default bookmark that Cyberduck created for you:
Server: https://<FOUNDRY_URL>
Access Key ID: <ACCESS_KEY_ID>
Secret Access Key: <SECRET_ACCESS_KEY>
Path: (under More Options) ri.foundry.main.dataset.<uuid>
Close the bookmark settings window.
Double-click the bookmark to open a connection.

Review the Cyberduck documentation ↗ for more information on connecting to S3-compatible sources.

Google Storage Transfer Service¶

Google Cloud's Storage Transfer Service ↗ can treat Foundry as an S3-compatible source ↗. You can transfer data from a Foundry dataset to a Cloud Storage bucket by following the below steps.

Set up an agent pool and transfer agents by following Google Cloud's instructions ↗.
Create a transfer job, and select S3-compatible object storage as the Source type. Then select the agent pool you created in step (1) and ensure the following configuration is set:
Bucket: ri.foundry.main.dataset.<uuid>.
Endpoint: https://<FOUNDRY_URL>/io/s3
Signing region: foundry
Signing process: Signature Version 4 (SigV4)
Addressing-style: Path-style requests
Network protocol: HTTPS
Listing API version: ListObjectsV2 Complete the setup to configure the Cloud Storage bucket destination, schedule, and settings of your transfer job.

Apache NiFi¶

You can use Apache NiFi to read and write files inside a Foundry dataset. The following example shows how to configure the PutS3Object processor type for writing:

Object Key: Logical file path in dataset of the form path/to/file.csv
Bucket: Dataset RID, such as ri.foundry.main.dataset.<uuid>
Access Key ID: <ACCESS_KEY_ID>
Secret Access Key: <SECRET_ACCESS_KEY>
Region: "US East (N. Virginia)" which corresponds to us-east-1
Use Path Style Access: true
Endpoint Override URL: https://<FOUNDRY_URL>/io/s3

Refer to the Apache NiFi documentation ↗ for more information on the PutS3Object processor and other processors that support S3-compatible sources.

Airbyte¶

Airbyte ↗'s support for S3 destinations can be used to write files to Foundry datasets. Set the following destination settings:

Destination type: S3
S3 Key ID: <ACCESS_KEY_ID>
S3 Access Key: <SECRET_ACCESS_KEY>
S3 Bucket Name: ri.foundry.main.dataset.<uuid>
S3 Bucket Path: Any valid subdirectory path. Airbyte will write files into this subdirectory within the Foundry dataset.
S3 Bucket Region: us-east-1
Output Format: All of Airbyte's output formats are compatible with Foundry datasets.
Endpoint: https://<FOUNDRY_URL>/io/s3

Refer to Airbyte's documentation for S3 destinations ↗ for more information and configuration options.

DuckDB¶

DuckDB ↗'s support for S3 can be used to query Foundry datasets. You can manage credentials using DuckDB secrets and query datasets using the s3:// prefixed URLs.

CREATE SECRET foundry_secret (
    TYPE S3,
    KEY_ID '<ACCESS_KEY_ID>',
    SECRET '<SECRET_ACCESS_KEY>',
    REGION 'foundry',
    ENDPOINT '<FOUNDRY_URL>/io/s3',
    URL_STYLE 'path'
);

CREATE TABLE new_tbl AS SELECT * FROM 's3://ri.foundry.main.dataset.<uuid>/spark/*.parquet';

Refer to the DuckDB documentation ↗ for more information.

:::callout{theme="neutral"} In the secret configuration above, the ENDPOINT configuration parameter should not include the https:// scheme. The URL scheme is handled automatically by the USE_SSL parameter, which defaults to true. :::

Polars¶

Polars' ↗ support for S3 can be used to query Foundry datasets.

import polars as pl

hostname = "https://<FOUNDRY_URL>/io/s3"
access_key_id = "<ACCESS_KEY_ID>"
secret_access_key = "<SECRET_ACCESS_KEY>"
dataset_rid = "ri.foundry.main.dataset.<uuid>"

storage_options = {
    "aws_access_key_id": access_key_id,
    "aws_secret_access_key": secret_access_key,
    "aws_region": "foundry",
    "endpoint_url": hostname
}

df = pl.scan_parquet(f"s3://{dataset_rid}/spark/*.parquet", storage_options=storage_options)

Refer to the Polars documentation ↗ for more information.

中文翻译¶

Foundry 数据集的 S3 兼容 API¶

Foundry 数据集的 S3 兼容 API 允许您像操作 S3 存储桶(Bucket)一样与 Foundry 数据集进行交互。了解通过该 API 访问时数据集的行为，并查看设置指南和示例。

Foundry 公开了 Amazon Simple Storage Service (S3) API ↗ 的一个子集，允许您使用能够与 S3 存储服务通信的客户端与 Foundry 数据集进行交互。示例包括 AWS CLI、AWS SDK、Hadoop S3 文件系统和 Cyberduck。

Foundry 数据集的 S3 兼容 API

S3 API 并未完全实现，因为并非所有 S3 概念都能自然地映射到 Foundry 中的概念。例如，目前不支持创建和删除存储桶（代表 Foundry 数据集）；在使用 API 之前，应在 Foundry 中创建数据集。但是，支持大多数文件读取/写入/删除工作流，包括分段上传(Multipart Upload)。有关支持哪些 S3 操作的列表，请参阅支持的操作。

概念¶

本节概述 S3 概念如何映射到 Foundry 数据集概念。

S3 存储桶对应 Foundry 数据集¶

S3 存储桶对应 Foundry 数据集，存储桶名称是 Foundry 数据集的唯一标识符（例如，ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7）。

S3 对象键(Object Key) 对应 Foundry 数据集内文件的逻辑路径（例如，top-level-file.csv 或 subfolder/nested-file.csv）。

分支(Branch)¶

该 API 支持访问具有字母数字名称（仅包含 a-z、A-Z 或 0-9）或特殊字符 - 和 _ 的数据集分支。要指定分支，请通过附加分支名称来修改存储桶名称，并用句点分隔：<dataset-rid>.<branch-name>。如果未指定分支，则使用默认分支。

例如，要访问 RID 为 ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7 的数据集的 mybranch 分支，请使用存储桶名称 ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7.mybranch。

S3 存储桶名称验证施加了前面描述的字符限制。如果您的分支包含不允许的字符，例如 /，您可以使用 Base64 编码对分支名称进行编码。您应省略任何尾随的 = 字符。例如，feature/my-branch 的 Base64 编码为 ZmVhdHVyZS9teS1icmFuY2g=。ZmVhdHVyZS9teS1icmFuY2g（不带 = 字符）可用于在存储桶名称中引用此分支。

该 API 不支持创建分支；指定的分支必须已存在于数据集上。

事务(Transaction)¶

该 API 支持以类似于分支的方式访问数据集事务。这允许用户访问数据集的历史版本。例如，要访问 RID 为 ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7 的数据集的 ri.foundry.main.transaction.0cdfe8c9-f595-4859-a194-7daecff9d6fe 事务，请使用存储桶名称 ri.foundry.main.dataset.bafb7e96-84f1-4d23-a4a5-40b17c6912e7.ri.foundry.main.transaction.0cdfe8c9-f595-4859-a194-7daecff9d6fe。

只有已提交的事务才能以这种方式访问。因此，该存储桶将是只读的；当使用包含事务标识符的存储桶名称时，无法放置或删除对象。

事务管理¶

S3 没有事务的概念，因此 Foundry 数据集事务会自动按以下行为处理：

当用户写入或删除文件时，事务会被延迟打开。
该 API 支持同时写入多个文件或同时删除多个文件。文件将在同一个打开的事务中被修改，写入使用 UPDATE 事务，删除使用 DELETE 事务。
该 API 不支持同时进行写入和删除，因为这需要同时激活 UPDATE 和 DELETE 事务。
在写入或删除之后，经过一段短暂的不活动期，事务会自动提交。这允许您在同一个事务中上传多个文件，避免产生许多包含极少文件的事务。为了避免事务无限期地保持打开状态（例如，如果文件持续上传导致没有不活动期），一旦没有活跃的上传，事务将在一定时间后提交。如果文件持续并行上传，则事务将保持打开状态，直到所有上传完成。
如果在事务打开期间的不活动期内收到读取请求，UPDATE 和 DELETE 事务将在可能的情况下立即提交。这旨在保留使用 S3 时通常期望的写后读一致性(Read-after-write semantics)。但是，如果在仍有对打开事务的活跃写入请求时尝试读取，则读取将基于最新的已提交视图。要确保在所有写入或删除完成后提交事务，您可以在任何新的写入或删除请求之前发出后续的读取请求。

鉴于上述行为，不保证写后读一致性。但是，会尽一切努力在可能的情况下提供此一致性。

身份验证(Authentication)¶

通过 API 的连接使用访问密钥 ID(Access Key ID)和秘密访问密钥(Secret Access Key)凭证进行身份验证。

静态凭证(Static credentials)¶

静态凭证类似于标准的 AWS 访问密钥 ID 和秘密访问密钥凭证。静态凭证是长期有效的，并且在 Foundry 的情况下，与在 Foundry 的控制面板(Control Panel)中注册的第三方应用的服务用户相关联。

当使用静态访问密钥 ID 和秘密访问密钥通过 S3 兼容 API 连接到数据集时，访问级别由授予第三方应用服务用户的访问权限决定。静态凭证必须限制在单个项目(Project)内。生成新凭证集时指定项目限制。只有指定项目内的数据集才能使用生成的凭证进行访问。

请参阅下面的设置指南，了解如何使用 /io/s3/api/v2/credentials API 端点生成这些凭证。

临时凭证(Temporary credentials)¶

我们建议在任何需要长期凭证或将访问权限绑定到服务用户的工作流中使用静态凭证。如果您希望以常规 Foundry 用户的身份向 API 进行身份验证，我们支持将用户身份验证令牌交换为临时 S3 凭证。此令牌也可以通过第三方应用的 OAuth2 授权之一获得。

临时凭证是使用标准的 AssumeRoleWithWebIdentity ↗ 安全令牌服务(Security Token Service, STS) API 获得的。我们只需要存在 WebIdentityToken 请求参数，并将其配置为如上所述的常规 Foundry 令牌。返回的临时凭证将具有与所提供令牌相同的范围。可以提供 DurationSeconds 参数来指定凭证的生命周期。凭证的最长生命周期为一小时，并且永远不会超过用于获取临时凭证的 Foundry 令牌的生命周期。

STS API 可以在 https://<FOUNDRY_URL>/io/s3 访问。如果您希望以编程方式获取 STS 凭证，此 URL 应配置在标准 STS 客户端或凭证提供者的端点配置中。例如：

import boto3

endpoint = "https://<FOUNDRY_URL>/io/s3"

session = boto3.session.Session()
client = session.client(service_name='sts', endpoint_url=endpoint)

# RoleArn 和 RoleSessionName 是 boto3 中的必需参数，尽管未被使用
credentials = client.assume_role_with_web_identity(
    RoleArn="xxxxxxxxxxxxxxxxxxxx",
    RoleSessionName="xxxxx",
    WebIdentityToken=token
)["Credentials"]

或者，您可以使用 cURL 或类似工具直接访问 API，如下例所示。<TOKEN> 应替换为有效的 Foundry 令牌。

curl -X POST \
    "https://<FOUNDRY_URL>/io/s3?Action=AssumeRoleWithWebIdentity&WebIdentityToken=<TOKEN>"

您将在 XML 响应中收到会话凭证，如下所示。这些凭证应安全存储。

<?xml version='1.0' encoding='UTF-8'?>
<AssumeRoleWithWebIdentityResponse
    xmlns="https://sts.amazonaws.com/doc/2011-06-15/">
    <AssumeRoleWithWebIdentityResult>
        <Credentials>
            <AccessKeyId>PLTRLZZJE0...</AccessKeyId>
            <SecretAccessKey>2j3hKX4EDP...</SecretAccessKey>
            <SessionToken>eyJwbG50ci...</SessionToken>
            <Expiration>2023-08-30T10:55:08.841403951Z</Expiration>
        </Credentials>
    </AssumeRoleWithWebIdentityResult>
</AssumeRoleWithWebIdentityResponse>

获得临时凭证后，请转到下面的设置指南中的第四步，了解如何配置 S3 客户端。您无需执行任何与第三方应用相关的步骤。配置 S3 客户端时，您必须提供会话令牌、访问密钥 ID 和秘密访问密钥。

:::callout{theme="info"} 要通过 S3 兼容 API 读取或写入数据，用户需要 s3-proxy:datasets-read 和 s3-proxy:datasets-write 权限，默认情况下，这些权限分别授予 Viewer 和 Editor 角色。使用静态凭证时，需要授予第三方应用对应的服务用户相关角色。使用临时凭证时，获取凭证的用户需要拥有相关角色。 :::

路径样式 URL 访问¶

该 API 仅支持路径样式(Path-style) ↗ 存储桶访问。您的存储桶 URL 将采用 https://<FOUNDRY_URL>/io/s3/<bucket-name>/<key-name> 格式。

用 Foundry 术语来说，这意味着 https://<FOUNDRY_URL>/io/s3/<dataset-rid>/<logical-filepath>。

目前不支持虚拟托管样式(Virtual-hosted-style)存储桶访问（其中存储桶名称包含在 URL 的子域中）。

预签名 URL(Presigned URLs)¶

支持对 putObject 操作使用预签名 URL。

设置指南¶

按照以下步骤从您的 S3 客户端设置到 API 的连接。

在 Foundry 中注册第三方应用
向第三方应用的服务用户授予权限
生成 S3 访问密钥 ID 和秘密访问密钥
配置 S3 客户端

查看特定 S3 客户端的示例配置设置。

步骤 1：在 Foundry 中注册第三方应用¶

要获取 S3 兼容 API 的凭证，您首先需要获取已在 Foundry 控制面板中创建的第三方应用的客户端 ID(Client ID)和密钥(Secret)：

在 Foundry 中打开控制面板。
在侧边栏中选择第三方应用。
选择新建应用，然后使用以下参数完成设置向导：
- 客户端类型： 选择机密客户端(Confidential client)。
- 授权类型： 启用客户端凭证授权(Client credentials grant)。
在摘要页面的右下角选择注册应用。
在完成屏幕上，记录客户端 ID，因为后续步骤中会用到。然后选择启用应用使用，并使用切换开关将其启用。

:::callout{theme="neutral"} 查看概念：身份验证以了解项目限制的要求和行为，以及生成凭证的范围。 :::

步骤 2：向第三方应用的服务用户授予权限¶

:::callout{theme="warning"} 用户应使用开发者控制台(Developer Console)来管理其应用配置。仅当用户未启用开发者控制台时，才适用控制面板视图。 :::

当您在上一步中创建第三方应用时，Foundry 会自动创建一个服务用户。要通过 S3 API 访问数据集，此服务用户必须对相关项目和标记(Markings)拥有足够的权限。

要为服务用户设置权限：

找到服务用户的名称和 ID。您可以在应用的“管理应用”页面上的授权类型 > 客户端凭证授权 > 服务用户下找到这些详细信息。
向服务用户授予权限，可以直接将用户添加到项目和标记中，也可以将用户添加到已获得这些权限的组中。

步骤 3：生成 S3 访问密钥 ID 和秘密访问密钥¶

:::callout{theme="neutral"} 要生成凭证，您需要在控制面板中拥有组织的 User experience administrator 角色。 :::

运行下面的终端命令（使用 curl 或 Powershell）以接收访问密钥 ID 和秘密访问密钥。将 <TOKEN> 替换为您用户账户的有效令牌，将 <CLIENT_ID> 替换为上一步生成的第三方应用的客户端 ID。此外，您必须将 <PROJECT_RID> 替换为凭证有权访问的项目的 RID。projectRestrictions 值可以接受多个项目，允许您在必要时列出更多项目 RID。必须至少指定一个项目。项目 RID 的格式应为 ri.compass.main.folder.{RID_VALUE}。

选项 1：curl¶

curl -X POST \
    -H "Authorization: Bearer <TOKEN>" \
    -H "Content-type: application/json" \
    --data '{"clientId":"<CLIENT_ID>","projectRestrictions":["<PROJECT_RID>"]}' \
    https://<FOUNDRY_URL>/io/s3/api/v2/credentials

选项 2：Powershell (Windows)¶

$headers = @{
    "Authorization" = "Bearer <TOKEN>"
    "Content-type" = "application/json"
}
$body = @{
    "clientId" = "<CLIENT_ID>"
    "projectRestrictions" = @("<PROJECT_RID>")
} | ConvertTo-Json

Invoke-WebRequest -Uri "https://<FOUNDRY_URL>/io/s3/api/v2/credentials" -Method POST -Headers $headers -Body $body

安全地存储在此请求响应中收到的访问密钥和秘密密钥。您必须使用这些凭证配置客户端，而不是 第三方应用的客户端 ID 和密钥。

指定组织¶

默认情况下，/v2/credentials 端点假定身份验证用户正在为其自己的组织(Organization)中的第三方应用生成凭证。如果第三方应用存在于不同的组织中，请在 URL 中将组织 ID 指定为查询参数：https://<FOUNDRY_URL>/io/s3/api/v2/credentials?organizationRid=<ORGANIZATION_ID>。

吊销凭证¶

如果您需要吊销访问密钥和秘密访问密钥，请调用以下端点，并将 <ACCESS_KEY_ID> 替换为您要吊销的访问密钥 ID：

curl -X DELETE \
    -H "Authorization: Bearer <TOKEN>" \
    https://<FOUNDRY_URL>/io/s3/api/v2/credentials/<ACCESS_KEY_ID>

列出凭证¶

您可以使用以下端点检索活跃（未吊销）访问密钥的列表，包括其客户端 ID 和项目限制。

curl -X GET \
    -H "Authorization: Bearer <TOKEN>" \
    https://<FOUNDRY_URL>/io/s3/api/v2/credentials

步骤 4：配置 S3 客户端¶

要配置 S3 客户端，您必须设置以下配置参数。请参阅下面的示例，了解如何在常见的 S3 客户端中配置这些参数。

名称	值	描述
主机名 / 端点	`https://<FOUNDRY_URL>/io/s3`	客户端应连接的主机名（而不是 AWS 中托管的本机 S3 存储桶的 `s3.amazonaws.com`）。
区域	`foundry`	区域必须设置为 `foundry`，因为它用作 V4 签名验证 ↗ 过程的一部分。如果客户端只能使用标准 AWS 区域，则使用 `us-east-1`。
凭证	访问密钥 ID 和秘密访问密钥，以及可选的会话令牌	如上所述生成的静态或临时凭证。
路径样式访问	`true`	该 API 仅支持路径样式 ↗ 存储桶访问，而非虚拟托管样式存储桶访问。
存储桶名称（可选）	`ri.foundry.main.dataset.<uuid>`	每个 Foundry 数据集都可作为单独的 S3 存储桶访问，存储桶名称即为数据集的 RID。

支持的操作¶

支持以下 S3 操作 ↗：

客户端设置示例¶

如上所述，您必须确保客户端/SDK/连接器配置为使用路径样式存储桶访问。如果客户端不支持路径样式存储桶访问，则当前与此 API 不兼容。例如，对于 S3A Hadoop 客户端 ↗，可以使用 fs.s3a.path.style.access 标志进行配置。

如果您使用的是临时凭证，请确保同时配置 AWS 会话访问令牌。有关详细信息，请查阅相关的 AWS 客户端文档。例如，对于 AWS CLI ↗，您应该设置 AWS_SESSION_TOKEN 环境变量。

AWS CLI¶

一旦您拥有访问密钥 ID 和秘密访问密钥，就可以配置 AWS CLI ↗ 了。运行以下命令，输入访问密钥 ID、秘密访问密钥和区域。

$ aws configure --profile foundry
AWS Access Key ID [None]: <ACCESS_KEY_ID>
AWS Secret Access Key [None]: <SECRET_ACCESS_KEY>
Default region name [None]: foundry
Default output format [None]:

您现在应该能够为 foundry 配置文件运行命令。例如：

aws --profile foundry --endpoint-url https://<FOUNDRY_URL>/io/s3 s3 ls s3://<DATASET_RID>

根据 AWS CLI 的最新版本 ↗，现在可以将 endpoint-url 配置为 profile 配置的一部分。下面显示了在 ~/.aws/config 中配置的示例 foundry 配置文件。当使用 endpoint_url 属性配置配置文件时，您在使用 aws 命令时不再需要包含 --endpoint-url 参数。相反，--profile 就足够了。

[profile foundry]
region = foundry
endpoint_url = https://<FOUNDRY_URL>/io/s3

要将临时凭证与 AWS CLI 一起使用，请遵循 AWS 文档 ↗，该文档解释了如何配置 CLI 以自动为您调用 AWS STS AssumeRoleWithWebIdentity。下面显示了在 ~/.aws/config 中配置的示例 foundry 配置文件。使用此配置时，您无需在环境变量（AWS_ACCESS_KEY_ID、AWS_SECRET_ACCESS_KEY、AWS_SESSION_TOKEN）或 ~/.aws/credentials 中配置凭证。

[profile foundry]
region = foundry
endpoint_url = https://<FOUNDRY_URL>/io/s3
web_identity_token_file = ~/.foundry/web-identity.token
role_arn=xxxxxxxxxxxxxxxxxxxx

上面的示例假设一个有效的 Foundry 令牌存储在 ~/.foundry/web-identity.token 文件中。我们仅建议在此文件得到适当保护且不存在泄露风险的情况下采用此方法。role_arn 属性未被使用，但由于 AWS CLI 的验证，必须提供且长度至少为 20 个字符。我们在示例中使用 xxxxxxxxxxxxxxxxxxxx 作为占位符。要使用此配置，您必须在 ~/.aws/config 中配置 endpoint_url，而不是使用 --endpoint-url，如上所述。

AWS SDK for Python (Boto3)¶

import boto3
import pandas as pd

s3 = boto3.client(
    's3',
    aws_access_key_id="<ACCESS_KEY_ID>",
    aws_secret_access_key="<SECRET_ACCESS_KEY>",
    # aws_session_token="<SESSION_TOKEN>", 仅在使用临时凭证时需要
    endpoint_url="https://<FOUNDRY_URL>/io/s3",
    region_name="foundry"
)

bucket = 'ri.foundry.main.dataset.<uuid>'
key = 'iris.csv'

obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'])
print(df)

有关使用 Boto3 连接到 S3 兼容源的更多信息，请查看 Boto3 文档的 S3 部分 ↗。

Spark¶

from pyspark.sql import SparkSession

hostname = "https://<FOUNDRY_URL>/io/s3"
access_key_id = "<ACCESS_KEY_ID>"
secret_access_key = "<SECRET_ACCESS_KEY>"

# 确保数据集 RID 可以被解析为有效的主机名
dataset_rid = "ri.foundry.main.dataset.<uuid>".replace('.', '-')

spark_session = (
    SparkSession.builder
        .config("fs.s3a.access.key", access_key_id)
        .config("fs.s3a.secret.key", secret_access_key)
        # .config("fs.s3a.session.token", session_token) 仅在使用临时凭证时需要
        .config("fs.s3a.endpoint", hostname)
        .config("fs.s3a.endpoint.region", "foundry")
        .config("fs.s3a.path.style.access", "true")
        .getOrCreate()
)

df = spark_session.read.parquet(f"s3a://{dataset_rid}/*")
df.show()

有关将 Spark 与 S3 结合使用的更多信息，请查看 Spark 文档 ↗。

:::callout{theme="warning"} 使用 Hadoop AWS 客户端处理包含 '.' 的存储桶名称时存在一个已知问题。您可能会遇到类似 "bucket is null/empty" 的错误消息。如果发生这种情况，则数据集 RID 无法被解析为有效的主机名。作为一种解决方法，您可以将数据集 RID 中的 '.' 替换为 '-'。 :::

Cyberduck¶

下载以下配置文件：Foundry S3.cyberduckprofile ↗
双击配置文件以打开并在 Cyberduck 中注册该配置文件。
在 Cyberduck 为您创建的默认书签上设置以下连接属性：
- 服务器： https://<FOUNDRY_URL>
- 访问密钥 ID： <ACCESS_KEY_ID>
- 秘密访问密钥： <SECRET_ACCESS_KEY>
- 路径：（在更多选项下）ri.foundry.main.dataset.<uuid>
关闭书签设置窗口。
双击书签以打开连接。

有关连接到 S3 兼容源的更多信息，请查看 Cyberduck 文档 ↗。

Google Storage Transfer Service¶

Google Cloud 的 Storage Transfer Service ↗ 可以将 Foundry 视为 S3 兼容源 ↗。您可以按照以下步骤将数据从 Foundry 数据集传输到 Cloud Storage 存储桶。

按照 Google Cloud 的说明 ↗ 设置代理池和传输代理。
创建一个传输作业，并选择 S3-compatible object storage 作为源类型。然后选择您在步骤 (1) 中创建的代理池，并确保设置以下配置：
- 存储桶： ri.foundry.main.dataset.<uuid>。
- 端点： https://<FOUNDRY_URL>/io/s3
- 签名区域： foundry
- 签名过程： Signature Version 4 (SigV4)
- 寻址样式： Path-style requests
- 网络协议： HTTPS
- 列表 API 版本： ListObjectsV2 完成设置以配置传输作业的 Cloud Storage 存储桶目标、计划和设置。

Apache NiFi¶

您可以使用 Apache NiFi 读取和写入 Foundry 数据集中的文件。以下示例演示了如何配置用于写入的 PutS3Object 处理器类型：

对象键： 数据集中的逻辑文件路径，格式为 path/to/file.csv
存储桶： 数据集 RID，例如 ri.foundry.main.dataset.<uuid>
访问密钥 ID： <ACCESS_KEY_ID>
秘密访问密钥： <SECRET_ACCESS_KEY>
区域： "US East (N. Virginia)"，对应 us-east-1
使用路径样式访问： true
端点覆盖 URL： https://<FOUNDRY_URL>/io/s3

有关 PutS3Object 处理器和其他支持 S3 兼容源的处理器的更多信息，请参阅 Apache NiFi 文档 ↗。

Airbyte¶

Airbyte ↗ 对 S3 目标的支持可用于将文件写入 Foundry 数据集。设置以下目标设置：

目标类型： S3
S3 密钥 ID： <ACCESS_KEY_ID>
S3 访问密钥： <SECRET_ACCESS_KEY>
S3 存储桶名称： ri.foundry.main.dataset.<uuid>
S3 存储桶路径： 任何有效的子目录路径。Airbyte 会将文件写入 Foundry 数据集中的此子目录内。
S3 存储桶区域： us-east-1
输出格式： Airbyte 的所有输出格式都与 Foundry 数据集兼容。
端点： https://<FOUNDRY_URL>/io/s3

有关更多信息和配置选项，请参阅 Airbyte 关于 S3 目标 ↗ 的文档。

DuckDB¶

DuckDB ↗ 对 S3 的支持可用于查询 Foundry 数据集。您可以使用 DuckDB 密钥管理凭证，并使用 s3:// 前缀的 URL 查询数据集。

CREATE SECRET foundry_secret (
    TYPE S3,
    KEY_ID '<ACCESS_KEY_ID>',
    SECRET '<SECRET_ACCESS_KEY>',
    REGION 'foundry',
    ENDPOINT '<FOUNDRY_URL>/io/s3',
    URL_STYLE 'path'
);

CREATE TABLE new_tbl AS SELECT * FROM 's3://ri.foundry.main.dataset.<uuid>/spark/*.parquet';

有关更多信息，请参阅 DuckDB 文档 ↗。

:::callout{theme="neutral"} 在上面的密钥配置中，ENDPOINT 配置参数不应包含 https:// 方案。URL 方案由 USE_SSL 参数自动处理，该参数默认为 true。 :::

Polars¶

Polars' ↗ 对 S3 的支持可用于查询 Foundry 数据集。

import polars as pl

hostname = "https://<FOUNDRY_URL>/io/s3"
access_key_id = "<ACCESS_KEY_ID>"
secret_access_key = "<SECRET_ACCESS_KEY>"
dataset_rid = "ri.foundry.main.dataset.<uuid>"

storage_options = {
    "aws_access_key_id": access_key_id,
    "aws_secret_access_key": secret_access_key,
    "aws_region": "foundry",
    "endpoint_url": hostname
}

df = pl.scan_parquet(f"s3://{dataset_rid}/spark/*.parquet", storage_options=storage_options)

有关更多信息，请参阅 Polars 文档 ↗。