Interact with data(与数据交互)¶
Jupyter® and RStudio® Code Workspaces allow you to read, analyze, transform, and write back to Foundry datasets, Iceberg tables, and virtual tables. They also allow you to read and analyze restricted views and time series properties.
Security¶
Code Workspaces respects Foundry markings applied to datasets, and a workspace will inherit the markings of all datasets loaded into it. This is referred to as the workspace lineage. This means that to access a workspace, you must also have the required permissions for all of the datasets and other inputs contained in the workspace. If you lose access to a single input of the workspace, you will lose access to the entire workspace.
Resource considerations¶
Jupyter® and RStudio® workspaces are designed for interactive, analytical workflows and application development. They run on a single virtual machine and do not process datasets with a distributed Spark environment. As a result, the data you work with must fit into your machine's resources, which default to a maximum of 8 CPUs and 64 GB of memory. Using filters or SQL to reduce the amount of data you need before loading large datasets can help you avoid exceeding these limits. Filtering dataset files is another strategy to stay within your workspace's limits.
To develop large-scale data pipelines, use Python transforms with Spark or Pipeline Builder.
Dataset branching¶
By default, Code Workspaces will load the data of the dataset from the same branch as the workspace itself, and will otherwise fall back to the master branch. For example, a code workspace currently on branch my-branch will try to read the my-branch version of the dataset, and will fall back to the master branch if my-branch does not exist on the dataset.
To pin a dataset branch for a specific imported dataset, select the options ••• icon next to the dataset in the Data tab of the left sidebar. Then, select Pin dataset branch for reads and choose the desired dataset branch for use within your code workspace.
Read data¶
Code Workspaces requires you to select an alias for every Foundry dataset or restricted view that you import into a workspace. The alias acts as a reference that allows you to read from a dataset or restricted view and write to a dataset within your code. When registering a dataset in the Data tab, Code Workspaces creates a mapping between the chosen dataset alias and the Foundry resource's unique identifier in a hidden file located under the /home/user/repo/.foundry folder of the workspace.
Code Workspaces allows you to load tabular datasets, non-tabular datasets, Iceberg tables, virtual tables, and restricted views. You can add a new data source to your workspace by using the Add > Read data button in the Data tab.
After selecting a dataset to add to your workspace, you must define an alias for that resource to serve as a unique identifier and reference in your code. Choosing a read strategy for the dataset (such as pandas DataFrame, Polars LazyFrame, or raw file access) generates a code snippet that loads the dataset as the specified input type.
- You can modify this read strategy and generate a new snippet at any time, even after registering the dataset.
- By default, code workspaces will suggest a dataset alias with the same name as the dataset itself.
Selecting Done will complete the dataset registration process and allow you to use the data in your code with the Run snippet or Copy to clipboard options.
This page provides examples for tabular datasets, non-tabular datasets, and restricted views.
Tabular datasets¶
The following snippets are generated for a Cats tabular dataset with a dataset alias kittens. Notice that Cats is not referenced anywhere in the code snippet; Code Workspaces implicitly registers it under your chosen alias.
In Jupyter®:
from foundry.transforms import Dataset
kittens = Dataset.get("kittens")\
.read_table(format="arrow")\
.to_pandas()
For the full list of methods and properties available on the Dataset class used in Jupyter® workspaces, see the foundry.transforms.Dataset API reference.
Note that the read_table method shown above supports the following arguments:
arrow(recommended): Converts the dataset to an Apache Arrow table on which efficient filtering can be performed. You can then convert this table to a pandas dataframe using.to_pandas().pandasordataframe: Converts the dataset to a pandas dataframe.polars: Converts the dataset to a Polars dataframe. You can then convert it to a pandas dataframe using.to_pandas().lazy-polars: The lazy variant of a Polar dataframe. Filters cannot be executed on lazy polars.path: Outputs the local path under which the dataset is stored.
:::callout{theme="neutral"}
For a given format to be available as part of a read_table operation, the corresponding packages pyarrow, pandas and polars must be present in the environment. These are included automatically in the default Code Workspaces environment, but may need to be manually added to your custom environment.
:::
In RStudio®:
kittens <- datasets.read_table("kittens")
The syntax above loads the dataset and automatically collects the data into an R dataframe.
If the data exceeds the workspace's memory capacity, you can apply push-down filters to only load a subset of rows or columns using the following syntax:
library(dplyr) # should be imported by default in .Rprofile
kittens_df <- Dataset.get("kittens") %>%
# (optional) apply other transformations before collecting
collect()
Query tabular datasets with SQL¶
You may query any tabular dataset, Iceberg table, or virtual table that supports SQL queries using SQL in Python.
The following snippets are generated for a Ducks tabular dataset with a dataset alias ducklings. Notice that Ducks is not referenced anywhere in the code snippet; Code Workspaces implicitly registers it under your chosen alias.
You may only query one dataset per SELECT Spark SQL statement using the approaches below. Queries return the results as a PyArrow table ↗ that you can then convert to another format, such as a pandas DataFrame.
- Note that the default limit for SQL queries is one million rows. If you need to query more than one million rows at a time, you can parallelize separate queries and combine their results. Ensure that your data can fit within your workspace’s memory.
In Jupyter®:
After installing the containers-sql and foundry-platform-sdk packages, you may use the FoundrySdkSqlExecutor to write Spark SQL:
from containers_sql import FoundrySdkSqlExecutor
sql = FoundrySdkSqlExecutor()
query = sql.execute("SELECT * from `ducklings`")
df = query.fetch_results() # optionally specify timeout arg: query.fetch_results(timeout_in_seconds=60)
You may also use a magic command ↗ by first running the following at the top of your notebook:
import containers_sql
%reload_ext containers_sql.foundry_sdk.magics
Then, use either the %sql magic command to run a single-line query or %%sql to run a multi-line query. Using %%sql -o df will assign the result of your SQL query to the python object df as a PyArrow table:
ducklings_df = %sql SELECT hatch_date, bill_color, * from `ducklings`
ducklings_pandas_df = ducklings_df.to_pandas()
%%sql -o ducklings_df
SELECT
hatch_date, bill_color, *
FROM
`ducklings`
LIMIT 10;
In RStudio®:
First install containers-sql, foundry-platform-sdk, pyarrow, and reticulate. Then use the reticulate package with the Python SQL executor described above to query a dataset:
library(reticulate)
FoundrySdkSqlExecutor <- import("containers_sql")$FoundrySdkSqlExecutor
sql <- FoundrySdkSqlExecutor()
query <- sql$execute("SELECT * from `ducklings`")
results_df <- query$fetch_results()
# For smaller data, creating an intermediate R object is acceptable.
r_list <- py_to_r(results_df$to_pydict())
tibble <- as_tibble(r_list)
# For larger data, you may want to write your data to disk
# to avoid exceeding your container's memory limits.
library(arrow)
pa <- import("pyarrow")
pa$feather$write_feather(results_df, "tmp.arrow")
tibble <- read_feather("tmp.arrow") |> as_tibble()
Non-tabular datasets¶
The following snippets are generated for a Dogs non-tabular dataset with dataset alias puppies. Notice that Dogs is not referenced anywhere in the code snippet; Code Workspaces implicitly registers it under your chosen alias. Contrary to reading tabular datasets, this approach gives you access to the files from the dataset in a puppies_files variable instead of inserting values inside a dataframe.
In Jupyter®:
from foundry.transforms import Dataset
puppies = Dataset.get("puppies")
puppies_files = puppies.files().download()
with open(puppies_files["puppy_names.txt"], "rb") as f:
puppy_names_text = f.read().decode("utf-8")
In RStudio®:
puppies_files <- datasets.list_files("puppies")
puppies_local_files <- datasets.download_files("puppies", puppies_files)
puppy_names_text <- readLines(raw_files_local_files[["puppy_names.txt"]], warn = FALSE)
See Filter dataset files for details on how to target a certain subset of files to download into your workspace.
Restricted views¶
:::callout{theme="neutral"} Querying restricted views is only supported in Jupyter® workspaces. :::
You can query restricted views using the SQL approach described above using Python. To do this, import a restricted view into your workspace and query it by its alias as with any other tabular dataset.
Note that importing a restricted view into your code workspace requires enabling restricted outputs mode to enforce a higher level of data security. With restricted outputs mode enabled, you can perform analysis on data contained in a restricted view but cannot publish outputs from your workspace that use the data.
When adding a restricted view to your workspace for the first time, a prompt will instruct you to install specific Python dependencies. You can then define an alias (as with a dataset) and add the restricted view to your workspace. You must enable restricted outputs mode and restart your workspace to query the restricted view.
:::callout{theme="neutral"} A restricted view's policy determines what data a user can query. For example, if you import a restricted view into a workspace and another user opens that workspace, they may see different data than you. If your access changes while a restricted view is loaded in a workspace, you will lose access to that workspace and will need to restart the workspace. :::
Filter dataset files¶
Code Workspaces enables you to download the files backing any dataset, whether they are tabular (i.e., they have a schema) or non-tabular. It is possible to select a subset of files to download, either by name, or by applying filtering logic on the file metadata (path, size in bytes, transaction RID, and updated time).
In Jupyter®:
from foundry.transforms import Dataset
# Download all files in the dataset
downloaded_files = Dataset.get("my_alias").files().download()
local_file = downloaded_files["file.pdf"]
# Download a single file
local_file = Dataset.get("my_alias").files().get("file.pdf").download()
# Download all PDF files of less than 1MB in the dataset
downloaded_files = my_dataset.files()\
.filter(lambda f: f.path.endswith(".pdf") and f.size_bytes < 1024 * 1024)
.download()
When downloading multiple files, you should use the filter syntax rather than downloading files individually by name to leverage parallel downloads.
In RStudio®:
# Download all files in the dataset
all_files <- datasets.list_files("my_alias")
downloaded_files <- datasets.download_files("my_alias", all_files)
local_file <- downloaded_files$`file.pdf`
# Download files by name
downloaded_files <- datasets.download_files("my_alias", c("file1.pdf", "file2.pdf"))
# Download all PDF files of less than 1MB in the dataset
all_files <- datasets.list_files("my_alias")
pdf_files <- all_files[sapply(all_files, function(f) f.endswith(".pdf") && f$sizeBytes < 1024*1024)]
downloaded_files <- datasets.download_files("my_alias", pdf_files)
In both cases, downloaded_files will be a map from file name as defined in the current dataset view (which may contain slashes) to the local path where the file was downloaded. Note that this local path may change, so it is recommended to rely on the map keys.
Filter tabular datasets¶
Code Workspaces enables you to apply filters to datasets prior to loading them into memory. This reduces the memory consumption of the dataframes imported into the workspace, enabling you to focus on the subset of data relevant for your analysis. Code Workspaces provides the flexibility to work with a selection of columns, rows, or both.
When to use filters for tabular datasets¶
To ensure that you stay within your workspace's available resources, we recommend the following:
-
When your uncompressed data fits within your workspace's memory, you can load datasets without filters and apply transformations in-memory for maximum efficiency.
-
When your uncompressed data exceeds your workspace's memory, you can use column and row filters to load a subset of the data into memory. These push-down filters are applied before data is loaded into the workspace, reducing the memory footprint of the imported data. The speed of these filter operations depends on the scale and partioning of the data. You can also use SQL with Python to query tabular datasets of any size as long as the resulting data fits within your machine's memory limits.
-
If this approach does not fit your use case, you can also use a Spark-based application to process your data and produce a smaller dataset that you can work with in Code Workspaces. This often speeds up interactive workflows that depend on using the same subset of data frequently or across multiple workspaces, because Parquet files from datasets are downloaded only once while filtering occurs each time data is loaded into the workspace.
-
For more complex data loading requirements, you can download the files backing a dataset using non-tabular dataset syntax and use native
PythonorRpackages to process the file contents.
Row limit¶
It is possible to only load a limited number of rows from a dataset.
In Jupyter®:
from foundry.transforms import Dataset
dogs_subset = Dataset.get("dogs")\
.limit(1000)\
.read_table(format="pandas")
In RStudio®:
library(dplyr) # should be imported by default in .Rprofile
dogs_subset <- Dataset.get("dogs") %>%
head(1000) %>%
collect()
Column filters¶
All tabular datasets can be loaded into a workspace with a subset of columns. Consider the following dataset as an example:
| name | weight | color | age | breed |
|---|---|---|---|---|
| Bella | 60 | Brown | 4 | Labrador |
| Max | 75 | Black | 7 | German Shepherd |
| Daisy | 30 | White | 2 | Poodle |
You may want to load this dataset only with the breed and age columns using the syntax below, assuming a dogs dataset was correctly registered into the workspace:
In Jupyter®:
from foundry.transforms import Dataset, Column
# Only load the "breed" and "age" columns
columns_to_load = ["breed", "age"]
breed_and_age_only = Dataset.get("dogs")\
.select(*columns_to_load)\
.read_table(format="pandas")
# Only load the "weight" and "color" columns
weight_and_color_only = Dataset.get("dogs")\
.select("weight", "color")\
.read_table(format="pandas")
In RStudio®:
library(dplyr) # should be imported by default in .Rprofile
# Only load the "weight" and "color" columns
weight_and_color_only <- Dataset.get("dogs") %>%
select(weight, color) %>%
collect()
Row filters¶
Tabular datasets can also be loaded into a workspace with a subset of rows that meet certain conditions.
:::callout{theme="neutral"} Row filters are applicable only to datasets in Parquet format. Other formats, such as CSV, allow for column filters but not row filters. With the help of a Foundry transform, most tabular datasets can be easily converted to Parquet format. :::
Recall the dogs dataset mentioned earlier:
| name | weight | color | age | breed |
|---|---|---|---|---|
| Bella | 60 | Brown | 4 | Labrador |
| Max | 75 | Black | 7 | German Shepherd |
| Daisy | 30 | White | 2 | Poodle |
| Buddy | 65 | Yellow | 3 | Labrador |
| Gizmo | 18 | Brown | 1 | Pug |
Row filter syntax in Jupyter®¶
The syntax below can be used to filter datasets in Jupyter® at the row level.
You may only load brown-colored dogs from the dogs dataset using the following syntax:
from foundry.transforms import Dataset, Column
# Only load dogs of color "Brown"
brown_dogs = Dataset.get("dogs")\
.where(Column.get("color") == "Brown")\
.read_table(format="pandas")
Notice the use of .where, select, or .limit to pre-filter the dataset before it gets loaded into the workspace. These statements can be chained to apply several conditions at once:
# Only load dogs of color "Brown" and of breed "Labrador"
golden_dogs = Dataset.get("dogs")\
.where(Column.get("color") == "Brown")\
.where(Column.get("breed") == "Labrador")\
.read_table(format="pandas")
Below, you can find more examples of acceptable row filtering syntax supported in Jupyter® Code Workspaces:
# only retain rows equal to a certain value
.where(Column.get("column_name") == value)
# only retain rows not equal to a certain value
.where(Column.get("column_name") != value)
# inequality using the ~ operator
.where(~(Column.get("column_name") == value))
# only retain rows whose value is comparable to another value
.where(Column.get("column_name") > value)
.where(Column.get("column_name") >= value)
.where(Column.get("column_name") < value)
.where(Column.get("column_name") <= value)
# OR / AND operators
.where((Column.get("column_name") == value1) | (Column.get("column_name") == value2))
.where((Column.get("column_name1") == value1) & (Column.get("column_name2") == value2))
# only retain rows whose value is not null
.where(~Column.get("column_name").isnull())
# only retain rows whose value is part of a given list
.where(Column.get("column_name").isin([value1, value2, value3]))
# only retain rows whose date is between two given inclusive bounds
.where(Column.get("date_column_name").between('lower_bound_incl', 'upper_bound_incl'))
# only retain the first N rows, where N is a number. This will be applied before other filters
.limit(N)
# select a subset of columns
.select("column_name1", "column_name2", "column_name3")
Row filter syntax in RStudio®¶
The syntax below can be used to filter datasets in RStudio® at the row level.
Rstudio filters are implemented through the use of the dplyr library and implement the standard methods filter, select, and head. These filters are pushed down, which means they are applied before the data gets loaded into the memory of the workspace.
You may load only brown-colored dogs from the dogs dataset using the following syntax:
library(dplyr) # should be imported by default in .Rprofile
# Only load dogs of color "Brown"
brown_dogs <- Dataset.get("dogs") %>%
foundry::filter(color == "Brown") %>%
collect()
Notice the use of foundry::filter to pre-filter the dataset before it gets loaded into the workspace. Technically, the foundry:: prefix is not required, but we recommend to use it in order to avoid potential conflicts with other similarly named filter functions from other packages in your environment. These filter statements can be chained to apply several conditions at once using the %>% operator from the dplyr library. This library should be imported by default in the .Rprofile file of your RStudio workspace.
library(dplyr) # should be imported by default in .Rprofile
# Only load dogs of color "Brown" and of breed "Labrador"
brown_labradors <- Dataset.get("dogs") %>%
foundry::filter(color == "Brown") %>%
foundry::filter(breed == "Labrador") %>%
collect()
Below, you can find more examples of acceptable row filtering syntax supported in RStudio® Code Workspaces. Column names must be passed to the foundry::filter function without wrapping them with quotation marks.
# only retain rows equal to a certain value
foundry::filter(column_name == "string_value") %>%
foundry::filter(integer_column_name == 4) %>%
# only retain rows not equal to a certain value
foundry::filter(column_name != value) %>%
# inequality using the ! operator
foundry::filter(!(column_name == value)) %>%
# only retain rows whose value is comparable to another value
foundry::filter(column_name > value) %>%
foundry::filter(column_name >= value) %>%
foundry::filter(column_name < value) %>%
foundry::filter(column_name <= value) %>%
# OR / AND operators
foundry::filter(column_name == value1 | column_name == value2) %>%
foundry::filter(column_name == value1 & column_name == value2) %>%
# only retain rows whose value is part of a given list
foundry::filter(column_name %in% c("value1", "value2")) %>%
# only retain rows whose value is not null
foundry::filter(!is.na(column_name)) %>%
# only retain rows whose value is between two given inclusive bounds
foundry::filter(between(age, 2, 4)) %>%
# select a subset of columns
select(column_name1, column_name2) %>% # if set, must include all columns used in `filter` clauses
# only retain the first N rows, where N is a number. This will be applied before other filters
head(N) %>%
Additionally, you may perform advanced data transformations, such as group_by, by temporarily collecting the data as an Arrow table:
library(dplyr) # should be imported by default in .Rprofile
grouped_dogs <- Dataset.get("alias") %>%
# Simple filters can be pushed down
foundry::filter(age > 2) %>%
collect(as_data_frame = FALSE) %>% # temporarily collect the data as an Arrow table
# Advanced transformations need to be applied on the arrow Table
group_by(breed) %>%
collect()
Column and row filters together¶
Column filters and row filters can be used together in order to load in a dataset that has both a subset of its columns and a subset of its rows. Using the dogs dataset mentioned earlier:
| name | weight | color | age | breed |
|---|---|---|---|---|
| Bella | 60 | Brown | 4 | Labrador |
| Max | 75 | Black | 7 | German Shepherd |
| Daisy | 30 | White | 2 | Poodle |
| Buddy | 65 | Yellow | 3 | Labrador |
| Gizmo | 18 | Brown | 1 | Pug |
The syntax below can be used to get a dataset with the name, breed, and color of brown dogs that exceed a given weight.
In Jupyter®:
# Only load dogs whose color is "Brown" and whose weight is above 62
# Only load the columns "name", "breed", and "color"
heavy_brown_dogs = Dataset.get("dogs")\
.where(Column.get("weight") > 62)\
.where(Column.get("color") == "Brown")\
.select("name", "breed", "color")\
.read_table(format="arrow")\
.to_pandas()
In RStudio®:
library(dplyr) # should be imported by default in .Rprofile
# Only load dogs whose color is "Brown" and whose weight is above 62
# Only load the columns "name", "breed", and "color"
heavy_brown_dogs <- Dataset.get("dogs") %>%
foundry::filter(weight > 62) %>%
foundry::filter(color == "Brown") %>%
select("name", "breed", "color") %>%
collect()
Write data¶
:::callout{theme="success"} To schedule your data transformation, view the data lineage, or write incrementally. You can convert your code to a Jupyter® or RStudio transform. :::
You can interactively write Foundry datasets with Code Workspaces by following the steps below.
- Create a target output Dataset by opening the Data tab and selecting the Save to dataset option, which can be found to the right of Import Dataset.
- Select a name for the output dataset as well as a location to save the dataset.
- Select Save.
- A new dataset will appear in the Data tab. By default, the Save to dataset option will be selected, which should be left as such for output datasets.
-
You will also be prompted to specify a dataset alias, which will become the name of the output dataset within the workspace, similarly to how aliases work when importing data.
-
For tabular output datasets, you will also be prompted to specify the dataframe variable which will populate the dataset.
-
For non-tabular datasets, you need instead to specify a local file or folder path to upload to the dataset.
-
Once the dataset type, the dataset alias, and the dataframe variable are set, select Copy and register dataset to register the dataset in the workspace, which will also save the code snippet to your clipboard.
- Paste the code snippet in your workspace, replacing the variable as necessary, and execute the code to write to the output dataset.
Transaction types¶
When writing back interactively, each SDK function call will correspond to one transaction, by default:
- A
SNAPSHOTtransaction will be created when writing back tabular data (output_dataset_tabular.write_table(df_variable)in Python ordatasets.write_table(df_variable, "output_dataset_tabular")in R). - An
UPDATEtransaction will be created when writing back files (output_dataset_non_tabular.upload_directory(path_to_file_variable)in Python ordatasets.upload_files(path_to_file_variable, "output_dataset_non_tabular")in R).
Once the script has been registered as a transform, interactive calls will start writing to a branch prefixed by code-workspace-sandbox/, while the current branch will be updated when the transform runs. In this case, a single transaction will be created for the full script execution, even if there are multiple SDK function calls:
- By default, the transaction will be of type
SNAPSHOT. - If incremental settings have been configured, transaction will be of type
APPEND.
Example code snippets¶
Following the instructions above, assume that two datasets named output_dataset_tabular and output_dataset_non_tabular were created with variables of the same name, and registered in the workspace. Code Workspaces will generate the following code snippets for each dataset based on your chosen variables:
# tabular snippet
from foundry.transforms import Dataset
output_dataset_tabular = Dataset.get("output_dataset_tabular")
output_dataset_tabular.write_table(df_variable)
# non-tabular snippet
from foundry.transforms import Dataset
output_dataset_non_tabular = Dataset.get("output_dataset_non_tabular")
output_dataset_non_tabular.upload_directory(path_to_file_variable)
And in R:
# tabular snippet
datasets.write_table(df_variable, "output_dataset_tabular")
# non-tabular snippet
datasets.upload_files(path_to_file_variable, "output_dataset_non_tabular")
中文翻译¶
与数据交互¶
Jupyter® 和 RStudio® 代码工作区(Code Workspaces)允许您读取、分析、转换数据,并将结果写回 Foundry 数据集、Iceberg 表(Iceberg tables)和虚拟表(virtual tables)。它们还允许您读取和分析受限视图(restricted views)和时间序列属性(time series properties)。
安全性¶
代码工作区(Code Workspaces)遵循应用于数据集的 Foundry 标记(Foundry markings),工作区将继承加载到其中的所有数据集的标记。这被称为工作区谱系(workspace lineage)。这意味着,要访问某个工作区,您还必须拥有该工作区中包含的所有数据集和其他输入所需的权限。如果您失去了对工作区中任何一个输入的访问权限,您将失去对整个工作区的访问权限。
资源考量¶
Jupyter® 和 RStudio® 工作区专为交互式分析工作流和应用程序开发而设计。它们在单个虚拟机上运行,不会使用分布式 Spark 环境处理数据集。因此,您处理的数据必须能够适配您机器的资源,默认情况下最大为 8 个 CPU 和 64 GB 内存。在加载大型数据集之前,使用过滤器(filters)或 SQL 来减少所需的数据量,可以帮助您避免超出这些限制。过滤数据集文件(filtering dataset files)是另一种保持在您工作区限制范围内的方法。
要开发大规模数据管道,请使用带有 Spark 的 Python 转换(Python transforms with Spark)或管道构建器(Pipeline Builder)。
数据集分支¶
默认情况下,代码工作区(Code Workspaces)会从与工作区本身相同的分支加载数据集的数据,否则将回退到 master 分支。例如,当前位于 my-branch 分支的代码工作区将尝试读取数据集的 my-branch 版本,如果该数据集上不存在 my-branch,则会回退到 master 分支。
要为特定导入的数据集固定分支,请选择左侧边栏数据(Data)选项卡中该数据集旁边的 ••• 图标。然后,选择固定数据集分支用于读取(Pin dataset branch for reads),并选择要在代码工作区中使用的所需数据集分支。
读取数据¶
代码工作区(Code Workspaces)要求您为导入到工作区中的每个 Foundry 数据集或受限视图选择一个别名(alias)。该别名作为一个引用,允许您在代码中从数据集或受限视图读取数据,以及向数据集写入数据。在数据(Data)选项卡中注册数据集时,代码工作区会在工作区的 /home/user/repo/.foundry 文件夹下的一个隐藏文件中,在所选数据集别名和 Foundry 资源的唯一标识符之间创建一个映射。
代码工作区(Code Workspaces)允许您加载表格数据集、非表格数据集、Iceberg 表(Iceberg tables)、虚拟表(virtual tables)和受限视图(restricted views)。您可以通过使用数据(Data)选项卡中的添加 > 读取数据(Add > Read data)按钮,向工作区添加新的数据源。
选择要添加到工作区的数据集后,您必须为该资源定义一个别名,作为代码中的唯一标识符和引用。为数据集选择读取策略(例如 pandas DataFrame、Polars LazyFrame 或原始文件访问)会生成一个代码片段,该片段将数据集加载为指定的输入类型。
- 您可以随时修改此读取策略并生成新的代码片段,即使在注册数据集之后也可以。
- 默认情况下,代码工作区会建议一个与数据集本身同名的数据集别名。
选择完成(Done)将完成数据集注册过程,并允许您使用运行代码片段(Run snippet)或复制到剪贴板(Copy to clipboard)选项在代码中使用数据。
表格数据集¶
以下是为一个名为 Cats 的表格数据集生成的代码片段,其数据集别名为 kittens。请注意,Cats 在代码片段中并未被引用;代码工作区(Code Workspaces)会隐式地将其注册在您选择的别名下。
在 Jupyter® 中:
from foundry.transforms import Dataset
kittens = Dataset.get("kittens")\
.read_table(format="arrow")\
.to_pandas()
有关 Jupyter® 工作区中使用的 Dataset 类的方法和属性的完整列表,请参阅 foundry.transforms.Dataset API 参考。
请注意,上面显示的 read_table 方法支持以下参数:
arrow(推荐): 将数据集转换为 Apache Arrow 表,可以对其进行高效过滤。然后,您可以使用.to_pandas()将此表转换为 pandas 数据框。pandas或dataframe: 将数据集转换为 pandas 数据框。polars: 将数据集转换为 Polars 数据框。然后,您可以使用.to_pandas()将其转换为 pandas 数据框。lazy-polars: Polars 数据框的惰性变体。无法对惰性 polars 执行过滤器。path: 输出存储数据集的本地路径。
:::callout{theme="neutral"}
要使特定格式可作为 read_table 操作的一部分使用,环境中必须存在相应的包 pyarrow、pandas 和 polars。这些包已自动包含在默认的代码工作区(Code Workspaces)环境中,但可能需要手动添加到您的自定义环境中。
:::
在 RStudio® 中:
kittens <- datasets.read_table("kittens")
上面的语法加载数据集并自动将数据收集到 R 数据框中。
如果数据超出工作区的内存容量,您可以使用下推过滤器(push-down filters)仅加载行或列的子集,语法如下:
library(dplyr) # should be imported by default in .Rprofile
kittens_df <- Dataset.get("kittens") %>%
# (optional) apply other transformations before collecting
collect()
使用 SQL 查询表格数据集¶
您可以使用 Python 中的 SQL 查询任何支持 SQL 查询的表格数据集、Iceberg 表(Iceberg table)或虚拟表(virtual table)。
以下是为一个名为 Ducks 的表格数据集生成的代码片段,其数据集别名为 ducklings。请注意,Ducks 在代码片段中并未被引用;代码工作区(Code Workspaces)会隐式地将其注册在您选择的别名下。
使用以下方法,每个 SELECT Spark SQL 语句只能查询一个数据集。查询将结果作为 PyArrow 表(PyArrow table)返回,然后您可以将其转换为其他格式,例如 pandas DataFrame。
- 请注意,SQL 查询的默认限制为一百万行。如果您需要一次查询超过一百万行,可以并行化多个查询并合并其结果。确保您的数据能够适配您工作区的内存。
在 Jupyter® 中:
安装 containers-sql 和 foundry-platform-sdk 包后,您可以使用 FoundrySdkSqlExecutor 来编写 Spark SQL:
from containers_sql import FoundrySdkSqlExecutor
sql = FoundrySdkSqlExecutor()
query = sql.execute("SELECT * from `ducklings`")
df = query.fetch_results() # optionally specify timeout arg: query.fetch_results(timeout_in_seconds=60)
您也可以使用魔法命令(magic command),首先在笔记本顶部运行以下代码:
import containers_sql
%reload_ext containers_sql.foundry_sdk.magics
然后,使用 %sql 魔法命令运行单行查询,或使用 %%sql 运行多行查询。使用 %%sql -o df 会将 SQL 查询的结果作为 PyArrow 表分配给 Python 对象 df:
ducklings_df = %sql SELECT hatch_date, bill_color, * from `ducklings`
ducklings_pandas_df = ducklings_df.to_pandas()
%%sql -o ducklings_df
SELECT
hatch_date, bill_color, *
FROM
`ducklings`
LIMIT 10;
在 RStudio® 中:
首先安装 containers-sql、foundry-platform-sdk、pyarrow 和 reticulate。然后使用 reticulate 包配合上述 Python SQL 执行器来查询数据集:
library(reticulate)
FoundrySdkSqlExecutor <- import("containers_sql")$FoundrySdkSqlExecutor
sql <- FoundrySdkSqlExecutor()
query <- sql$execute("SELECT * from `ducklings`")
results_df <- query$fetch_results()
# For smaller data, creating an intermediate R object is acceptable.
r_list <- py_to_r(results_df$to_pydict())
tibble <- as_tibble(r_list)
# For larger data, you may want to write your data to disk
# to avoid exceeding your container's memory limits.
library(arrow)
pa <- import("pyarrow")
pa$feather$write_feather(results_df, "tmp.arrow")
tibble <- read_feather("tmp.arrow") |> as_tibble()
非表格数据集¶
以下是为一个名为 Dogs 的非表格数据集生成的代码片段,其数据集别名为 puppies。请注意,Dogs 在代码片段中并未被引用;代码工作区(Code Workspaces)会隐式地将其注册在您选择的别名下。与读取表格数据集不同,此方法使您能够通过 puppies_files 变量访问数据集中的文件,而不是将值插入到数据框中。
在 Jupyter® 中:
from foundry.transforms import Dataset
puppies = Dataset.get("puppies")
puppies_files = puppies.files().download()
with open(puppies_files["puppy_names.txt"], "rb") as f:
puppy_names_text = f.read().decode("utf-8")
在 RStudio® 中:
puppies_files <- datasets.list_files("puppies")
puppies_local_files <- datasets.download_files("puppies", puppies_files)
puppy_names_text <- readLines(raw_files_local_files[["puppy_names.txt"]], warn = FALSE)
有关如何定位要下载到工作区中的特定文件子集的详细信息,请参阅过滤数据集文件。
受限视图¶
:::callout{theme="neutral"} 查询受限视图仅在 Jupyter® 工作区中受支持。 :::
您可以使用 Python 通过上述 SQL 方法查询受限视图。为此,请将受限视图导入到您的工作区中,并像处理任何其他表格数据集一样,通过其别名进行查询。
请注意,将受限视图导入到您的代码工作区需要启用受限输出模式(restricted outputs mode),以强制执行更高级别的数据安全性。启用受限输出模式后,您可以对受限视图中包含的数据进行分析,但无法从使用该数据的工作区发布输出。
首次向工作区添加受限视图时,系统会提示您安装特定的 Python 依赖项。然后,您可以定义一个别名(与数据集一样),并将受限视图添加到工作区。您必须启用受限输出模式并重新启动工作区才能查询受限视图。
:::callout{theme="neutral"} 受限视图的策略(policy)决定了用户可以查询哪些数据。例如,如果您将受限视图导入到一个工作区,而另一个用户打开了该工作区,他们可能会看到与您不同的数据。如果在工作区中加载了受限视图时您的访问权限发生变化,您将失去对该工作区的访问权限,并且需要重新启动工作区。 :::
过滤数据集文件¶
代码工作区(Code Workspaces)使您能够下载支持任何数据集的文件,无论它们是表格数据集(即具有模式)还是非表格数据集。您可以选择下载文件的子集,可以按名称选择,也可以对文件元数据(路径、字节大小、事务 RID 和更新时间)应用过滤逻辑。
在 Jupyter® 中:
from foundry.transforms import Dataset
# Download all files in the dataset
downloaded_files = Dataset.get("my_alias").files().download()
local_file = downloaded_files["file.pdf"]
# Download a single file
local_file = Dataset.get("my_alias").files().get("file.pdf").download()
# Download all PDF files of less than 1MB in the dataset
downloaded_files = my_dataset.files()\
.filter(lambda f: f.path.endswith(".pdf") and f.size_bytes < 1024 * 1024)
.download()
下载多个文件时,应使用过滤语法,而不是按名称逐个下载文件,以利用并行下载的优势。
在 RStudio® 中:
# Download all files in the dataset
all_files <- datasets.list_files("my_alias")
downloaded_files <- datasets.download_files("my_alias", all_files)
local_file <- downloaded_files$`file.pdf`
# Download files by name
downloaded_files <- datasets.download_files("my_alias", c("file1.pdf", "file2.pdf"))
# Download all PDF files of less than 1MB in the dataset
all_files <- datasets.list_files("my_alias")
pdf_files <- all_files[sapply(all_files, function(f) f.endswith(".pdf") && f$sizeBytes < 1024*1024)]
downloaded_files <- datasets.download_files("my_alias", pdf_files)
在这两种情况下,downloaded_files 将是一个映射,键是当前数据集视图中定义的文件名(可能包含斜杠),值是文件下载到的本地路径。请注意,此本地路径可能会更改,因此建议依赖映射的键。
过滤表格数据集¶
代码工作区(Code Workspaces)使您能够在将数据集加载到内存之前对其应用过滤器。这减少了导入到工作区中的数据框的内存消耗,使您能够专注于与分析相关的数据子集。代码工作区提供了灵活处理列、行或两者选择的能力。
何时对表格数据集使用过滤器¶
为确保您的工作区不超过其可用资源,我们建议如下:
-
当您的未压缩数据能够适配工作区内存时,您可以在不应用过滤器的情况下加载数据集,并在内存中应用转换以获得最高效率。
-
当您的未压缩数据超出工作区内存时,您可以使用列过滤器和行过滤器将数据的子集加载到内存中。这些下推过滤器(push-down filters)在数据加载到工作区之前应用,从而减少了导入数据的内存占用。这些过滤操作的速度取决于数据的规模和分区方式。您还可以使用带有 Python 的 SQL 来查询任何大小的表格数据集,只要结果数据能够适配您机器的内存限制。
-
如果此方法不适合您的用例,您还可以使用基于 Spark 的应用程序来处理您的数据,并生成一个可以在代码工作区(Code Workspaces)中处理的较小数据集。这通常会加速依赖于频繁使用相同数据子集或在多个工作区之间使用相同数据子集的交互式工作流,因为数据集中的 Parquet 文件只下载一次,而每次将数据加载到工作区时都会进行过滤。
-
对于更复杂的数据加载需求,您可以使用非表格数据集语法下载支持数据集的文件,并使用原生的
Python或R包来处理文件内容。
行数限制¶
可以仅从数据集加载有限数量的行。
在 Jupyter® 中:
from foundry.transforms import Dataset
dogs_subset = Dataset.get("dogs")\
.limit(1000)\
.read_table(format="pandas")
在 RStudio® 中:
library(dplyr) # should be imported by default in .Rprofile
dogs_subset <- Dataset.get("dogs") %>%
head(1000) %>%
collect()
列过滤器¶
所有表格数据集都可以加载到工作区中,并带有列的子集。以下面的数据集为例:
| name | weight | color | age | breed |
|---|---|---|---|---|
| Bella | 60 | Brown | 4 | Labrador |
| Max | 75 | Black | 7 | German Shepherd |
| Daisy | 30 | White | 2 | Poodle |
您可能希望仅使用 breed 和 age 列加载此数据集,使用以下语法(假设 dogs 数据集已正确注册到工作区中):
在 Jupyter® 中:
from foundry.transforms import Dataset, Column
# Only load the "breed" and "age" columns
columns_to_load = ["breed", "age"]
breed_and_age_only = Dataset.get("dogs")\
.select(*columns_to_load)\
.read_table(format="pandas")
# Only load the "weight" and "color" columns
weight_and_color_only = Dataset.get("dogs")\
.select("weight", "color")\
.read_table(format="pandas")
在 RStudio® 中:
library(dplyr) # should be imported by default in .Rprofile
# Only load the "weight" and "color" columns
weight_and_color_only <- Dataset.get("dogs") %>%
select(weight, color) %>%
collect()
行过滤器¶
表格数据集也可以加载到工作区中,并带有满足特定条件的行子集。
:::callout{theme="neutral"} 行过滤器仅适用于 Parquet 格式的数据集。其他格式(如 CSV)允许列过滤器,但不允许行过滤器。借助 Foundry 转换,大多数表格数据集可以轻松转换为 Parquet 格式。 :::
回想前面提到的 dogs 数据集:
| name | weight | color | age | breed |
|---|---|---|---|---|
| Bella | 60 | Brown | 4 | Labrador |
| Max | 75 | Black | 7 | German Shepherd |
| Daisy | 30 | White | 2 | Poodle |
| Buddy | 65 | Yellow | 3 | Labrador |
| Gizmo | 18 | Brown | 1 | Pug |
Jupyter® 中的行过滤语法¶
以下语法可用于在 Jupyter® 中对数据集进行行级过滤。
您可以使用以下语法仅从 dogs 数据集中加载棕色的狗:
from foundry.transforms import Dataset, Column
# Only load dogs of color "Brown"
brown_dogs = Dataset.get("dogs")\
.where(Column.get("color") == "Brown")\
.read_table(format="pandas")
注意使用 .where、select 或 .limit 在数据集加载到工作区之前对其进行预过滤。这些语句可以链接起来以同时应用多个条件:
# Only load dogs of color "Brown" and of breed "Labrador"
golden_dogs = Dataset.get("dogs")\
.where(Column.get("color") == "Brown")\
.where(Column.get("breed") == "Labrador")\
.read_table(format="pandas")
下面,您可以找到 Jupyter® 代码工作区(Code Workspaces)支持的更多可接受的行过滤语法示例:
# only retain rows equal to a certain value
.where(Column.get("column_name") == value)
# only retain rows not equal to a certain value
.where(Column.get("column_name") != value)
# inequality using the ~ operator
.where(~(Column.get("column_name") == value))
# only retain rows whose value is comparable to another value
.where(Column.get("column_name") > value)
.where(Column.get("column_name") >= value)
.where(Column.get("column_name") < value)
.where(Column.get("column_name") <= value)
# OR / AND operators
.where((Column.get("column_name") == value1) | (Column.get("column_name") == value2))
.where((Column.get("column_name1") == value1) & (Column.get("column_name2") == value2))
# only retain rows whose value is not null
.where(~Column.get("column_name").isnull())
# only retain rows whose value is part of a given list
.where(Column.get("column_name").isin([value1, value2, value3]))
# only retain rows whose date is between two given inclusive bounds
.where(Column.get("date_column_name").between('lower_bound_incl', 'upper_bound_incl'))
# only retain the first N rows, where N is a number. This will be applied before other filters
.limit(N)
# select a subset of columns
.select("column_name1", "column_name2", "column_name3")
RStudio® 中的行过滤语法¶
以下语法可用于在 RStudio® 中对数据集进行行级过滤。
RStudio 过滤器通过使用 dplyr 库实现,并实现了标准方法 filter、select 和 head。这些过滤器是下推的(pushed down),这意味着它们在数据加载到工作区内存之前应用。
您可以使用以下语法仅从 dogs 数据集中加载棕色的狗:
library(dplyr) # should be imported by default in .Rprofile
# Only load dogs of color "Brown"
brown_dogs <- Dataset.get("dogs") %>%
foundry::filter(color == "Brown") %>%
collect()
注意使用 foundry::filter 在数据集加载到工作区之前对其进行预过滤。从技术上讲,foundry:: 前缀不是必需的,但我们建议使用它,以避免与环境中其他包中名称相似的 filter 函数发生潜在冲突。这些 filter 语句可以使用 dplyr 库中的 %>% 运算符链接起来,以同时应用多个条件。该库应在您的 RStudio 工作区的 .Rprofile 文件中默认导入。
library(dplyr) # should be imported by default in .Rprofile
# Only load dogs of color "Brown" and of breed "Labrador"
brown_labradors <- Dataset.get("dogs") %>%
foundry::filter(color == "Brown") %>%
foundry::filter(breed == "Labrador") %>%
collect()
下面,您可以找到 RStudio® 代码工作区(Code Workspaces)支持的更多可接受的行过滤语法示例。列名必须传递给 foundry::filter 函数,无需用引号括起来。
# only retain rows equal to a certain value
foundry::filter(column_name == "string_value") %>%
foundry::filter(integer_column_name == 4) %>%
# only retain rows not equal to a certain value
foundry::filter(column_name != value) %>%
# inequality using the ! operator
foundry::filter(!(column_name == value)) %>%
# only retain rows whose value is comparable to another value
foundry::filter(column_name > value) %>%
foundry::filter(column_name >= value) %>%
foundry::filter(column_name < value) %>%
foundry::filter(column_name <= value) %>%
# OR / AND operators
foundry::filter(column_name == value1 | column_name == value2) %>%
foundry::filter(column_name == value1 & column_name == value2) %>%
# only retain rows whose value is part of a given list
foundry::filter(column_name %in% c("value1", "value2")) %>%
# only retain rows whose value is not null
foundry::filter(!is.na(column_name)) %>%
# only retain rows whose value is between two given inclusive bounds
foundry::filter(between(age, 2, 4)) %>%
# select a subset of columns
select(column_name1, column_name2) %>% # if set, must include all columns used in `filter` clauses
# only retain the first N rows, where N is a number. This will be applied before other filters
head(N) %>%
此外,您可以通过临时将数据收集为 Arrow 表来执行高级数据转换,例如 group_by:
library(dplyr) # should be imported by default in .Rprofile
grouped_dogs <- Dataset.get("alias") %>%
# Simple filters can be pushed down
foundry::filter(age > 2) %>%
collect(as_data_frame = FALSE) %>% # temporarily collect the data as an Arrow table
# Advanced transformations need to be applied on the arrow Table
group_by(breed) %>%
collect()
列过滤器和行过滤器结合使用¶
列过滤器和行过滤器可以结合使用,以加载同时具有列子集和行子集的数据集。使用前面提到的 dogs 数据集:
| name | weight | color | age | breed |
|---|---|---|---|---|
| Bella | 60 | Brown | 4 | Labrador |
| Max | 75 | Black | 7 | German Shepherd |
| Daisy | 30 | White | 2 | Poodle |
| Buddy | 65 | Yellow | 3 | Labrador |
| Gizmo | 18 | Brown | 1 | Pug |
以下语法可用于获取一个数据集,其中包含超过给定体重的棕色狗的名字、品种和颜色。
在 Jupyter® 中:
# Only load dogs whose color is "Brown" and whose weight is above 62
# Only load the columns "name", "breed", and "color"
heavy_brown_dogs = Dataset.get("dogs")\
.where(Column.get("weight") > 62)\
.where(Column.get("color") == "Brown")\
.select("name", "breed", "color")\
.read_table(format="arrow")\
.to_pandas()
在 RStudio® 中:
library(dplyr) # should be imported by default in .Rprofile
# Only load dogs whose color is "Brown" and whose weight is above 62
# Only load the columns "name", "breed", and "color"
heavy_brown_dogs <- Dataset.get("dogs") %>%
foundry::filter(weight > 62) %>%
foundry::filter(color == "Brown") %>%
select("name", "breed", "color") %>%
collect()
写入数据¶
:::callout{theme="success"} 要调度您的数据转换、查看数据谱系或增量写入,您可以将代码转换为 Jupyter® 或 RStudio 转换。 :::
您可以按照以下步骤,使用代码工作区(Code Workspaces)交互式地写入 Foundry 数据集。
- 创建一个目标输出数据集:打开数据(Data)选项卡,选择保存到数据集(Save to dataset)选项(位于导入数据集(Import Dataset)的右侧)。
- 为输出数据集选择一个名称以及一个保存位置。
- 选择保存(Save)。
- 一个新的数据集将出现在数据(Data)选项卡中。默认情况下,保存到数据集(Save to dataset)选项将被选中,对于输出数据集应保持此状态。
-
系统还会提示您指定一个数据集别名,该别名将成为工作区内输出数据集的名称,类似于导入数据时别名的使用方式。
-
对于表格输出数据集,系统还会提示您指定将填充数据集的数据框变量。
-
对于非表格数据集,您需要指定要上传到数据集的本地文件或文件夹路径。
-
设置好数据集类型、数据集别名和数据框变量后,选择复制并注册数据集(Copy and register dataset)以在工作区中注册数据集,这也会将代码片段保存到您的剪贴板。
- 将代码片段粘贴到您的工作区中,根据需要更改变量,然后执行代码以写入输出数据集。
事务类型¶
当交互式写回时,每个 SDK 函数调用默认对应一个事务:
- 写回表格数据时,将创建一个
SNAPSHOT事务(Python 中为output_dataset_tabular.write_table(df_variable),R 中为datasets.write_table(df_variable, "output_dataset_tabular"))。 - 写回文件时,将创建一个
UPDATE事务(Python 中为output_dataset_non_tabular.upload_directory(path_to_file_variable),R 中为datasets.upload_files(path_to_file_variable, "output_dataset_non_tabular"))。
一旦脚本被注册为转换,交互式调用将开始写入以 code-workspace-sandbox/ 为前缀的分支,而当前分支将在转换运行时更新。在这种情况下,即使有多个 SDK 函数调用,也会为整个脚本执行创建一个单一事务:
- 默认情况下,事务类型为
SNAPSHOT。 - 如果已配置增量设置,事务类型将为
APPEND。
示例代码片段¶
按照上述说明,假设创建了两个名为 output_dataset_tabular 和 output_dataset_non_tabular 的数据集,并使用了相同名称的变量,且已在工作区中注册。代码工作区(Code Workspaces)将根据您选择的变量为每个数据集生成以下代码片段:
# tabular snippet
from foundry.transforms import Dataset
output_dataset_tabular = Dataset.get("output_dataset_tabular")
output_dataset_tabular.write_table(df_variable)
# non-tabular snippet
from foundry.transforms import Dataset
output_dataset_non_tabular = Dataset.get("output_dataset_non_tabular")
output_dataset_non_tabular.upload_directory(path_to_file_variable)
以及在 R 中:
# tabular snippet
datasets.write_table(df_variable, "output_dataset_tabular")
# non-tabular snippet
datasets.upload_files(path_to_file_variable, "output_dataset_non_tabular")