跳转至

R Filesystem API(R 文件系统 API(R Filesystem API))

R TransformInput object

The interface for low level operations on a Foundry dataset.

spark.df()

data.frame()

fileSystem()

  • Returns a FileSystem object for direct FoundryFS access.

R TransformOutput object

The interface for low level write operations on a Foundry dataset.

write.spark.df(df, partition_cols=NULL, bucket_cols=NULL, bucket_count=NULL, sort_by=NULL)

  • Write the given DataFrame ↗ to the output dataset.

    Parameters
    • df (pyspark.sql.DataFrame) – The PySpark dataframe to write.
    • partition_cols (List[str], optional) - Column partitioning to use when writing data.
    • bucket_cols (List[str], optional) - The columns by which to bucket the data. Must be specified if bucket_count is given.
    • bucket_count (int, optional) – The number of buckets. Must be specified if bucket_cols is given.
    • sort_by (List[str], optional) - The columns by which to sort the bucketed data.

write.data.frame(rdf)

fileSystem()

  • Returns a FileSystem object for direct FoundryFS access.

R FileSystem object

ls(glob=NULL, regex='.*', show_hidden=FALSE)

  • Lists all files matching the given pattern (either glob or regex), with respect to the root directory of the dataset.

    Parameters
    • glob (str, optional) – A unix file matching pattern. Also supports globstar.
    • regex (str, optional) – A regex pattern against which to match filenames.
    • show_hidden (bool, optional) – Include hidden files, those prefixed with ‘.’ or ‘_’.
    Returns R array of the FileStatus named tuple (path, size, modified) - The logical path, file size (bytes), modified timestamp (ms since January 1, 1970 UTC)

open(path, open='r', disk_optimal=FALSE, encoding=default)

  • Open a FoundryFS file in the given mode.

    Parameters
    • path (str) – The logical path of the file in the dataset. (Remote path)
    • open (str) - A description of the mode in which to open the connection.
    • disk_optimal (bool, optional) – Controls how FoundryFileSystem handles file i/o.
    • encoding (str, optional) - Defaults to the R language default (UTF-8).
    Returns An R connection object

get_path(path, open='r', disk_optimal=FALSE, encoding=default)

  • For a given FoundryFS (remote) path, returns the local temporary path.

    Parameters
    • path (str) – The logical path of the file in the dataset. (Remote path)
    • open (str) - A description of the mode in which to open the connection.
    • disk_optimal (bool, optional) – Controls how FoundryFileSystem handles file i/o.
    • encoding (str, optional) - Defaults to the R language default (UTF-8).
    Returns str

upload(local_path, remote_path)

  • Upload the file from the local to the remote path. Write only.

    Parameters
    • local_path (str) – The local path of the file to upload.
    • remote_path (str) - The logical path of the file in the dataset.
    Returns None

Advanced topic: disk_optimal setting

In the FileSystem methods open() and get_path(), the disk_optimal argument controls how file input and output (i/o) is handled.

By default, disk_optimal is set to FALSE in both open() and get_path(). In this mode, files are guaranteed to be downloaded before they are accessed.

If you choose to set disk_optimal to TRUE, files are downloaded simultaneously while the code executes. The temporary local path must be opened via fifo() in order to read correctly. Note that not all libraries support reading this type of file.

You may choose to set disk_optimal to TRUE when the file you are reading is very large.

For example, let's imagine we have a very large txt file and we only want to read the first 10 lines. Use the below code to print only the first 10 lines, without reading the entire file.

disk_optimal_example<- function(large_txt_file) {
    fs <- large_txt_file$fileSystem()

    ## Open a connection with fifo()
    ## The text file is titled large_txt_file.txt
    conn <- fs$open("large_txt_file.txt", "r", disk_optimal = TRUE)

    A <- readLines(conn, n = 10)
    print(A)
    return(NULL)    
}

If you want to use R TransformOutput to write a file and then read it, disk_optimal must be set to false.


中文翻译

R 文件系统 API(R Filesystem API)

R TransformInput 对象(TransformInput object)

用于对 Foundry 数据集进行底层操作的接口。

spark.df()

data.frame()

fileSystem()

  • 返回一个 FileSystem 对象,用于直接访问 FoundryFS

R TransformOutput 对象(TransformOutput object)

用于对 Foundry 数据集进行底层写入操作的接口。

write.spark.df(df, partition_cols=NULL, bucket_cols=NULL, bucket_count=NULL, sort_by=NULL)

  • 将给定的 DataFrame ↗ 写入输出数据集。

    参数(Parameters)
    • df (pyspark.sql.DataFrame) – 要写入的 PySpark 数据框。
    • partition_cols (List[str], 可选) – 写入数据时使用的列分区。
    • bucket_cols (List[str], 可选) – 用于分桶的列。如果指定了 bucket_count,则必须指定此参数。
    • bucket_count (int, 可选) – 桶的数量。如果指定了 bucket_cols,则必须指定此参数。
    • sort_by (List[str], 可选) – 用于对分桶数据进行排序的列。

write.data.frame(rdf)

fileSystem()

  • 返回一个 FileSystem 对象,用于直接访问 FoundryFS

R FileSystem 对象(FileSystem object)

ls(glob=NULL, regex='.*', show_hidden=FALSE)

  • 列出与给定模式(globregex)匹配的所有文件,相对于数据集的根目录。

    参数(Parameters)
    • glob (str, 可选) – Unix 文件匹配模式。也支持 globstar。
    • regex (str, 可选) – 用于匹配文件名的正则表达式模式。
    • show_hidden (bool, 可选) – 是否包含隐藏文件(以 '.' 或 '_' 开头的文件)。
    返回值(Returns) FileStatus 命名元组(path, size, modified)的 R 数组 - 逻辑路径、文件大小(字节)、修改时间戳(自 1970 年 1 月 1 日 UTC 以来的毫秒数)

open(path, open='r', disk_optimal=FALSE, encoding=default)

  • 以给定模式打开 FoundryFS 文件。

    参数(Parameters)
    • path (str) – 数据集中文件的逻辑路径。(远程路径
    • open (str) – 描述打开连接的模式
    • disk_optimal (bool, 可选) – 控制 FoundryFileSystem 如何处理文件输入/输出。
    • encoding (str, 可选) – 默认为 R 语言默认编码(UTF-8)。
    返回值(Returns) R 连接对象(connection object)

get_path(path, open='r', disk_optimal=FALSE, encoding=default)

  • 对于给定的 FoundryFS(远程)路径,返回本地临时路径。

    参数(Parameters)
    • path (str) – 数据集中文件的逻辑路径。(远程路径
    • open (str) – 描述打开连接的模式
    • disk_optimal (bool, 可选) – 控制 FoundryFileSystem 如何处理文件输入/输出。
    • encoding (str, 可选) – 默认为 R 语言默认编码(UTF-8)。
    返回值(Returns) str

upload(local_path, remote_path)

  • 将文件从本地路径上传到远程路径。仅用于写入操作。

    参数(Parameters)
    • local_path (str) – 要上传文件的本地路径。
    • remote_path (str) – 数据集中文件的逻辑路径。
    返回值(Returns) None

高级主题:disk_optimal 设置(Advanced topic: disk_optimal setting)

FileSystemopen()get_path() 方法中,disk_optimal 参数控制文件输入和输出(i/o)的处理方式。

默认情况下,open()get_path() 中的 disk_optimal 均设置为 FALSE。在此模式下,文件在访问前会被保证已下载完成。

如果选择将 disk_optimal 设置为 TRUE,文件将在代码执行的同时进行下载。必须通过 fifo() 打开临时本地路径才能正确读取。请注意,并非所有库都支持读取此类文件。

当您要读取的文件非常大时,可以选择将 disk_optimal 设置为 TRUE

例如,假设我们有一个非常大的 txt 文件,只想读取前 10 行。使用以下代码仅打印前 10 行,而无需读取整个文件。

disk_optimal_example<- function(large_txt_file) {
    fs <- large_txt_file$fileSystem()

    ## 使用 fifo() 打开连接
    ## 文本文件名为 large_txt_file.txt
    conn <- fs$open("large_txt_file.txt", "r", disk_optimal = TRUE)

    A <- readLines(conn, n = 10)
    print(A)
    return(NULL)    
}

如果要使用 R TransformOutput 写入文件然后读取该文件,则必须将 disk_optimal 设置为 false。