Extract rows from an Excel file(从 Excel 文件中提取行)¶
Supported in: Batch
Reads a dataset of Microsoft Excel files and parses each file into rows. Supported file formats: .xls, .xlt, .xltm, .xltx, .xlsx, .xlsm.
The processing of individual Excel files is not distributed across multiple Spark executors, so we recommend enabling the usage of local Spark in build settings if the input dataset is expected to have exactly one file.
Particularly large Excel files can require a lot of memory to process, so if you observe builds failing with out-of-memory errors, consider using custom build settings with increased executor memory (or increased driver memory in the case of local Spark). For such large files, it may not be possible to preview the output, but deployment can still succeed given appropriate build settings.
Transform categories: File
Declared arguments¶
- Dataset: Dataset of files to process.
Files - Rows to skip: Number of rows to skip at the start of each sheet. If you do not use the "Treat first row (after skipping) as header" option and a header is present, this value should include the header row.
Literal\ - Schema: An ordered list of column names corresponding to data in the sheets that match the specified pattern.
List\> - Source sheet pattern: Data will be extracted from all sheets whose names contain a substring matching this regular expression. If you specify the empty string (the default value for this parameter), data will be extracted from all sheets. To do a full-string match instead of a substring match, you can add ^ at the start and $ at the end of the string.
Literal\ - optional Output column to put the file path: If present, an output column will be created with this column name, containing the path of the parsed file.
Literal\ - optional Output column to put the row number: If present, an output column will be created with this column name, containing the 1-indexed row number of the parsed row.
Literal\ - optional Output column to put the sheet name: If present, an output column will be created with this column name, containing the sheet name of the parsed row.
Literal\ -
optional Treat first row (after skipping) as header: If true, the first row in each sheet after skipping will be treated as a header, and the order of the fields in that header will be used to determine the mapping between the columns in the sheet and the columns in the schema. The order of the fields can differ between sheets (and files), and it is neither necessary for all fields specified in the schema to be present in all sheets nor for all fields present in the Excel sheets to be included in the schema. The strings present in the cell values of the header row will be mapped to the schema column names case-insensitively after applying the following sanitization and disambiguation steps:
-
If present, remove any string-initial sequence of any combination of the characters
(),;{}\n\t=(the first character in this list is the ASCII space). - Replace all remaining instances of
(),;{}\n\t=with underscores. - Replace all consecutive underscores with one underscore.
- If the string now ends with an underscore, remove that underscore.
- If the string's length is now 0, replace the string with
_untitled_column. - After applying steps 1 through 5 to each string, if the same string value (ignoring case) appears more than once, append an underscore and number suffix to the second instance onward (the first suffix added this way will be
_2).
Literal\
中文翻译¶
从 Excel 文件中提取行¶
支持:批处理(Batch)
读取一个包含 Microsoft Excel 文件的数据集(Dataset),并将每个文件解析为行。支持的文件格式:.xls、.xlt、.xltm、.xltx、.xlsx、.xlsm。
单个 Excel 文件的处理不会分布在多个 Spark 执行器(executor)上,因此,如果输入数据集预计只有一个文件,我们建议在构建设置(build settings)中启用本地 Spark(local Spark)。
特别大的 Excel 文件可能需要大量内存来处理,因此,如果您观察到构建因内存不足错误而失败,请考虑使用自定义构建设置(custom build settings),增加执行器内存(executor memory)(如果使用本地 Spark,则增加驱动内存 driver memory)。对于此类大文件,可能无法预览输出,但只要构建设置得当,部署仍然可以成功。
转换类别:文件(File)
声明的参数¶
- 数据集(Dataset): 要处理的文件数据集。
文件(Files) - 跳过的行数(Rows to skip): 每个工作表开头要跳过的行数。如果您不使用"将(跳过后的)第一行视为标题"选项,并且存在标题行,则该值应包含标题行。
字面量\<整数>(Literal\) - 模式(Schema): 一个有序的列名列表,对应于与指定模式匹配的工作表中的数据。
列表\<字面量\<字符串>>(List\>) - 源工作表模式(Source sheet pattern): 将从所有名称包含与此正则表达式匹配的子字符串的工作表中提取数据。如果您指定空字符串(此参数的默认值),则将提取所有工作表中的数据。要执行全字符串匹配而非子字符串匹配,可以在字符串开头添加 ^,在结尾添加 $。
字面量\<字符串>(Literal\) - 可选 用于存放文件路径的输出列(Output column to put the file path): 如果存在,将创建一个以此列名命名的输出列,其中包含已解析文件的路径。
字面量\<字符串>(Literal\) - 可选 用于存放行号的输出列(Output column to put the row number): 如果存在,将创建一个以此列名命名的输出列,其中包含已解析行的行号(从 1 开始)。
字面量\<字符串>(Literal\) - 可选 用于存放工作表名称的输出列(Output column to put the sheet name): 如果存在,将创建一个以此列名命名的输出列,其中包含已解析行的工作表名称。
字面量\<字符串>(Literal\) -
可选 将(跳过后的)第一行视为标题(Treat first row (after skipping) as header): 如果为 true,则每个工作表中跳过后的第一行将被视为标题行,该标题行中的字段顺序将用于确定工作表中的列与模式(Schema)中列之间的映射关系。字段顺序在不同工作表(和文件)之间可以不同,并且模式中指定的所有字段不必都出现在所有工作表中,Excel 工作表中存在的所有字段也不必都包含在模式中。标题行单元格值中的字符串将按以下清理和消歧步骤处理后,以不区分大小写的方式映射到模式列名:
-
如果存在,则移除字符串开头由字符
(),;{}\n\t=(此列表中的第一个字符是 ASCII 空格)任意组合构成的序列。 - 将所有剩余的
(),;{}\n\t=替换为下划线。 - 将所有连续的下划线替换为一个下划线。
- 如果字符串现在以下划线结尾,则移除该下划线。
- 如果字符串长度现在为 0,则将该字符串替换为
_untitled_column。 - 在对每个字符串应用步骤 1 到 5 后,如果相同的字符串值(忽略大小写)出现多次,则从第二次出现的实例开始,追加一个下划线和数字后缀(以此方式添加的第一个后缀将是
_2)。
字面量\<布尔值>(Literal\