CSV parsing(CSV 解析)¶
Foundry supports CSV datasets. These are datasets that contain files in the CSV format.
The CSV format can use different delimiters, quote characters, and escape characters. To manage this, you can define parameters that control how CSV files are parsed. These parameters are stored in the schema of a dataset. Foundry will use inference to suggest a sensible set of parameters for a given dataset, but results should be validated and changes made if necessary.
Parsing in Foundry¶
Foundry CSV datasets will generally have the TextDataFrameReader defined as their dataFrameReaderClass in the schema. This supports a set of custom parameters that can help deal with messy data effectively. At execution time, this delegates to the Spark CSV DataFrameReader ↗ for the best possible performance and reliability.
Configuration¶
In Foundry, you can view the schema on any dataset in the Dataset Preview application by navigating to the Details tab and selecting Schema. For more details on the schema, see the Dataset documentation.
CSV schemas can be manipulated in the Edit Schema UI, available from Dataset Preview when viewing the preview tab. This will help visualize the options available and how they affect the output dataset. In cases where CSVs are particularly malformed, you may need to manually edit the schema to get the desired output.
TextDataFrameReader options¶
To manually configure the TextDataFrameReader options in the schema, you can navigate to the schema page in the Details tab of Dataset Preview and select Edit. At the bottom of the schema, there should be a section as follows:
"dataFrameReaderClass": "com.palantir.foundry.spark.input.TextDataFrameReader",
"customMetadata": {
"textParserParams": {
"parser": "CSV_PARSER",
"charsetName": "UTF-8",
"fieldDelimiter": ",",
"recordDelimiter": "\n",
"quoteCharacter": "\"",
"dateFormat": {},
"skipLines": 1,
"jaggedRowBehavior": "THROW_EXCEPTION",
"parseErrorBehavior": "THROW_EXCEPTION",
"addFilePath": false,
"addFilePathInsteadOfUri": false,
"addImportedAt": false,
"initialReadTimeout": "1 hour"
}
}
}
The following options are available for the TextDataFrameReader:
| Property | Purpose | Accepted values | Required? | Parsers supported |
|---|---|---|---|---|
| parser | The parser type to use. | CSV_PARSER, MULTILINE_CSV_PARSER, SIMPLE_PARSER, SINGLE_COLUMN_PARSER | Yes | N/A |
| nullValues | The values that should be parsed to null. | A list of strings | Yes | all |
| fieldDelimiter | The delimiter character for splitting a record into multiple fields. | A one-character string | No, default to , (comma) | CSV_PARSER, MULTILINE_CSV_PARSER, SIMPLE_PARSER |
| recordDelimiter | The end of line symbols for splitting a CSV file into multiple records. | A string ends with newline character | No, default to \n (newline) CSV_PARSER, MULTILINE_CSV_PARSER | |
| quoteCharacter | The quote character for CSV parsing. | A one-character string | No, default to " (doublequote) | CSV_PARSER, MULTILINE_CSV_PARSER |
| dateFormat | Format strings for date parsing in certain columns. | A map that maps column names to JodaTime DateTimeFormat patterns | No, default to empty map | CSV_PARSER, MULTILINE_CSV_PARSER, SIMPLE_PARSER |
| skipLines | The number of lines to skip parsing at the start of each file. | A non-negative number | No, default to 0 | all |
| jaggedRowBehavior | Behavior when there are more or fewer columns than types specified in the header. | THROW_EXCEPTION, DROP_ROW | No, defaults to THROW_EXCEPTION | N/A |
| parseErrorBehavior | Behavior when a value fails to parse into the requested type specified in the header. | THROW_EXCEPTION, REPLACE_WITH_NULL | No, defaults to THROW_EXCEPTION | N/A |
| addFilePath | Each row is augmented by a file path. | Boolean | No, default to false | all |
| addImportedAt | Each row is augmented by the import time. | Boolean | No, default to false | all |
| initialReadTimeout | Limits the time the parser will wait to read the first row. | A human-readable duration | No, default to 1 hour | all |
Spark CSV options¶
If you are already familiar with the Spark CSV DataFrameReader, you can configure the dataFrameReaderClass to be DataSourceDataFrameReader and the format to be csv in the customMetadata.
See the Spark CSV DataFrameReader documentation ↗ for a list of supported options. You can add the configurations as key-value pairs like this:
"dataFrameReaderClass": "com.palantir.foundry.spark.input.DataSourceDataFrameReader",
"customMetadata": {
"format": "csv",
"options": {
"unescapedQuoteHandling": "STOP_AT_DELIMITER",
"multiline": true,
...
}
}
:::callout{theme="neutral"} Note that the schema options listed above are only applicable to datasets constructed from CSV files. :::
中文翻译¶
CSV 解析¶
Foundry 支持 CSV 数据集(CSV datasets)。这些数据集包含 CSV 格式的文件。
CSV 格式可以使用不同的分隔符、引号字符和转义字符。为此,您可以定义控制 CSV 文件解析方式的参数。这些参数存储在数据集的模式(schema)中。Foundry 会通过推断(inference)为给定数据集建议一组合理的参数,但您应验证结果并在必要时进行修改。
Foundry 中的解析¶
Foundry CSV 数据集通常会在其模式(schema)中将 TextDataFrameReader 定义为 dataFrameReaderClass。这支持一组自定义参数,可有效处理杂乱数据。在执行时,它会委托给 Spark CSV DataFrameReader ↗ 以获得最佳性能和可靠性。
配置¶
在 Foundry 中,您可以在 数据集预览(Dataset Preview) 应用中查看任何数据集的模式(schema):导航至 详情(Details) 选项卡并选择 模式(Schema)。有关模式(schema)的更多详细信息,请参阅 数据集(Dataset) 文档。
CSV 模式(schema)可以在 编辑模式(Edit Schema) 界面中进行操作,该界面可在查看预览选项卡时从 数据集预览(Dataset Preview) 中访问。这有助于可视化可用选项及其对输出数据集的影响。如果 CSV 文件格式特别不规范,您可能需要手动编辑模式(schema)以获得所需的输出。
TextDataFrameReader 选项¶
要手动配置模式(schema)中的 TextDataFrameReader 选项,您可以导航至 数据集预览(Dataset Preview) 的 详情(Details) 选项卡中的模式(schema)页面,然后选择 编辑(Edit)。在模式(schema)底部,应出现如下部分:
"dataFrameReaderClass": "com.palantir.foundry.spark.input.TextDataFrameReader",
"customMetadata": {
"textParserParams": {
"parser": "CSV_PARSER",
"charsetName": "UTF-8",
"fieldDelimiter": ",",
"recordDelimiter": "\n",
"quoteCharacter": "\"",
"dateFormat": {},
"skipLines": 1,
"jaggedRowBehavior": "THROW_EXCEPTION",
"parseErrorBehavior": "THROW_EXCEPTION",
"addFilePath": false,
"addFilePathInsteadOfUri": false,
"addImportedAt": false,
"initialReadTimeout": "1 hour"
}
}
}
TextDataFrameReader 提供以下选项:
| 属性(Property) | 用途(Purpose) | 可接受的值(Accepted values) | 是否必需(Required?) | 支持的解析器(Parsers supported) |
|---|---|---|---|---|
| parser | 要使用的解析器类型。 | CSV_PARSER, MULTILINE_CSV_PARSER, SIMPLE_PARSER, SINGLE_COLUMN_PARSER | 是(Yes) | 不适用(N/A) |
| nullValues | 应解析为 null 的值。 | 字符串列表(A list of strings) | 是(Yes) | 全部(all) |
| fieldDelimiter | 用于将记录拆分为多个字段的分隔符字符。 | 单字符字符串(A one-character string) | 否(No),默认为 , (逗号) | CSV_PARSER, MULTILINE_CSV_PARSER, SIMPLE_PARSER |
| recordDelimiter | 用于将 CSV 文件拆分为多条记录的行尾符号。 | 以换行符结尾的字符串(A string ends with newline character) | 否(No),默认为 \n (换行符) CSV_PARSER, MULTILINE_CSV_PARSER | |
| quoteCharacter | CSV 解析的引号字符。 | 单字符字符串(A one-character string) | 否(No),默认为 " (双引号) | CSV_PARSER, MULTILINE_CSV_PARSER |
| dateFormat | 特定列中日期解析的格式字符串。 | 将列名映射到 JodaTime DateTimeFormat 模式的映射(A map that maps column names to JodaTime DateTimeFormat patterns) | 否(No),默认为空映射 | CSV_PARSER, MULTILINE_CSV_PARSER, SIMPLE_PARSER |
| skipLines | 每个文件开头要跳过解析的行数。 | 非负数(A non-negative number) | 否(No),默认为 0 | 全部(all) |
| jaggedRowBehavior | 当列数多于或少于表头中指定的类型时的行为。 | THROW_EXCEPTION, DROP_ROW | 否(No),默认为 THROW_EXCEPTION | 不适用(N/A) |
| parseErrorBehavior | 当某个值无法解析为表头中指定的请求类型时的行为。 | THROW_EXCEPTION, REPLACE_WITH_NULL | 否(No),默认为 THROW_EXCEPTION | 不适用(N/A) |
| addFilePath | 每行都附加一个文件路径。 | 布尔值(Boolean) | 否(No),默认为 false | 全部(all) |
| addImportedAt | 每行都附加导入时间。 | 布尔值(Boolean) | 否(No),默认为 false | 全部(all) |
| initialReadTimeout | 限制解析器等待读取第一行的时间。 | 人类可读的持续时间(A human-readable duration) | 否(No),默认为 1 小时 | 全部(all) |
Spark CSV 选项¶
如果您已经熟悉 Spark CSV DataFrameReader,您可以将 dataFrameReaderClass 配置为 DataSourceDataFrameReader,并在 customMetadata 中将 format 配置为 csv。
有关支持的选项列表,请参阅 Spark CSV DataFrameReader 文档 ↗。您可以像这样将配置添加为键值对:
"dataFrameReaderClass": "com.palantir.foundry.spark.input.DataSourceDataFrameReader",
"customMetadata": {
"format": "csv",
"options": {
"unescapedQuoteHandling": "STOP_AT_DELIMITER",
"multiline": true,
...
}
}
:::callout{theme="neutral"} 请注意,上述模式(schema)选项仅适用于由 CSV 文件构建的数据集。 :::