跳转至

Dataset Preview FAQ(数据集预览常见问题解答)

The following are some frequently asked questions about Dataset Preview.

For general information, see the Dataset Preview overview.


CSV fails to parse due to unmatched quote character

When uploading a CSV with nested double-quotes and embedded newline (\n) characters within quoted fields, the schema inference will fail, and you cannot use the Schema Editor to create a valid schema.

To troubleshoot, perform the following steps:

  1. Upload the CSV to a dataset and select Edit Schema.
  2. Set the Column and Quotes properties.
  3. Ignore the schema validation errors and select Save without Validating, available in the dropdown menu next to the Save and Validate option. This will create a schema with the correct column definitions.
  4. In the Details tab, open the schema in edit mode.
  5. Change dataFrameReaderClass to com.palantir.foundry.spark.input.DataSourceDataFrameReader.
  6. Add the following to the "customMetadata" object:
  "customMetadata": {
    "format": "csv",
    "options": {
      "header": true,
      "multiLine": true
    }
  }

Return to top


Dataset fails to parse because some underlying CSVs have more columns

When a dataset is composed of multiple CSVs (for example, through a Data Connection APPEND transaction), and some of those CSVs contain more columns, the schema inference will fail. One option is to ignore jagged rows (such as rows that are missing certain columns). To do this, select Edit schema, expand the Parsing options section, and check Ignore jagged rows.

However, if you want to keep the jagged rows and specify a standardized schema for the dataset, then this section applies. If the conditions outlined in the Assumptions section below hold for your data, then the troubleshooting steps will produce a dataset with a standard schema defined by the user, in which jagged rows are autopopulated with null for the columns which they are missing.

Symptoms of parsing failures:

You may encounter an error message such as the following: "Could not load preview: Encountered an error parsing the input CSV data. Make sure all data is correctly formatted."

You may also encounter the following error message after selecting Edit Schema and then Save and Validate: "Your dataset failed to validate on x rows."

Assumptions:

  1. You can define the desired schema, such as all column names and types that the dataset should possess.

  2. Your schema enforces strict column ordering. For example, if you want the dataset to contain and show columns {a, b, c}, an underlying CSV can have a column structure like:

  3. {a}

  4. {a, b}

but cannot have a column structure like:

  • {b, a}

  • An underlying CSV must have all columns preceding the n-th column if it has the (n+1) th column. For example, if you want the dataset to contain and show columns {a, b, c, d}, an underlying CSV can have a column structure like:

  • {a}

  • {a, b}
  • {a, b, c}

but cannot have a column structure like:

  • {b, c, d}
  • {a, c, d}
  • {a, b, d}

Here is an example of a case in which the troubleshooting steps would be applicable and useful:

CSVs are regularly added to a dataset through APPEND transactions. One day, a new column is added and is the new last column in the CSV. In the dataset, rows from all previously-appended CSVs should have the new column, with field values autopopulated with null rather than being considered jagged.

The troubleshooting steps do not replicate the functionality of the mergeSchema option available for raw Parquet datasets (which are parsed with com.palantir.foundry.spark.input.ParquetDataFrameReader as the dataFrameReaderClass). A user-written transform is required to replicate such functionality on a raw CSV dataset.

To troubleshoot, perform the following steps:

  1. In the Details tab, open the Schema tab, and select Edit
  2. Modify the fieldSchemaList to ensure it includes all the columns that the dataset should possess. For example, if the dataset should have columns {a, b, c}, all with integer types, the fieldSchemaList may look like the following:
  "fieldSchemaList": [
    {
      "type": "INTEGER",
      "name": "a",
      "nullable": null,
      "userDefinedTypeClass": null,
      "customMetadata": {},
      "arraySubtype": null,
      "precision": null,
      "scale": null,
      "mapKeyType": null,
      "mapValueType": null,
      "subSchemas": null
    },
    {
      "type": "INTEGER",
      "name": "b",
      "nullable": null,
      "userDefinedTypeClass": null,
      "customMetadata": {},
      "arraySubtype": null,
      "precision": null,
      "scale": null,
      "mapKeyType": null,
      "mapValueType": null,
      "subSchemas": null
    },
    {
      "type": "INTEGER",
      "name": "c",
      "nullable": null,
      "userDefinedTypeClass": null,
      "customMetadata": {},
      "arraySubtype": null,
      "precision": null,
      "scale": null,
      "mapKeyType": null,
      "mapValueType": null,
      "subSchemas": null
    }
  ],
  1. Change "dataFrameReaderClass" and its nested customMetadata object so that the end of your schema JSON looks like the following:
"dataFrameReaderClass": "com.palantir.foundry.spark.input.DataSourceDataFrameReader",
  "customMetadata": {
    "format": "csv",
    "options": {
      "multiLine": true,
      "header": true,
      "mode": "PERMISSIVE"
    }
  }
}
  1. Select Save.

Return to top


Why does my downloaded dataset look strange when I open it in Microsoft Excel?

When exporting datasets from the platform, some files may appear squeezed when opened. This issue has been observed in certain regions and is caused by the default delimiter used in Excel. To fix this issue, you will need to change the delimiter pattern in your export settings:

  1. Open the file in Excel.
  2. Select the Data tab in the ribbon.
  3. Select the Text to Columns option in the Data Tools group.
  4. In the Convert text to columns wizard window, select the Delimited option and then Next.
  5. In the Delimiters section, select the delimiter you want to use (such as comma, semicolon, tab).
  6. Select Next and then Finish to complete the process.

Return to top


中文翻译

数据集预览常见问题解答

以下是关于数据集预览的一些常见问题。

有关一般信息,请参阅数据集预览概述


CSV因引号字符不匹配而解析失败

当上传包含嵌套双引号以及引号字段内嵌换行符(\n)的CSV时,模式推断(Schema Inference)将失败,您无法使用模式编辑器(Schema Editor)创建有效的模式。

请按照以下步骤进行故障排除:

  1. 将CSV上传到数据集,然后选择编辑模式(Edit Schema)
  2. 设置列(Column)引号(Quotes)属性。
  3. 忽略模式验证错误,从保存并验证(Save and Validate)选项旁边的下拉菜单中选择不验证直接保存(Save without Validating)。这将创建一个包含正确列定义的模式。
  4. 详细信息(Details)选项卡中,以编辑模式打开模式。
  5. dataFrameReaderClass更改为com.palantir.foundry.spark.input.DataSourceDataFrameReader
  6. "customMetadata"对象中添加以下内容:
  "customMetadata": {
    "format": "csv",
    "options": {
      "header": true,
      "multiLine": true
    }
  }

返回顶部


数据集因部分底层CSV包含更多列而解析失败

当一个数据集由多个CSV组成(例如,通过数据连接(Data Connection)的APPEND事务),并且其中部分CSV包含更多列时,模式推断(Schema Inference)将失败。一种选择是忽略参差不齐的行(例如缺少某些列的行)。为此,请选择编辑模式(Edit schema),展开解析选项(Parsing options)部分,然后勾选忽略参差不齐的行(Ignore jagged rows)

但是,如果您希望保留这些参差不齐的行并为数据集指定标准化模式,则本节内容适用。如果下方假设(Assumptions)部分中列出的条件适用于您的数据,则故障排除步骤将生成一个具有用户定义标准模式的数据集,其中参差不齐的行中缺失的列将自动填充为null

解析失败的征兆:

您可能会遇到类似以下的错误消息:"无法加载预览:解析输入的CSV数据时遇到错误。请确保所有数据格式正确。"

在选择编辑模式(Edit Schema)然后选择保存并验证(Save and Validate)后,您也可能会遇到以下错误消息:"您的数据集在x行上验证失败。"

假设(Assumptions)

  1. 您可以定义所需的模式,例如数据集应包含的所有列名和类型。

  2. 您的模式强制执行严格的列顺序。例如,如果您希望数据集包含并显示列{a, b, c},则底层CSV可以具有如下列结构:

  3. {a}

  4. {a, b}

不能具有如下列结构:

  • {b, a}

  • 如果底层CSV包含第(n+1)列,则它必须包含所有前面的列直到第n列。例如,如果您希望数据集包含并显示列{a, b, c, d},则底层CSV可以具有如下列结构:

  • {a}

  • {a, b}
  • {a, b, c}

不能具有如下列结构:

  • {b, c, d}
  • {a, c, d}
  • {a, b, d}

以下是一个适用故障排除步骤的示例场景:

CSV通过APPEND事务定期添加到数据集中。某天,新增了一列,并成为CSV中的新最后一列。在数据集中,来自之前所有已追加CSV的行都应包含新列,其字段值自动填充为null,而不是被视为参差不齐的行。

这些故障排除步骤不会复制原始Parquet数据集(使用com.palantir.foundry.spark.input.ParquetDataFrameReader作为dataFrameReaderClass进行解析)可用的mergeSchema选项的功能。要在原始CSV数据集上复制此类功能,需要用户编写的转换(Transform)。

请按照以下步骤进行故障排除:

  1. 详细信息(Details)选项卡中,打开模式(Schema)选项卡,然后选择编辑(Edit)
  2. 修改fieldSchemaList以确保其包含数据集应拥有的所有列。例如,如果数据集应包含列{a, b, c},且均为整数类型,则fieldSchemaList可能如下所示:
  "fieldSchemaList": [
    {
      "type": "INTEGER",
      "name": "a",
      "nullable": null,
      "userDefinedTypeClass": null,
      "customMetadata": {},
      "arraySubtype": null,
      "precision": null,
      "scale": null,
      "mapKeyType": null,
      "mapValueType": null,
      "subSchemas": null
    },
    {
      "type": "INTEGER",
      "name": "b",
      "nullable": null,
      "userDefinedTypeClass": null,
      "customMetadata": {},
      "arraySubtype": null,
      "precision": null,
      "scale": null,
      "mapKeyType": null,
      "mapValueType": null,
      "subSchemas": null
    },
    {
      "type": "INTEGER",
      "name": "c",
      "nullable": null,
      "userDefinedTypeClass": null,
      "customMetadata": {},
      "arraySubtype": null,
      "precision": null,
      "scale": null,
      "mapKeyType": null,
      "mapValueType": null,
      "subSchemas": null
    }
  ],
  1. 更改"dataFrameReaderClass"及其嵌套的customMetadata对象,使模式JSON的结尾如下所示:
"dataFrameReaderClass": "com.palantir.foundry.spark.input.DataSourceDataFrameReader",
  "customMetadata": {
    "format": "csv",
    "options": {
      "multiLine": true,
      "header": true,
      "mode": "PERMISSIVE"
    }
  }
}
  1. 选择保存(Save)

返回顶部


为什么下载的数据集在Microsoft Excel中打开时显示异常?

从平台导出数据集时,某些文件打开后可能会出现内容挤压在一起的情况。此问题已在某些地区观察到,是由Excel中使用的默认分隔符引起的。要解决此问题,您需要更改导出设置中的分隔符模式:

  1. 在Excel中打开文件。
  2. 选择功能区中的数据(Data)选项卡。
  3. 数据工具(Data Tools)组中选择分列(Text to Columns)选项。
  4. 文本分列向导(Convert text to columns wizard)窗口中,选择分隔符号(Delimited)选项,然后点击下一步(Next)
  5. 分隔符号(Delimiters)部分,选择您要使用的分隔符(例如逗号、分号、制表符)。
  6. 选择下一步(Next),然后选择完成(Finish)以完成该过程。

返回顶部