跳转至

Data Connection(Data Connection(数据连接))

How do you cleanly uninstall an agent from a system?

To cleanly uninstall an agent, refer to the documentation on reinstalling or upgrading the agent in the user interface. Before deleting the agent's directory, make sure to stop all related processes and copy any local settings, such as proxy configurations. Additionally, clear any cron jobs as outlined in the agent setup documentation.

Timestamp: February 13, 2024

What could be causing 'java.lang.NullPointerException' and 'SchemaColumnConvertNotSupportedException' errors when importing datasets from S3 and applying a schema in Foundry?

The problem could be related to the incorrect interpretation of a column's datatype during the schema application in Foundry. To debug, download the Parquet files and use Python code to read the data with the applied schema. If the error messages mention columns names, you could exclude the problematic column from the schema to see if the rest load correctly.

Timestamp: February 13, 2024

How can I ensure that an S3 bucket only contains the latest exported data when using Foundry, avoiding appending new files to old ones?

To ensure that an S3 bucket only contains the latest exported data, you can use external transformations to call AWS APIs directly and implement custom logic for cleanup or pre/post-processing. This could involve deleting everything in the bucket before exporting, creating new directories, or moving things around. Additionally, you can also write a script to delete the contents of the S3 bucket before exporting from Foundry.

Timestamp: February 13, 2024

How do multiple agents on a single sync handle division of labor? Is there any parallelism between the agents?

There is no parallelism between two agents. Syncs are scheduled on available healthy agents either randomly or based on whichever has fewer syncs in queue, which is configurable. Each agent can run a configurable number of syncs concurrently depending on allocated resources.

Timestamp: February 13, 2024

How can I resolve a SQL Server connection timeout due to an IP change? Can I connect using the computer name instead?

The solution is to reconfigure the source with the new details, which includes using the computer name in the URL.

Timestamp: February 13, 2024

Is it possible to automatically create tickets in a Service Now instance using the Service Now connector?

The Service Now connector currently only supports batch syncs. To perform writes to Service Now, such as automatically creating tickets, you can build directly against their API using the REST API source type or external transforms.

Timestamp: February 14, 2024

Do Data Connection sources currently inherit Markings from agents?

No, Data Connection sources do not currently inherit Markings from agents.

Timestamp: February 13, 2024

How can we handle the loss of microsecond precision in Data Connection when importing timestamps for incremental syncs to avoid duplicate entries?

Create an additional string column that is a string value of the timestamp, and perform incremental syncs on that string column instead of the original timestamp column.

Timestamp: February 20, 2024

No additional setup for AWS PrivateLink is required if both the Foundry instance and the customer's AWS VPC are in the same region, as AWS transfers data without exposing it to the Internet.

Timestamp: February 13, 2024

Why is a custom plugin not being recognized when running a job?

The issue could be due to differences in the Java versions of the plugin and the bootvisor.

Timestamp: February 23, 2024

How can I set up a new type of data connection using Globus to enable data transfer between blob storage stores?

You will need to integrate with the Globus Python SDK using Python external transforms.

Timestamp: February 13, 2024

Are there any export options available for JDBC sources from Foundry to Microsoft SQL server?

For JDBC exports, legacy export tasks using the JDBC connector are the only option available for now.

Timestamp: February 21, 2024

Why am I getting an error stating 'The uploaded Jar was not signed correctly' when trying to upload a jar for an Oracle EBS connection?

You must only use Palantir signed jars.

Timestamp: February 13, 2024

How can I export datasets larger than the data-proxy limit of 10M rows?

Convert datasets from Parquet to CSV in Foundry transforms, and then use file-based exports (Data Connection exports) to write the data to a file-based destination like S3 or streaming systems like Kafka.

Timestamp: February 13, 2024

Is it possible to update an out-of-the-box HDFS source type to a custom ABFS source type and maintain syncs intact during a migration?

Yes, it is possible to update the source type while keeping syncs intact. We recommend saving the existing configuration and reverting if something breaks. Additionally, try the update on a test source first before applying the changes to the actual source.

Timestamp: February 13, 2024

Why can't we connect to the ABFS source using a shared access signature and a blob SAS token?

If soft-delete is enabled for the ABFS source, then you cannot use shared access signature and a blob SAS token to connect to ABFS. This is the allowed configuration from Azure.

Timestamp: April 16, 2024

Can I use legacy tasks to export data to a tabular datasource since the new export framework does not support tabular destinations?

Yes, if your tabular datasource has a JDBC driver, you can use the JDBC export task to export data.

Timestamp: April 25, 2024

How can I connect to MS OneLake?

You can connect to MS OneLake by using the ABFS connector, or by using external transforms and leveraging the Python client provided by OneLake.

Timestamp: June 25, 2025

Can stored procedures in a database be viewed or accessed on the Foundry side when connected through a data connector?

No, stored procedures on the database cannot be viewed or accessed directly on the Foundry side when connected through a data connector, but they can be executed via the "SQL Query" option when configuring a sync.

Timestamp: April 16, 2024

Why does an agent start downloading additional files after initial installation and start up?

Agents need to download updated versions of bootstrapper / bootvisor / agent binaries and initial or updated versions of managed plugin binaries. Some of these are always downloaded, while others are only downloaded if a source of that type is assigned to the agent.

Timestamp: April 24, 2024

How can I set up an incremental ingest on SQLServer CDC tables using a binary type column $start that is not showing up in the incremental section of the sync UI?

Cast the binary type column $start to varchar(max) to avoid truncation and then use the column in the incremental section of the sync UI.

Timestamp: April 16, 2024

How can I correctly use rewritePaths to rename files when exporting data to Azure, and why is it only exporting one file?

You should use the new export functionality for file-based exports, which does not support rewritePaths. Instead, perform any necessary file renaming or data transformations upstream of the export process. This approach is recommended because legacy export tasks are more difficult to configure and debug.

Timestamp: April 16, 2024

How can I migrate an agent between two hosts?

To migrate an agent between two hosts, you should first shut down the agent properly on the old host using ./auto_restart.sh clear; ./init.sh stop to remove the cronjob and stop the bootstrapper. Then, copy the entire directory containing the agent to the new host using a tool like scp, assuming both hosts are up and can connect.

Timestamp: April 18, 2024

How can I import data into Foundry using JDBC when the SQL query changes dynamically based on dataset values?

For importing data into Foundry, you should use syncs/extracts, which are supported for JDBC. For use cases where the SQL query changes dynamically, you should use external transforms to write the custom logic for data ingestion, rather than using it to change the sync configuration. This approach is preferred over data connection tasks, which are discouraged due to their limitations.

Timestamp: April 16, 2024

The recommended approach is to ingest the array-type columns as strings and then parse them in Pipeline Builder.

Timestamp: April 24, 2024

What Project-level permissions are required to create an agent?

Owner permissions on the Project are required to create an agent.

Timestamp: April 16, 2024

What should be done when encountering the 'ExplorationRuntimeReadinessService:ExplorationRuntimeNotReady' error while trying to explore JDBC sources?

The issue might be transient and can be fixed by refreshing the service a few times.

Timestamp: April 16, 2024

What should be the SSL parameter for an Oracle JDBC driver connection?

The SSL parameter needed for an Oracle JDBC driver connection is CONNECTION_PROPERTY_THIN_NET_ENCRYPTION_LEVEL.

Timestamp: May 23, 2024

How can I limit the number of files being ingested in an incremental sync and guarantee the order in which the files are chosen?

The filter:

- type: sortByLastModified
  order: DESCENDING

can be used to limit the number of files being ingested and guarantee the order in which the files are chosen.

Timestamp: April 24, 2024

Can Data Connection agents be installed on AWS Fargate (Serverless ECS or EKS) ?

AWS Fargate (Serverless ECS or EKS) is not recommended by Palantir as infrastructure for deploying the Data Connection Agents, primarily due to the lack of default volumes attached to them. Choosing to deploy agents in containers using these services is not officially supported.

Timestamp: September 5, 2024

How can we resolve the "Only single connections are supported" error when trying to connect to Databricks using a code-based external transform?

The error is caused by attempting to call get_https_connection() on a non-REST API source. The solution is to either create a rest API source for the Databricks instance or to construct a custom client for the connection. Storing credentials in a "Generic" source or a REST API source is also a viable option.

Timestamp: December 18, 2024


中文翻译


Data Connection(数据连接)

如何从系统中彻底卸载代理(agent)?

要彻底卸载代理,请参阅用户界面中关于重新安装或升级代理的文档。在删除代理目录前,请确保停止所有相关进程,并复制代理配置(proxy configurations)等本地设置。此外,按照代理设置文档中的说明清除所有 cron 作业。

时间戳: 2024年2月13日

从 S3 导入数据集并在 Foundry 中应用模式(schema)时,出现 'java.lang.NullPointerException' 和 'SchemaColumnConvertNotSupportedException' 错误,可能是什么原因?

问题可能与 Foundry 中应用模式时对列数据类型(datatype)的错误解析有关。调试时,可下载 Parquet 文件并使用 Python 代码读取已应用模式的数据。如果错误信息中提到了列名,可以尝试从模式中排除有问题的列,以验证其余列是否能正常加载。

时间戳: 2024年2月13日

使用 Foundry 时,如何确保 S3 存储桶仅包含最新导出的数据,而不会将新文件追加到旧文件中?

要确保 S3 存储桶仅包含最新导出的数据,可以使用外部转换(external transformations)直接调用 AWS API,并实现自定义的清理或预处理/后处理逻辑。这包括在导出前删除存储桶中的所有内容、创建新目录或移动文件。此外,也可以在从 Foundry 导出前编写脚本删除 S3 存储桶的内容。

时间戳: 2024年2月13日

单个同步(sync)中的多个代理如何分工?代理之间是否存在并行机制?

两个代理之间不存在并行机制。同步任务会根据配置,随机分配给可用的健康代理,或分配给队列中同步任务较少的代理。每个代理可根据分配的资源,并发运行可配置数量的同步任务。

时间戳: 2024年2月13日

如何解决因 IP 变更导致的 SQL Server 连接超时?能否使用计算机名进行连接?

解决方案是使用新的详细信息重新配置数据源,包括在 URL 中使用计算机名。

时间戳: 2024年2月13日

是否可以使用 Service Now 连接器(connector)在 Service Now 实例中自动创建工单(tickets)?

Service Now 连接器目前仅支持批量同步(batch syncs)。要实现对 Service Now 的写入操作(如自动创建工单),可以使用 REST API 源类型或外部转换(external transforms)直接基于其 API 进行构建。

时间戳: 2024年2月14日

Data Connection 源(sources)当前是否会继承代理的标记(Markings)?

不会,Data Connection 源当前不会继承代理的标记。

时间戳: 2024年2月13日

在 Data Connection 中导入时间戳(timestamps)进行增量同步(incremental syncs)时,如何处理微秒精度丢失问题以避免重复条目?

创建一个额外的字符串列,用于存储时间戳的字符串值,然后对该字符串列(而非原始时间戳列)执行增量同步。

时间戳: 2024年2月20日

如果 Foundry 实例和客户的 AWS VPC 位于同一区域,则无需为 AWS PrivateLink 进行额外设置,因为 AWS 会在不将数据暴露于互联网的情况下传输数据。

时间戳: 2024年2月13日

运行作业时,自定义插件(plugin)为何无法被识别?

问题可能源于插件与 bootvisor 的 Java 版本差异。

时间戳: 2024年2月23日

如何使用 Globus 设置新型数据连接,以实现 Blob 存储之间的数据传输?

您需要使用 Python 外部转换(external transforms)与 Globus Python SDK 进行集成。

时间戳: 2024年2月13日

从 Foundry 向 Microsoft SQL Server 导出 JDBC 源时,有哪些可用的导出选项?

对于 JDBC 导出,目前仅支持使用 JDBC 连接器的旧版导出任务

时间戳: 2024年2月21日

尝试为 Oracle EBS 连接上传 jar 文件时,为何收到 'The uploaded Jar was not signed correctly' 错误?

您必须仅使用 Palantir 签名的 jar 文件。

时间戳: 2024年2月13日

如何导出超过 data-proxy 限制(1000万行)的数据集?

在 Foundry 转换(transforms)中将数据集从 Parquet 转换为 CSV,然后使用基于文件的导出(Data Connection 导出)将数据写入基于文件的目标(如 S3)或流式系统(如 Kafka)。

时间戳: 2024年2月13日

是否可以将开箱即用的 HDFS 源类型更新为自定义的 ABFS 源类型,并在迁移过程中保持同步任务不受影响?

可以,更新源类型时可以保持同步任务不受影响。我们建议保存现有配置,以便在出现问题时回滚。此外,在应用到实际源之前,先在测试源上尝试更新。

时间戳: 2024年2月13日

为什么无法使用共享访问签名(shared access signature)和 Blob SAS 令牌连接到 ABFS 源?

如果 ABFS 源启用了 soft-delete(软删除),则无法使用共享访问签名和 Blob SAS 令牌进行连接。这是 Azure 允许的配置。

时间戳: 2024年4月16日

由于新的导出框架不支持表格目标(tabular destinations),我能否使用旧版任务将数据导出到表格数据源?

可以,如果您的表格数据源具有 JDBC 驱动程序,则可以使用 JDBC 导出任务导出数据。

时间戳: 2024年4月25日

如何连接到 MS OneLake?

您可以通过使用 ABFS 连接器,或使用外部转换并利用 OneLake 提供的 Python 客户端进行连接。

时间戳: 2025年6月25日

通过数据连接器(data connector)连接数据库后,能否在 Foundry 端查看或访问数据库中的存储过程(stored procedures)?

不能,通过数据连接器连接后,无法在 Foundry 端直接查看或访问数据库中的存储过程,但可以在配置同步时通过"SQL 查询"选项执行它们。

时间戳: 2024年4月16日

为什么代理在初始安装和启动后会开始下载额外文件?

代理需要下载 bootstrapper / bootvisor / agent 二进制文件的更新版本,以及托管插件(managed plugin)二进制文件的初始或更新版本。其中一些文件始终会被下载,而另一些仅当该类型的源被分配给代理时才会下载。

时间戳: 2024年4月24日

如何使用二进制类型列 $start 在 SQLServer CDC 表上设置增量摄取(incremental ingest),但该列未显示在同步 UI 的增量部分?

将二进制类型列 $start 转换为 varchar(max) 以避免截断,然后在同步 UI 的增量部分使用该列。

时间戳: 2024年4月16日

如何正确使用 rewritePaths 在导出数据到 Azure 时重命名文件?为什么它只导出一个文件?

对于基于文件的导出,应使用新的导出功能,该功能不支持 rewritePaths。相反,应在导出过程的上游执行任何必要的文件重命名或数据转换。推荐此方法,因为旧版导出任务配置和调试更为困难。

时间戳: 2024年4月16日

如何将代理迁移到两个主机之间?

要将代理迁移到两个主机之间,首先应在旧主机上使用 ./auto_restart.sh clear; ./init.sh stop 正确关闭代理,以移除 cronjob 并停止 bootstrapper。然后,假设两个主机均已启动且可连接,使用 scp 等工具将包含代理的整个目录复制到新主机。

时间戳: 2024年4月18日

当 SQL 查询根据数据集值动态变化时,如何使用 JDBC 将数据导入 Foundry?

对于数据导入,应使用 JDBC 支持的同步/提取(syncs/extracts)。对于 SQL 查询动态变化的场景,应使用外部转换(external transforms)编写自定义数据摄取逻辑,而非用于更改同步配置。此方法优于数据连接任务(data connection tasks),后者因其局限性而不被推荐。

时间戳: 2024年4月16日

从 Postgres 摄取数组类型列(array-type columns)的推荐方法是什么?

推荐方法是将数组类型列作为字符串摄取,然后在 Pipeline Builder 中进行解析。

时间戳: 2024年4月24日

创建代理需要哪些项目级权限?

创建代理需要项目的 Owner(所有者)权限。

时间戳: 2024年4月16日

尝试探索 JDBC 源时遇到 'ExplorationRuntimeReadinessService:ExplorationRuntimeNotReady' 错误,应如何处理?

该问题可能是暂时的,刷新服务几次即可解决。

时间戳: 2024年4月16日

Oracle JDBC 驱动程序连接需要什么 SSL 参数?

Oracle JDBC 驱动程序连接所需的 SSL 参数是 CONNECTION_PROPERTY_THIN_NET_ENCRYPTION_LEVEL

时间戳: 2024年5月23日

如何限制增量同步中摄取的文件数量,并保证文件的选择顺序?

可以使用以下过滤器:

- type: sortByLastModified
  order: DESCENDING

来限制摄取的文件数量并保证文件的选择顺序。

时间戳: 2024年4月24日

Data Connection 代理能否安装在 AWS Fargate(无服务器 ECS 或 EKS)上?

Palantir 不建议将 AWS Fargate(无服务器 ECS 或 EKS)作为部署 Data Connection 代理的基础设施,主要原因是它们缺少默认卷。使用这些服务在容器中部署代理并非官方支持的方式。

时间戳: 2024年9月5日

尝试使用基于代码的外部转换连接到 Databricks 时,如何解决 "Only single connections are supported" 错误?

该错误是由于尝试在非 REST API 源上调用 get_https_connection() 引起的。解决方案是为 Databricks 实例创建一个 REST API 源,或为连接构建自定义客户端。将凭据存储在"通用"源或 REST API 源中也是一个可行的选项。

时间戳: 2024年12月18日