Troubleshooting reference(故障排除参考)¶
This page describes several common issues with Syncs with steps to debug:
- PKIX and SSL Exceptions
- Egress Proxy Issues
- Incremental JDBC Syncs
- Intermittent sync failures or hangs
- A sync started failing
- Unexpected data was synced
PKIX and SSL exceptions¶
For certificate and TLS troubleshooting, including PKIX and SSLHandshakeException errors, see Certificates and TLS.
If your sync fails with the error Response 421 received. Server closed connection, this suggests you may be connecting with an unsupported SSL protocol / port combination. An example includes implicit FTPS over port 991, which is an outdated and unsupported standard. Explicit SSL over port 21 is the preferred method in this case.
Egress proxy issues¶
FTP syncs¶
If your sync is an FTP/S sync, ensure that you are not using an egress proxy load balancer. FTP is a stateful protocol, so using a load balancer can cause the sync to fail if sequential requests don't originate from the same IP.
Note that due to the nature of load balancing, failures will be non-deterministic; syncs and previews may sometimes succeed, even with the load-balancing proxy in place.
Additionally, HTTP proxies do not support active mode transfers for FTP and FTPS. Active mode requires the FTP server to connect back to the client, which is not possible through an HTTP proxy. When using HTTP proxies with FTP/FTPS connections, only passive mode transfers are supported. Ensure your FTP source is configured to use passive mode when connecting through an HTTP proxy.
S3 syncs¶
If your sync or exploration is failing with the error com.amazonaws.services.s3.model.AmazonS3Exception:Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: null; S3 Extended Request ID: null), this means that the command is hitting an error on going through the egress proxy. If you receive this error, you should check whether any of the following scenarios are applicable:
- The Region field is empty. This is required when the S3 bucket you are connecting to is in a different region than the proxy.
- STS is unreachable due to not being allowlisted.
- The S3 URL is unreachable due to not being allowlisted.
- The STS credentials are invalid or you are unable to assume the IAM role.
- The sync needs to use VPC instead of proxy; to resolve the S3 endpoint, address(es) must be excluded from the proxy by adding the following configurations:
-
To the S3 source proxyConfiguration block, add:
host: <address of deployment gateway or egress NLB> port: 8888 protocol: http nonProxyHosts: <bucket>.s3.<region>.amazonaws.com,s3.<region>.amazonaws.com,s3-<region>.amazonaws.come.g: Allowlisting all VPC buckets would involve a config addition of:
clientConfiguration: proxyConfiguration: host: <color>-egress-proxy.palantirfoundry.com port: 8888 protocol: http nonProxyHosts: *.s3.<region>.amazonaws.com, s3.<region>.amazonaws.com, s3-<region>.amazonaws.com
If your sync or exploration is failing with the error Could not list objects in bucket, check whether both the region and signing region are configured. Setting both values can cause this error; configure only the signing region to resolve the issue.
Incremental JDBC sync issues¶
To see the exact query that ran against your source system, refer to _data-ingestion.log.
If your sync is an incremental sync, ensure you have provided a monotonically increasing column (e.g. timestamp or id) and an initial value for this column.
Once you've chosen the incremental column, you need to make sure you have added the ? operator to the SQL query in the sync configuration page (the ? is replaced with the 'incremental' value and only a single ? may be used). For example, SELECT * FROM table WHERE id > ?.
Missing rows and rows not being updated¶
If you believe there are rows missing from your synced dataset or that previously synced rows aren't being properly updated, check the following:
- If new rows are added to the existing dataset, and the value stored in the incremental column for the new row(s) is less than the current cursor, the new rows will not be synced.
- For example, if you're using
IDas your monotonically increasing column and the lastIDvalue synced in the last sync was 10, and then you add a row withID5, that row withID5 won't be synced. - If you have rows that have not been synced due to the above, you can still sync these rows by either:
- Performing a snapshot sync rather than an incremental sync, or
- Adjusting your query to target the missing rows, and run it as a one-off.
Duplicate rows¶
If you believe existing rows are being re-synced, check the following:
- If existing rows in the source database are updated and the incremental column for those row(s) is changed such that it becomes greater than the current cursor, those rows will be re-synced, and thus duplicated. For example, if the column you are using as the incremental column is a timestamp, representing when the row was inserted or last updated, and you update a row between dataset syncs, that updated row will be re-synced.
- If you have duplicate rows present due to the above, you can remove them by either:
- Performing a snapshot sync rather than an incremental sync, or
- Using a downstream transform that removes duplicate rows.
- If the data type of your chosen incremental column is a timestamp and it uses sub-millisecond precision, duplicate rows will be re-synced. This is because currently incremental JDBC syncs only serialize timestamp values up to millisecond precision, and the incremental value is then always rounded down to the nearest millisecond. Therefore, rows with microsecond and/or nanosecond precision will always be re-synced because the comparison against the current (rounded-down) incremental value is always "positive".
- If you have duplicate rows present due to the above you will need to cast the incremental column to either a
LONGor aSTRING(in ISO-8601 format).
NullPointerException thrown on incremental sync¶
If a NullPointerException is thrown on your incremental sync, this may indicate that the SQL query is retrieving rows from the database that would cause the incremental column to contain null values.
- For example, take the query
SELECT * FROM table WHERE col > ? OR timestamp > 1, wherecolis the incremental column being used for the sync. The use ofORmeans that the query does not guarantee thatcolonly contains non-null values. If a null value forcolis synced for any row, then the sync will fail upon Data Connection attempting to update the incremental state for the sync since the current state will be compared with the synced null value and throw an error. - To remediate situations like these, either choose a different incremental column or ensure that no null values can be synced for the current incremental column. For the query above, we could avoid the errors with a rewrite like
SELECT * FROM table WHERE (col > ? OR timestamp > 1) AND col IS NOT NULL.
:::callout{theme="neutral"} If you wish to change the incremental column used for your sync, we recommend that you create a new sync. :::
Intermittent sync failures or hangs¶
Check if your agent is running out of resources¶
On the agent host, in the <bootvisor-directory>/var/data/processes directory, run ls -lrt to find the most recently created bootstrapper~<uuid> directory.
cdinto that directory and navigate to/var/log/.- Review the contents of the
magritte-agent-output.log.
If you see the error OutOfMemory Exception, it means that the agent cannot handle the workload being assigned to it.
- To fix this, you may need to increase the "agent heap size" parameter, which can be done in the agent overview page. However, before doing so we recommend you read the instructions for Tuning Heap Sizes in the agent troubleshooting reference guide.
- If you cannot increase the agent's heap size, you may need to reduce the "Maximum concurrent syncs" parameter. This can also be done in the agent overview page.
Hanging syncs¶
Below are some common causes of hanging syncs and their associated fixes:
All syncs: Hanging during the fetching stage
If your sync is hanging during the fetching stage, check if the source is both available and operational:
- To check whether the source is available, try to connect and interact with the source (without using your agent or other Foundry products). If you are able to connect successfully and queries run as expected, contact your Palantir representative for further assistance.
- If you find that you are unable to connect to the source or that responses to queries sent to the source are slow, it is likely the source is either experiencing higher than normal volumes of traffic or is down. To mitigate the impact of busy sources, we suggest doing the following:
- Break up your sync in to smaller syncs.
- Use incremental syncs (if applicable).
- Schedule your syncs at times when you know the source won't be busy.
JDBC syncs: Hanging during the fetching stage
If your sync is taking longer than expected to complete the fetching stage, it could be because the agent is making a large number of network and database calls. In order to tune the number of network and database calls made during a sync, you can alter the Fetch Size parameter:
- The
Fetch Sizeparameter is located within the "advanced options" section of the source configuration and defines the number of rows fetched during each database round trip for a given query. Therefore: - Decreasing the
Fetch Sizeparameter will result in fewer rows being returned per call to the database, and more calls will be required. However, this means the agent will use less memory as fewer rows will be stored in the agent's heap at a given time. - Increasing the
Fetch Sizeparameter will result in more rows being returned per call to the database, and fewer calls will be required. However, this means the agent will use more memory as a larger number of rows will be stored in the agent's heap at a given time. - We recommend starting with
Fetch Size: 500 and tuning accordingly.
JDBC syncs: Hanging during the upload stage
If your sync is taking a long time to upload files or fails during the upload stage, you could be overloading a network link. In this case we suggest tuning the Max file size parameter:
- The
Max file sizeparameter is located within the "advanced options" section of the source configuration and defines the maximum size (in bytes or rows) of the output files which are uploaded to Foundry. Therefore: - Decreasing the
Max file sizeparameter can increase pressure on the network as smaller files are uploaded more frequently; if a file upload fails, the cost of re-uploading is less. - Increasing the
Max file sizeparameter will require less total bandwidth, but such uploads are more likely to fail. - We recommend
Max file size: 120mb.
FTP / SFTP / Directory / syncs: Hanging during the fetching stage
The most common reason why file-based syncs hang during the fetching stage is because the agent is crawling a large file system.
- In order to avoid long crawl times, ensure you have specified the subfolder to crawl within the source configuration page.
- Note: Any regex filters will run on the path of the file relative to the source’s root directory.
- If a subfolder is not specified, the sync will crawl the source root.
:::callout{theme="neutral"} Syncs that crawl a filesystem will do two complete crawls of the filesystem (unless configured otherwise). This is to ensure the sync does not upload files which are currently being written to or altered in any way. :::
Large file-based syncs are vulnerable to transient failures¶
File-based syncs are transactional. If a sync fails at any point, the transaction is aborted and none of the files from that run are committed to the dataset. For example, if a sync is configured to upload one million files and a network error occurs after uploading all but one file, the entire transaction is aborted. None of the uploaded files are committed and the dataset view is not modified.
Due to this transactional behavior, syncs that process a large number of files in a single run are more vulnerable to transient network issues or source system problems. To reduce the impact of sync failures:
- Use incremental
APPENDsyncs instead ofSNAPSHOTsyncs. - Apply a file limit filter so that each run processes a smaller batch of files.
This way, a failure only requires re-syncing the current batch rather than all files.
For more details, see sync failure behavior and optimize file-based syncs.
If your sync fails with the REQUEST_ENTITY_TOO_LARGE error¶
Downloading, processing, and uploading large files is error-prone and slow. REQUEST_ENTITY_TOO_LARGE service exceptions occur if an individual file exceeds the maximum size configured for the agent's upload destination. For the data-proxy upload strategy, this is set to 100GB by default.
Overriding the limit is not recommended; if possible, find a way to access this data as a collection of smaller files. However, if you wish to override this limit as a temporary workaround, use the following steps:
- Within Data Connection, navigate to your agent and select the Advanced configuration tab.
- Select the "Agent" tab.
- Under the destinations block, include the following to increase the limit to 150Gb:
uploadStrategy:
type: data-proxy
maximumUploadedFileSizeBytes: 161061273600
A sync started failing¶
If your sync fails with the BadPaddingException error¶
BadPaddingException exceptions occur because the source credential encryption key stored within the agent is not what was expected. This commonly happens when an agent manager is manually upgraded, but the old /var/data directory is not copied to the new install location.
The easiest way to resolve this is to re-enter the credentials for each of the sources using the affected agent.
Unexpected data was synced¶
If timestamp columns in your JBDC sync show up as Long columns in Foundry¶
When rows are synced from a JDBC source and they contain timestamp columns, those timestamp columns will be cast to long columns in Foundry. This behavior exists for backwards compatibility reasons.
To fix the data type for these columns, we recommend using a Python Transform environment to perform this cleaning. Here is an example code snippet that casts the column "mytimestamp" back into timestamp form:
df = df.withColumn("mytimestamp", (F.col("mytimestamp") / 1000).cast("timestamp"))
中文翻译¶
故障排除参考¶
本文档描述了同步(Sync)的几种常见问题及调试步骤:
PKIX 和 SSL 异常¶
有关证书和 TLS 故障排除(包括 PKIX 和 SSLHandshakeException 错误),请参阅证书和 TLS。
如果同步失败并显示错误 Response 421 received. Server closed connection,这表示您可能使用了不支持的 SSL 协议/端口组合。例如,在端口 991 上使用隐式 FTPS,这是一种过时且不受支持的标准。在这种情况下,首选方法是在端口 21 上使用显式 SSL。
出口代理问题¶
FTP 同步¶
如果您的同步是 FTP/S 同步,请确保没有使用出口代理负载均衡器。FTP 是一种有状态协议,如果顺序请求不来自同一 IP,使用负载均衡器可能导致同步失败。
请注意,由于负载均衡的特性,故障将是非确定性的;即使存在负载均衡代理,同步和预览有时也可能成功。
此外,HTTP 代理不支持 FTP 和 FTPS 的主动模式传输。主动模式要求 FTP 服务器回连到客户端,这在通过 HTTP 代理时无法实现。当通过 HTTP 代理使用 FTP/FTPS 连接时,仅支持被动模式传输。请确保您的 FTP 源在通过 HTTP 代理连接时配置为使用被动模式。
S3 同步¶
如果您的同步或探索失败并显示错误 com.amazonaws.services.s3.model.AmazonS3Exception:Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: null; S3 Extended Request ID: null),这意味着命令在通过出口代理时遇到错误。如果收到此错误,应检查以下情况是否适用:
- Region 字段为空。当您连接的 S3 存储桶与代理位于不同区域时,此字段为必填项。
- STS 因未加入白名单而无法访问。
- S3 URL 因未加入白名单而无法访问。
- STS 凭证无效,或者您无法承担 IAM 角色。
- 同步需要使用 VPC 而非代理;要解析 S3 端点,必须通过添加以下配置将地址从代理中排除:
-
在 S3 源 proxyConfiguration 块中添加:
host: <部署网关或出口 NLB 的地址> port: 8888 protocol: http nonProxyHosts: <bucket>.s3.<region>.amazonaws.com,s3.<region>.amazonaws.com,s3-<region>.amazonaws.com例如:将所有 VPC 存储桶加入白名单需要添加如下配置:
clientConfiguration: proxyConfiguration: host: <color>-egress-proxy.palantirfoundry.com port: 8888 protocol: http nonProxyHosts: *.s3.<region>.amazonaws.com, s3.<region>.amazonaws.com, s3-<region>.amazonaws.com
如果您的同步或探索失败并显示错误 Could not list objects in bucket,请检查是否同时配置了区域(region)和签名区域(signing region)。同时设置这两个值可能导致此错误;仅配置签名区域即可解决问题。
增量 JDBC 同步问题¶
要查看针对源系统执行的确切查询,请参考 _data-ingestion.log。
如果您的同步是增量同步,请确保提供了单调递增的列(例如时间戳或 ID)以及该列的初始值。
选择增量列后,您需要确保在同步配置页面的 SQL 查询中添加了 ? 运算符(? 会被替换为"增量"值,且只能使用一个 ?)。例如,SELECT * FROM table WHERE id > ?。
缺少行和行未更新¶
如果您认为同步的数据集中缺少某些行,或者之前同步的行未正确更新,请检查以下内容:
- 如果向现有数据集添加了新行,且新行的增量列中存储的值小于当前游标(cursor),则新行将不会被同步。
- 例如,如果您使用
ID作为单调递增列,上次同步的最后一个ID值为 10,然后您添加了一个ID为 5 的行,则该ID为 5 的行不会被同步。 - 如果由于上述原因导致某些行未被同步,您仍可以通过以下方式同步这些行:
- 执行快照同步(Snapshot Sync)而非增量同步,或
- 调整查询以定位缺失的行,并作为一次性操作运行。
重复行¶
如果您认为现有行被重新同步,请检查以下内容:
- 如果源数据库中的现有行被更新,并且这些行的增量列发生变化,使其大于当前游标,则这些行将被重新同步,从而导致重复。例如,如果您用作增量列的列是时间戳,表示行的插入或最后更新时间,并且在数据集同步之间更新了一行,则该更新后的行将被重新同步。
- 如果由于上述原因存在重复行,您可以通过以下方式删除它们:
- 执行快照同步而非增量同步,或
- 使用下游转换(transform)删除重复行。
- 如果您选择的增量列的数据类型是时间戳,并且使用亚毫秒精度,则重复行将被重新同步。这是因为目前增量 JDBC 同步仅将时间戳值序列化到毫秒精度,然后增量值始终向下舍入到最近的毫秒。因此,具有微秒和/或纳秒精度的行将始终被重新同步,因为与当前(向下舍入的)增量值的比较始终为"正"。
- 如果由于上述原因存在重复行,您需要将增量列转换为
LONG或STRING(ISO-8601 格式)。
增量同步抛出 NullPointerException¶
如果在增量同步时抛出 NullPointerException,这可能表示 SQL 查询从数据库中检索的行会导致增量列包含空值。
- 例如,考虑查询
SELECT * FROM table WHERE col > ? OR timestamp > 1,其中col是用于同步的增量列。使用OR意味着查询不能保证col只包含非空值。如果任何行的col同步了空值,则当 Data Connection 尝试更新同步的增量状态时,同步将失败,因为当前状态将与同步的空值进行比较并抛出错误。 - 要解决此类情况,请选择不同的增量列,或确保当前增量列不会同步任何空值。对于上述查询,我们可以通过重写来避免错误,例如
SELECT * FROM table WHERE (col > ? OR timestamp > 1) AND col IS NOT NULL。
:::callout{theme="neutral"} 如果您希望更改同步使用的增量列,我们建议您创建一个新的同步。 :::
间歇性同步失败或挂起¶
检查代理是否资源不足¶
在代理主机上,进入 <bootvisor-directory>/var/data/processes 目录,运行 ls -lrt 查找最近创建的 bootstrapper~<uuid> 目录。
cd进入该目录并导航到/var/log/。- 查看
magritte-agent-output.log的内容。
如果看到错误 OutOfMemory Exception,表示代理无法处理分配给它的工作负载。
- 要解决此问题,您可能需要增加"代理堆大小(agent heap size)"参数,这可以在代理概览页面中完成。但在执行此操作之前,我们建议您阅读代理故障排除参考指南中的调整堆大小说明。
- 如果无法增加代理的堆大小,您可能需要减少"最大并发同步数(Maximum concurrent syncs)"参数。这也可以在代理概览页面中完成。
同步挂起¶
以下是同步挂起的一些常见原因及其相关修复方法:
所有同步:在获取阶段挂起
如果您的同步在获取阶段挂起,请检查源是否可用且正常运行:
- 要检查源是否可用,请尝试连接并与源交互(不使用您的代理或其他 Foundry 产品)。如果您能成功连接且查询按预期运行,请联系您的 Palantir 代表以获取进一步帮助。
- 如果您发现无法连接到源,或发送到源的查询响应缓慢,则源可能正在经历高于正常水平的流量或已宕机。为减轻繁忙源的影响,我们建议执行以下操作:
- 将同步拆分为较小的同步。
- 使用增量同步(如果适用)。
- 在您知道源不会繁忙的时间安排同步。
JDBC 同步:在获取阶段挂起
如果您的同步完成获取阶段所需的时间比预期长,可能是因为代理正在进行大量网络和数据库调用。为了调整同步期间进行的网络和数据库调用次数,您可以更改获取大小(Fetch Size)参数:
获取大小参数位于源配置的"高级选项"部分,定义了每次数据库往返中为给定查询获取的行数。因此:- 减小
获取大小参数将导致每次对数据库的调用返回更少的行,并且需要更多的调用。但这意味着代理将使用更少的内存,因为在给定时间存储在代理堆中的行数更少。 - 增大
获取大小参数将导致每次对数据库的调用返回更多的行,并且需要的调用更少。但这意味着代理将使用更多的内存,因为在给定时间存储在代理堆中的行数更多。 - 我们建议从
获取大小:500 开始,然后相应调整。
JDBC 同步:在上传阶段挂起
如果您的同步上传文件需要很长时间或在上传阶段失败,则可能是网络链路过载。在这种情况下,我们建议调整最大文件大小(Max file size)参数:
最大文件大小参数位于源配置的"高级选项"部分,定义了上传到 Foundry 的输出文件的最大大小(以字节或行数计)。因此:- 减小
最大文件大小参数可能会增加网络压力,因为较小的文件上传更频繁;如果文件上传失败,重新上传的成本较低。 - 增大
最大文件大小参数将需要更少的总带宽,但此类上传更可能失败。 - 我们建议
最大文件大小:120mb。
FTP / SFTP / 目录 / 同步:在获取阶段挂起
基于文件的同步在获取阶段挂起的最常见原因是代理正在爬取大型文件系统。
- 为避免长时间爬取,请确保在源配置页面中指定了要爬取的子文件夹。
- 注意:任何正则表达式过滤器将在相对于源根目录的文件路径上运行。
- 如果未指定子文件夹,同步将爬取源根目录。
:::callout{theme="neutral"} 爬取文件系统的同步将执行两次完整的文件系统爬取(除非另有配置)。这是为了确保同步不会上传当前正在写入或以任何方式更改的文件。 :::
大型基于文件的同步容易受到瞬时故障的影响¶
基于文件的同步是事务性的。如果同步在任何时候失败,事务将被中止,并且该运行中的任何文件都不会提交到数据集。例如,如果同步配置为上传一百万个文件,并且在上传除一个文件外的所有文件后发生网络错误,则整个事务将被中止。所有已上传的文件都不会被提交,数据集视图也不会被修改。
由于这种事务性行为,在单次运行中处理大量文件的同步更容易受到瞬时网络问题或源系统问题的影响。为减少同步失败的影响:
- 使用增量
APPEND同步而不是SNAPSHOT同步。 - 应用文件限制过滤器,使每次运行处理更小的文件批次。
这样,失败只需要重新同步当前批次,而不是所有文件。
如果同步失败并显示 REQUEST_ENTITY_TOO_LARGE 错误¶
下载、处理和上传大文件容易出错且速度缓慢。如果单个文件超过代理上传目标配置的最大大小,则会发生 REQUEST_ENTITY_TOO_LARGE 服务异常。对于 data-proxy 上传策略,默认设置为 100GB。
不建议覆盖此限制;如果可能,请找到一种方法以较小的文件集合形式访问此数据。但是,如果您希望作为临时解决方法覆盖此限制,请使用以下步骤:
- 在 Data Connection 中,导航到您的代理并选择高级(Advanced)配置选项卡。
- 选择"代理(Agent)"选项卡。
- 在目标块下,包含以下内容以将限制增加到 150Gb:
uploadStrategy:
type: data-proxy
maximumUploadedFileSizeBytes: 161061273600
同步开始失败¶
如果同步失败并显示 BadPaddingException 错误¶
BadPaddingException 异常的发生是因为代理中存储的源凭证加密密钥与预期不符。这通常发生在代理管理器被手动升级,但旧的 /var/data 目录未被复制到新的安装位置时。
解决此问题的最简单方法是重新输入受影响代理的每个源的凭证。
同步了意外数据¶
如果 JDBC 同步中的时间戳列在 Foundry 中显示为长整型(Long)列¶
当从 JDBC 源同步行且这些行包含时间戳列时,这些时间戳列将在 Foundry 中被转换为长整型列。此行为出于向后兼容的原因而存在。
要修复这些列的数据类型,我们建议使用 Python 转换(Python Transform)环境来执行此清理。以下是一个示例代码片段,将列 "mytimestamp" 转换回时间戳形式:
df = df.withColumn("mytimestamp", (F.col("mytimestamp") / 1000).cast("timestamp"))