Create geospatial transforms(创建地理空间转换(Geospatial Transforms))¶
With Pipeline Builder, you can load, transform, and wield geospatial data. If your geospatial workflow is not yet supported by Pipeline Builder's current capabilities, consult Foundry’s legacy geospatial documentation to transform your data in Code Repositories.
Modeling geospatial data¶
Logical types¶
Pipeline Builder models geospatial data internally using the concept of a logical type, which is a base type (string, integer, boolean, array, struct) with additional constraints on the data represented. For example, the Geometry type is defined as a string which must be valid GeoJSON, while a GeoPoint must be a struct of longitude between -180 and 180 and latitude between -90 and 90, both inclusive. A full list of supported types can be found below.
All logical types in Pipeline Builder are inheritors of their base types; for instance, a geometry can be used as input to an expression which expects an input of type string, but not vice versa. To cast from a base type to a particular logical type which extends that base type, you can use the “Logical Type Cast” expression, which will apply the constraints associated with that logical type to the data and null any values which fail this validation. The ability for expressions to specify logical types as input and output ensures that when a geospatial-specific expression expects a GeoJSON string, a GeoJSON string will be received.
Supported geospatial types¶
Pipeline Builder currently supports the following geospatial types:
- GeoPoint: A struct of
longitudeandlatitude, wherelongitudeis a double between-180and180, andlatitudeis a double between-90and90, both inclusive. A GeoPoint must be a valid (x, y) coordinate according to the WGS:84 or EPSG:4326 coordinate reference system (CRS). - Geometry: A stringified JSON blob adhering to the GeoJSON specification. Individual coordinates are expected to be in WGS:84/EPSG:4326 format, like the GeoPoint type.
- H3Index: A string which represents a valid H3 hexagonal index.
- LatLonBoundingBox: A bounding box, represented by a struct of
minLat,minLon,maxLat,maxLon, where each entry is a valid GeoPoint and wheremaxLat > minLatandmaxLon > minLon. - Ontology GeoPoint: A string compatible with the Ontology's GeoPoint property type, fulfilling the format
{lat},{lon}, where-90 <= lat <= 90and-180 <= lon <= 180. - MGRS: A string which represents a valid MGRS (Military Grid Reference System) coordinate.
Loading geospatial data¶
Pipeline Builder supports a variety of different transforms and expressions for geospatial data.
- GeoPoint:
- Construct GeoPoint column: Takes a
lat,lonpair, validates the bounds outlined above, and converts it into GeoPoint representation. - Create GeoPoint from Coordinate System (CRS): Takes an
x,ypair and a coordinate reference system, projects that(x,y)into WGS:84, then constructs a GeoPoint representation. Supports conversion from most coordinate systems in the EPSG database, including all UTM zones. - Geometry:
- Parse well-known text (WKT): Converts well-known text (WKT) string to the geometry logical type. Optionally, supply a source coordinate system identifier to convert from the source CRS to WGS:84 if the WKT is not in WGS:84 already.
- Normalize Geometry: Given a GeoJSON string in WGS:84 format, normalizes the following attributes: proper order (right-hand rule), closed rings, duplicate removal, and constant dimensions of points.
- Extract rows from Shapefile: Given a dataset of raw shapefiles, parses each shapefile into rows containing each entry’s geometry and properties. The output dataset will have a geometry column as well as a column for each property listed by the user. Coordinate reference systems can be specified for non-WGS:84 datasets.
- Extract rows from GeoJSON: Given a dataset of raw GeoJSON files, parses each shapefile into rows containing each entry’s geometry and properties. The output dataset will have a geometry column as well as a column for each property listed by the user. Coordinate reference systems can be specified for non-WGS:84 datasets.
Additional expressions exist to translate between the above two types, as well as to convert them to H3 indices, MGRS, bounding boxes, and the Ontology GeoPoint format.
Transforming geospatial data¶
Once you have populated your columns of Pipeline Builder’s geospatial types, you can take advantage of transforms that operate specifically on geospatial data. Most transforms (except for geo-joins) are currently supported in both streaming and batch workflows. Some highlights are listed below.
Geometry comparisons¶
- Intersection
- Difference
- Symmetric difference
- Union (column-wise and aggregate)
Spherical geometry¶
- Haversine/great circle distance between two points
- Inverse haversine distance (given a starting point, distance, and bearing angle, calculate the end point)
- Area/centroid/length of a geometry
H3¶
- Get neighbors of H3 hexagon at a certain resolution
- Cover a polygon with H3 hexagons at a certain resolution
Complex shape approximation¶
- Ellipse/Circle
- Range fan (annulus sector)
- Convex Hull of a given geometry
Geospatial joins¶
Pipeline Builder supports the following geospatial joins:
Geometry intersection joins¶
Pipeline Builder's geometry intersection join requires two datasets, each of which must have a geometry typed column. The geometry intersection join does not accept Ontology GeoPoint or GeoPoint as an input type. Before applying the join, we recommend normalizing the geometry column and explicitly filtering out null values if they are not needed in the output. If there is non-determinism or another join in the pipeline, we recommend adding a checkpoint prior to the geojoin.
:::callout{theme="netural"} Pipeline Builder can join datasets of medium-sized geometries (approximately up to 34 points) with a scale of up to 1 million rows on either side, assuming a twofold increase in the number of output rows. For skewed data, the join can support up to 250 million rows on one side against 1.6 thousand rows on the other. Stability may degrade as the size of the geometries increases. The join can consistently support joining a dataset with one massive geometry (on the order of 40k points) against up to 500k rows. Any larger scale may succeed intermittently but is not officially supported.
Geometry intersection joins that have a number of rows in the output comparable to that of a cross join can cause stability degradation in the join. :::
As an alternative to the geometry intersection join, the cross join configured with the “Geometries have intersection” filter may provide more stable memory usage. However, this approach could lead to a sharp increase in build times.
Geometry distance joins¶
Pipeline Builder's geometry distance join requires two datasets, each of which must have a geometry typed column, a value for distance greater than zero, and a coordinate reference system string which will determine the units of the distance provided. For example, if "epsg:4326" is provided for the coordinate reference system, then the distance will be assumed to be in units of degrees. Similar to the intersection join, we recommend normalizing the geometry column, and explicitly filtering out null values if they are not needed in the output. If there is another join or non-determinism in the pipeline, add a checkpoint prior to the join.
:::callout{theme="neutral"} Pipeline Builder can join datasets of small geometries (approximately up to 8 points each) with a scale of up to 1 million rows on either side, assuming a 2x increase in the number of rows as a result of the join. When the number of rows output is comparable to that of a cross join, stability may degrade.
An alternative to the geometry distance join, a cross join configured with a geometry buffer and "Geometries have intersection" filter may provide more stable memory usage when the increase in row count is large. However, this approach could sharply increase build times in most cases. :::
Geometry k-nearest neighbors (KNN) joins¶
Pipeline Builder's geometry nearest neighbors join requires two datasets: a base dataset of geometries and a neighbors dataset of points. The k integer parameter configures the number of nearest neighbors to find for each base geometry. A coordinate reference system is required to determine how distances between base geometries and neighbor points are calculated and compared. The result will be the set of combined rows, each of which contains a GeoPoint that is one of the k closest points to the base geometry. Ties are broken arbitrarily, and results are returned in no particular order.
:::callout{theme="neutral"} Note that this join has two requirements:
-
All GeoPoints in the
neighborsdataset must be able to fit inside executor and driver memory. This is currently a hard requirement and limits the scalability of the join. Contact your Palantir representative if your use case requires distributing theneighborsdataset. -
Foundry currently only accepts the GeoPoint logical type in the
neighborsdataset to limit memory consumption. Contact your Palantir representative if non-point geometries are required on theneighborsside of the join. :::
:::callout{theme="neutral"}
In practice, Pipeline Builder supports modest values of k (< 5) with up to a few hundred thousand rows in the neighbors dataset and 1 million geometries in the base dataset. When both datasets have a few hundred thousand rows, Pipeline Builder can support much larger values of k. Finding up to several hundred nearest neighbors should finish quickly in such cases. Increasing the scale of the inputs beyond this point may succeed intermittently, but is not currently supported in general.
:::
Troubleshooting¶
If your join is encountering stability issues, use the following steps to remediate:
- Drop unnecessary columns prior to the join.
- Simplify the input geometries (for example, can you use a coarser grain for large geometries?)
- Scale vertically; manually select a compute profile with more memory for the driver and executors.
- Split the largest input dataset into sets of about 25 million rows, then union the results together in a separate build.
- Reduce the number of rows in the output (that is, the number of intersections between left and right geometries).
Preview transform results¶
Once you have finished transforming your data in Pipeline Builder, you can validate the results of these transforms visually on a map. In the regular preview pane, select the cells you would like to preview on a map (the cells must be from columns of one of the geospatial types mentioned above). Right-click and select Open Geo Preview.

A new preview tab will appear, displaying the selected cells plotted on a map.

Using geospatial data with the Ontology¶
Pipeline Builder’s geospatial capabilities are designed to integrate seamlessly with downstream data across the platform.
- Ontology
- Builder’s geometry column type is compatible with the ontology’s geoshape type, but make sure to apply the "Normalize geometry" expression before mapping your column into an object in Builder. This ensures that the geoshape data will pass validations performed while indexing data into the ontology.
- While the current GeoPoint logical type cannot be used directly in the ontology, points can be easily converted to an "Ontology GeoPoint" type (a string of the format '{lat},{lon}') prior to indexing.
- Datasets
- Geospatial type data is persisted on output datasets on Builder pipelines, so that if you create a downstream Builder pipeline from that dataset, your data will still preserve its correct logical/geospatial types.
- Geotemporal series syncs
- You can map GeoPoint and geometry columns to a geotemporal series sync output for rendering points and geometries in downstream applications, such as the Map application. All geotemporal series sync outputs must have a GeoPoint column indicating the position of the observation corresponding to each row. Consult the geotemporal series output documentation for more details on how to configure geospatial properties in your integrations.
中文翻译¶
创建地理空间转换(Geospatial Transforms)¶
使用Pipeline Builder,您可以加载、转换和处理地理空间数据。如果您的地理空间工作流尚未被Pipeline Builder的现有功能支持,请查阅Foundry的旧版地理空间文档,在代码仓库(Code Repositories)中转换数据。
建模地理空间数据¶
逻辑类型(Logical Types)¶
Pipeline Builder内部使用逻辑类型(Logical Type)的概念来建模地理空间数据。逻辑类型是一种基础类型(字符串、整数、布尔值、数组、结构体),并附加了对所表示数据的额外约束。例如,Geometry类型被定义为必须是有效GeoJSON的字符串,而GeoPoint必须是经度在-180到180之间、纬度在-90到90之间(均包含边界值)的结构体。支持的完整类型列表请参见下文。
Pipeline Builder中的所有逻辑类型都是其基础类型的继承者;例如,几何体(Geometry)可以作为期望字符串类型输入的表达式输入,但反之则不行。要从基础类型转换为扩展该基础类型的特定逻辑类型,可以使用"逻辑类型转换(Logical Type Cast)"表达式,该表达式会将与该逻辑类型相关的约束应用于数据,并将任何未通过验证的值设为空值。表达式能够将逻辑类型指定为输入和输出,这确保了当地理空间特定表达式期望GeoJSON字符串时,实际接收到的将是GeoJSON字符串。
支持的地理空间类型¶
Pipeline Builder目前支持以下地理空间类型:
- GeoPoint: 一个包含
longitude和latitude的结构体,其中longitude是介于-180和180之间的双精度浮点数,latitude是介于-90和90之间的双精度浮点数,均包含边界值。GeoPoint必须是符合WGS:84或EPSG:4326坐标参考系(CRS)的有效(x, y)坐标。 - Geometry: 一个符合GeoJSON规范的字符串化JSON数据块。单个坐标应使用WGS:84/EPSG:4326格式,与GeoPoint类型相同。
- H3Index: 一个表示有效H3六边形索引的字符串。
- LatLonBoundingBox: 一个边界框,由包含
minLat、minLon、maxLat、maxLon的结构体表示,其中每个条目都是有效的GeoPoint,并且满足maxLat > minLat和maxLon > minLon。 - Ontology GeoPoint: 一个与本体的GeoPoint属性类型兼容的字符串,格式为
{lat},{lon},其中-90 <= lat <= 90且-180 <= lon <= 180。 - MGRS: 一个表示有效MGRS(军事格网参考系统)坐标的字符串。
加载地理空间数据¶
Pipeline Builder支持多种针对地理空间数据的转换和表达式。
- GeoPoint:
- 构建GeoPoint列(Construct GeoPoint column): 接收
lat,lon对,验证上述边界条件,并将其转换为GeoPoint表示。 - 从坐标系(CRS)创建GeoPoint(Create GeoPoint from Coordinate System): 接收
x,y对和一个坐标参考系,将该(x,y)投影到WGS:84,然后构建GeoPoint表示。支持从EPSG数据库中的大多数坐标系进行转换,包括所有UTM区域。 - Geometry:
- 解析熟知文本(WKT)(Parse well-known text): 将熟知文本(WKT)字符串转换为几何体逻辑类型。可选地,提供源坐标系标识符,以便在WKT尚未使用WGS:84格式时从源CRS转换为WGS:84。
- 标准化几何体(Normalize Geometry): 给定一个WGS:84格式的GeoJSON字符串,标准化以下属性:正确的顺序(右手定则)、闭合环、去除重复点以及点的恒定维度。
- 从Shapefile提取行(Extract rows from Shapefile): 给定一个原始shapefile数据集,将每个shapefile解析为包含每个条目几何体和属性的行。输出数据集将包含一个几何体列以及用户列出的每个属性对应的列。可以为非WGS:84数据集指定坐标参考系。
- 从GeoJSON提取行(Extract rows from GeoJSON): 给定一个原始GeoJSON文件数据集,将每个shapefile解析为包含每个条目几何体和属性的行。输出数据集将包含一个几何体列以及用户列出的每个属性对应的列。可以为非WGS:84数据集指定坐标参考系。
还存在其他表达式用于在上述两种类型之间进行转换,以及将它们转换为H3索引、MGRS、边界框和本体GeoPoint格式。
在坐标参考系和投影文档中了解更多关于地理空间数据格式的信息。
转换地理空间数据¶
一旦您填充了Pipeline Builder地理空间类型的列,就可以利用专门针对地理空间数据的转换。大多数转换(地理连接除外)目前同时在流式(streaming)和批处理(batch)工作流中得到支持。以下列出了一些重点功能。
几何体比较(Geometry Comparisons)¶
- 交集(Intersection)
- 差集(Difference)
- 对称差集(Symmetric difference)
- 并集(Union)(列级和聚合级)
球面几何(Spherical Geometry)¶
- 两点之间的哈弗辛/大圆距离(Haversine/great circle distance)
- 逆哈弗辛距离(Inverse haversine distance)(给定起点、距离和方位角,计算终点)
- 几何体的面积/质心/长度(Area/centroid/length)
H3¶
- 获取特定分辨率下H3六边形的邻居
- 使用特定分辨率的H3六边形覆盖多边形
复杂形状近似(Complex Shape Approximation)¶
- 椭圆/圆(Ellipse/Circle)
- 扇形环(Range fan)(环形扇区)
- 给定几何体的凸包(Convex Hull)
地理空间连接(Geospatial Joins)¶
Pipeline Builder支持以下地理空间连接:
几何体交集连接(Geometry Intersection Joins)¶
Pipeline Builder的几何体交集连接需要两个数据集,每个数据集必须有一个几何体类型的列。几何体交集连接不接受Ontology GeoPoint或GeoPoint作为输入类型。在应用连接之前,我们建议标准化几何体列,并明确过滤掉输出中不需要的null值。如果管道中存在非确定性或另一个连接,我们建议在地理连接之前添加一个检查点(Checkpoint)。
:::callout{theme="neutral"} Pipeline Builder可以连接中等大小几何体(大约最多34个点)的数据集,假设输出行数增加两倍,每侧规模可达100万行。对于偏斜数据,该连接可以支持一侧最多2.5亿行与另一侧1600行进行连接。随着几何体大小的增加,稳定性可能会下降。该连接可以稳定地支持将一个包含巨大几何体(约4万个点)的数据集与最多50万行进行连接。任何更大的规模可能间歇性成功,但不受官方支持。
输出行数与交叉连接(Cross Join)相当的几何体交集连接可能会导致稳定性下降。 :::
作为几何体交集连接的替代方案,配置了"几何体有交集(Geometries have intersection)"过滤器的交叉连接可能提供更稳定的内存使用。然而,这种方法可能导致构建时间急剧增加。
几何体距离连接(Geometry Distance Joins)¶
Pipeline Builder的几何体距离连接需要两个数据集,每个数据集必须有一个几何体类型的列、一个大于零的距离值以及一个坐标参考系字符串,该字符串将决定所提供距离的单位。例如,如果为坐标参考系提供了"epsg:4326",则距离将被假定以度为单位。与交集连接类似,我们建议标准化几何体列,并明确过滤掉输出中不需要的null值。如果管道中存在另一个连接或非确定性,请在连接之前添加一个检查点。
:::callout{theme="neutral"} Pipeline Builder可以连接小几何体(每个大约最多8个点)的数据集,假设连接结果行数增加2倍,每侧规模可达100万行。当输出行数与交叉连接相当时,稳定性可能会下降。
作为几何体距离连接的替代方案,配置了几何体缓冲区(Geometry Buffer)和"几何体有交集"过滤器的交叉连接可能在行数增加较大时提供更稳定的内存使用。然而,在大多数情况下,这种方法可能会急剧增加构建时间。 :::
几何体k最近邻(KNN)连接(Geometry k-nearest Neighbors (KNN) Joins)¶
Pipeline Builder的几何体最近邻连接需要两个数据集:一个base几何体数据集和一个neighbors点数据集。k整数参数配置要为每个基础几何体查找的最近邻数量。需要一个坐标参考系来确定如何计算和比较基础几何体与邻居点之间的距离。结果将是一组组合行,每行包含一个GeoPoint,该点是距离基础几何体最近的k个点之一。并列情况将任意打破,结果不按特定顺序返回。
:::callout{theme="neutral"} 请注意,此连接有两个要求:
-
neighbors数据集中的所有GeoPoint必须能够放入执行器(Executor)和驱动器(Driver)内存中。这目前是一个硬性要求,限制了连接的可扩展性。如果您的用例需要分发neighbors数据集,请联系您的Palantir代表。 -
Foundry目前仅在
neighbors数据集中接受GeoPoint逻辑类型,以限制内存消耗。如果连接neighbors侧需要非点几何体,请联系您的Palantir代表。 :::
:::callout{theme="neutral"}
在实践中,Pipeline Builder支持适度的k值(< 5),邻居数据集最多几十万行,基础数据集最多100万个几何体。当两个数据集都有几十万行时,Pipeline Builder可以支持更大的k值。在这种情况下,查找最多几百个最近邻应该能快速完成。超出此范围增加输入规模可能间歇性成功,但目前一般不受支持。
:::
故障排除(Troubleshooting)¶
如果您的连接遇到稳定性问题,请使用以下步骤进行修复:
- 在连接之前删除不必要的列。
- 简化输入几何体(例如,对于大型几何体,能否使用更粗略的粒度?)
- 垂直扩展;手动选择为驱动器和执行器提供更多内存的计算配置文件(Compute Profile)。
- 将最大的输入数据集拆分为约2500万行的集合,然后在单独的构建中合并结果。
- 减少输出中的行数(即左右几何体之间的交集数量)。
预览转换结果¶
在Pipeline Builder中完成数据转换后,您可以在地图上直观地验证这些转换的结果。在常规预览窗格中,选择您想要在地图上预览的单元格(这些单元格必须来自上述地理空间类型之一的列)。右键单击并选择打开地理预览(Open Geo Preview)。

将出现一个新的预览选项卡,显示所选单元格在地图上的绘制结果。

将地理空间数据与本体(Ontology)结合使用¶
Pipeline Builder的地理空间功能旨在与平台上的下游数据无缝集成。
- 本体(Ontology)
- Builder的几何体列类型与本体的地理形状(Geoshape)类型兼容,但请确保在将列映射到Builder中的对象之前应用"标准化几何体(Normalize geometry)"表达式。这确保了地理形状数据能够通过将数据索引到本体时执行的验证。
- 虽然当前的GeoPoint逻辑类型不能直接在本体中使用,但可以在索引之前轻松地将点转换为"Ontology GeoPoint"类型(格式为'{lat},{lon}'的字符串)。
- 数据集(Datasets)
- 地理空间类型数据会持久化存储在Builder管道的输出数据集中,因此如果您从该数据集创建下游Builder管道,您的数据将仍然保留其正确的逻辑/地理空间类型。
- 地理时间序列同步(Geotemporal series syncs)
- 您可以将GeoPoint和几何体列映射到地理时间序列同步输出,以便在下游应用程序(如地图应用程序)中渲染点和几何体。所有地理时间序列同步输出必须有一个GeoPoint列,指示每行对应观测的位置。有关如何在集成中配置地理空间属性的更多详细信息,请查阅地理时间序列输出文档。