As its name implies, TopoJSON is a topological geospatial data format. In contrast to a geometry, where each shape is encoded with separate (and often redundant) arrays of coordinates, a topology encodes sequences of coordinates in line fragments called arcs that can be shared. For example, the border between California and Nevada is an arc that is shared by both polygons. The various arcs of the U.S. state borders are shown above.
The main benefit of a topology is that it improves shape simplification by avoiding artifacts that would be caused by simplifying shapes independently. It also enables applications like map coloring and selective meshing, and makes the format more compact by removing redundant coordinates.
Yet because most geospatial data formats (other than E00) are non-topological, you must typically create TopoJSON by inferring the topology from geometry. This article documents my implementation, partly for general interest and partly to help you debug conversion problems when using the topojson
command-line tool. This can also serve as a guide should you be interested in implementing TopoJSON encoding in another language.
The algorithm has four steps:
Each step is implemented in a source file of the same name, should you wish to follow along.
The minimum information needed to infer the topology is the set of all lines and rings extracted from the input geometry. This extraction simplifies the downstream logic; for example, the topology doesn’t care whether two polygons are represented as two Polygon geometries or one MultiPolygon geometry, as both are simply two rings.
Extracting lines and rings is a straightforward decomposition of the input geometries: a LineString is converted to one line, a MultiLineString is converted to zero or more lines, a Polygon is converted to one or more rings, and so on. The algorithm must also record which lines and rings are associated with which input geometries, so that the geometry can be recomposed from the final topology when the conversion is complete.
This animation demonstrates extraction on U.S. state borders:
The order of rings is arbitrary, though note that rings for each state are extracted in sequence, since a state with multiple polygons in this example is represented as a MultiPolygon.
Why differentiate between lines and rings? For topologies, rings are a special form of closed lines that can be rotated: the start of the ring can be changed to any point along the ring without affecting the geometry. This special property can be used later to reduce the number of arcs that need to be cut, optimizing the topology.
The next step is to identify junctions: shared points where two or more lines (or rings) intersect. For example, consider the line ABC, where A, B and C are distinct points. If there is another line DBE, then B is a junction. Similarly, if there are two lines ABC and ABD, then because the lines diverge at point B, B is a junction. A few simple examples:
(Note: if two line segments intersect, but don’t share a point, this intersection is ignored. The topology only cares how points are connected, and whether they are distinct, not their positions.)
Junctions are irrespective to order. So, given two lines ABC and CBA, then B is not a junction — unless there is some other line that intersects B. For the purposes of topology construction, we also consider the start and end point of every line to be a junction. This simplifies the code slightly, and has no detrimental effect because it is not necessary to cut a line at a given junction if the line already terminates at that junction.
The junctions for the U.S. state borders can be seen here as black dots:
As you expect, the border intersections in the contiguous United States are correctly identified. You can see a few additional junctions on the coasts, such as on the Gulf of Mexico and Puget Sound. These junctions are caused by coastal features that are too small to see in the current map, such as adjacent islands that share a single point. Independent rings, such as the Hawaiian islands, have no junctions; as before, this is because rings can be freely rotated.
The process of identifying junctions requires iterating over each line and ring while recording the neighbors of each visited point. If a point is subsequently revisited and has a different set of neighbors than before, then the point is a junction. For example, given the two lines ABC and ABD, the first time we visit B, the neighbors are {A, C}. The second time we visit B, the new neighbors are {A, D}, which is different than before, and thus B is a junction.
With all junctions identified, the next step is to cut the lines and rings so that they terminate at the junctions. For example, if B is a junction, then the line ABC is split into two lines AB and BC, while the line ABD is split into AB and BD. Splitting lines at junctions guarantees that all shared point sequences (like AB and AB) will be correctly deduped, and can be performed by simply iterating over the lines and rings while querying the set of previously-computed junctions.
This animation demonstrates cutting on U.S. state borders:
Red indicates the section of the ring (or line) before the cut, while the blue indicates the section after the cut. The blue part gets successively smaller as the ring is cut at each junction.
For the first junction encountered on a ring, we can rotate the ring so that it starts on the junction, eliminating a cut and reducing the number of arcs in the generated topology. For example, given the ring ABCA, if B is a junction, then we can rotate the ring to the equivalent sequence BCAB. This produces a smaller topology than cutting the ring into AB and BCA. This treatment could be applied as well to closed lines, but since the start and end point of a closed line may be meaningful — say, Magellan’s circumnavigation of the globe — it’s safer to restrict this optimization to rings. Ring rotation is not shown explicitly in the above animation, but notice that the first red cut on the ring always occurs on a junction, not at the ring’s original start point.
With the lines and rings cut into arcs, the last step is to deduplicate arcs so that each shared sequence of points is encoded only once in the final topology. Again, this requires iterating over the arcs that comprise each line or ring, and comparing the current arc to the arcs that were visited previously. If the current arc is different from all previously-visited arcs, it is assigned a new unique identifier (0, 1, 2… etc.). However if it is the same, it is given the identifier from the identical previous arc, and thus deduplicated.
This animation demonstrates deduplicating cut arcs on U.S. state borders:
The first time an arc is seen, it flashes red. If it is revisited (and thus duplicated), it flashes blue.
Detecting whether two arcs are the same is mostly a matter of iterating over the points in the arc and seeing whether they are equal, similar to string comparison. However, since arcs can be reversed, it’s necessary to perform a backwards traversal if the forwards traversal fails.
Comparing rings is further complicated in that two rings may be rotated but still equivalent. This is resolved by picking a consistent rotation-invariant starting point on both rings before traversing, such as the point with the minimum x-coordinate, or the minimum y-coordinate if there are multiple points at the minimum x-coordinate.
The implementation described in this article is new for TopoJSON 1.4.0. (It differs substantially from the old implementation; after this rewrite, I’m amazed the old implementation worked at all, though I’m glad there’s a respectable test suite!)
An issue of particular importance to topology inference is messy inputs. For example, if you have two points [-122.416712304917, 37.783305712306] and [-122.416712304916, 37.783305712306], then in regards to the topology they are as distinct as San Francisco and New York. The TopoJSON reference implementation eliminates many of these errors by quantizing inputs, rounding them to lower precision. Quantization gives excellent results when constructing topologies from multiple layers, such as county, state and nation borders. However, overly-quantized inputs can introduce excessive arc self-intersection, resulting in a big topology. So user control over quantization, and perhaps more heuristics, are still needed for dealing with “real-world” inputs. Or we could all start using topological formats…
If you have questions or comments, please find me on Twitter.