Mike Bostock

Let’s Make a Bubble Map

My previous Let’s Make a Map tutorial describes how to make a basic map with D3 and TopoJSON; now it’s time to cover thematic mapping in the form of a proportional symbol map. The simplest symbol is a circle, or bubble, whose area is proportional to the associated data. In this tutorial, we’ll make a bubble map of population by U.S. county.

Source: American Community Survey, 2012 5-Year Estimate

This tutorial necessarily covers a lot of ground. The main tasks are for any visualization are:

acquiring data from primary sources,
transforming data into a display-appropriate representation, and
displaying data in a suitable visual encoding.

There are many different ways to perform these tasks, but this tutorial will focus on my preferred workflow. After acquiring cartographic boundaries and population estimates from the U.S. Census Bureau, we’ll transform this data to TopoJSON and display it using D3. Lastly, I’ll briefly comment on effective design for visual communication.

#Initializing

At a minimum, you’ll need Node and a basic web server for making maps. I covered this previously, so I won’t repeat myself here.

Although not essential, I also recommend Git to keep a history of your changes, allowing you to revert mistakes (such as accidentally deleting hours of work). Create a new folder for this project, go to that folder in the terminal, and run the following command:

git init

I use NPM to define local dependencies. The benefit of this approach is that you can have multiple versions of software packages installed simultaneously, and you don’t have to worry about things breaking when you upgrade because each project is isolated. A minimal package definition is:

{
  "name": "anonymous",
  "private": true,
  "version": "0.0.1",
  "dependencies": {
    "topojson": "1"
  }
}

Save this to a file called package.json, and run:

npm install

You should now see a node_modules folder, containing the installed topojson package.

If you’re using Git, you should also create a local .gitignore file so that you don’t accidentally check-in generated files to the repository. It should look something like this:

.DS_Store
build
node_modules

The build directory is where we’ll store our generated files. Because those files are generated, they don’t need to be saved in the Git repository — they can be rebuilt at any time.

#Finding Boundaries

The U.S. Census Bureau publishes simplified cartographic boundaries as shapefiles for thematic mapping. The Census Bureau also publishes TIGER/Line shapefiles that are higher resolution and more up-to-date; however, for the small scale map we are making, that extra resolution is not needed. County boundaries also don’t change very frequently, so it’s usually acceptable to use the decennial census rather than the most recent release.

We’ll be using the lowest-resolution shapefile, at “20m” or 1:20,000,000 scale. Rather than download the file and check it in to our git repo, we’ll use Make to document where this file is located and download it. Create a Makefile with the following contents:

Here $@ is the path to the target file, $(dir $@) is the directory containing the target file, and $(notdir $@) is the target file name. These abbreviations are faster to read and help avoid typos, but use long names if you find them too cryptic.

build/gz_2010_us_050_00_20m.zip:
	mkdir -p $(dir $@)
	curl -o $@ http://www2.census.gov/geo/tiger/GENZ2010/$(notdir $@)

Next, run:

make build/gz_2010_us_050_00_20m.zip

This will download the zipfile from the Census Bureau and save it in the build directory.

#Converting Boundaries

The zipfile by itself isn’t very useful. We need to unzip its contents and convert the contained shapefile into TopoJSON for web delivery. We could do this by hand, but we’ll again use Make so that our process is documented and repeatable. Add the following to the Makefile:

build/gz_2010_us_050_00_20m.shp: build/gz_2010_us_050_00_20m.zip
	unzip -od $(dir $@) $<
	touch $@

This rule unzips the previously-downloaded file, giving us shapefiles. But don’t run it yet — we can combine it with another rule to convert the shapefiles to TopoJSON:

build/counties.json: build/gz_2010_us_050_00_20m.shp
	node_modules/.bin/topojson \
		-o $@ \
		--projection='width = 960, height = 600, d3.geo.albersUsa() \
			.scale(1280) \
			.translate([width / 2, height / 2])' \
		--simplify=.5 \
		-- counties=$<

Now run this new command:

make build/counties.json

In fact, this is not just converting the shapefile to TopoJSON, but also quantizing, projecting to the Albers USA projection and simplifying. Together, these changes save quite a bit of space! The resulting file is 496KB, while the original shapefile was 1.7MB.

#Displaying Boundaries

Enough terminal. Time to get something on the screen. Create an index.html:

<!DOCTYPE html>
<meta charset="utf-8">
<style>

path {
  fill: none;
  stroke: #000;
  stroke-linejoin: round;
  stroke-linecap: round;
}

</style>
<body>
<script src="//d3js.org/d3.v3.min.js" charset="utf-8"></script>
<script src="//d3js.org/topojson.v1.min.js"></script>
<script>

var width = 960,
    height = 600;

var path = d3.geo.path()
    .projection(null);

var svg = d3.select("body").append("svg")
    .attr("width", width)
    .attr("height", height);

d3.json("build/counties.json", function(error, us) {
  if (error) return console.error(error);

  svg.append("path")
      .datum(topojson.mesh(us))
      .attr("d", path);
});

</script>

Launch your local web server, and then visit your page. It should look something like this:

Two things to note at this stage. First, the d3.geo.path instance has a null projection; that’s because our TopoJSON is already projected, so we can display it as-is. This greatly improves rendering performance. Second, we’re just displaying the county boundaries so far (using topojson.mesh). We still have a bit of work to do before we can draw population bubbles.

#Finding Data

The next task is to fetch the data we want to visualize: population estimates by county. Sometimes you may find that data conveniently baked into your shapefile, but here we’ll need to return to the U.S. Census Bureau and gather the requisite table from the American Community Survey (ACS) using the American FactFinder.

Here are the approximately twenty steps required to download a CSV:

Go to factfinder2.census.gov.
Find where it says “American Community Survey” and click “get data »”.
Click the blue “Geographies” button on the left.
In the pop-up, select “..... County - 050” in the “geographic type” menu.
Select “All Counties within United States” in the “geographic areas” box.
Click the “ADD TO YOUR SELECTIONS” button.
Click “CLOSE” to dismiss the pop-up.
Click the blue “Topics” button on the left.
In the pop-up, expand the “People” submenu.
Expand the “Basic Count/Estimate” submenu.
Click “Population Total”.
Click “CLOSE” to dismiss the pop-up.
In the table, click on the most recent ACS 5-year estimate named “TOTAL POPULATION”.
On the next page, click the “Download” link under “Actions”.
In the pop-up, click “OK”.
Wait for it to “build” your file.
When it’s ready, click “DOWNLOAD”.
Finally, expand the downloaded zip file (ACS_12_5YR_B01003.zip).

If you would prefer this as a two-minute instructional video:

An eminently more usable alternative to FactFinder is censusreporter.org, a Knight News Challenge-funder project with a convenient autocomplete interface and a robust API. Here is a direct link to download the latest ACS total population estimate by county. Note, however, that the column headers for this CSV are slightly different than the ones from FactFinder: you must either edit the file or the Makefile rules accordingly.

If you want to experience the FactFinder vicariously, you may also instead download my copy. However, I recommend that you prefer data from primary sources whenever possible, as this ensures the data’s accuracy.

#Merging Data

The downloaded ACS_12_5YR_B01003_with_ann.csv is slightly unusual in that it contains two header lines. Normally, a CSV file only contains at most one header line defining the names of the columns; this is the format that d3.csv (and TopoJSON) expects. Open the downloaded CSV in your text editor and delete the first of the two header lines. The first few lines should look like this:

Id,Id2,Geography,Estimate; Total,Margin of Error; Total
0500000US01001,01001,"Autauga County, Alabama",54590,*****
0500000US01003,01003,"Baldwin County, Alabama",183226,*****
0500000US01005,01005,"Barbour County, Alabama",27469,*****

Now we can use TopoJSON’s --external-properties feature to join the shapefile of counties with the CSV of population estimates, making additional properties available in the output TopoJSON. This flag works similar to a join in a relational database. Using the ID property as a primary key, we assign each row in the CSV file to the corresponding feature in the shapefile.

One frequent complication is that the external properties do not use the same ID property name as the shapefile. Here the CSV file uses the name Id2, while the shapefile uses STATE and COUNTY. (We could use the longer Id and GEO_ID properties, but we’d prefer to use the shorter identifier here, without the redundant leading 0500000US.)

To address these inconsistencies, the --id-property argument accepts a comma-separated list of JavaScript expressions to specify how the ID property should be computed. For the shapefile, we’ll use the expression STATE+COUNTY to concatenate those two properties, while for the CSV, we’ll use Id2.

We can also use JavaScript expressions to define the properties we want to include in the generated TopoJSON. Here we’ll map the Geography column from the CSV to the name property, and the Estimate; Total column to the population property. The latter requires special syntax because the column name isn’t a valid JavaScript identifier. Also, we want it to be a number.

Modifying our Makefile slightly:

build/counties.json: build/gz_2010_us_050_00_20m.shp ACS_12_5YR_B01003_with_ann.csv
	node_modules/.bin/topojson \
		-o $@ \
		--id-property='STATE+COUNTY,Id2' \
		--external-properties=ACS_12_5YR_B01003_with_ann.csv \
		--properties='name=Geography' \
		--properties='population=+d.properties["Estimate; Total"]' \
		--projection='width = 960, height = 600, d3.geo.albersUsa() \
			.scale(1280) \
			.translate([width / 2, height / 2])' \
		--simplify=.5 \
		-- counties=$<

#Merging Boundaries

One subtle detail you may not have noticed in the final bubble map is that it displays state boundaries rather than county boundaries. This reduces visual noise; each county has a corresponding bubble, while the state boundary lines provide additional geographic context.

We can compute the state boundaries without downloading another shapefile because TopoJSON is a topological format. The following rule merges (or “dissolves”) counties within the same state, producing a new states layer in the output TopoJSON file:

build/states.json: build/counties.json
	node_modules/.bin/topojson-merge \
		-o $@ \
		--in-object=counties \
		--out-object=states \
		--key='d.id.substring(0, 2)' \
		-- $<

The resulting state mesh:

A similar rule can compute the national boundary by merging states:

us.json: build/states.json
	node_modules/.bin/topojson-merge \
		-o $@ \
		--in-object=states \
		--out-object=nation \
		-- $<

To run these new rules:

make us.json

The topojson.merge function is part of the client API, so we could do this step in the client rather than baking it into the TopoJSON file. However, it’s slightly faster to precompute the merged areas, and sometimes it’s nice to have fewer moving parts.

Don’t forget to load the new file in index.html, replacing the old counties-only file:

d3.json("us.json", function(error, us) {
  if (error) return console.error(error);

  // Append to svg here.
});

#Displaying Data

First, let’s finish the base map that will appear underneath the bubbles.

The relevant code for the base map is:

svg.append("path")
    .datum(topojson.feature(us, us.objects.nation))
    .attr("class", "land")
    .attr("d", path);

svg.append("path")
    .datum(topojson.mesh(us, us.objects.states, function(a, b) { return a !== b; }))
    .attr("class", "border border--state")
    .attr("d", path);

The land is drawn as a single feature, with the state borders drawn as white lines on top. The filter function passed to topojson.mesh specifies that only internal state borders should be drawn; the coastlines are not stroked so as to retain detail around small islands and inlets.

We’ll need these new styles, as well, replacing the old ones:

.land {
  fill: #ddd;
}

.border {
  fill: none;
  stroke: #fff;
  stroke-linejoin: round;
  stroke-linecap: round;
}

Now to place bubbles at each county centroid:

svg.append("g")
    .attr("class", "bubble")
  .selectAll("circle")
    .data(topojson.feature(us, us.objects.counties).features)
  .enter().append("circle")
    .attr("transform", function(d) { return "translate(" + path.centroid(d) + ")"; })
    .attr("r", 1.5);

To size the bubbles, create a d3.scale.sqrt so that the area of the circle is proportional to the associated population; the radius of the circle is proportional to the square root of the population. (Alternatively, you could use d3.svg.symbol for other proportional symbols.) We could compute the domain of the scale from the data, but since we know the approximate distribution of the data beforehand, we can simply hard-code reasonable values:

var radius = d3.scale.sqrt()
    .domain([0, 1e6])
    .range([0, 15]);

This version of the map suffers greatly from occlusion: larger circles, such as Cook County in Illinois and Los Angeles County in California, obscure smaller bubbles underneath. Occlusion can be mitigated by making the bubbles smaller, but this makes it harder to see less-populated counties and emphasizes dense urban areas.

Another way to reduce occlusion is to sort bubbles by descending size, so that smaller bubbles are drawn on top of larger bubbles. The bubbles still overlap, but the smaller bubbles are now visible.

svg.append("g")
    .attr("class", "bubble")
  .selectAll("circle")
    .data(topojson.feature(us, us.objects.counties).features
      .sort(function(a, b) { return b.properties.population - a.properties.population; }))
  .enter().append("circle")
    .attr("transform", function(d) { return "translate(" + path.centroid(d) + ")"; })
    .attr("r", function(d) { return radius(d.properties.population); });

A bit of transparency and thin white stroke also helps.

.bubble {
  fill-opacity: .5;
  stroke: #fff;
  stroke-width: .5px;
}

Boom! A bubble map. But now that our map is legible, it’s a good time to consider its validity: often our source data is not as clean and regular as we expect, and data-cleanliness issues may not be apparent in the visualization. It’s critical to spot-check data and verify that it’s correct. You should run sanity checks on the data, such as whether any counties are duplicated or missing data.

For example, an earlier version of this tutorial used county boundaries from a different source, and the shapefile specified separate features for each of a county’s discontiguous areas. (Honolulu County in Hawaii consists not only of Oahu, but the tiny Ford and Sand islands as well.) To avoid duplicate bubbles and misleading readers, you would need to group features by county! The shapefile from the U.S. Census Bureau is already grouped, so we could skip this step.

#Communicating

To make this map communicate rather than simply look pretty, we need a few administrative touches. Adding a basic tooltip using SVG’s title element is a reasonable improvement, but we really need a legend to make the meaning of the area encoding is apparent. Here is a basic legend that displays three circles and their associated population sizes:

var legend = svg.append("g")
    .attr("class", "legend")
    .attr("transform", "translate(" + (width - 50) + "," + (height - 20) + ")")
  .selectAll("g")
    .data([1e6, 3e6, 6e6])
  .enter().append("g");

legend.append("circle")
    .attr("cy", function(d) { return -radius(d); })
    .attr("r", radius);

legend.append("text")
    .attr("y", function(d) { return -2 * radius(d); })
    .attr("dy", "1.3em")
    .text(d3.format(".1s"));

And the corresponding styles:

.legend circle {
  fill: none;
  stroke: #ccc;
}

.legend text {
  fill: #777;
  font: 10px sans-serif;
  text-anchor: middle;
}

An alternative to the explicit legend is to annotate a few circles with their exact value — say, Los Angeles, Miami-Dade, and Cook. These values can then serve as comparison points for the other value, rather than needing additional visual elements.

Lastly, a wide variety of interactive improvements could be made, such as custom tooltip that displays additional information and the county outline, or panning and zooming to allow the viewer to dive in for more detail. You might also consider a Voronoi overlay to make the counties with small populations easier to hover. This tutorial merely provides a basic starting point for an interactive graduated symbol map.