library(redist) #> Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame' #> when loading 'dplyr' library(igraph) #> #> Attaching package: 'igraph' #> The following objects are masked from 'package:stats': #> #> decompose, spectrum #> The following object is masked from 'package:base': #> #> union library(spdep) #> Loading required package: sp #> Loading required package: spData #> To access larger datasets in this package, install the spDataLarge #> package with: `install.packages('spDataLarge', #> repos='https://nowosad.github.io/drat/', type='source')` #> Loading required package: sf #> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1 library(coda) set.seed(1)
The redist
package is designed to allow for replicable redistricting simulations. The package comes loaded with data for simple testing and with functions to simulate redistricting and to run diagnostics on the redistricting plans created. These data form the basis for small-scale validations of sampling methods in Automated Redistricting Simulation Using Markov Chain Monte Carlo. For larger scale validation, see The Essential Role of Empirical Validation in Legislative Redistricting Simulation, which has additional methods which will be added to this package later in 2020.
redist can be installed either from CRAN, for the stable version with:
install.packages('redist')
or from GitHub, which is updated more often:
devtools::install_github(repo = 'kosukeimai/redist', ref = 'master')
This package is often used with a set of other packages: sf
and sp
are useful for working with shapefiles and can be loaded with:
For additional functions for working with the shapefiles, using spdep
is recommended. This is also used with creating lists of which precincts are adjacent to other precincts. It can be loaded with:
For plotting maps and adjacency graphs, ggplot2
and igraph
are useful. The former allows for making ggplot maps, when used with sf
. These may be loaded with:
Additionally useful data manipulation tools for working with objects from the package are dplyr
and magrittr
.
library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:igraph': #> #> as_data_frame, groups, union #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union library(magrittr)
The package contains four datasets. The first three are algdat.p10
, algdat.p20
, and algdat.pfull
which contain data for 25 continuous precincts within Florida when partitioned into three districts. Respectively, these are those plans which fall within 10% population parity, 20% population parity, and all possible partitions. These are loaded in with the package and can be loaded as follows:
Each algdat object is a list with five objects:
names(algdat.p10) #> [1] "adjlist" "cdmat" "precinct.data" #> [4] "segregation.index" "distancemat" names(algdat.p20) #> [1] "adjlist" "cdmat" "precinct.data" #> [4] "segregation.index" "distancemat" names(algdat.pfull) #> [1] "adjlist" "cdmat" "precinct.data" #> [4] "segregation.index" "distancemat"
adjlist - contains 25 adjacency lists as nb objects from package spdep
cdmat - contains all possible congressional districts under the population constraint for that object as a matrix
precinct.data - data frame with a row for each precinct with five columns: pop, demvote, repvote, blackpop, hispanicpop
segregation.index - the dissimilarity index of segregation for each plan (The Dimensions of Residential Segregation, Massey & Denton 1987).
distancemat - a symmetric matrix with the squared distance between two precincts as its entries
The fourth data set is the sf dataframe, containing the shapes themselves for creating these objects. It can be called with:
data("fl25")
From the head of the data, we can see this is a typical dataframe with a geometry column appended. See the website for sf for more information about this type of object.
head(fl25) #> Simple feature collection with 6 features and 11 fields #> geometry type: POLYGON #> dimension: XY #> bbox: xmin: -81.85582 ymin: 26.4483 xmax: -81.56292 ymax: 26.7171 #> CRS: NA #> geoid10 pop vap obama mccain TotPop BlackPop HispPop VAP BlackVAP #> 2942 2519.0_0 15993 12379 647 1522 15993 863 1980 12379 582 #> 2561 2402.0_0 7799 6084 266 226 7799 2417 1600 6084 1601 #> 2581 2520.0_0 6347 5698 758 1846 6347 97 304 5698 78 #> 2594 2532.0_0 3808 3134 424 828 3808 41 180 3134 27 #> 2687 2438.0_0 4982 4173 668 842 4982 316 837 4173 199 #> 2732 2521.0_0 8907 7587 644 935 8907 195 1147 7587 153 #> HispVAP geometry #> 2942 1292 POLYGON ((-81.78102 26.6274... #> 2561 1133 POLYGON ((-81.80102 26.6341... #> 2581 240 POLYGON ((-81.79772 26.5093... #> 2594 134 POLYGON ((-81.79842 26.5474... #> 2687 545 POLYGON ((-81.75042 26.7038... #> 2732 860 POLYGON ((-81.66622 26.4657...
Each column contains useful information for typical redistricting questions: * geoid10
is a unique identifier. * pop
is the population of the precinct. * vap
is the voter age population. * obama
is the vote count for Obama in the 2012 presidential election. * mccain
is the vote count for McCain in the 2012 presidential election. * TotPop
is identical to pop
. * BlackPop
is the number of black residents within the precinct. * HispPop
is the number of Hispanic residents within the precinct. * VAP
is identical to vap
. * BlackVAP
is the number of voter age black residents within the precinct. * HispVAP
is the number of voter age hispanic residents within the precinct. * geometry
contains the sf geometry.
We begin with a simple example of thinking about adjacency-based redistricting, using a small piece of Florida, named fl25. It can be loaded with the data command as follows. For more information on the data, see the data section
data("fl25")
This Florida subset contains the shapefiles for the districts, which can be plotted with sf.
In this map, the color represents the population, where darker colors are smaller populations. The 25 shapes outlined above are the precincts that we work with. A difficulty in redistricting is that the area of the precinct does not necessarily tell us anything about the population of the precinct. As such, we work with adjacency lists for the precincts.
If we number the districts from 1 to 25, we say that two districts are connected if they share a side, which is referred to as rook contiguity.
If we arbitrarily number the districts from the plot above, we can pick a small subset of them.
fl25$id <- 1:25 fl25[c(12,13,1,14,15),] %>% ggplot() + geom_sf() + geom_sf_label(aes(label = id))
Then, the 15th precinct by this numbering is connected to 1, 13, and 14. So, the adjacency list would be 1,13,14. 12, for example, would not be on this list, as it does not share a portion of a side with precinct 25.
To make an adjacency list, we may use the packages spdep
with function poly2nb
. We use this function with queen = FALSE
, as we want the rooks adjacency, not the queens adjacency, which would allow for if only corners touch, like on a chess board.
adjlist <- poly2nb(pl = fl25, queen = FALSE)
We can further verify the above comments by looking at the 25th element of the adjacency list, which corresponds to the 25th district.
adjlist[[25]] #> [1] 5 22 24
Using igraph, we can easily plot the whole set of adjacencies:
plot(graph_from_adj_list(adjlist, mode = 'total'))
While these are numbered in 1:25, the back end in C++ requires 0 indexing for efficiency, so we sink the adjacency list to be 0:24.
for(i in 1:25){ adjlist[[i]] <- adjlist[[i]]-1 }
Thus, everything is the same up to naming.
adjlist[[25]] #> [1] 4 21 23
Each algorithm within the package maintains geographically contiguous districts. The MCMC
and enumerate
algorithms consider the underlying graphical structure of the districts, where contiguity is represented by the edges in the above graph.
To begin running the MCMC algorithm, we have to provide some basic data. We provide the adjacency list to adjobj
, a vector of populations to popvec
, the number of districts as 3 to ndists
, and choose to run 10000 simulations with nsims
. We can also save the output with savename
.
alg_mcmc <- redist.mcmc(adjobj = algdat.pfull$adjlist, popvec = algdat.pfull$precinct.data$pop, ndists = 3, nsims = 10000, savename = "redist.mcmc") #> #> ==================== #> redist.mcmc(): Automated Redistricting Simulation Using #> Markov Chain Monte Carlo #> #> Preprocessing data. #> #> #> ==================== #> Using redist.rsg() to generate starting values. #> #> 10 percent done. #> Metropolis acceptance ratio: 0.965966 #> #> 20 percent done. #> Metropolis acceptance ratio: 0.967984 #> #> 30 percent done. #> Metropolis acceptance ratio: 0.966656 #> #> 40 percent done. #> Metropolis acceptance ratio: 0.965741 #> #> 50 percent done. #> Metropolis acceptance ratio: 0.964993 #> #> 60 percent done. #> Metropolis acceptance ratio: 0.965328 #> #> 70 percent done. #> Metropolis acceptance ratio: 0.964852 #> #> 80 percent done. #> Metropolis acceptance ratio: 0.964996 #> #> 90 percent done. #> Metropolis acceptance ratio: 0.964329 #> #> 100 percent done. #> Metropolis acceptance ratio: 0.964796
Note that we do not need to specify ndists
when we supply a different argument, initcds
, which is the set of districts to initialize to. If we do not specific initcds
, the RSG function is run to initialize districts. When tuning parameters, the initcds
argument may be useful for ensuring diverse starting positions for different chains, though this is not typically necessary.
We can specify initcds
, for example, as the first column of the full enumeration in algdat.pfull
.
initcds <- algdat.pfull$cdmat[,1]
alg_mcmc <- redist.mcmc(adjobj = algdat.pfull$adjlist, popvec = algdat.pfull$precinct.data$pop, initcds = initcds, nsims = 10000, savename = "redist.mcmc") #> #> ==================== #> redist.mcmc(): Automated Redistricting Simulation Using #> Markov Chain Monte Carlo #> #> Preprocessing data. #> #> 10 percent done. #> Metropolis acceptance ratio: 0.95996 #> #> 20 percent done. #> Metropolis acceptance ratio: 0.962981 #> #> 30 percent done. #> Metropolis acceptance ratio: 0.959987 #> #> 40 percent done. #> Metropolis acceptance ratio: 0.96174 #> #> 50 percent done. #> Metropolis acceptance ratio: 0.960792 #> #> 60 percent done. #> Metropolis acceptance ratio: 0.96216 #> #> 70 percent done. #> Metropolis acceptance ratio: 0.961423 #> #> 80 percent done. #> Metropolis acceptance ratio: 0.960745 #> #> 90 percent done. #> Metropolis acceptance ratio: 0.961996 #> #> 100 percent done. #> Metropolis acceptance ratio: 0.961796
Once we’ve run the algorithm, the output is of class redist
. As savename
was provided, there is an Rdata file created in the working directory with a copy of the output.
class(alg_mcmc) #> [1] "redist" names(alg_mcmc) #> [1] "partitions" "distance_parity" "distance_original" #> [4] "mhdecisions" "mhprob" "pparam" #> [7] "beta_sequence" "energy_psi" "constraint_pop" #> [10] "constraint_compact" "constraint_vra" "constraint_similar" #> [13] "constraint_countysplit" "boundary_partitions" "boundaryratio"
For simple runs of the algorithm, the most important pieces of the output are partitions
and distance_parity
. The partitions
object is a matrix with ndist
rows and nsims
columns. Each column is one redistricting plan, where the numbers go from 0 to n-1, with each number represents the district assignment.
alg_mcmc$partitions[,1] #> [1] 0 1 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
The distance_parity
output provides the population parity of the districts as an array with nsims
entries.
alg_mcmc$distance_parity[1] #> [1] 1.305439
The additional outputs are useful for implementing the algorithm with constraints.
The running of this function is the same as in redist.mcmc, but it requires Rmpi installed.
library(Rmpi) redist.mcmc.mpi(adjobj = algdat.pfull$adjlist, popvec = algdat.pfull$precinct.data$pop, nsims = 10000, ndists = 3, savename = "redist.mcmc.mpi")
Outputs and usage are the same as in the MCMC Section, but MPI will allow for much faster computation.
When running larger redistricting analyses, one important step is to run multiple chains of the MCMC algorithm. This will also allow us to diagnose convergence better, using the Gelman-Rubin plot, as seen in the section on Diagnostic Plots.
On Windows and in smaller capacities, it is useful to run the algorithm within an lapply loop. First, we set up the seed for replicability and decide on the number of chains and simulations.
mcmc_chains <- lapply(1:nchains, function(x){ redist.mcmc(adjobj = algdat.pfull$adjlist, popvec = algdat.pfull$precinct.data$pop, nsims = nsims, ndists = 3) }) #> #> ==================== #> redist.mcmc(): Automated Redistricting Simulation Using #> Markov Chain Monte Carlo #> #> Preprocessing data. #> #> #> ==================== #> Using redist.rsg() to generate starting values. #> #> 10 percent done. #> Metropolis acceptance ratio: 0.964965 #> #> 20 percent done. #> Metropolis acceptance ratio: 0.961981 #> #> 30 percent done. #> Metropolis acceptance ratio: 0.963988 #> #> 40 percent done. #> Metropolis acceptance ratio: 0.95974 #> #> 50 percent done. #> Metropolis acceptance ratio: 0.961392 #> #> 60 percent done. #> Metropolis acceptance ratio: 0.96066 #> #> 70 percent done. #> Metropolis acceptance ratio: 0.961566 #> #> 80 percent done. #> Metropolis acceptance ratio: 0.96087 #> #> 90 percent done. #> Metropolis acceptance ratio: 0.960662 #> #> 100 percent done. #> Metropolis acceptance ratio: 0.960596 #> #> #> ==================== #> redist.mcmc(): Automated Redistricting Simulation Using #> Markov Chain Monte Carlo #> #> Preprocessing data. #> #> #> ==================== #> Using redist.rsg() to generate starting values. #> #> 10 percent done. #> Metropolis acceptance ratio: 0.955956 #> #> 20 percent done. #> Metropolis acceptance ratio: 0.96048 #> #> 30 percent done. #> Metropolis acceptance ratio: 0.962654 #> #> 40 percent done. #> Metropolis acceptance ratio: 0.96049 #> #> 50 percent done. #> Metropolis acceptance ratio: 0.962593 #> #> 60 percent done. #> Metropolis acceptance ratio: 0.962827 #> #> 70 percent done. #> Metropolis acceptance ratio: 0.963566 #> #> 80 percent done. #> Metropolis acceptance ratio: 0.964371 #> #> 90 percent done. #> Metropolis acceptance ratio: 0.963663 #> #> 100 percent done. #> Metropolis acceptance ratio: 0.963696 #> #> #> ==================== #> redist.mcmc(): Automated Redistricting Simulation Using #> Markov Chain Monte Carlo #> #> Preprocessing data. #> #> #> ==================== #> Using redist.rsg() to generate starting values. #> #> 10 percent done. #> Metropolis acceptance ratio: 0.958959 #> #> 20 percent done. #> Metropolis acceptance ratio: 0.961481 #> #> 30 percent done. #> Metropolis acceptance ratio: 0.964321 #> #> 40 percent done. #> Metropolis acceptance ratio: 0.966992 #> #> 50 percent done. #> Metropolis acceptance ratio: 0.967193 #> #> 60 percent done. #> Metropolis acceptance ratio: 0.966994 #> #> 70 percent done. #> Metropolis acceptance ratio: 0.966995 #> #> 80 percent done. #> Metropolis acceptance ratio: 0.966371 #> #> 90 percent done. #> Metropolis acceptance ratio: 0.965996 #> #> 100 percent done. #> Metropolis acceptance ratio: 0.965597 #> #> #> ==================== #> redist.mcmc(): Automated Redistricting Simulation Using #> Markov Chain Monte Carlo #> #> Preprocessing data. #> #> #> ==================== #> Using redist.rsg() to generate starting values. #> #> 10 percent done. #> Metropolis acceptance ratio: 0.968969 #> #> 20 percent done. #> Metropolis acceptance ratio: 0.965483 #> #> 30 percent done. #> Metropolis acceptance ratio: 0.964655 #> #> 40 percent done. #> Metropolis acceptance ratio: 0.963491 #> #> 50 percent done. #> Metropolis acceptance ratio: 0.963993 #> #> 60 percent done. #> Metropolis acceptance ratio: 0.962494 #> #> 70 percent done. #> Metropolis acceptance ratio: 0.96328 #> #> 80 percent done. #> Metropolis acceptance ratio: 0.96387 #> #> 90 percent done. #> Metropolis acceptance ratio: 0.964218 #> #> 100 percent done. #> Metropolis acceptance ratio: 0.964896
In unix-based systems, this can be run considerably faster by running this in parallel.
mcmc_chains <- parallel::mclapply(1:nchains, function(x){ redist.mcmc(adjobj = algdat.pfull$adjlist, popvec = algdat.pfull$precinct.data$pop, nsims = nsims, ndists = 3) }, mc.set.seed = 1, mc.cores = parallel::detectCores())
This package also contains an implementation of Chen and Rodden’s non-compact Random Seed and Grow redistricting algorithm from their paper Unintentional Gerrymandering: Political Geography and Electoral Biases in Legislatures.
The adjacency list, via adj.list
, is created as in Adjacency-Based Redistricting, population
is the population of the districts, ndists
is the number of districts, and thresh
is the allowed population parity. A thresh
of 0.05
allows for 5% population parity. In easier redistricting questions, a small maxiter
is typically sufficient, but in large maps, values may be closer to 50,000
to ensure that a solution is found.
rsg <- redist.rsg(adj.list = algdat.pfull$adjlist, population = algdat.pfull$precinct.data$pop, ndists = 3, thresh = 0.05, maxiter = 5000) #> #> ==================== #> redist.rsg(): Automated Redistricting Starts #> #> #> 3 districts built using 25 precincts in 0 seconds...
The output of redist.rsg
is a list with three objects. district_membership
provides a numeric array with indices for which district each precinct belongs to.
rsg$district_membership #> [1] 1 1 2 2 0 2 0 0 2 1 1 1 1 2 2 1 0 1 0 0 2 0 2 0 0
district_list
will provide an alternate formulation of this with ndists
arrays, indicating which precincts are in each district.
rsg$district_list #> [[1]] #> [1] 19 7 6 23 4 21 24 18 16 #> #> [[2]] #> [1] 0 12 11 1 10 17 15 9 #> #> [[3]] #> [1] 5 22 2 8 3 13 14 20
Corresponding to this, district_pop
will give the district population for each of the ndists
districts.
rsg$district_pop #> [1] 59125 56965 58953
To evaluate redistricting plans, the redist
package comes with the redist.segcalc
function. This computes the segregation index introduced in Massey and Denton 1987.
It takes three arguments, algout
, which is a redist
object from the various mcmc
function or rsg
function, grouppop
which is the population of the group being used for comparison, and fullpop
is the total population of the precinct.
seg <- redist.segcalc(algout = alg_mcmc, grouppop = algdat.pfull$precinct.data$blackpop, fullpop = algdat.pfull$precinct.data$pop)
This returns a numeric vector with the an entry for each district provided.
When using the MCMC algorithms, there are various useful diagnostic plots. The redist.diagplot
function creates familiar plots by converting numeric entries into mcmc
objects to use with coda
.
The first four plots take a single index, such as one created by redist.segcalc
. * Trace Plot
redist.diagplot(seg, plot = "trace")
redist.diagplot(seg, plot = "autocorr")
redist.diagplot(seg, plot = "densplot")
redist.diagplot(seg, plot = "mean")
The final plot requires at least two chains.
seg_chains <- lapply(1:nchains, function(i){redist.segcalc(algout = mcmc_chains[[i]], grouppop = algdat.pfull$precinct.data$blackpop, fullpop = algdat.pfull$precinct.data$pop)}) redist.diagplot(seg_chains, plot = 'gelmanrubin')
Beware that this diagnostic will often fail to converge if there are insufficient iterations. If an error is thrown and no plot is created, try increasing
nsims
.