By the end of this lesson, you should be familiar with:
Basic tools for selecting and subsetting data
Strategies for describing data
Getting Set Up
The Data
So far, we’ve worked with “tiny” bits of data - mainly things that we input manually. It’s time to start working with some real world data!
We’re going to work with data on those census tracts that were designated as Opportunity Zones as part of the federal Tax Cuts and Jobs Act of 2017. These incentives are designed to spur investment in low-income and undercapitalized cities, by providing investors with tax incentives to invest capital in these locations. Each state had the opportunity to designate specific census tracts as Opportunity Zones. Practitioners and researchers have many questions about the efficacy of the program and the designations made by governors.
Take a look here to see a map of where designated opportunity zones are located. The orange geometries reflected on the map are census tracts, which we often use as a proxy for neighborhoods, especially in urban areas. Find a place you know, and take a look at which areas are designated.
The specific data we’ll work with comes from the Urban Institute - they have joined the IRS’s list of Opportunity Zones designated by the Tax Cuts and Jobs Act to a series of indicators focused on investment potential. A copy of the Urban Institute’s dataset is available here for download.
You’ll need to authenticate and log in to UIUC Box to access this file. You can also download the data directly from Urban Institute’s Opportunity Zone landing page.
Extending Base R With Packages
We’re been working exclusively in “base” R as we are getting familiar with the R language and RStudio interface. This means we’ve been working with the commands and functions that come with every new installation of R.
There’s a whole world of other functions that we can use to extend R’s functionality - these packages can help us do things like load data with specific formats, make visualizations, and run specific types of statistical models. There’s more than 20,000 packages which users have created to extend R’s functionality - you can find a list here. We’ll introduce common packages which help with neighborhood analysis tasks, but your favorite search engine is probably the easiest way to find packages or functions that may help you read specific data or perform certain types of tasks.
Readxl
Our first exploration with packages will help us read in our Opportunity Zone data. We will use a package called readxl which is designed to, well, read Microsoft Excel files such as the Opportunity Zone data.
Installing Readxl
We will use the readxl package to help us read our Opportunity Zone data from an Excel workbook into Rstudio so we can view it as a data frame. The first time you want to make use of a package you have not used before, you’ll need to install it, which essentially means R needs to download it from a repository on the internet and prepare it to be accessed on your computer.
We use the install.packages() command to automatically download and prepare the package for use:
Note that the name of the package is in quotes here. Any guesses as to why?
Note that we only need to install the package one time. It will remain available on your machine for use in the future.
Loading Readxl
In our previous step, we installed the readxl package which involved downloading it form an internet repository and preparing it for use. There’s an additional step - now we need to load the package, which tells R that we want to make the package’s functions available for use in our current R session. The library() command loads packages for use in our current R session. While we only need to install a package one time, each time you start a new R session, you’ll need to load the packages you want to use for a particular kind of additional functionality.
Note that the name of the package is not in quotes here. Any guesses as to why?
Now that the readxl package is loaded, we can make use of it to read our Opportunity Zone data into our R session.
Takeaways
In general, you will only need to use install.packages() once (once a package is installed, you can use it for all R sessions moving forward).
To load packages, we use the library() command. This will load an already installed package and make it accessible within our current R session. You will need to load already installed packages at the beginning of each new R session. Typically, it is a good practice to load the packages you’ll use in a script at the beginning of the script.
Note that to install the package, you need to treat the package name as a character vector "readxl", but when you load it in your R session, it does not need to be treated as a character vectorreadxl.
Read in Data
Now that we’ve loaded the readxl package, we can import the Excel file containing data on tracts designated as opportunity zones in the United States. To learn about the functions in the package, you can either do a Google search for Readxl, or you can use R’s built in documentation by typing ?readxl
?readxl
Note that running this command pops open documentation for readxl in the Help tab of the auxiliary pane. As the documentation states, readxl imports excel files. Looking at the documentation, the read_excel() command will read a single excel sheet, or we can optionally select a sheet by name or number from an excel workbook with multiple sheets. In this case, the Urban Institute data is in a workbook with a single sheet, so we just need to tell R where the file is to load.
The dataset we want to load is called “urbaninstitute_tractlevelozanalysis_update01142021.xlsx” (oof! - that’s a descriptive but way too long file name!). We can point R to the correct location. Since our R project file sets the relative path for all of the work within, the path to the data is:"data/urbaninstitute_tractlevelozanalysis_update1242018.xlsx". Wrapped into the command to read the excel file, it looks like this:
R read the data and is displaying it to us. But one problem - it read the data in and we can look at it, but it’s not really available for our continued use. We typically want to read data and store it as an object so that we can refer back to it and even modify it.
Let’s go ahead and read the Excel data in again, but this time, we’ll assign it to an object called “ozs”:
Look at your Environment window (top right quadrant of RStudio) - a data frame containing information on opportunity zones should be loaded in an object called “ozs”.
The environment window tells us that the object ozs contains 42,176 observations (rows) and 27 variables (columns).
If we type the name of the object, we can view it’s contents:
ozs
# A tibble: 42,178 × 27
geoid state DesignatedOZ county Type dec_score SE_Flag Population
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 01001020200 Alabama NA Autauga … Low-… 4 NA 2196
2 01001020300 Alabama NA Autauga … Non-… 6 NA 3136
3 01001020700 Alabama 1 Autauga … Low-… 8 NA 3047
4 01001020802 Alabama NA Autauga … Non-… 10 NA 10743
5 01001021000 Alabama NA Autauga … Non-… 5 NA 2899
6 01001021100 Alabama NA Autauga … Low-… 6 NA 3247
7 01003010100 Alabama NA Baldwin … Non-… 6 NA 4013
8 01003010200 Alabama 1 Baldwin … Low-… 9 NA 3067
9 01003010300 Alabama NA Baldwin … Non-… 10 NA 8079
10 01003010400 Alabama 1 Baldwin … Non-… 9 NA 4578
# ℹ 42,168 more rows
# ℹ 19 more variables: medhhincome <dbl>, PovertyRate <dbl>, unemprate <dbl>,
# medvalue <dbl>, medrent <dbl>, pctown <dbl>, severerentburden <dbl>,
# vacancyrate <dbl>, pctwhite <dbl>, pctBlack <dbl>, pctHispanic <dbl>,
# pctAAPIalone <dbl>, pctunder18 <dbl>, pctover64 <dbl>, HSorlower <dbl>,
# BAorhigher <dbl>, Metro <dbl>, Micro <dbl>, NoCBSAType <dbl>
You could also inspect the dataset using the View() command. This will allow us to look at the data in a tabular format.
tibble [42,178 × 27] (S3: tbl_df/tbl/data.frame)
$ geoid : chr [1:42178] "01001020200" "01001020300" "01001020700" "01001020802" ...
$ state : chr [1:42178] "Alabama" "Alabama" "Alabama" "Alabama" ...
$ DesignatedOZ : num [1:42178] NA NA 1 NA NA NA NA 1 NA 1 ...
$ county : chr [1:42178] "Autauga County" "Autauga County" "Autauga County" "Autauga County" ...
$ Type : chr [1:42178] "Low-Income Community" "Non-LIC Contiguous" "Low-Income Community" "Non-LIC Contiguous" ...
$ dec_score : num [1:42178] 4 6 8 10 5 6 6 9 10 9 ...
$ SE_Flag : num [1:42178] NA NA NA NA NA NA NA NA NA NA ...
$ Population : num [1:42178] 2196 3136 3047 10743 2899 ...
$ medhhincome : num [1:42178] 41107 51250 45234 61242 49567 ...
$ PovertyRate : num [1:42178] 0.24 0.107 0.19 0.153 0.151 ...
$ unemprate : num [1:42178] 0.0775 0.051 0.1407 0.0459 0.0289 ...
$ medvalue : num [1:42178] 95300 113800 93500 160400 102900 ...
$ medrent : num [1:42178] 743 817 695 1018 546 ...
$ pctown : num [1:42178] 0.628 0.703 0.711 0.823 0.83 ...
$ severerentburden: num [1:42178] 0.3269 0.3223 0.3887 0.1994 0.0994 ...
$ vacancyrate : num [1:42178] 0.0584 0.1399 0.0619 0.0609 0.2182 ...
$ pctwhite : num [1:42178] 0.439 0.671 0.833 0.814 0.726 ...
$ pctBlack : num [1:42178] 0.5187 0.205 0.0922 0.1572 0.2456 ...
$ pctHispanic : num [1:42178] 0.01275 0.0727 0.0338 0.01368 0.00448 ...
$ pctAAPIalone : num [1:42178] 0.01093 0.01052 0 0.00959 0 ...
$ pctunder18 : num [1:42178] 0.218 0.224 0.249 0.27 0.245 ...
$ pctover64 : num [1:42178] 0.124 0.175 0.149 0.122 0.156 ...
$ HSorlower : num [1:42178] 0.581 0.464 0.544 0.45 0.621 ...
$ BAorhigher : num [1:42178] 0.162 0.219 0.113 0.229 0.136 ...
$ Metro : num [1:42178] 1 1 1 1 1 1 1 1 1 1 ...
$ Micro : num [1:42178] NA NA NA NA NA NA NA NA NA NA ...
$ NoCBSAType : num [1:42178] NA NA NA NA NA NA NA NA NA NA ...
We get a list of the columns in the data, along with their types (in this case character or numeric), and then we see the values associated with the first few observations.
A few things to note after your preliminary inspection:
These data are at the census tract level and include geographic identifiers including geoid, the combined, state-county-tract FIPS code, state the state name, and county the county name.
These data include a field named Designated which is 1 when an eligible tract was designated as an opportunity zone, and NA where the tract was not designated.
The dataset also includes some other tract-level demographic measures, as well as additional geographic flags.
Describing the Data
When we load a new dataset that we’re not familiar with, it’s a good idea to spend a few minutes describing the data. This allows us to understand a bit more about it’s structure and may also help us ask some basic questions about the validity and reliability of the data (at least compared to what we are expecting to see). R has several functions for determining the structure of data frames and tibbles. See below:
Size
dim(ozs): returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
# A tibble: 6 × 27
geoid state DesignatedOZ county Type dec_score SE_Flag Population medhhincome
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 7803… <NA> NA <NA> Non-… NA NA NA NA
2 7803… Virg… 1 <NA> Non-… NA NA NA NA
3 7803… Virg… 1 <NA> Low-… NA NA NA NA
4 7803… Virg… 1 <NA> Low-… NA NA NA NA
5 7803… Virg… 1 <NA> Low-… NA NA NA NA
6 7803… Virg… 1 <NA> Low-… NA NA NA NA
# ℹ 18 more variables: PovertyRate <dbl>, unemprate <dbl>, medvalue <dbl>,
# medrent <dbl>, pctown <dbl>, severerentburden <dbl>, vacancyrate <dbl>,
# pctwhite <dbl>, pctBlack <dbl>, pctHispanic <dbl>, pctAAPIalone <dbl>,
# pctunder18 <dbl>, pctover64 <dbl>, HSorlower <dbl>, BAorhigher <dbl>,
# Metro <dbl>, Micro <dbl>, NoCBSAType <dbl>
tibble [42,178 × 27] (S3: tbl_df/tbl/data.frame)
$ geoid : chr [1:42178] "01001020200" "01001020300" "01001020700" "01001020802" ...
$ state : chr [1:42178] "Alabama" "Alabama" "Alabama" "Alabama" ...
$ DesignatedOZ : num [1:42178] NA NA 1 NA NA NA NA 1 NA 1 ...
$ county : chr [1:42178] "Autauga County" "Autauga County" "Autauga County" "Autauga County" ...
$ Type : chr [1:42178] "Low-Income Community" "Non-LIC Contiguous" "Low-Income Community" "Non-LIC Contiguous" ...
$ dec_score : num [1:42178] 4 6 8 10 5 6 6 9 10 9 ...
$ SE_Flag : num [1:42178] NA NA NA NA NA NA NA NA NA NA ...
$ Population : num [1:42178] 2196 3136 3047 10743 2899 ...
$ medhhincome : num [1:42178] 41107 51250 45234 61242 49567 ...
$ PovertyRate : num [1:42178] 0.24 0.107 0.19 0.153 0.151 ...
$ unemprate : num [1:42178] 0.0775 0.051 0.1407 0.0459 0.0289 ...
$ medvalue : num [1:42178] 95300 113800 93500 160400 102900 ...
$ medrent : num [1:42178] 743 817 695 1018 546 ...
$ pctown : num [1:42178] 0.628 0.703 0.711 0.823 0.83 ...
$ severerentburden: num [1:42178] 0.3269 0.3223 0.3887 0.1994 0.0994 ...
$ vacancyrate : num [1:42178] 0.0584 0.1399 0.0619 0.0609 0.2182 ...
$ pctwhite : num [1:42178] 0.439 0.671 0.833 0.814 0.726 ...
$ pctBlack : num [1:42178] 0.5187 0.205 0.0922 0.1572 0.2456 ...
$ pctHispanic : num [1:42178] 0.01275 0.0727 0.0338 0.01368 0.00448 ...
$ pctAAPIalone : num [1:42178] 0.01093 0.01052 0 0.00959 0 ...
$ pctunder18 : num [1:42178] 0.218 0.224 0.249 0.27 0.245 ...
$ pctover64 : num [1:42178] 0.124 0.175 0.149 0.122 0.156 ...
$ HSorlower : num [1:42178] 0.581 0.464 0.544 0.45 0.621 ...
$ BAorhigher : num [1:42178] 0.162 0.219 0.113 0.229 0.136 ...
$ Metro : num [1:42178] 1 1 1 1 1 1 1 1 1 1 ...
$ Micro : num [1:42178] NA NA NA NA NA NA NA NA NA NA ...
$ NoCBSAType : num [1:42178] NA NA NA NA NA NA NA NA NA NA ...
Practice your querying skills - how would we return only those records for census tracts with a median household income above $100,000 per year?
ozs[ozs$medhhincome>=100000,]
# A tibble: 399 × 27
geoid state DesignatedOZ county Type dec_score SE_Flag Population
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 <NA> <NA> NA <NA> <NA> NA NA NA
2 <NA> <NA> NA <NA> <NA> NA NA NA
3 <NA> <NA> NA <NA> <NA> NA NA NA
4 <NA> <NA> NA <NA> <NA> NA NA NA
5 04013815600 Arizona NA Maricopa… Non-… 10 NA 5762
6 04013816800 Arizona NA Maricopa… Non-… 10 1 3727
7 <NA> <NA> NA <NA> <NA> NA NA NA
8 <NA> <NA> NA <NA> <NA> NA NA NA
9 <NA> <NA> NA <NA> <NA> NA NA NA
10 <NA> <NA> NA <NA> <NA> NA NA NA
# ℹ 389 more rows
# ℹ 19 more variables: medhhincome <dbl>, PovertyRate <dbl>, unemprate <dbl>,
# medvalue <dbl>, medrent <dbl>, pctown <dbl>, severerentburden <dbl>,
# vacancyrate <dbl>, pctwhite <dbl>, pctBlack <dbl>, pctHispanic <dbl>,
# pctAAPIalone <dbl>, pctunder18 <dbl>, pctover64 <dbl>, HSorlower <dbl>,
# BAorhigher <dbl>, Metro <dbl>, Micro <dbl>, NoCBSAType <dbl>
Tracts in Illinois
Oftentimes, we’ll want to query out a subset of observations based upon their geographic location. Let’s try selecting all tracts in Illinois based upon their designation status.
How would we query out tracts in Illinois? Experiment in your own script.
ozs[ozs$state=="Illinois",]
# A tibble: 1,682 × 27
geoid state DesignatedOZ county Type dec_score SE_Flag Population
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 17001000201 Illinois NA Adams C… Non-… 7 NA 1937
2 17001000202 Illinois NA Adams C… Low-… 1 NA 2563
3 17001000400 Illinois NA Adams C… Low-… 1 NA 3403
4 17001000500 Illinois NA Adams C… Low-… 1 NA 2298
5 17001000700 Illinois NA Adams C… Low-… 1 NA 1259
6 17001000800 Illinois 1 Adams C… Low-… 1 NA 2700
7 17001000900 Illinois NA Adams C… Low-… 5 NA 2671
8 17001010100 Illinois NA Adams C… Non-… 2 NA 4323
9 17001010200 Illinois NA Adams C… Low-… 2 NA 3436
10 17001010300 Illinois NA Adams C… Non-… 8 NA 6038
# ℹ 1,672 more rows
# ℹ 19 more variables: medhhincome <dbl>, PovertyRate <dbl>, unemprate <dbl>,
# medvalue <dbl>, medrent <dbl>, pctown <dbl>, severerentburden <dbl>,
# vacancyrate <dbl>, pctwhite <dbl>, pctBlack <dbl>, pctHispanic <dbl>,
# pctAAPIalone <dbl>, pctunder18 <dbl>, pctover64 <dbl>, HSorlower <dbl>,
# BAorhigher <dbl>, Metro <dbl>, Micro <dbl>, NoCBSAType <dbl>
We can see in our table output that there are 1,682 eligible or designated tracts in Illinois. We could also use the nrow() command to count the number of rows.
We might also want to calculate statistics like averages for subsets. mean() will calculate the mean of a list or column. What’s the average income for tracts with a vacancy rate above 20 percent?
You might need to check out the documentation for mean() in order to return an answer here. R will not calculate the mean if an NA values are present in the vector for which you’ve requested the mean - a good safety feature if you’re expecting all values to be present.
In imperfect data like what we’re dealing with, you can instruct R to remove those NAs and find the mean for remaining values. You may also want to make sure you’ve counted the number of NA values so you know what proportion of your data the mean is actually representing.
Subsetting
We might also be interested in combining query criteria. R can make use of logical statements. & is equivalent to AND and | is equivalent to OR. Now give it a go!
You’ll note that there are several flag variables in the data for which values are either 1 or NA. We have flags for whether a tract was designated as an Opportunity Zone, Whether it is located in a Metropolitan, Micropolitan, or Non-Core-Based Statistical Area.
We can use logical tests in R to identify those values that are NA. We can use is.na() to test whether a value is NA (TRUE) or is not NA (FALSE). We could also use the negation sign ! to determine whether a value is notNA (!is.na()).
The following code returns a vector of logical values (TRUE / FALSE) regarding whether the value for the Designated column is NA or not.
Warning: Unknown or uninitialised column: `Designated`.
logical(0)
For logical values, R codes 1 as TRUE and 0 as false, meaning if we wanted to count the number of undesginated tracts, we could ask for the sum of the values for which the logical test is true (the sum of the values that are NA):
Warning: Unknown or uninitialised column: `Designated`.
[1] 0
We might also want to recode those NA values to something else. We can use assignment and subset notation to replace na values with something else. Let’s replace those NAs in the Designated column with 0.
# A tibble: 42,178 × 27
geoid state DesignatedOZ county Type dec_score SE_Flag Population
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 01001020200 Alabama 0 Autauga … Low-… 4 NA 2196
2 01001020300 Alabama 0 Autauga … Non-… 6 NA 3136
3 01001020700 Alabama 1 Autauga … Low-… 8 NA 3047
4 01001020802 Alabama 0 Autauga … Non-… 10 NA 10743
5 01001021000 Alabama 0 Autauga … Non-… 5 NA 2899
6 01001021100 Alabama 0 Autauga … Low-… 6 NA 3247
7 01003010100 Alabama 0 Baldwin … Non-… 6 NA 4013
8 01003010200 Alabama 1 Baldwin … Low-… 9 NA 3067
9 01003010300 Alabama 0 Baldwin … Non-… 10 NA 8079
10 01003010400 Alabama 1 Baldwin … Non-… 9 NA 4578
# ℹ 42,168 more rows
# ℹ 19 more variables: medhhincome <dbl>, PovertyRate <dbl>, unemprate <dbl>,
# medvalue <dbl>, medrent <dbl>, pctown <dbl>, severerentburden <dbl>,
# vacancyrate <dbl>, pctwhite <dbl>, pctBlack <dbl>, pctHispanic <dbl>,
# pctAAPIalone <dbl>, pctunder18 <dbl>, pctover64 <dbl>, HSorlower <dbl>,
# BAorhigher <dbl>, Metro <dbl>, Micro <dbl>, NoCBSAType <dbl>
Can you inspect the table here to see what happened? In plain language, we told R “for those values of the column named Designated in the ozs data table where the values are NA, assign a new value of 0.”