Lesson 2: Manipulating Data Frames

Lesson Overview

Lesson Goals

By the end of this lesson, you should be familiar with:

Basic tools for selecting and subsetting data
Strategies for describing data

Getting Set Up

The Data

So far, we’ve worked with “tiny” bits of data - mainly things that we input manually. It’s time to start working with some real world data!

We’re going to work with data on those census tracts that were designated as Opportunity Zones as part of the federal Tax Cuts and Jobs Act of 2017. These incentives are designed to spur investment in low-income and undercapitalized cities, by providing investors with tax incentives to invest capital in these locations. Each state had the opportunity to designate specific census tracts as Opportunity Zones. Practitioners and researchers have many questions about the efficacy of the program and the designations made by governors.

Take a look here to see a map of where designated opportunity zones are located. The orange geometries reflected on the map are census tracts, which we often use as a proxy for neighborhoods, especially in urban areas. Find a place you know, and take a look at which areas are designated.

The specific data we’ll work with comes from the Urban Institute - they have joined the IRS’s list of Opportunity Zones designated by the Tax Cuts and Jobs Act to a series of indicators focused on investment potential. A copy of the Urban Institute’s dataset is available here for download.

You’ll need to authenticate and log in to UIUC Box to access this file. You can also download the data directly from Urban Institute’s Opportunity Zone landing page.

Extending Base R With Packages

We’re been working exclusively in “base” R as we are getting familiar with the R language and RStudio interface. This means we’ve been working with the commands and functions that come with every new installation of R.

There’s a whole world of other functions that we can use to extend R’s functionality - these packages can help us do things like load data with specific formats, make visualizations, and run specific types of statistical models. There’s more than 20,000 packages which users have created to extend R’s functionality - you can find a list here. We’ll introduce common packages which help with neighborhood analysis tasks, but your favorite search engine is probably the easiest way to find packages or functions that may help you read specific data or perform certain types of tasks.

Readxl

Our first exploration with packages will help us read in our Opportunity Zone data. We will use a package called readxl which is designed to, well, read Microsoft Excel files such as the Opportunity Zone data.

Installing Readxl

We will use the readxl package to help us read our Opportunity Zone data from an Excel workbook into Rstudio so we can view it as a data frame. The first time you want to make use of a package you have not used before, you’ll need to install it, which essentially means R needs to download it from a repository on the internet and prepare it to be accessed on your computer.

We use the install.packages() command to automatically download and prepare the package for use:

install.packages("readxl")

Note that the name of the package is in quotes here. Any guesses as to why?

Note that we only need to install the package one time. It will remain available on your machine for use in the future.

Loading Readxl

In our previous step, we installed the readxl package which involved downloading it form an internet repository and preparing it for use. There’s an additional step - now we need to load the package, which tells R that we want to make the package’s functions available for use in our current R session. The library() command loads packages for use in our current R session. While we only need to install a package one time, each time you start a new R session, you’ll need to load the packages you want to use for a particular kind of additional functionality.

library(readxl)

Note that the name of the package is not in quotes here. Any guesses as to why?

Now that the readxl package is loaded, we can make use of it to read our Opportunity Zone data into our R session.

Takeaways

In general, you will only need to use install.packages() once (once a package is installed, you can use it for all R sessions moving forward).
To load packages, we use the library() command. This will load an already installed package and make it accessible within our current R session. You will need to load already installed packages at the beginning of each new R session. Typically, it is a good practice to load the packages you’ll use in a script at the beginning of the script.

Note that to install the package, you need to treat the package name as a character vector "readxl", but when you load it in your R session, it does not need to be treated as a character vectorreadxl.

Read in Data

Now that we’ve loaded the readxl package, we can import the Excel file containing data on tracts designated as opportunity zones in the United States. To learn about the functions in the package, you can either do a Google search for Readxl, or you can use R’s built in documentation by typing ?readxl

?readxl

Note that running this command pops open documentation for readxl in the Help tab of the auxiliary pane. As the documentation states, readxl imports excel files. Looking at the documentation, the read_excel() command will read a single excel sheet, or we can optionally select a sheet by name or number from an excel workbook with multiple sheets. In this case, the Urban Institute data is in a workbook with a single sheet, so we just need to tell R where the file is to load.

The dataset we want to load is called “urbaninstitute_tractlevelozanalysis_update01142021.xlsx” (oof! - that’s a descriptive but way too long file name!). We can point R to the correct location. Since our R project file sets the relative path for all of the work within, the path to the data is:"data/urbaninstitute_tractlevelozanalysis_update1242018.xlsx". Wrapped into the command to read the excel file, it looks like this:

read_excel("data/urbaninstitute_tractlevelozanalysis_update01142021.xlsx")

R read the data and is displaying it to us. But one problem - it read the data in and we can look at it, but it’s not really available for our continued use. We typically want to read data and store it as an object so that we can refer back to it and even modify it.

Let’s go ahead and read the Excel data in again, but this time, we’ll assign it to an object called “ozs”:

ozs<-read_excel("data/urbaninstitute_tractlevelozanalysis_update01142021.xlsx")

Look at your Environment window (top right quadrant of RStudio) - a data frame containing information on opportunity zones should be loaded in an object called “ozs”.

The environment window tells us that the object ozs contains 42,176 observations (rows) and 27 variables (columns).

If we type the name of the object, we can view it’s contents:

ozs

# A tibble: 42,178 × 27
   geoid       state   DesignatedOZ county    Type  dec_score SE_Flag Population
   <chr>       <chr>          <dbl> <chr>     <chr>     <dbl>   <dbl>      <dbl>
 1 01001020200 Alabama           NA Autauga … Low-…         4      NA       2196
 2 01001020300 Alabama           NA Autauga … Non-…         6      NA       3136
 3 01001020700 Alabama            1 Autauga … Low-…         8      NA       3047
 4 01001020802 Alabama           NA Autauga … Non-…        10      NA      10743
 5 01001021000 Alabama           NA Autauga … Non-…         5      NA       2899
 6 01001021100 Alabama           NA Autauga … Low-…         6      NA       3247
 7 01003010100 Alabama           NA Baldwin … Non-…         6      NA       4013
 8 01003010200 Alabama            1 Baldwin … Low-…         9      NA       3067
 9 01003010300 Alabama           NA Baldwin … Non-…        10      NA       8079
10 01003010400 Alabama            1 Baldwin … Non-…         9      NA       4578
# ℹ 42,168 more rows
# ℹ 19 more variables: medhhincome <dbl>, PovertyRate <dbl>, unemprate <dbl>,
#   medvalue <dbl>, medrent <dbl>, pctown <dbl>, severerentburden <dbl>,
#   vacancyrate <dbl>, pctwhite <dbl>, pctBlack <dbl>, pctHispanic <dbl>,
#   pctAAPIalone <dbl>, pctunder18 <dbl>, pctover64 <dbl>, HSorlower <dbl>,
#   BAorhigher <dbl>, Metro <dbl>, Micro <dbl>, NoCBSAType <dbl>

You could also inspect the dataset using the View() command. This will allow us to look at the data in a tabular format.

View(ozs)

Now, use the str() (structure) command to gain a better understanding of the types of data in each column

str(ozs)

tibble [42,178 × 27] (S3: tbl_df/tbl/data.frame)
 $ geoid           : chr [1:42178] "01001020200" "01001020300" "01001020700" "01001020802" ...
 $ state           : chr [1:42178] "Alabama" "Alabama" "Alabama" "Alabama" ...
 $ DesignatedOZ    : num [1:42178] NA NA 1 NA NA NA NA 1 NA 1 ...
 $ county          : chr [1:42178] "Autauga County" "Autauga County" "Autauga County" "Autauga County" ...
 $ Type            : chr [1:42178] "Low-Income Community" "Non-LIC Contiguous" "Low-Income Community" "Non-LIC Contiguous" ...
 $ dec_score       : num [1:42178] 4 6 8 10 5 6 6 9 10 9 ...
 $ SE_Flag         : num [1:42178] NA NA NA NA NA NA NA NA NA NA ...
 $ Population      : num [1:42178] 2196 3136 3047 10743 2899 ...
 $ medhhincome     : num [1:42178] 41107 51250 45234 61242 49567 ...
 $ PovertyRate     : num [1:42178] 0.24 0.107 0.19 0.153 0.151 ...
 $ unemprate       : num [1:42178] 0.0775 0.051 0.1407 0.0459 0.0289 ...
 $ medvalue        : num [1:42178] 95300 113800 93500 160400 102900 ...
 $ medrent         : num [1:42178] 743 817 695 1018 546 ...
 $ pctown          : num [1:42178] 0.628 0.703 0.711 0.823 0.83 ...
 $ severerentburden: num [1:42178] 0.3269 0.3223 0.3887 0.1994 0.0994 ...
 $ vacancyrate     : num [1:42178] 0.0584 0.1399 0.0619 0.0609 0.2182 ...
 $ pctwhite        : num [1:42178] 0.439 0.671 0.833 0.814 0.726 ...
 $ pctBlack        : num [1:42178] 0.5187 0.205 0.0922 0.1572 0.2456 ...
 $ pctHispanic     : num [1:42178] 0.01275 0.0727 0.0338 0.01368 0.00448 ...
 $ pctAAPIalone    : num [1:42178] 0.01093 0.01052 0 0.00959 0 ...
 $ pctunder18      : num [1:42178] 0.218 0.224 0.249 0.27 0.245 ...
 $ pctover64       : num [1:42178] 0.124 0.175 0.149 0.122 0.156 ...
 $ HSorlower       : num [1:42178] 0.581 0.464 0.544 0.45 0.621 ...
 $ BAorhigher      : num [1:42178] 0.162 0.219 0.113 0.229 0.136 ...
 $ Metro           : num [1:42178] 1 1 1 1 1 1 1 1 1 1 ...
 $ Micro           : num [1:42178] NA NA NA NA NA NA NA NA NA NA ...
 $ NoCBSAType      : num [1:42178] NA NA NA NA NA NA NA NA NA NA ...

We get a list of the columns in the data, along with their types (in this case character or numeric), and then we see the values associated with the first few observations.

A few things to note after your preliminary inspection:

These data are at the census tract level and include geographic identifiers including geoid, the combined, state-county-tract FIPS code, state the state name, and county the county name.
These data include a field named Designated which is 1 when an eligible tract was designated as an opportunity zone, and NA where the tract was not designated.
The dataset also includes some other tract-level demographic measures, as well as additional geographic flags.

Describing the Data

When we load a new dataset that we’re not familiar with, it’s a good idea to spend a few minutes describing the data. This allows us to understand a bit more about it’s structure and may also help us ask some basic questions about the validity and reliability of the data (at least compared to what we are expecting to see). R has several functions for determining the structure of data frames and tibbles. See below:

Size

dim(ozs): returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)

dim(ozs)

[1] 42178    27

nrow(ozs): returns the number of rows

nrow(ozs)

[1] 42178

ncol(ozs): returns the number of columns

ncol(ozs)

[1] 27

Content

head(ozs): shows the first 6 rows

head(ozs)

# A tibble: 6 × 27
  geoid state DesignatedOZ county Type  dec_score SE_Flag Population medhhincome
  <chr> <chr>        <dbl> <chr>  <chr>     <dbl>   <dbl>      <dbl>       <dbl>
1 0100… Alab…           NA Autau… Low-…         4      NA       2196       41107
2 0100… Alab…           NA Autau… Non-…         6      NA       3136       51250
3 0100… Alab…            1 Autau… Low-…         8      NA       3047       45234
4 0100… Alab…           NA Autau… Non-…        10      NA      10743       61242
5 0100… Alab…           NA Autau… Non-…         5      NA       2899       49567
6 0100… Alab…           NA Autau… Low-…         6      NA       3247       40801
# ℹ 18 more variables: PovertyRate <dbl>, unemprate <dbl>, medvalue <dbl>,
#   medrent <dbl>, pctown <dbl>, severerentburden <dbl>, vacancyrate <dbl>,
#   pctwhite <dbl>, pctBlack <dbl>, pctHispanic <dbl>, pctAAPIalone <dbl>,
#   pctunder18 <dbl>, pctover64 <dbl>, HSorlower <dbl>, BAorhigher <dbl>,
#   Metro <dbl>, Micro <dbl>, NoCBSAType <dbl>

tail(ozs): shows the last 6 rows

tail(ozs)

# A tibble: 6 × 27
  geoid state DesignatedOZ county Type  dec_score SE_Flag Population medhhincome
  <chr> <chr>        <dbl> <chr>  <chr>     <dbl>   <dbl>      <dbl>       <dbl>
1 7803… <NA>            NA <NA>   Non-…        NA      NA         NA          NA
2 7803… Virg…            1 <NA>   Non-…        NA      NA         NA          NA
3 7803… Virg…            1 <NA>   Low-…        NA      NA         NA          NA
4 7803… Virg…            1 <NA>   Low-…        NA      NA         NA          NA
5 7803… Virg…            1 <NA>   Low-…        NA      NA         NA          NA
6 7803… Virg…            1 <NA>   Low-…        NA      NA         NA          NA
# ℹ 18 more variables: PovertyRate <dbl>, unemprate <dbl>, medvalue <dbl>,
#   medrent <dbl>, pctown <dbl>, severerentburden <dbl>, vacancyrate <dbl>,
#   pctwhite <dbl>, pctBlack <dbl>, pctHispanic <dbl>, pctAAPIalone <dbl>,
#   pctunder18 <dbl>, pctover64 <dbl>, HSorlower <dbl>, BAorhigher <dbl>,
#   Metro <dbl>, Micro <dbl>, NoCBSAType <dbl>

Names

names(ozs): returns the column names as a list

names(ozs)

 [1] "geoid"            "state"            "DesignatedOZ"     "county"          
 [5] "Type"             "dec_score"        "SE_Flag"          "Population"      
 [9] "medhhincome"      "PovertyRate"      "unemprate"        "medvalue"        
[13] "medrent"          "pctown"           "severerentburden" "vacancyrate"     
[17] "pctwhite"         "pctBlack"         "pctHispanic"      "pctAAPIalone"    
[21] "pctunder18"       "pctover64"        "HSorlower"        "BAorhigher"      
[25] "Metro"            "Micro"            "NoCBSAType"

Summaries

str(ozs): structure of the object and information about the class, length and content of each column

str(ozs)

tibble [42,178 × 27] (S3: tbl_df/tbl/data.frame)
 $ geoid           : chr [1:42178] "01001020200" "01001020300" "01001020700" "01001020802" ...
 $ state           : chr [1:42178] "Alabama" "Alabama" "Alabama" "Alabama" ...
 $ DesignatedOZ    : num [1:42178] NA NA 1 NA NA NA NA 1 NA 1 ...
 $ county          : chr [1:42178] "Autauga County" "Autauga County" "Autauga County" "Autauga County" ...
 $ Type            : chr [1:42178] "Low-Income Community" "Non-LIC Contiguous" "Low-Income Community" "Non-LIC Contiguous" ...
 $ dec_score       : num [1:42178] 4 6 8 10 5 6 6 9 10 9 ...
 $ SE_Flag         : num [1:42178] NA NA NA NA NA NA NA NA NA NA ...
 $ Population      : num [1:42178] 2196 3136 3047 10743 2899 ...
 $ medhhincome     : num [1:42178] 41107 51250 45234 61242 49567 ...
 $ PovertyRate     : num [1:42178] 0.24 0.107 0.19 0.153 0.151 ...
 $ unemprate       : num [1:42178] 0.0775 0.051 0.1407 0.0459 0.0289 ...
 $ medvalue        : num [1:42178] 95300 113800 93500 160400 102900 ...
 $ medrent         : num [1:42178] 743 817 695 1018 546 ...
 $ pctown          : num [1:42178] 0.628 0.703 0.711 0.823 0.83 ...
 $ severerentburden: num [1:42178] 0.3269 0.3223 0.3887 0.1994 0.0994 ...
 $ vacancyrate     : num [1:42178] 0.0584 0.1399 0.0619 0.0609 0.2182 ...
 $ pctwhite        : num [1:42178] 0.439 0.671 0.833 0.814 0.726 ...
 $ pctBlack        : num [1:42178] 0.5187 0.205 0.0922 0.1572 0.2456 ...
 $ pctHispanic     : num [1:42178] 0.01275 0.0727 0.0338 0.01368 0.00448 ...
 $ pctAAPIalone    : num [1:42178] 0.01093 0.01052 0 0.00959 0 ...
 $ pctunder18      : num [1:42178] 0.218 0.224 0.249 0.27 0.245 ...
 $ pctover64       : num [1:42178] 0.124 0.175 0.149 0.122 0.156 ...
 $ HSorlower       : num [1:42178] 0.581 0.464 0.544 0.45 0.621 ...
 $ BAorhigher      : num [1:42178] 0.162 0.219 0.113 0.229 0.136 ...
 $ Metro           : num [1:42178] 1 1 1 1 1 1 1 1 1 1 ...
 $ Micro           : num [1:42178] NA NA NA NA NA NA NA NA NA NA ...
 $ NoCBSAType      : num [1:42178] NA NA NA NA NA NA NA NA NA NA ...

summary(ozs): summary statistics for each column

summary(ozs)

    geoid              state            DesignatedOZ      county         
 Length:42178       Length:42178       Min.   :1       Length:42178      
 Class :character   Class :character   1st Qu.:1       Class :character  
 Mode  :character   Mode  :character   Median :1       Mode  :character  
                                       Mean   :1                         
                                       3rd Qu.:1                         
                                       Max.   :1                         
                                       NA's   :33414                     
     Type             dec_score         SE_Flag        Population   
 Length:42178       Min.   : 1.000   Min.   :1       Min.   :    0  
 Class :character   1st Qu.: 3.000   1st Qu.:1       1st Qu.: 2752  
 Mode  :character   Median : 5.000   Median :1       Median : 3897  
                    Mean   : 5.495   Mean   :1       Mean   : 4147  
                    3rd Qu.: 8.000   3rd Qu.:1       3rd Qu.: 5224  
                    Max.   :10.000   Max.   :1       Max.   :40616  
                    NA's   :1239     NA's   :41113   NA's   :112    
  medhhincome      PovertyRate       unemprate          medvalue      
 Min.   :  2499   Min.   :0.0000   Min.   :0.00000   Min.   :   9999  
 1st Qu.: 32014   1st Qu.:0.1381   1st Qu.:0.05900   1st Qu.:  85700  
 Median : 41094   Median :0.2055   Median :0.08735   Median : 122400  
 Mean   : 42153   Mean   :0.2331   Mean   :0.10063   Mean   : 165663  
 3rd Qu.: 50833   3rd Qu.:0.2996   3rd Qu.:0.12600   3rd Qu.: 191300  
 Max.   :181406   Max.   :1.0000   Max.   :1.00000   Max.   :2000001  
 NA's   :249      NA's   :141      NA's   :141       NA's   :1106     
    medrent           pctown       severerentburden  vacancyrate     
 Min.   :  99.0   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.: 655.0   1st Qu.:0.3833   1st Qu.:0.1662   1st Qu.:0.07116  
 Median : 800.0   Median :0.5728   Median :0.2403   Median :0.11661  
 Mean   : 860.9   Mean   :0.5436   Mean   :0.2476   Mean   :0.14121  
 3rd Qu.:1010.0   3rd Qu.:0.7316   3rd Qu.:0.3206   3rd Qu.:0.18011  
 Max.   :3501.0   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
 NA's   :395      NA's   :1035     NA's   :189      NA's   :167      
    pctwhite         pctBlack        pctHispanic       pctAAPIalone    
 Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.2039   1st Qu.:0.01071   1st Qu.:0.02603   1st Qu.:0.00000  
 Median :0.5614   Median :0.06655   Median :0.09304   Median :0.00883  
 Mean   :0.5210   Mean   :0.18651   Mean   :0.22063   Mean   :0.03806  
 3rd Qu.:0.8294   3rd Qu.:0.24998   3rd Qu.:0.32018   3rd Qu.:0.03532  
 Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :0.91144  
 NA's   :131      NA's   :131       NA's   :131       NA's   :131      
   pctunder18       pctover64         HSorlower        BAorhigher    
 Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.1908   1st Qu.:0.09436   1st Qu.:0.4150   1st Qu.:0.1120  
 Median :0.2300   Median :0.13604   Median :0.5182   Median :0.1680  
 Mean   :0.2295   Mean   :0.14340   Mean   :0.5067   Mean   :0.2035  
 3rd Qu.:0.2719   3rd Qu.:0.18057   3rd Qu.:0.6113   3rd Qu.:0.2536  
 Max.   :0.6468   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
 NA's   :131      NA's   :131       NA's   :132      NA's   :132     
     Metro          Micro         NoCBSAType   
 Min.   :1      Min.   :1       Min.   :1      
 1st Qu.:1      1st Qu.:1       1st Qu.:1      
 Median :1      Median :1       Median :1      
 Mean   :1      Mean   :1       Mean   :1      
 3rd Qu.:1      3rd Qu.:1       3rd Qu.:1      
 Max.   :1      Max.   :1       Max.   :1      
 NA's   :9111   NA's   :37450   NA's   :37795

Your Turn!

Try your hand at some of these summarization methods to see what they produce.

Selective Summaries

How would we run summaries just for population, median household income, and poverty rate (think back to how we created subsets using lists)?

summary(ozs[, c("Population", "medhhincome", "PovertyRate")])

   Population     medhhincome      PovertyRate    
 Min.   :    0   Min.   :  2499   Min.   :0.0000  
 1st Qu.: 2752   1st Qu.: 32014   1st Qu.:0.1381  
 Median : 3897   Median : 41094   Median :0.2055  
 Mean   : 4147   Mean   : 42153   Mean   :0.2331  
 3rd Qu.: 5224   3rd Qu.: 50833   3rd Qu.:0.2996  
 Max.   :40616   Max.   :181406   Max.   :1.0000  
 NA's   :112     NA's   :249      NA's   :141

Your Turn!
Solution

Practice your querying skills - how would we return only those records for census tracts with a median household income above $100,000 per year?

ozs[ozs$medhhincome>=100000,]

# A tibble: 399 × 27
   geoid       state   DesignatedOZ county    Type  dec_score SE_Flag Population
   <chr>       <chr>          <dbl> <chr>     <chr>     <dbl>   <dbl>      <dbl>
 1 <NA>        <NA>              NA <NA>      <NA>         NA      NA         NA
 2 <NA>        <NA>              NA <NA>      <NA>         NA      NA         NA
 3 <NA>        <NA>              NA <NA>      <NA>         NA      NA         NA
 4 <NA>        <NA>              NA <NA>      <NA>         NA      NA         NA
 5 04013815600 Arizona           NA Maricopa… Non-…        10      NA       5762
 6 04013816800 Arizona           NA Maricopa… Non-…        10       1       3727
 7 <NA>        <NA>              NA <NA>      <NA>         NA      NA         NA
 8 <NA>        <NA>              NA <NA>      <NA>         NA      NA         NA
 9 <NA>        <NA>              NA <NA>      <NA>         NA      NA         NA
10 <NA>        <NA>              NA <NA>      <NA>         NA      NA         NA
# ℹ 389 more rows
# ℹ 19 more variables: medhhincome <dbl>, PovertyRate <dbl>, unemprate <dbl>,
#   medvalue <dbl>, medrent <dbl>, pctown <dbl>, severerentburden <dbl>,
#   vacancyrate <dbl>, pctwhite <dbl>, pctBlack <dbl>, pctHispanic <dbl>,
#   pctAAPIalone <dbl>, pctunder18 <dbl>, pctover64 <dbl>, HSorlower <dbl>,
#   BAorhigher <dbl>, Metro <dbl>, Micro <dbl>, NoCBSAType <dbl>

Tracts in Illinois

Oftentimes, we’ll want to query out a subset of observations based upon their geographic location. Let’s try selecting all tracts in Illinois based upon their designation status.

Your Turn!
Solution

How would we query out tracts in Illinois? Experiment in your own script.

ozs[ozs$state == "Illinois",]

# A tibble: 1,682 × 27
   geoid       state    DesignatedOZ county   Type  dec_score SE_Flag Population
   <chr>       <chr>           <dbl> <chr>    <chr>     <dbl>   <dbl>      <dbl>
 1 17001000201 Illinois           NA Adams C… Non-…         7      NA       1937
 2 17001000202 Illinois           NA Adams C… Low-…         1      NA       2563
 3 17001000400 Illinois           NA Adams C… Low-…         1      NA       3403
 4 17001000500 Illinois           NA Adams C… Low-…         1      NA       2298
 5 17001000700 Illinois           NA Adams C… Low-…         1      NA       1259
 6 17001000800 Illinois            1 Adams C… Low-…         1      NA       2700
 7 17001000900 Illinois           NA Adams C… Low-…         5      NA       2671
 8 17001010100 Illinois           NA Adams C… Non-…         2      NA       4323
 9 17001010200 Illinois           NA Adams C… Low-…         2      NA       3436
10 17001010300 Illinois           NA Adams C… Non-…         8      NA       6038
# ℹ 1,672 more rows
# ℹ 19 more variables: medhhincome <dbl>, PovertyRate <dbl>, unemprate <dbl>,
#   medvalue <dbl>, medrent <dbl>, pctown <dbl>, severerentburden <dbl>,
#   vacancyrate <dbl>, pctwhite <dbl>, pctBlack <dbl>, pctHispanic <dbl>,
#   pctAAPIalone <dbl>, pctunder18 <dbl>, pctover64 <dbl>, HSorlower <dbl>,
#   BAorhigher <dbl>, Metro <dbl>, Micro <dbl>, NoCBSAType <dbl>

We can see in our table output that there are 1,682 eligible or designated tracts in Illinois. We could also use the nrow() command to count the number of rows.

Your Turn!
Solution

Try crafting code that would count the number of rows for Illinois

nrow(ozs[ozs$state == "Illinois",])

[1] 1682

Grouped Means

We might also want to calculate statistics like averages for subsets. mean() will calculate the mean of a list or column. What’s the average income for tracts with a vacancy rate above 20 percent?

Your Turn
Solution

What’s the average income for tracts with a vacancy rate below 20 percent?

mean(ozs$medhhincome[ozs$vacancyrate > .2], na.rm=TRUE)

[1] 35375.68

mean(ozs$medhhincome[ozs$vacancyrate < .2], na.rm=TRUE)

[1] 43849.1

You might need to check out the documentation for mean() in order to return an answer here. R will not calculate the mean if an NA values are present in the vector for which you’ve requested the mean - a good safety feature if you’re expecting all values to be present.

In imperfect data like what we’re dealing with, you can instruct R to remove those NAs and find the mean for remaining values. You may also want to make sure you’ve counted the number of NA values so you know what proportion of your data the mean is actually representing.

Subsetting

We might also be interested in combining query criteria. R can make use of logical statements. & is equivalent to AND and | is equivalent to OR. Now give it a go!

Your Turn
Solution

What is the average income for tracts in Illinois with a poverty rate of greater than 20 percent.

mean(ozs$medhhincome[ozs$state == "Illinois" & ozs$PovertyRate > .2], na.rm=TRUE)

[1] 33526

To confirm we got the query correct, it may be useful to have a look at the returned data without calculating the mean:

ozs$medhhincome[ozs$state == "Illinois" & ozs$PovertyRate > .2]

  [1] 28819 32313 17850 26012 40475 35387 40714 22326 21500 49590 40599 37679
 [13] 26676  7004    NA  7273  5736 38056 38083 33873 49597 22813 30994 49303
 [25] 48125 46364  7234 22688 44800 35904 21107 37823 36164 44460 30406 30682
 [37] 22945 15020 38922 39000 29861 38861 29432 26750 29870 17569 36464 51458
 [49] 64073 39349 40913 45349 51840 48450 40272 36848 39096 35463 43681 29091
 [61] 27931 34671 35286 27018 36412 22647 23700 42316 38571 31696 43895    NA
 [73] 42077 36159 42280 45417 51250 52074 47652 31364 42762 42232 36522 44750
 [85] 46905 39926 37284 36607 36644 48316 40662 38693 49236 47000 44393 43912
 [97] 55741 34375 43607 34340 45208 47344 52063 28558 52833 43717 56250 30503
[109] 32317 34933 36338 32232 26773 27788 51542 35698 40515 43650 69348 52096
[121] 28487 37227 31004 33873 31160 35625 34583 23516 43011 28631 28363 22708
[133] 34030 28311 20405 20620 22289 23264 33508 31029 24914 12036 25921 29306
[145] 22097 28867 21607 23214 20912 20563 30817 16989 23929 22150 32717 31181
[157] 17652 21250 20278 20565 58047 45625 71250 61389 23834 21516 25291 27019
[169] 36250 38054 40511 29120 29446 25099 26234 26375 31216 30205 37013 35764
[181] 33088 29914 45096 48083 29980 40719 34922 31875 33173 20990 11964 41645
[193] 11250 42324 11310 10942 23603 24500 19747 18063 21250 26000 34327    NA
[205]    NA 32500 31602 18808 24178 20250 31938 25346 18859 21250 51875 35636
[217] 38260 52817 82667 19034 37610 29960 14500 22353 17731 20573 29688 21658
[229] 33849 20889 23606 15723 18125 19872 28214 19688 42292 32500 21362 22407
[241] 21727 22306 22619 33125 28984 22371 33281 31008 27353 29125 20944 43341
[253] 31518 20118 13536 23684 36938 41917 44514 51723 38750 36631 32430 35469
[265] 43083 26641 39276 26630 26443 39348 33304 37679 32277 34154 33750 35240
[277] 38857 24604 27073 42684 39500 41625 35089 25956 14556 14273 52485 44706
[289] 40199 39722 40743 40254 41081 38452 31688 49358 43698 32682 32377 31982
[301] 33631 21678 20761 22969 26727 26289 32949 27622 31250 26985 27880 25609
[313] 25855 23906 45000 32331 33509 40054 52879 40901 22422 32083 39942 27712
[325] 27156 35338 25846 31329 45313 35647 24408 22381 25786 27589 20119 24911
[337] 22000 34583 27266 19205 33947 26094 25133 24427 36042 42670 22844 27841
[349] 15485 15139 16862 22163 22972 35476 18875 26070 29844 34609 20431 28472
[361] 36818 25461 22250 14622 26180 24806 32961 27864 24045 22250 19561 31330
[373] 39005 32788 35573 35991 31867 51681 47656 50913 40125 43872 57692 61991
[385] 50726 49116 49518 48011 65357 55523 56975 41051 51699 61806 49268 30532
[397] 46988 62132 35500 44678 32027 40268 43021 44778 34835 35810 39139 40308
[409] 40201 38163 36250 47943 46172 39212 40083 39965 28508 22261 45901 44083
[421] 37941 40099 37885 39534 39766 46728 41196 49572 46739 27861 37374 32196
[433] 37334 44048 36693 37500 41449 44653 31683 41405 30769 31474 37281 17425
[445] 22857 23048 22650 41250 20500 17633 23761 30487 36613 37659 43529 39593
[457] 27214 14722 22242 44722 22956 31188 35577 33220 44592 46533 99375 21456
[469] 41902 42786 40286 35179 32500 34231 58500 50521 51442 23750 37179 31238
[481] 30161 38883 32204 20898 26071 21827 19054 26514 37723 19207 23667    NA
[493] 55813 18627 52717 25699 52308 30946 28806 43203 26319 13429 26023 27283
[505] 33250 38750 56908 17377 17370 38382 43516 13281 20294 19228 84780 29831
[517] 22279 42200 33942 45694 37891 31389 30759 50417 27185 32315 43409 25694
[529] 19000 22446 18582 32895 35528 39652  9485 35041 21053 17197 28000 42254
[541] 37353 24886 35875 31788 38060 27012 43780 74038 23302 16844 15242 54444
[553] 52368 38551 39306 34028 34375 41646 37772 39049 43156 36469 24922 28448
[565] 31907 38279 30656 35469 54293 43860 31845 42463 34167 46731 24886 29831
[577] 37917 10127 13101 15671 22108 33514 18411 28791 15000 47733 54483 44872
[589] 35329 40491 47273 36543 47908 49904 44732 49091 47981 38281 46639 49101
[601] 36830 26627 30705 26406 24423 21488 42832 23569 36806 41053 17938 22813
[613] 33579 23011 34970 38380 36824 60630 38188 36127 48339 22500 29408 37883
[625] 51121 22312 30547 34063 32134 52981 37361 45333 46020 32725 26861 45385
[637] 38625 35123 33250 41824 21739    NA 37443 33056 47813 55861 11053 37266
[649] 62966 37356 26224 40188 42537 51354 29479 26250 31250 40556 21613 26103
[661] 26667 14434 17577 28866 32083 48706 35191 39013 41739 23222 26432 25377
[673] 32083 55313 16010 45078 34382 22121 21701 41958 37261 44291 36618 40408
[685] 39514 20000 39792 30735 74554 36917 25781 37875 44094 29304 28125 24932
[697] 35179 32198 33144 31446 38144 30464 34314 46034 12474 29773 22121 21582
[709] 26591 12247 16337 18815 27599 27225 29762 44844 67000 31653 35291 41462
[721] 45929 20942 31635 42583 32396 28884 36902 24671 40298 38750 29677 36979
[733] 41037 32434 42917 42344 20741 36960 21969 25463  9836 17120 15394 28065
[745] 21849 24095 29875 41964 37750 28077 37357 34449 19650 27019 29250 33240
[757] 25134 29779 24486  9967 15036 27005 16353 20439 35398 28750 63750 36333
[769] 32188 38750 41886 30930 25994 38480 28294 20500 16667 31716 16750 16513
[781] 17500 24821 41966 31002 23214 32956 54387 32697 32260 52788 52768 25078
[793] 23371 29978 40506 19145 27165 32034 23514 43007 31639 33981 40532 39006
[805] 34708 37185 31250 37414 51767 31772 34732 31674 37386 27448 19518 22899
[817] 50052 40212 37792 40676 45962 35028 56190 38083 36794 30932 46587 27457
[829] 18983 14731 23594 34420 36053 27566 28004 14853 40737 20000 27397 19167
[841] 25791 26979 12816 21091 21330 27432 31477 34939 44792 35660    NA    NA
[853]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[865]    NA    NA    NA    NA    NA    NA    NA    NA    NA

Dealing with Missing Values

You’ll note that there are several flag variables in the data for which values are either 1 or NA. We have flags for whether a tract was designated as an Opportunity Zone, Whether it is located in a Metropolitan, Micropolitan, or Non-Core-Based Statistical Area.

We can use logical tests in R to identify those values that are NA. We can use is.na() to test whether a value is NA (TRUE) or is not NA (FALSE). We could also use the negation sign ! to determine whether a value is not NA (!is.na()).

The following code returns a vector of logical values (TRUE / FALSE) regarding whether the value for the Designated column is NA or not.

is.na(ozs$Designated)

Warning: Unknown or uninitialised column: `Designated`.

logical(0)

For logical values, R codes 1 as TRUE and 0 as false, meaning if we wanted to count the number of undesginated tracts, we could ask for the sum of the values for which the logical test is true (the sum of the values that are NA):

sum(is.na(ozs$Designated))

Warning: Unknown or uninitialised column: `Designated`.

[1] 0

33,414 tracts were not designated.

Your Turn
Solution

Now count the number of tracts that were designated (where the value is not NA).

sum(!is.na(ozs$Designated))

Warning: Unknown or uninitialised column: `Designated`.

[1] 0

We might also want to recode those NA values to something else. We can use assignment and subset notation to replace na values with something else. Let’s replace those NAs in the Designated column with 0.

ozs$DesignatedOZ[is.na(ozs$DesignatedOZ)]<-0
ozs

# A tibble: 42,178 × 27
   geoid       state   DesignatedOZ county    Type  dec_score SE_Flag Population
   <chr>       <chr>          <dbl> <chr>     <chr>     <dbl>   <dbl>      <dbl>
 1 01001020200 Alabama            0 Autauga … Low-…         4      NA       2196
 2 01001020300 Alabama            0 Autauga … Non-…         6      NA       3136
 3 01001020700 Alabama            1 Autauga … Low-…         8      NA       3047
 4 01001020802 Alabama            0 Autauga … Non-…        10      NA      10743
 5 01001021000 Alabama            0 Autauga … Non-…         5      NA       2899
 6 01001021100 Alabama            0 Autauga … Low-…         6      NA       3247
 7 01003010100 Alabama            0 Baldwin … Non-…         6      NA       4013
 8 01003010200 Alabama            1 Baldwin … Low-…         9      NA       3067
 9 01003010300 Alabama            0 Baldwin … Non-…        10      NA       8079
10 01003010400 Alabama            1 Baldwin … Non-…         9      NA       4578
# ℹ 42,168 more rows
# ℹ 19 more variables: medhhincome <dbl>, PovertyRate <dbl>, unemprate <dbl>,
#   medvalue <dbl>, medrent <dbl>, pctown <dbl>, severerentburden <dbl>,
#   vacancyrate <dbl>, pctwhite <dbl>, pctBlack <dbl>, pctHispanic <dbl>,
#   pctAAPIalone <dbl>, pctunder18 <dbl>, pctover64 <dbl>, HSorlower <dbl>,
#   BAorhigher <dbl>, Metro <dbl>, Micro <dbl>, NoCBSAType <dbl>

Can you inspect the table here to see what happened? In plain language, we told R “for those values of the column named Designated in the ozs data table where the values are NA, assign a new value of 0.”

Your Turn
Solution

Go ahead and do the same thing for the Metro, Micro, and NoCBSAType columns.

ozs$Metro[is.na(ozs$Metro)]<-0
ozs$Micro[is.na(ozs$Micro)]<-0
ozs$NoCBSAType[is.na(ozs$NoCBSAType)]<-0

Independent Exploration:

Now answer the following questions:

Your Turn
Solution

Report average poverty rates for designated opportunity zones in metropolitan, micropolitan, and non-CBSA areas

# Your Work Here
mean(ozs$PovertyRate[ozs$Designated ==1 & ozs$Metro == 1], na.rm=TRUE)

Warning: Unknown or uninitialised column: `Designated`.

[1] NaN

mean(ozs$PovertyRate[ozs$Designated ==1 & ozs$Micro == 1], na.rm=TRUE)

Warning: Unknown or uninitialised column: `Designated`.

[1] NaN

mean(ozs$PovertyRate[ozs$Designated ==1 & ozs$NoCBSAType == 1], na.rm=TRUE)

Warning: Unknown or uninitialised column: `Designated`.

[1] NaN

Your Turn
Solution

For Illinois, how different are the average vacancy rates for designated and undesignated census tracts?

mean(ozs$vacancyrate[ozs$state == "Illinois" & ozs$Designated == 1])

Warning: Unknown or uninitialised column: `Designated`.

[1] NaN

mean(ozs$vacancyrate[ozs$state == "Illinois" & ozs$Designated == 0], na.rm=TRUE)

Warning: Unknown or uninitialised column: `Designated`.

[1] NaN

Think of another question you’d like to ask of these data - write it down and work the problem out.