We’ll tackle this in a loop, one page at a time.
But first, let’s do a test run on a single page of results. Notice how we can manipulate the URL: I’ll set the start_row
equal to 1
and change the orderby
header to sort descending by date.
download.file( "https://www.azjobconnection.gov/ada/mn_warn_dsp.cfm?securitysys=on&start_row=1&max_rows=25&orderby=noticeDateSort%20DESC&choice=1",
"warn_listing/page001.html" )
If there’s no error message, we should be in the clear. We can now write a loop to download all 17 pages of data, pausing for 1 second between each request.
On each iteration of the code, we need only to change the start_row
; everything else in the URL can stay the same. The code below shows how that works, saving each page to a folder called warn_listing
:
url <- "https://www.azjobconnection.gov/ada/mn_warn_dsp.cfm?securitysys=on&start_row=1&max_rows=25&orderby=noticeDateSort%20DESC&choice=1"
# Get the pages, starting at page 2 since we saved the first page in the last step:
for (p in 2:17) {
# set new start row
new_start_row <- ( (p-1) * 25 + 1 )
# build the url by using str_replace to adjust the start_row
new_url <- str_replace( url,
"(start_row=)(1)", # regex pattern
paste0( "\\1", new_start_row ) ) # replacement
download.file (new_url,
paste0 ( "warn_listing/page",
sprintf( "%03d", p ),
".html" ),
method="auto" )
Sys.sleep( 1 )
}
The next task is to parse the data on every saved html file using the rvest package. Here’s how that might look on a single page, step by step:
# read in a single page
my_html <- read_html( "warn_listing/page001.html" )
# convert html page to a series of tree-like nodes
all_rows <- html_nodes( my_html, xpath='//div[2]/table/tr' )
Get the ‘children’ of the html nodes, by row, to get a feel for how the data is organized. Rows one and two didn’t return any table elements, but on row three we see the first sign of data:
## {xml_nodeset (5)}
## [1] <td class="cfPadLeft cfAlternate " valign="top" align="left" colspan ...
## [2] <td class=" cfAlternate " valign="top" align="left" colspan="1">\n<s ...
## [3] <td class=" cfAlternate " valign="top" align="left" colspan="1">\n<s ...
## [4] <td class=" cfAlternate " valign="top" align="left" colspan="1">\n<s ...
## [5] <td class=" cfAlternate cfPadRight" valign="top" align="left" colspa ...
We can start looking for data one column at a time.
n = length( all_rows )
data_rows <- all_rows[ 3:n ]
# use xpath to access the first column of data <td>
html_node( data_rows[ 1 ], xpath = 'td[1]' ) %>% html_text( )
## [1] "Flying Food Group, LLC "
Note: html_table( )
is another neat command in rvest that converts html tables into dataframes. But since we need to get the URL in each row manually, it doesn’t save us any time - we may as well parse every column.
We can automate the parsing now that we understand how to access the data in each column.
Begin by creating a table to hold our results:
# define parent url
parent_url <- "https://www.azjobconnection.gov/ada/"
# define data frame
row_data <- tibble(
page_num = character( ),
row_num = numeric( ),
full_link = character( ),
company = character( ),
co_hq_city = character( ),
co_hq_zip = character( ),
labor_area = character( ),
notice_date_txt = character( )
)
Now begin the loop.
# iterate through 17 html pages
for ( p in 1:17 ) {
# parse page items
my_html = read_html( paste0 ( "warn_listing/page", sprintf( "%03d", p ), ".html" ) )
all_rows <- html_nodes( my_html, xpath = '//div[2]/table/tr' )
n = length( all_rows )
data_rows = all_rows[ 3:n ]
new_start_row <- ( (p-1) * 25 + 1 )
num_data_rows <- length( data_rows )
# parse data_rows - begins at a multiple of 25 (plus 1)
for ( i in 1:num_data_rows ) {
j <- new_start_row + i - 1
row_data[ j, 1 ] <- paste0 ("page", sprintf( "%03d", p ), ".html") # col 1 = page_num
row_data[ j, 2 ] <- j # col 2 = row_num
# quick two-step URL extraction
row_url <- html_node( data_rows[ i ], xpath = "td[1]/span/a/@href" ) %>% html_text( )
row_data[ j, 3 ] <- paste0( parent_url, str_extract( row_url, "^.{16}(id=)\\d{1,4}" ) ) # col 3 = full_link
row_data[ j, 4 ] <- html_node( data_rows[ i ], xpath = "td[1]" ) %>% html_text( ) # col 4 = company
row_data[ j, 5 ] <- html_node( data_rows[ i ], xpath = "td[2]" ) %>% html_text( ) # col 5 = co_hq_city
row_data[ j, 6 ] <- html_node( data_rows[ i ], xpath = "td[3]" ) %>% html_text( ) # col 6 = co_hq_zip
row_data[ j, 7 ] <- html_node( data_rows[ i ], xpath = "td[4]" ) %>% html_text( ) # col 7 = labor_area
row_data[ j, 8 ] <- html_node( data_rows[ i ], xpath="td[5]" ) %>% html_text( ) # col 8 = notice_date_txt
}
# ta-da!
}
Note: The loop range is hard coded as 1:17 here based on my visual analysis of the website. In a future iteration of this code, I would automate that process.
A little bit of clean-up:
# add an id column + make the date column a datetime datatype
warn_data <- row_data %>%
mutate( id = str_extract( full_link, "\\d{1,4}"),
notice_date = mdy( notice_date_txt ) ) %>%
select( -notice_date_txt )
And don’t forget to save the results!
# save without overwriting past results
date_curr = today( )
dir.create( paste0(
"output/",
date_curr ) )
save_location = paste0(
"output/",
date_curr,
"/warn_data.csv"
)
write_csv( warn_data, save_location )
Half our work is done. Now let’s prep for the next step: scraping data from the detailed WARN notices. These are conveniently saved in our warn_data
table in the full_link
column.
First, create a table with all the hyperlinks we saved.
Similar to before, we can write a loop to download these webpages. We’ll iterate through our list of links and save the page with a filepath that reflects the WARN ID. Here, since we have 420 requests instead of 17, I’ve shortened the ‘sleeping’ window to 0.75 seconds.
On this task, we don’t need to do any manual parsing - so html_table( )
will be a big help. All we have to do is pivot the parsed data and add an id
column.
One additonal challenge is that we can’t walk through the pages in a sequential order as we did before, since the pages are saved according to their WARN ID. We can get around that using list.files( )
, which lets us iterate through all the html files in the folder.
Just as before, let’s create a table to hold the data:
# define data frame
warn_detail <- tibble(
page_name = character( ),
row_num = numeric( ),
id = character( ),
company = character( ),
co_st_address = character( ),
co_hq_city = character( ),
co_hq_state = character( ),
co_hq_zip = character( ),
num_employees_affected = numeric( ),
other_notes = character( )
)
Now begin the loop to parse the data. We’ll store it in a table called warn_detail
.
# create a list of files
file_list <- list.files( path= "company_urls/warn" )
n <- length( file_list )
# loop through every scraped file to parse the data
for (i in 1:n ) {
file_name <- file_list[ i ] %>% as.character( )
my_html <- read_html( paste0( "company_urls/warn/", file_name ) )
my_table <- html_table( html_nodes( my_html, "table" )[[2]])
table_length <- nrow( my_table )
# account for layoff notices that don't have "Other Notes"
if( table_length < 7 ) {
r = tribble(
~X1, ~X2,
"Other Notes", "" )
my_table <- bind_rows( my_table, r )
}
warn_row <- my_table %>%
pivot_wider( names_from = X1, values_from = X2 ) %>%
mutate(
id = str_extract(file_name, "\\d{1,4}"),
page_name = file_name,
row_num = i,
num_employees_affected = `Number of employees affected` %>% as.numeric( ),
other_notes = `Other Notes`
) %>%
# re-order and rename columns to match data-frame
select(
page_name,
row_num,
id,
company = `Company name`,
co_st_address = `Street address`,
co_hq_city = `City`,
co_hq_state = `State`,
co_hq_zip = `Zip code`,
num_employees_affected,
other_notes
)
warn_detail <- bind_rows( warn_detail, warn_row )
}
# done!
Save the results as a csv:
Let’s combine the two sets of data - warn_data
and warn_detail
- into one table about layoffs.
warn_data_complete <- warn_data %>%
left_join( warn_detail, by = c( "id" = "id" ) ) %>%
select(
id,
results_page = page_num, # page number of results on WARN website
page_name, # id number within URL for detailed WARN
full_link,
company = company.x,
co_hq_city = co_hq_city.x,
co_hq_state,
co_hq_zip = co_hq_zip.x,
labor_area,
notice_date,
num_employees_affected,
other_notes
)
# save
save_location = paste0(
"output/",
date_curr,
"/warn_data_complete.csv"
)
write_csv( warn_data_complete, save_location )
We’re done. Now, we have over 400 records of layoffs dating back to 2010, which is when the first notice was posted online. We can group them together by month and year.
Below is a list of monthly layoffs, ranked by most employees impacted. March 2020 tops the list by far; April is second on the list, though it’s only halfway through. Putting the two months together, we can estimate that about 9,000 people have been impacted by layoffs or furloughs in Arizona since covid-19 sparked a pandemic. It’s a sobering record — and one that still leaves out entrepreneurs, small businesses and government employees on furlough.
warn_data_complete %>%
mutate( m = month( notice_date ),
y = year( notice_date ),
mo_yr = paste(m, y, sep = "-" ) ) %>%
group_by( mo_yr ) %>%
summarise(
notices = n( ),
distinct_companies = n_distinct( company ),
employees_affected = sum( num_employees_affected )
) %>%
arrange( desc( distinct_companies ) ) %>%
head( )
## # A tibble: 6 x 4
## mo_yr notices distinct_companies employees_affected
## <chr> <int> <int> <dbl>
## 1 3-2020 44 38 6288
## 2 4-2020 27 26 2821
## 3 10-2019 10 10 1528
## 4 8-2015 12 8 1298
## 5 10-2014 7 7 2636
## 6 12-2019 7 7 1074
Which companies have issued the most layoffs due to coronavirus?
warn_data_complete %>%
filter( str_detect( other_notes, "covid|coronavirus|Covid|Coronavirus|COVID" ) ) %>%
select( company, co_hq_city, co_hq_state, notice_date, num_employees_affected ) %>%
arrange( desc( num_employees_affected ) ) %>%
head( 10 )
## # A tibble: 10 x 5
## company co_hq_city co_hq_state notice_date num_employees_affe…
## <chr> <chr> <chr> <date> <dbl>
## 1 Sam Levits Tucson AZ 2020-03-25 628
## 2 Sub-Zero Group… Goodyear AZ 2020-03-20 446
## 3 Four Seasons R… Scottsdale AZ 2020-03-20 422
## 4 Taylor farms Salinas CA 2020-04-07 407
## 5 Taylor Farms Yuma AZ 2020-03-27 407
## 6 Yelp Scottsdale AZ 2020-04-09 375
## 7 Zip Recruiter Santa Monic… CA 2020-03-27 314
## 8 Great Wolf Lod… Scottsdale AZ 2020-03-30 312
## 9 Cinemark Tucson AZ 2020-04-01 292
## 10 Go Rentals Newport CA 2020-03-23 275
There’s so much more to analyze - but for now, we’ll adjourn.
Sarah Cohen generously got me started on this by sharing code that took through me most of Part 1. I used what I learned from reading her code to write Part 2 and assemble this tutorial.
The Arizona Department of Economic Security also has an online listing of “non-WARN” notices — companies not bound by the WARN Act but who choose to submit layoff notices to the state anyhow. They are issued much less frequently, and this analysis excludes them.
It’s worth noting, too, that while March and April are record-setting months for layoffs in our dataset, we don’t have online records going back to the housing crash of 2008. With this data alone, we can’t tell how these numbers compare to the last great recession.
Finally, this code-through is written as a one-time exercise, but it would be helpful to have another script that scrapes only the most recent layoffs and discards any duplicates already stored. Or, one that scrapes automatically when it detects a website change. That would save me and the server some time.