Scraping in the time of coronavirus

By late March, two weeks into the official pandemic, requests for unemployment assistance were setting records in Arizona. Gov. Doug Ducey issued orders to “stay home, stay healthy, stay connected” on March 30, and most restaurants and stores had closed down. Still, it was unclear which industries and employers were laying off workers due to covid-19.

This walkthrough gathers the data to help answer that question via a scraping exercise in R. We’ll go from a clunky government website to a single data table that shows the business impact, still ongoing, of coronavirus over time.

WARN listings

Companies that have at least 100 employees are required by law to submit notices to the state under the Worker Adjustment and Retraining Notification (WARN) Act. There are other specifics under the law, but the 100-employee rule broadly holds - and some companies with less than 100 employees issue notices, too. WARN does not apply to government agencies.

Here’s how the state organizes listings online:

The WARN homepage in Arizona

The WARN homepage in Arizona

The website makes it difficult to get data. Every row links to a new page with details about how many people were laid off at the company, but there’s no bulk download button. To collect data on a large scale, we’d have to undertake a lot of time-consuming, error-prone clicking around.

If I click on Yelp, I see the detailed notice - 375 layoffs.

If I click on Yelp, I see the detailed notice - 375 layoffs.

But before we jump into scraping, let’s examine the URL and the underlying data.

https://www.azjobconnection.gov/ada/mn_warn_dsp.cfm?securitysys=on&start_row=26&max_rows=25&orderby=employer&choice=1

When I page through the site, I see that a maximum of 25 layoff notices can be displayed at one time. The URL makes that clear: max_rows is set to 25. With some websites, typing over the 25 with a larger number will trick the system to display all the data, but here it doesn’t work. The website appears to be one giant table divided into 25-row sections: Page 2 shows rows 26-50, page 3 shows 51-75 and so on.

As of April 13, 2020, there are 420 rows - 420 layoff notices - spread over 17 pages.

Though we can’t trick the website into displaying all the data at once, we can manipulate other keys in the URL.

Note the different equals signs in the webite address. Isolated, they are:

Right now, the data is sorted by employer name. I’ll double-click the Notice Date column to make sure the most recent layoffs are shown first. This changes the orderby field to equal noticeDateSort%20DESC, which will be useful later on. When we come back to this exercise, for example, we might scrape only the most recent results.

Click the column header twice to arrange the most recent results first

Click the column header twice to arrange the most recent results first

Now, with a better understanding of the website’s structure, we can get ready to scrape.

Plan of attack

We’ll scrape the site in two parts:

  • First, we’ll scrape the data on the outer ‘shell’ of the website - including the hyperlinks that take us to details on the layoffs
  • Second, we’ll scrape the detailed listings for extra information - such as how many employees are affected

Part 1: Scraping the first layer of data

Downloading

We’ll tackle this in a loop, one page at a time.

But first, let’s do a test run on a single page of results. Notice how we can manipulate the URL: I’ll set the start_row equal to 1 and change the orderby header to sort descending by date.

If there’s no error message, we should be in the clear. We can now write a loop to download all 17 pages of data, pausing for 1 second between each request.

On each iteration of the code, we need only to change the start_row; everything else in the URL can stay the same. The code below shows how that works, saving each page to a folder called warn_listing:

Parsing the data - setup

The next task is to parse the data on every saved html file using the rvest package. Here’s how that might look on a single page, step by step:

Get the ‘children’ of the html nodes, by row, to get a feel for how the data is organized. Rows one and two didn’t return any table elements, but on row three we see the first sign of data:

## {xml_nodeset (5)}
## [1] <td class="cfPadLeft cfAlternate " valign="top" align="left" colspan ...
## [2] <td class=" cfAlternate " valign="top" align="left" colspan="1">\n<s ...
## [3] <td class=" cfAlternate " valign="top" align="left" colspan="1">\n<s ...
## [4] <td class=" cfAlternate " valign="top" align="left" colspan="1">\n<s ...
## [5] <td class=" cfAlternate cfPadRight" valign="top" align="left" colspa ...

We can start looking for data one column at a time.

## [1] "Flying Food Group, LLC  "

Note: html_table( ) is another neat command in rvest that converts html tables into dataframes. But since we need to get the URL in each row manually, it doesn’t save us any time - we may as well parse every column.

Parsing the data in a loop

We can automate the parsing now that we understand how to access the data in each column.

Begin by creating a table to hold our results:

Now begin the loop.

Note: The loop range is hard coded as 1:17 here based on my visual analysis of the website. In a future iteration of this code, I would automate that process.

A little bit of clean-up:

And don’t forget to save the results!

Half our work is done. Now let’s prep for the next step: scraping data from the detailed WARN notices. These are conveniently saved in our warn_data table in the full_link column.

Part 2: Scraping the detailed layoff data

Downloading

First, create a table with all the hyperlinks we saved.

Similar to before, we can write a loop to download these webpages. We’ll iterate through our list of links and save the page with a filepath that reflects the WARN ID. Here, since we have 420 requests instead of 17, I’ve shortened the ‘sleeping’ window to 0.75 seconds.

Parsing the data in a loop

On this task, we don’t need to do any manual parsing - so html_table( ) will be a big help. All we have to do is pivot the parsed data and add an id column.

One additonal challenge is that we can’t walk through the pages in a sequential order as we did before, since the pages are saved according to their WARN ID. We can get around that using list.files( ), which lets us iterate through all the html files in the folder.

Just as before, let’s create a table to hold the data:

Now begin the loop to parse the data. We’ll store it in a table called warn_detail.

Save the results as a csv:

Review

Let’s combine the two sets of data - warn_data and warn_detail - into one table about layoffs.

We’re done. Now, we have over 400 records of layoffs dating back to 2010, which is when the first notice was posted online. We can group them together by month and year.

Below is a list of monthly layoffs, ranked by most employees impacted. March 2020 tops the list by far; April is second on the list, though it’s only halfway through. Putting the two months together, we can estimate that about 9,000 people have been impacted by layoffs or furloughs in Arizona since covid-19 sparked a pandemic. It’s a sobering record — and one that still leaves out entrepreneurs, small businesses and government employees on furlough.

## # A tibble: 6 x 4
##   mo_yr   notices distinct_companies employees_affected
##   <chr>     <int>              <int>              <dbl>
## 1 3-2020       44                 38               6288
## 2 4-2020       27                 26               2821
## 3 10-2019      10                 10               1528
## 4 8-2015       12                  8               1298
## 5 10-2014       7                  7               2636
## 6 12-2019       7                  7               1074

Which companies have issued the most layoffs due to coronavirus?

## # A tibble: 10 x 5
##    company         co_hq_city   co_hq_state notice_date num_employees_affe…
##    <chr>           <chr>        <chr>       <date>                    <dbl>
##  1 Sam Levits      Tucson       AZ          2020-03-25                  628
##  2 Sub-Zero Group… Goodyear     AZ          2020-03-20                  446
##  3 Four Seasons R… Scottsdale   AZ          2020-03-20                  422
##  4 Taylor farms    Salinas      CA          2020-04-07                  407
##  5 Taylor Farms    Yuma         AZ          2020-03-27                  407
##  6 Yelp            Scottsdale   AZ          2020-04-09                  375
##  7 Zip Recruiter   Santa Monic… CA          2020-03-27                  314
##  8 Great Wolf Lod… Scottsdale   AZ          2020-03-30                  312
##  9 Cinemark        Tucson       AZ          2020-04-01                  292
## 10 Go Rentals      Newport      CA          2020-03-23                  275

There’s so much more to analyze - but for now, we’ll adjourn.

Credits, caveats and next steps

Sarah Cohen generously got me started on this by sharing code that took through me most of Part 1. I used what I learned from reading her code to write Part 2 and assemble this tutorial.

The Arizona Department of Economic Security also has an online listing of “non-WARN” notices — companies not bound by the WARN Act but who choose to submit layoff notices to the state anyhow. They are issued much less frequently, and this analysis excludes them.

It’s worth noting, too, that while March and April are record-setting months for layoffs in our dataset, we don’t have online records going back to the housing crash of 2008. With this data alone, we can’t tell how these numbers compare to the last great recession.

Finally, this code-through is written as a one-time exercise, but it would be helpful to have another script that scrapes only the most recent layoffs and discards any duplicates already stored. Or, one that scrapes automatically when it detects a website change. That would save me and the server some time.