FCC Data Scrapers

The Federal Communications Commission (FCC) keeps public records of all satellite actions and license permissions in the central filing system for the International Bureau (IB), MyIBFS. Data analysis of these records may provide insights into patterns of satellite usage trends over time and reveal areas of missing or contradictory information. However, these records are not available in a friendly format.

Thus, we built an application to extract information from HTML data tables, individual web pages, and public notice PDFs. It also performs preliminary cleaning, consolidation, and data validation. The final tabular and .json data files will be fed to an interactive R data application for exploration and analysis.

Table of Contents

Set-Up Environment

These instructions are guaranteed to work on Ubuntu 20.04. It should work for newer versions but you may have to figure out package dependencies by yourself. No packages here are un-maintained or too far off the beaten path, so it should be fine.

Step 1. Download Google Chrome drivers to interface with selenium

This boils down to downloading correct driver for your version of Chrome browser and putting it in the correct location (ex: on Linux put it into /usr/local/bin so it is automatically in the first place your import managers will look in).

If you do not have Google Chrome, download the latest version. Then find out your version of Chrome (open Chrome, click on the 3 dots in the upper right corner, hover cursor over “Help”, click “About Google Chrome”).

Next, go to https://sites.google.com/chromium.org/driver/ and download the chromedriver_linux64.zip of the Google Chrome driver that is compatible with your Google Chrome browser version.

Now extract the driver by running

unzip ~/Downloads/chromedriver_linux64.zip

You should see an executable file named chromedriver in your Downloads folder. Since your import managers will not be looking on your Downloads folder, move this file to /usr/local/bin, which is the customary place to put user-downloaded system-wide affecting files without interfering with other automatically updating packages.

sudo mv ~/Downloads/chromedriver /usr/local/bin

You may read the full selenium installation instructions for troubleshooting advice.

Step 2. Install necessary Python libraries

Data Formatting
Installing pandas, and the correct versions of NumPy and SciPy can be difficult. The official pandas installation instructions recommend installing these as part of the Miniconda distribution.

Web Scraper

conda install -c conda-forge selenium
conda install -c anaconda requests
conda install -c anaconda beautifulsoup4

PDF Text Extraction
pdfminer3

pip install pdfminer3 pdftotext
sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
pip install pdftotext

Utilities

For time estimates on long for-loops: conda install -c conda-forge tqdm
For easily running script with args from terminal: conda install fire -c conda-forge
For setting up Jupyter Notebook, follow this article

Create a new folder called dataset in the current directory. This is where data files will be stored. The .gitignore knows to ignore files in dataset/* when syncing with GitHub.

mkdir dataset

Create a new folder called runs in the directory. This is where log files will be stored. The .gitignore knows to ignorelog_* files when syncing with GitHub.

mkdir runs

Create a new folder called save_pdfs in the directory. This is where public notice pdf files will be stored. The .gitignore knows to ignore files in save_pdfs/* when syncing with GitHub.

mkdir save_pdfs

Test: Run a Python script on a single year

Run the python script

python3 scrape_webpages.py --start_date="01/01/1980" --end_date="12/31/1980" --save_string_prefix="dataset/RUN_1980" &> runs/log_1980.txt

Output:

This runs the script and saves the logging output to a plain-text log file.

The script saves data files with the following naming scheme: f"{save_string_prefix}_START_{start_date_string}_END_{end_date_string}_{table_name}"

The script creates the following tables:

_table.csv –
_listing.csv –
save_pdfs/ –
_notice.json –
_notice.csv –
_gsongso.csv –
_merge.csv – merge _table.csv, _listing.csv, _notice.csv, _gsongso.csv on filename
_gso.csv – subsection of _merge.csv with is_GSO == 1
_ngso.csv – subsection of _merge.csv with is_NGSO == 1
_neither_gso_ngso.csv – subsection of _merge.csv with is_GSO == 0 & is_NGSO == 0
_both_gso_ngso.csv – subsection of _merge.csv with is_GSO == 1 & is_NGSO == 1

Useful commands:

To see logging in real-time from a different terminal

tail -f log_1980.txt

To read the whole log file

less log_1980.txt

To search for words / ERRORs in log file

grep "ERROR" log_1980.txt

Run: Bash script calling Python script for multiple years

A bash script runs overnight to scrape data from all years in range 1980-2021

sh all_scrape.sh &> runs/log_all_scrape.txt

Each run spans one year of records. All subsequent data files are kept isolated, so that if a run fails, it is easy to redo without affecting other runs. For each step of information extraction, data files are kept separate, so that if steps of a run fail, it is easy to resume with the canned data files from previous steps.

Overview of Functionality

Main Function

The web scraper’s main function is given a time range and a save location. It calls the other functions to automatically perform the web scraping steps.

def main(start_date="01/01/2016", end_date="12/31/2016", save_string_prefix="dataset/RUN_2016"):
    # for filenames, replace / with - so file names are valid
    start_date_string = start_date.replace("/", "-")
    end_date_string = end_date.replace("/", "-")
    print(f"SAVE TO: {save_string_prefix}_START_{start_date_string}_END_{end_date_string}")

    # name save locations
    save_table = f"{save_string_prefix}_START_{start_date_string}_END_{end_date_string}_table.csv"
    save_listing = f"{save_string_prefix}_START_{start_date_string}_END_{end_date_string}_listing.csv"
    save_pdfs = "save_pdfs/"
    save_notice_json = f"{save_string_prefix}_START_{start_date_string}_END_{end_date_string}_notice.json" 
    save_notice = f"{save_string_prefix}_START_{start_date_string}_END_{end_date_string}_notice.csv"
    save_gsongso = f"{save_string_prefix}_START_{start_date_string}_END_{end_date_string}_gsongso.csv"
    save_merge = f"{save_string_prefix}_START_{start_date_string}_END_{end_date_string}_merge.csv"
    save_gso = f"{save_string_prefix}_START_{start_date_string}_END_{end_date_string}_gso.csv"
    save_ngso = f"{save_string_prefix}_START_{start_date_string}_END_{end_date_string}_ngso.csv"
    save_neither_gso_ngso = f"{save_string_prefix}_START_{start_date_string}_END_{end_date_string}_neither_gso_ngso.csv"
    save_both_gso_ngso = f"{save_string_prefix}_START_{start_date_string}_END_{end_date_string}_both_gso_ngso.csv"

Scrape Table

Go to Date Selection form, generate table for filings acted upon between given start and end date, scrape info from table, aggregate records by file number, and save table to .csv file (Link to FCC Date Selection Form)

# scrape table
scrape_table(start_date, end_date, save_table)

Scrape Listing

For each file number + listing link in given list, follow to listing link, scrape information, and save results to .csv (example)

# get file numbers and listing links
list_of_filenumbers, list_of_listing_links = get_list_of_listing_links(save_table)
# scape listing
scrape_listing(list_of_filenumbers, list_of_listing_links, save_listing)

Scrape Public Notices

For each file number + public notice list link in given list, follow the public notice list link, scrape information from table and save the as a .json (example).

# get file numbers and public notice list links
list_of_filenumbers, list_of_public_notice_list_links = get_list_of_public_notice_list_links(save_listing)

# scrape notice
scrape_public_notice_list_a(list_of_filenumbers, list_of_public_notice_list_links, save_pdfs, save_notice_json)
scrape_public_notice_list_b(list_of_filenumbers, list_of_public_notice_list_links, save_pdfs, save_notice_json)
# extract info from saved json
extract_content_from_public_notice(list_of_filenumbers, save_notice_json, save_notice)

For each file number + public notice list link in given list, download and open public notice pdfs from report link, extract description text, and save as .json

Extract out certain segments of the PDF text. This is natural text. Regular expressions (RegEx) are used for selecting text that matches a given pattern. Open .json with public notice description text, create data frame with extracted properties, save to .csv

Properties extracted from public notice description text:

redirects
outside redirects (outside of the list_of_filenumbers, i.e. were not changed) within the operating date range.)
frequency band
frequency band with descriptor (Earth-to-Space, Space-to-Earth)
orbit location
waivers
Callsigns (TODO)
Policy implications (TODO)

Scrape GSO and NGSO Label

Mark GSO and NGSO types by checking against Search Form, save data frames to .csv files (Link to Advanced Search Form)

# get file numbers and listing links
list_of_filenumbers, list_of_callsigns = get_list_of_callsigns(save_table)
# scrape gso ngso
scrape_gso_ngso(list_of_filenumbers, list_of_callsigns, save_gsongso)

Merge Tables

Merge TABLE, LISTING, NOTICE, GSONGSO data frames by filename as key. Flag any discrepancies

Flag if Call Sign does not exist
Flag if Call Sign exists in data table and not in pdf-description or vice-versa
Flag if frequency exists in data table and not in pdf-description or vice-versa
Flag if orbital location exists in data table and not in pdf-description or vice-versa

# merge tables
merge_tables(list_of_filenumbers, save_table, save_listing, save_notice, save_gsongso, save_merge)

Save Separate Tables For Review

Split the entire table into GSO and NGSO data frames. Save the tables to “marked as GSO”, “marked as NGSO”, “marked as GSO and NGSO”, and “marked as neither”.

# split into gso, ngso
split_gso_ngso(save_merge, save_gso, save_ngso, save_neither_gso_ngso, save_both_gso_ngso)

Known Errors & Solutions

Sometimes information is not scraped because the FCC website is temporarily down. The recommended solution is to log instances of bad links and at a later time rerun partial web scrape for just the missing data or manually retrieve the information.

There are two types of errors:

un-handled errors stop the script and their error logs are recorded in the runs/log_all_scrape.txt.
handled errors do not stop the script, and their error logs are recorded in runs/log_{year}.txt for that specific call to the python script

Dealing with un-handled errors

Ideally, all errors would be handled, but scripts are not perfect and the FCC website is unstable. While there is error-handling code in place, in some cases, it makes more sense to stop the run and re-try the web scrape later when the FCC website is more stable. Since the data files are canned as checkpoints, commenting out steps in the Python script before the error happens will save time on re-runs. In the final saved data set, there are no unresolved un-handled errors.

Error Type 1

Public notice PDF was not fetched because FCC website was so unstable that it did not even give the error message for being unable to fetch a public notice PDF resource. Solution: manually download the public notice pdf and re-run python script from the call to scrape_public_notice_list_b

Error Type 2

GSO/NSGO flags were not recorded because the FCC website was so unstable that it did not load a found/not-found message for a constrained advanced search even after 3 repeat tries. Solution: re-run python script from the call to scrape_gso_ngso

Way towards solution

find interrupted runs by seeing what runs/log_{year}.txt logs terminated unexpectedly

# view last five lines for runs in 1980-1999
tail -n 5 runs/log_1*
# view last five lines for runs in 2000-2021
tail -n 5 runs/log_2*

Successful run output (examples: 2001, 2002):

==> runs/log_2001.txt <==
merge table saved to dataset/RUN_2001_START_01-01-2001_END_12-31-2001_merge.csv | length: (192, 51)
gso saved to dataset/RUN_2001_START_01-01-2001_END_12-31-2001_gso.csv | length: (80, 51)
ngso saved to dataset/RUN_2001_START_01-01-2001_END_12-31-2001_ngso.csv | length: (20, 51)
neither saved to dataset/RUN_2001_START_01-01-2001_END_12-31-2001_neither_gso_ngso.csv | length: (92, 51)
both saved to dataset/RUN_2001_START_01-01-2001_END_12-31-2001_both_gso_ngso.csv | length: (0, 51)

==> runs/log_2002.txt <==
merge table saved to dataset/RUN_2002_START_01-01-2002_END_12-31-2002_merge.csv | length: (178, 51)
gso saved to dataset/RUN_2002_START_01-01-2002_END_12-31-2002_gso.csv | length: (88, 51)
ngso saved to dataset/RUN_2002_START_01-01-2002_END_12-31-2002_ngso.csv | length: (16, 51)
neither saved to dataset/RUN_2002_START_01-01-2002_END_12-31-2002_neither_gso_ngso.csv | length: (74, 51)
both saved to dataset/RUN_2002_START_01-01-2002_END_12-31-2002_both_gso_ngso.csv | length: (0, 51)

Interrupted run output (examples: 2003, 2004):

==> runs/log_2003.txt <==
* pn 1/2
* Report No. SAT00167 Report Link https://licensing.fcc.gov/ibfsweb/ib.page.FetchPN?report_key=335209

- - - - - - - - -


==> runs/log_2004.txt <==
312 SAT-T/C-20040910-00173 nan
313 SAT-T/C-20040924-00190 nan
314 SAT-T/C-20041216-00222 S2367
315 SAT-WAV-19980803-00061 nan
316 SAT-WAV-20010302-00018 AMSC-1

Find what step was interrupted by searching the interrupted run’s log runs/log_{year}.txt (depends on keyword search)

Find unhandled error messages by searching for interrupted run’s logs within runs/log_all_scrape.txt (depends on keyword search)

Dealing with handled errors

Handled errors flag when the script skipped past scraping a part of the unstable website. This flagging is helpful for (1) deciding whether to re-run or (2) stick to manual correction. This is how to interpret and handle the handled error messages:

Error Type 1

Invalid / Broken public notice pdf download link. The script asked the FCC to fetch a public notice PDF, but the FCC gave a HTML-file error template instead. The script tried and failed to open the downloaded file with two pdf-opening tools (pdfminer3, pdftotext). Recommendation: try to manually download the public notice pdf (i.e. replace save_pdfs/SAT00001.pdf with a real downloaded public notice pdf) and re-run the script.

ERROR: pdfminer3 could not open pdf save_pdfs/SAT00001.pdf because the FCC pdf download link is broken.
ERROR: pdftotext could not open pdf save_pdfs/SAT00001.pdf because the FCC pdf download link is broken.
ERROR! Could not find SAT-MOD-19970130-00012!

There are 10 public notice links that are suspected to be permanently broken

SAT00001
SAT00019
SAT00832
SAT01525
SAT01530
SAT01524
SAT01531
SAT01523
SAT01589
SAT01526

Go to the Advanced Search Form, search the file number, go the listing page, go to the public notice list, and check if the link is still broken. If it is not broken, you can download the pdf.

year	file number	public notice
1998	SAT-MOD-19970130-00012	SAT00001
2004	SAT-STA-19980812-00064	SAT00001
2004	SAT-WAV-19980803-00061	SAT00001
1999	SAT-ASG-19990527-00059	SAT00019
1999	SAT-STA-19990525-00056	SAT00019
2000	SAT-AMD-19990601-00060	SAT00019
2000	SAT-LOA-19990601-00061	SAT00019
2001	SAT-AMD-19990526-00057	SAT00019
2001	SAT-AMD-19990526-00058	SAT00019
2009	SAT-MOD-19990603-00062	SAT00019
2012	SAT-T/C-20100112-00008	SAT00832
2012	SAT-T/C-20100112-00008	SAT00832
2021	SAT-MOD-20201222-00150	SAT01523
2021	SAT-STA-20201218-00147	SAT01523
2021	SAT-MOD-20200805-00091	SAT01524
2021	SAT-MOD-20200805-00091	SAT01524
2021	SAT-MPL-20201231-00155	SAT01524
2021	SAT-MPL-20201231-00156	SAT01524
2021	SAT-MPL-20201231-00158	SAT01524
2021	SAT-MPL-20201231-00159	SAT01524
2021	SAT-STA-20201112-00135	SAT01524
2021	SAT-STA-20201211-00143	SAT01524
2021	SAT-T/C-20201224-00151	SAT01524
2021	SAT-LOA-20200907-00105	SAT01525
2021	SAT-STA-20210106-00003	SAT01525
2021	SAT-STA-20210111-00006	SAT01526
2021	SAT-MOD-20200526-00057	SAT01530
2021	SAT-STA-20201210-00140	SAT01530
2021	SAT-STA-20201210-00141	SAT01530
2021	SAT-STA-20210127-00015	SAT01530
2021	SAT-MOD-20201201-00138	SAT01531
2021	SAT-STA-20210115-00011	SAT01531
2021	SAT-MOD-20210618-00082	SAT01589
2021	SAT-STA-20210802-00093	SAT01589
2021	SAT-STA-20211018-00131	SAT01589
2021	SAT-STA-20211022-00132	SAT01589
2021	SAT-T/C-20210817-00104	SAT01589

Error Type 2

The public notice pdf was successfully downloaded, but strange formatting issues make the automatic extraction code fail. Recommendation: perform manual extraction of public notice pdf description text. This hit 6 public notices (8 file numbers).

year	file number	public notice
2004	SAT-MOD-20031118-00333	SPB200
2006	SAT-LOA-20051221-00267	SAT00335
2006	SAT-RPL-20051118-00233	SAT00335
2010	SAT-MOD-20100212-00027	SAT00667
2013	SAT-PDR-20070129-00024	SAT00422
2018	SAT-PDR-20161115-00112	SAT01231
2020	SAT-PDR-20191017-00115	SAT01433
2020	SAT-PDR-20191017-00116	SAT01433

Error Type 3

3 attempts on a GSO/NGSO advanced search timing out. This turns into an un-handled error because this notes a period of unusual FCC website in-stability. Solution: re-run the script at a later time, when the FCC website is more stable.

time out happened 3x on NGSO check on SAT-PPL-20210108-00005

Find the errors:

# view flagged errors in runs 1980-1999
grep -i -n "error" runs/log_1*
# view flagged errors in runs 2000-2021
grep -i -n "error" runs/log_2*

Jupyter Notebook Prototype

We used Python Jupyter Notebook to for easy prototyping quick interactive debugging.

jupyter notebook Interactive.ipynb

To scrape a different date range, change the variables in the first cell.

before_date=""
after_date=""
start_date="1/1/2016"
end_date="12/30/2016"

Credits

Made by Gati Aher and Philip Post for Olin Satellite & Spectrum Technology Policy (OSSTP).