Political Science Academic Job Market

An NLP Analysis of APSA eJobs Listings

Author

Knowledge Mining Workshop

Published

April 9, 2026

Executive Summary

This report analyzes the political science academic job market using all active listings published in APSA Political Science Jobs — the official monthly eJobs journal of the American Political Science Association. Subfield counts are sourced directly from each issue’s Table of Contents (page 2) for accuracy, while individual job records are scraped from body text and verified against those ground-truth counts.

0.1 Dataset at a Glance

113

Issues Processed

10,168

Unique Job Listings

5,576

Unique Institutions

2016–2026

Years Covered

0.2 Geographic Overview

1 Data Collection & Parsing

1.1 Methodology

The data pipeline operates in three stages:

PDF Extraction — pdftools::pdf_text() reads raw text from each monthly issue.
Page 2 TOC Parsing — A regex pipeline targeting (N listings) on the Table of Contents extracts the official, ground-truth count per subfield per issue.
Body Text Scraping — The parser walks the document line-by-line, tracking section headers to assign subfields, flushing a record each time it detects an eJobs ID: marker.

1.2 Verification Report

2 Subfield Analysis

2.1 Summary Statistics by Subfield

Summary Statistics by Subfield
Subfield	N Jobs	N w/ Salary	Median Salary	Mean Salary	% TT	% Visiting	% Teaching Trk	% Postdoc
Methods	1496	191	$65,000	$72,254	15.4%	3.9%	4.3%	6.5%
CP	1490	172	$65,000	$67,282	16.4%	2.9%	3.0%	7.0%
IR	1461	179	$60,000	$66,753	14.6%	2.7%	3.1%	8.5%
AP	946	105	$65,000	$68,731	16.4%	4.2%	5.5%	9.6%
Other	946	94	$65,000	$67,877	16.1%	3.1%	3.7%	6.0%
PT	942	78	$65,000	$64,371	16.2%	3.8%	3.9%	5.1%
PL	634	60	$75,000	$74,696	23.0%	4.1%	2.8%	4.1%
PP	574	61	$72,500	$78,408	15.9%	2.4%	3.5%	7.1%
Open	525	64	$65,000	$73,694	12.6%	1.5%	4.2%	7.4%
Non-Academic	501	54	$65,000	$72,906	12.4%	2.2%	3.4%	4.6%
Admin	494	55	$65,000	$76,880	14.4%	3.6%	4.0%	3.8%
PAdmin	159	15	$65,000	$67,880	17.6%	1.3%	5.0%	1.3%

2.2 Trends Over Time

Show Code

toc_year %>%
  filter(!is.na(year), !is.na(subfield)) %>%
  ggplot(aes(x = year, y = toc_count, colour = subfield)) +
  geom_line(linewidth = 0.9) +
  geom_point(size = 1.8) +
  scale_colour_brewer(palette = "Paired") +
  scale_x_continuous(breaks = pretty_breaks(n = 8)) +
  labs(title    = "Job Listings by Subfield Over Time",
       subtitle = "Counts sourced from the Table of Contents of each issue",
       x = NULL, y = "Listings", colour = "Subfield") +
  theme(legend.position = "bottom",
        legend.text = element_text(size = 9, family = PAL))

Show Code

annual_totals %>%
  filter(!is.na(year)) %>%
  ggplot(aes(x = year, y = total, fill = total)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = comma(total)), vjust = -0.4,
            size = 3.2, family = PAL) +
  scale_fill_viridis_c(option = "D", direction = -1) +
  scale_x_continuous(breaks = pretty_breaks(n = 8)) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, .1))) +
  labs(title = "Total Political Science Job Listings Per Year",
       x = NULL, y = "Total Listings")

Show Code

toc_year %>%
  filter(!is.na(year), !is.na(subfield)) %>%
  ggplot(aes(x = year,
             y = reorder(subfield, toc_count),
             fill = toc_count)) +
  geom_tile(colour = "white", linewidth = 0.4) +
  geom_text(aes(label = toc_count), size = 2.6,
            colour = "white", family = PAL) +
  scale_fill_viridis_c(option = "C", name = "Listings") +
  scale_x_continuous(breaks = pretty_breaks(n = 8)) +
  labs(title = "Subfield × Year Heatmap", x = NULL, y = NULL) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, family = PAL),
        legend.position = "bottom",
        legend.text = element_text(size = 9, family = PAL))

3 Rank Analysis

3.1 Distribution of Rank Categories

Show Code

rank_order <- jobs %>%
  count(rank_category) %>% arrange(n) %>% pull(rank_category)

jobs %>%
  count(rank_category) %>%
  mutate(rank_category = factor(rank_category, levels = rank_order)) %>%
  ggplot(aes(x = n, y = rank_category, fill = n)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = comma(n)), hjust = -0.2,
            size = 3.4, family = PAL) +
  scale_fill_viridis_c(option = "E", direction = -1) +
  scale_x_continuous(expand = expansion(mult = c(0, .18))) +
  labs(title = "Job Listings by Rank Category",
       x = "Number of Listings", y = NULL)

3.2 Rank Composition by Subfield

Show Code

jobs %>%
  filter(rank_category %in% tt_ranks, !is.na(subfield)) %>%
  count(subfield, rank_category) %>%
  group_by(subfield) %>%
  mutate(pct = n / sum(n)) %>% ungroup() %>%
  ggplot(aes(x = reorder(subfield, -n, sum), y = pct,
             fill = factor(rank_category, levels = rev(tt_ranks)))) +
  geom_col() +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_brewer(palette = "Spectral", name = "Rank") +
  labs(title = "Rank Composition by Subfield",
       x = NULL, y = "Share of Listings") +
  theme(axis.text.x = element_text(angle = 35, hjust = 1, family = PAL),
        legend.position = "bottom",
        legend.text = element_text(size = 9, family = PAL))

3.3 Rank Trends Over Time

Show Code

jobs %>%
  filter(rank_category %in% tt_ranks, !is.na(year)) %>%
  count(year, rank_category) %>%
  ggplot(aes(x = year, y = n, colour = rank_category)) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 1.5) +
  scale_colour_brewer(palette = "Dark2", name = "Rank") +
  scale_x_continuous(breaks = pretty_breaks(n = 8)) +
  labs(title = "Rank Category Trends Over Time",
       x = NULL, y = "Listings") +
  theme(legend.position = "bottom",
        legend.text = element_text(size = 9, family = PAL))

Tip

Key observation: Tenure-track assistant professor positions (Asst Prof (TT)) typically dominate the market. Watch the Visiting Professor and Teaching Track trend lines — a rising share may signal structural shifts in how departments staff their courses.

4 Geographic Distribution

4.1 Listings by US Region

Show Code

jobs %>%
  count(region) %>%
  arrange(desc(n)) %>%
  ggplot(aes(x = reorder(region, n), y = n, fill = region)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = comma(n)), hjust = -0.2,
            size = 3.5, family = PAL) +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(expand = expansion(mult = c(0, .15))) +
  coord_flip() +
  labs(title = "Job Listings by US Region",
       x = NULL, y = "Listings")

4.2 Top 20 States

Show Code

jobs %>%
  filter(!is.na(state_raw)) %>%
  count(state_raw, sort = TRUE) %>%
  slice_head(n = 20) %>%
  ggplot(aes(x = n, y = reorder(state_raw, n), fill = n)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = comma(n)), hjust = -0.2,
            size = 3.3, family = PAL) +
  scale_fill_viridis_c(option = "D", direction = -1) +
  scale_x_continuous(expand = expansion(mult = c(0, .15))) +
  labs(title = "Top 20 States by Job Listings",
       x = "Listings", y = NULL)

4.3 State Choropleth (Detail)

5 Salary Analysis

Warning

Only listings with an explicit numeric salary are included here. Most listings say “Competitive” or “Commensurate with experience” and are excluded. Interpret figures with caution.

5.1 Salary by Rank and Subfield

Show Code

sal_df <- jobs %>% filter(!is.na(salary_est), salary_est > 10000)

p_sal_rank <- sal_df %>%
  filter(rank_category %in% tt_ranks) %>%
  ggplot(aes(x = reorder(rank_category, salary_est, median),
             y = salary_est, fill = rank_category)) +
  geom_boxplot(outlier.shape = 21, outlier.size = 1.5, show.legend = FALSE) +
  scale_y_continuous(labels = dollar_format()) +
  scale_fill_brewer(palette = "Spectral") +
  coord_flip() +
  labs(title = "Salary Distribution by Rank",
       subtitle = "Listings with explicit numeric salary only",
       x = NULL, y = "Estimated Annual Salary")

p_sal_sf <- sal_df %>%
  filter(!is.na(subfield)) %>%
  ggplot(aes(x = reorder(subfield, salary_est, median),
             y = salary_est, fill = subfield)) +
  geom_boxplot(outlier.shape = 21, outlier.size = 1.5, show.legend = FALSE) +
  scale_y_continuous(labels = dollar_format()) +
  scale_fill_brewer(palette = "Paired") +
  coord_flip() +
  labs(title = "Salary Distribution by Subfield",
       x = NULL, y = "Estimated Annual Salary")

p_sal_rank / p_sal_sf

5.2 Salary Trend Over Time

Show Code

sal_df %>%
  filter(!is.na(year)) %>%
  group_by(year) %>%
  summarise(median_sal = median(salary_est),
            mean_sal   = mean(salary_est),
            n = n(), .groups = "drop") %>%
  ggplot(aes(x = year)) +
  geom_ribbon(aes(ymin = median_sal, ymax = mean_sal),
              alpha = 0.18, fill = "steelblue") +
  geom_line(aes(y = median_sal, colour = "Median"), linewidth = 1) +
  geom_line(aes(y = mean_sal,   colour = "Mean"),
            linewidth = 1, linetype = "dashed") +
  scale_y_continuous(labels = dollar_format()) +
  scale_x_continuous(breaks = pretty_breaks(n = 8)) +
  scale_colour_manual(values = c(Median = "steelblue", Mean = "tomato"),
                      name = NULL) +
  labs(title    = "Salary Trend Over Time",
       subtitle = paste0("n = ", comma(nrow(sal_df)),
                         " listings with explicit numeric salary"),
       x = NULL, y = "Annual Salary") +
  theme(legend.position = "bottom",
        legend.text = element_text(size = 9, family = PAL))

6 Text Mining

6.1 Top 30 Terms

Show Code

tidy_words %>%
  count(word, sort = TRUE) %>%
  slice_head(n = 30) %>%
  ggplot(aes(x = n, y = reorder(word, n), fill = n)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = comma(n)), hjust = -0.2,
            size = 3.3, family = PAL) +
  scale_fill_viridis_c(option = "D", direction = -1) +
  scale_x_continuous(labels = comma,
                     expand = expansion(mult = c(0, .15))) +
  labs(title    = "Top 30 Terms Across All Job Listings",
       subtitle = "After removing stopwords; based on rank + unit fields",
       x = "Frequency", y = NULL)

6.2 TF-IDF: Distinctive Terms by Subfield

TF-IDF (Term Frequency–Inverse Document Frequency) surfaces words that are unusually common in one subfield relative to all others — revealing the distinctive vocabulary of each field.

Show Code

tidy_words %>%
  filter(!is.na(subfield)) %>%
  count(subfield, word) %>%
  bind_tf_idf(word, subfield, n) %>%
  group_by(subfield) %>%
  slice_max(tf_idf, n = 8) %>% ungroup() %>%
  mutate(word = reorder_within(word, tf_idf, subfield)) %>%
  ggplot(aes(x = tf_idf, y = word, fill = subfield)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ subfield, scales = "free_y", ncol = 3) +
  scale_y_reordered() +
  scale_fill_brewer(palette = "Paired") +
  labs(title    = "Most Distinctive Terms by Subfield (TF-IDF)",
       subtitle = "Words uniquely associated with each subfield",
       x = "TF-IDF Score", y = NULL) +
  theme(axis.text.y = element_text(size = 8, family = PAL))

6.3 TF-IDF: Distinctive Terms by Rank

Show Code

tidy_words %>%
  filter(rank_category %in% tt_ranks) %>%
  count(rank_category, word) %>%
  bind_tf_idf(word, rank_category, n) %>%
  group_by(rank_category) %>%
  slice_max(tf_idf, n = 8) %>% ungroup() %>%
  mutate(word = reorder_within(word, tf_idf, rank_category)) %>%
  ggplot(aes(x = tf_idf, y = word, fill = rank_category)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ rank_category, scales = "free_y", ncol = 3) +
  scale_y_reordered() +
  scale_fill_brewer(palette = "Spectral") +
  labs(title = "Most Distinctive Terms by Rank (TF-IDF)",
       x = "TF-IDF Score", y = NULL) +
  theme(axis.text.y = element_text(size = 8, family = PAL))

6.4 Top Bigrams

Show Code

jobs %>%
  unnest_tokens(bigram, full_text, token = "ngrams", n = 2) %>%
  separate(bigram, into = c("w1","w2"), sep = " ") %>%
  filter(!w1%in% stop_words$word, !w2%in% stop_words$word,
         !w1%in% ps_stopwords$word, !w2%in% ps_stopwords$word,
         str_length(w1) > 2, str_length(w2) > 2,
         !str_detect(w1,"^\\d+$"), !str_detect(w2,"^\\d+$")) %>%
  unite(bigram, w1, w2, sep = " ") %>%
  count(bigram, sort = TRUE) %>%
  slice_head(n = 25) %>%
  ggplot(aes(x = n, y = reorder(bigram, n), fill = n)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = comma(n)), hjust = -0.2,
            size = 3.3, family = PAL) +
  scale_fill_viridis_c(option = "C", direction = -1) +
  scale_x_continuous(labels = comma,
                     expand = expansion(mult = c(0, .15))) +
  labs(title    = "Top 25 Bigrams in Job Listings",
       subtitle = "Common two-word phrases after stopword removal",
       x = "Frequency", y = NULL)

6.5 Word Clouds by Subfield

7 Browse All Jobs

8 Appendix

8.1 R Session Info

Show Code

sessionInfo()

R version 4.5.3 (2026-03-11)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggtext_0.1.2       plotly_4.11.0      DT_0.33            kableExtra_1.4.0  
 [5] knitr_1.50         patchwork_1.3.2    maps_3.4.3         ggridges_0.5.6    
 [9] viridis_0.6.5      viridisLite_0.4.2  wordcloud_2.6      RColorBrewer_1.1-3
[13] SnowballC_0.7.1    tidytext_0.4.2     lubridate_1.9.5    scales_1.4.0      
[17] ggplot2_4.0.0      tibble_3.3.0       stringr_1.6.0      tidyr_1.3.2       
[21] dplyr_1.2.0       

loaded via a namespace (and not attached):
 [1] janeaustenr_1.0.0 sass_0.4.10       generics_0.1.4    xml2_1.5.1       
 [5] stringi_1.8.7     lattice_0.22-9    digest_0.6.39     magrittr_2.0.4   
 [9] evaluate_1.0.5    grid_4.5.3        timechange_0.4.0  fastmap_1.2.0    
[13] jsonlite_2.0.0    Matrix_1.7-4      gridExtra_2.3     httr_1.4.7       
[17] purrr_1.2.1       crosstalk_1.2.1   jquerylib_0.1.4   codetools_0.2-20 
[21] lazyeval_0.2.2    textshaping_1.0.1 cli_3.6.5         rlang_1.1.7      
[25] tokenizers_0.3.0  cachem_1.1.0      withr_3.0.2       yaml_2.3.10      
[29] tools_4.5.3       vctrs_0.7.2       R6_2.6.1          lifecycle_1.0.5  
[33] htmlwidgets_1.6.4 pkgconfig_2.0.3   bslib_0.9.0       pillar_1.11.1    
[37] gtable_0.3.6      data.table_1.17.2 glue_1.8.0        Rcpp_1.1.0       
[41] systemfonts_1.2.3 xfun_0.54         tidyselect_1.2.1  rstudioapi_0.17.1
[45] farver_2.1.2      htmltools_0.5.8.1 labeling_0.4.3    svglite_2.2.1    
[49] rmarkdown_2.30    compiler_4.5.3    S7_0.2.0          gridtext_0.1.5

8.2 Data Pipeline Summary

Stage	Tool	Output
PDF text extraction	`pdftools::pdf_text()`	Raw character vectors
TOC count parsing	`stringr` regex on pages 1–3	`ps_jobs_toc_counts.csv`
Body text scraping	Line-by-line section tracker + eJobs ID flush	`ps_jobs_all_raw.csv`
Deduplication	`dplyr::distinct(ejobs_id)`	`ps_jobs_all_unique.csv`
Verification	TOC count vs scraped count per issue × subfield	`ps_jobs_verification.csv`
Analytics & Report	`ggplot2`, `tidytext`, `maps`, Quarto	This document

8.3 Subfield Code Reference

Code	Full Name
AP	American Government and Politics
CP	Comparative Politics
IR	International Relations
Methods	Methodology
PT	Political Theory
PL	Public Law
PP	Public Policy
PAdmin	Public Administration
Admin	Administration
Non-Academic	Non-Academic Positions
Open	Open Subfield
Other	Other

--- title: "Political Science Academic Job Market" subtitle: "An NLP Analysis of APSA eJobs Listings" author: "Knowledge Mining Workshop" date: today format: html: theme: cosmo toc: true toc-depth: 3 toc-location: left toc-title: "Contents" number-sections: true code-fold: true code-tools: true code-summary: "Show Code" smooth-scroll: true fig-width: 10 fig-height: 6 fig-align: center cap-location: bottom df-print: paged embed-resources: true execute: echo: true warning: false message: false cache: true --- ```{r setup, include=FALSE} library(dplyr); library(tidyr); library(stringr); library(tibble) library(ggplot2); library(scales); library(lubridate) library(tidytext); library(SnowballC); library(wordcloud) library(RColorBrewer); library(viridis); library(ggridges) library(maps); library(patchwork); library(knitr); library(kableExtra) library(DT); library(plotly); library(ggtext); library(showtext) # ── Palatino via showtext (cross-platform) ──────────────────────────────── font_add("Palatino", regular = "/Library/Fonts/pala.ttf") # Windows path; adjust if needed # macOS alternative: font_add("Palatino", regular = "Palatino.ttc") # Linux alternative: font_add("Palatino", regular = "TeX Gyre Pagella Regular.otf") showtext_auto() showtext_opts(dpi = 150) PAL <- "Palatino" # ── Global ggplot2 theme ────────────────────────────────────────────────── theme_ps <- function(...) { theme_minimal(base_size = 13, base_family = PAL) %+replace% theme( plot.title = element_text(face = "bold", size = 15, family = PAL, hjust = 0.5), plot.subtitle = element_text(colour = "grey45", size = 11, family = PAL, hjust = 0.5), plot.caption = element_text(colour = "grey60", size = 9, family = PAL, hjust = 0.5), axis.text = element_text(family = PAL), axis.title = element_text(family = PAL), legend.text = element_text(size = 9, family = PAL), legend.title = element_text(size = 10, family = PAL), legend.position = "bottom", panel.grid.minor = element_blank(), strip.text = element_text(face = "bold", family = PAL), ... ) } theme_set(theme_ps()) # ── Load data ───────────────────────────────────────────────────────────── jobs <- read.csv("ps_jobs_all_unique.csv", stringsAsFactors = FALSE) toc <- read.csv("ps_jobs_toc_counts.csv", stringsAsFactors = FALSE) summary_stats <- read.csv("ps_jobs_summary_by_subfield.csv", stringsAsFactors = FALSE) verify <- read.csv("ps_jobs_verification.csv", stringsAsFactors = FALSE) # ── Derived columns ─────────────────────────────────────────────────────── jobs <- jobs %>% mutate( year = as.integer(str_extract(source_file, "(?<=PSJobs)\\d{4}")), month = as.integer(str_sub(str_extract(source_file, "\\d{6}"), 5, 6)), issue_date = ymd(paste0(year, "-", str_pad(month, 2, pad = "0"), "-01")) ) classify_rank <- function(r) { r <- str_to_lower(coalesce(r, "")) case_when( str_detect(r, "full professor|professor \$tenured\$|associate.*full") ~ "Full Professor", str_detect(r, "associate professor") & !str_detect(r, "visiting|lecturer") ~ "Associate Professor", str_detect(r, "assistant professor") & !str_detect(r, "visiting|instruction|teaching|practice") ~ "Asst Prof (TT)", str_detect(r, "visiting assistant|visiting associate|visiting professor") ~ "Visiting Professor", str_detect(r, "professor of (instruction|teaching|practice)|teaching (track|professor)|lecturer") ~ "Teaching Track", str_detect(r, "postdoc|post-doc|post doc") ~ "Postdoc", str_detect(r, "instructor") ~ "Instructor", str_detect(r, "director|chair|dean") ~ "Admin/Director", str_detect(r, "open|all ranks|any rank|multiple") ~ "Open Rank", TRUE ~ "Other/NEC" ) } parse_salary <- function(sal) { s <- str_to_lower(coalesce(sal, "")) if (str_detect(s, "competitive|commensurate|negotiable|varies|tbd") | s == "") return(NA_real_) nums <- as.numeric(str_remove_all(str_extract_all(s, "\\d[\\d,]*")[], ","))[1] nums <- nums[nums > 10000 & nums < 500000] if (length(nums) == 0) return(NA_real_) mean(nums) } us_states <- c(state.name, "District of Columbia") state_abbr_map <- setNames(c(state.name, "District of Columbia"), c(state.abb, "DC")) extract_state <- function(text) { if (is.na(text)) return(NA_character_) for (s in us_states) if (str_detect(text, regex(paste0("\\b", s, "\\b"), ignore_case = TRUE))) return(s) for (ab in names(state_abbr_map)) if (str_detect(text, paste0("\\b", ab, "\\b"))) return(state_abbr_map[[ab]]) return(NA_character_) } jobs <- jobs %>% mutate( rank_category = classify_rank(rank), salary_est = sapply(salary, parse_salary), state_raw = mapply(function(o, u) { s <- extract_state(o); if (is.na(s)) extract_state(u) else s }, org_name, unit), region = case_when( state_raw %in% c("Maine","New Hampshire","Vermont","Massachusetts","Rhode Island", "Connecticut","New York","Pennsylvania","New Jersey") ~ "Northeast", state_raw %in% c("Ohio","Indiana","Illinois","Michigan","Wisconsin","Minnesota", "Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas") ~ "Midwest", state_raw %in% c("Delaware","Maryland","District of Columbia","Virginia", "West Virginia","North Carolina","South Carolina","Georgia", "Florida","Kentucky","Tennessee","Alabama","Mississippi", "Arkansas","Louisiana","Oklahoma","Texas") ~ "South", state_raw %in% c("Montana","Idaho","Wyoming","Colorado","New Mexico","Arizona", "Utah","Nevada","Washington","Oregon","California","Alaska","Hawaii") ~ "West", TRUE ~ "Non-US / Unknown" ), full_text = paste(coalesce(rank, ""), coalesce(unit, ""), coalesce(org_name, ""), sep = " ") ) toc_year <- toc %>% mutate(year = as.integer(str_extract(source_file, "(?<=PSJobs)\\d{4}"))) %>% group_by(year, subfield) %>% summarise(toc_count = sum(toc_count, na.rm = TRUE), .groups = "drop") annual_totals <- toc_year %>% group_by(year) %>% summarise(total = sum(toc_count), .groups = "drop") tt_ranks <- c("Asst Prof (TT)","Associate Professor","Full Professor", "Open Rank","Visiting Professor","Teaching Track", "Postdoc","Admin/Director") ps_stopwords <- tibble(word = c( "university","college","department","political","science","sciences", "position","applications","candidates","faculty","professor","research", "teaching","please","review","apply","submit","materials","letters", "letter","equal","opportunity","employer","affirmative","action", "diversity","equity","inclusion","phd","degree","tenure","track", "students","graduate","undergraduate","cv","cover","school","institute", "program","required","applicants","appointment","employment","search", "committee","competitive","salary","begin","academic","year" )) tidy_words <- jobs %>% select(ejobs_id, subfield, rank_category, year, full_text) %>% unnest_tokens(word, full_text) %>% anti_join(stop_words, by = "word") %>% anti_join(ps_stopwords, by = "word") %>% filter(!str_detect(word, "^\\d+$"), str_length(word) > 2) %>% mutate(word_stem = wordStem(word)) # ── KPI values ──────────────────────────────────────────────────────────── n_issues <- n_distinct(jobs$source_file) n_jobs <- nrow(jobs) n_orgs <- n_distinct(jobs$org_name, na.rm = TRUE) year_min <- min(jobs$year, na.rm = TRUE) year_max <- max(jobs$year, na.rm = TRUE) # ── Choropleth (built once, reused in two places) ───────────────────────── state_counts <- jobs %>% filter(!is.na(state_raw), state_raw != "District of Columbia") %>% count(state_raw, name = "n_jobs") %>% mutate(region = str_to_lower(state_raw)) us_map_data <- map_data("state") %>% left_join(state_counts, by = "region") choropleth <- ggplot(us_map_data, aes(long, lat, group = group, fill = n_jobs)) + geom_polygon(colour = "white", linewidth = 0.25) + coord_fixed(1.3) + scale_fill_viridis_c(option = "A", direction = -1, na.value = "grey90", name = "Listings", labels = comma) + labs( title = "Political Science Job Listings by State", subtitle = "Institution location extracted from org_name / unit fields", x = NULL, y = NULL ) + theme_void(base_size = 13) + theme( text = element_text(family = PAL), plot.title = element_text(face = "bold", size = 14, family = PAL, hjust = 0.5), plot.subtitle = element_text(colour = "grey45", size = 10, family = PAL, hjust = 0.5), legend.text = element_text(size = 9, family = PAL), legend.title = element_text(size = 10, family = PAL), legend.position = "bottom" ) ``` ```{css, echo=FALSE} /* ── Google Fonts: Palatino web fallback (EB Garamond is closest web-safe) */ @import url('https://fonts.googleapis.com/css2?family=EB+Garamond:ital,wght@0,400;0,700;1,400&display=swap'); body, h1, h2, h3, h4, p, li, .quarto-title, .description { font-family: "Palatino Linotype", "Palatino", "Book Antiqua", "EB Garamond", Georgia, serif !important; } .metric-box { background: #f8f9fa; border-left: 4px solid #2c6e91; border-radius: 6px; padding: 14px 18px; margin: 8px 4px; text-align: center; font-family: "Palatino Linotype", Palatino, serif; } .metric-box .metric-number { font-size: 2rem; font-weight: 700; color: #2c6e91; line-height: 1.2; font-family: "Palatino Linotype", Palatino, serif; } .metric-box .metric-label { font-size: 0.82rem; color: #6c757d; text-transform: uppercase; letter-spacing: 0.06em; font-family: "Palatino Linotype", Palatino, serif; } .callout { border-radius: 6px; } .panel-tabset .nav-link { font-size: 0.9rem; font-family: Palatino, serif; } h1, h2, h3 { text-align: center; } .quarto-title h1.title { text-align: center; } .quarto-title .description { text-align: center; } ``` # Executive Summary {.unnumbered} ::: {.callout-note appearance="simple"} This report analyzes the political science academic job market using all active listings published in **APSA *Political Science Jobs*** — the official monthly eJobs journal of the American Political Science Association. Subfield counts are sourced directly from each issue's **Table of Contents** (page 2) for accuracy, while individual job records are scraped from body text and verified against those ground-truth counts. ::: ## Dataset at a Glance ```{r kpis, echo=FALSE} htmltools::tags$div( style = "display:grid; grid-template-columns: repeat(4,1fr); gap:12px; margin:16px 0;", htmltools::tags$div(class = "metric-box", htmltools::tags$div(class = "metric-number", scales::comma(n_issues)), htmltools::tags$div(class = "metric-label", "Issues Processed")), htmltools::tags$div(class = "metric-box", htmltools::tags$div(class = "metric-number", scales::comma(n_jobs)), htmltools::tags$div(class = "metric-label", "Unique Job Listings")), htmltools::tags$div(class = "metric-box", htmltools::tags$div(class = "metric-number", scales::comma(n_orgs)), htmltools::tags$div(class = "metric-label", "Unique Institutions")), htmltools::tags$div(class = "metric-box", htmltools::tags$div(class = "metric-number", paste0(year_min, "–", year_max)), htmltools::tags$div(class = "metric-label", "Years Covered")) ) ``` ## Geographic Overview ```{r choropleth-summary, echo=FALSE, fig.height=5.5} choropleth ``` --- # Data Collection & Parsing ## Methodology The data pipeline operates in three stages: 1. **PDF Extraction** — `pdftools::pdf_text()` reads raw text from each monthly issue. 2. **Page 2 TOC Parsing** — A regex pipeline targeting `(N listings)` on the Table of Contents extracts the **official, ground-truth count** per subfield per issue. 3. **Body Text Scraping** — The parser walks the document line-by-line, tracking section headers to assign subfields, flushing a record each time it detects an `eJobs ID:` marker. ## Verification Report ```{r verification-table, echo=FALSE} verify %>% filter(status != "OK") %>% arrange(desc(abs(diff))) %>% select(source_file, subfield, toc_count, scraped_count, diff, status) %>% datatable( caption = "Discrepancies: Scraped Count vs TOC Count", filter = "top", rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE) ) %>% formatStyle("status", backgroundColor = styleEqual(c("UNDER","OVER"), c("#fff3cd","#f8d7da"))) ``` --- # Subfield Analysis ## Summary Statistics by Subfield ```{r subfield-summary-table, echo=FALSE} summary_stats %>% kbl( caption = "Summary Statistics by Subfield", align = c("l","r","r","r","r","r","r","r","r"), col.names = c("Subfield","N Jobs","N w/ Salary","Median Salary", "Mean Salary","% TT","% Visiting","% Teaching Trk","% Postdoc") ) %>% kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE, font_size = 13) %>% row_spec(0, bold = TRUE, background = "#e9ecef") %>% column_spec(1, bold = TRUE) ``` ## Trends Over Time ::: {.panel-tabset} ### By Subfield ```{r p-trend-subfield, fig.height=6} toc_year %>% filter(!is.na(year), !is.na(subfield)) %>% ggplot(aes(x = year, y = toc_count, colour = subfield)) + geom_line(linewidth = 0.9) + geom_point(size = 1.8) + scale_colour_brewer(palette = "Paired") + scale_x_continuous(breaks = pretty_breaks(n = 8)) + labs(title = "Job Listings by Subfield Over Time", subtitle = "Counts sourced from the Table of Contents of each issue", x = NULL, y = "Listings", colour = "Subfield") + theme(legend.position = "bottom", legend.text = element_text(size = 9, family = PAL)) ``` ### Annual Total ```{r p-annual-bar, fig.height=5} annual_totals %>% filter(!is.na(year)) %>% ggplot(aes(x = year, y = total, fill = total)) + geom_col(show.legend = FALSE) + geom_text(aes(label = comma(total)), vjust = -0.4, size = 3.2, family = PAL) + scale_fill_viridis_c(option = "D", direction = -1) + scale_x_continuous(breaks = pretty_breaks(n = 8)) + scale_y_continuous(labels = comma, expand = expansion(mult = c(0, .1))) + labs(title = "Total Political Science Job Listings Per Year", x = NULL, y = "Total Listings") ``` ### Heatmap ```{r p-heatmap, fig.height=6} toc_year %>% filter(!is.na(year), !is.na(subfield)) %>% ggplot(aes(x = year, y = reorder(subfield, toc_count), fill = toc_count)) + geom_tile(colour = "white", linewidth = 0.4) + geom_text(aes(label = toc_count), size = 2.6, colour = "white", family = PAL) + scale_fill_viridis_c(option = "C", name = "Listings") + scale_x_continuous(breaks = pretty_breaks(n = 8)) + labs(title = "Subfield × Year Heatmap", x = NULL, y = NULL) + theme(axis.text.x = element_text(angle = 45, hjust = 1, family = PAL), legend.position = "bottom", legend.text = element_text(size = 9, family = PAL)) ``` ::: --- # Rank Analysis ## Distribution of Rank Categories ```{r p-rank-bar, fig.height=6} rank_order <- jobs %>% count(rank_category) %>% arrange(n) %>% pull(rank_category) jobs %>% count(rank_category) %>% mutate(rank_category = factor(rank_category, levels = rank_order)) %>% ggplot(aes(x = n, y = rank_category, fill = n)) + geom_col(show.legend = FALSE) + geom_text(aes(label = comma(n)), hjust = -0.2, size = 3.4, family = PAL) + scale_fill_viridis_c(option = "E", direction = -1) + scale_x_continuous(expand = expansion(mult = c(0, .18))) + labs(title = "Job Listings by Rank Category", x = "Number of Listings", y = NULL) ``` ## Rank Composition by Subfield ```{r p-rank-subfield, fig.height=6} jobs %>% filter(rank_category %in% tt_ranks, !is.na(subfield)) %>% count(subfield, rank_category) %>% group_by(subfield) %>% mutate(pct = n / sum(n)) %>% ungroup() %>% ggplot(aes(x = reorder(subfield, -n, sum), y = pct, fill = factor(rank_category, levels = rev(tt_ranks)))) + geom_col() + scale_y_continuous(labels = percent_format()) + scale_fill_brewer(palette = "Spectral", name = "Rank") + labs(title = "Rank Composition by Subfield", x = NULL, y = "Share of Listings") + theme(axis.text.x = element_text(angle = 35, hjust = 1, family = PAL), legend.position = "bottom", legend.text = element_text(size = 9, family = PAL)) ``` ## Rank Trends Over Time ```{r p-rank-trend, fig.height=6} jobs %>% filter(rank_category %in% tt_ranks, !is.na(year)) %>% count(year, rank_category) %>% ggplot(aes(x = year, y = n, colour = rank_category)) + geom_line(linewidth = 0.8) + geom_point(size = 1.5) + scale_colour_brewer(palette = "Dark2", name = "Rank") + scale_x_continuous(breaks = pretty_breaks(n = 8)) + labs(title = "Rank Category Trends Over Time", x = NULL, y = "Listings") + theme(legend.position = "bottom", legend.text = element_text(size = 9, family = PAL)) ``` ::: {.callout-tip} **Key observation:** Tenure-track assistant professor positions (`Asst Prof (TT)`) typically dominate the market. Watch the **Visiting Professor** and **Teaching Track** trend lines — a rising share may signal structural shifts in how departments staff their courses. ::: --- # Geographic Distribution ## Listings by US Region ```{r p-region, fig.height=5} jobs %>% count(region) %>% arrange(desc(n)) %>% ggplot(aes(x = reorder(region, n), y = n, fill = region)) + geom_col(show.legend = FALSE) + geom_text(aes(label = comma(n)), hjust = -0.2, size = 3.5, family = PAL) + scale_fill_brewer(palette = "Set2") + scale_y_continuous(expand = expansion(mult = c(0, .15))) + coord_flip() + labs(title = "Job Listings by US Region", x = NULL, y = "Listings") ``` ## Top 20 States ```{r p-top-states, fig.height=7} jobs %>% filter(!is.na(state_raw)) %>% count(state_raw, sort = TRUE) %>% slice_head(n = 20) %>% ggplot(aes(x = n, y = reorder(state_raw, n), fill = n)) + geom_col(show.legend = FALSE) + geom_text(aes(label = comma(n)), hjust = -0.2, size = 3.3, family = PAL) + scale_fill_viridis_c(option = "D", direction = -1) + scale_x_continuous(expand = expansion(mult = c(0, .15))) + labs(title = "Top 20 States by Job Listings", x = "Listings", y = NULL) ``` ## State Choropleth (Detail) ```{r choropleth-detail, echo=FALSE, fig.height=6} choropleth ``` --- # Salary Analysis ::: {.callout-warning} Only listings with an **explicit numeric salary** are included here. Most listings say "Competitive" or "Commensurate with experience" and are excluded. Interpret figures with caution. ::: ## Salary by Rank and Subfield ```{r p-salary-box, fig.height=11} sal_df <- jobs %>% filter(!is.na(salary_est), salary_est > 10000) p_sal_rank <- sal_df %>% filter(rank_category %in% tt_ranks) %>% ggplot(aes(x = reorder(rank_category, salary_est, median), y = salary_est, fill = rank_category)) + geom_boxplot(outlier.shape = 21, outlier.size = 1.5, show.legend = FALSE) + scale_y_continuous(labels = dollar_format()) + scale_fill_brewer(palette = "Spectral") + coord_flip() + labs(title = "Salary Distribution by Rank", subtitle = "Listings with explicit numeric salary only", x = NULL, y = "Estimated Annual Salary") p_sal_sf <- sal_df %>% filter(!is.na(subfield)) %>% ggplot(aes(x = reorder(subfield, salary_est, median), y = salary_est, fill = subfield)) + geom_boxplot(outlier.shape = 21, outlier.size = 1.5, show.legend = FALSE) + scale_y_continuous(labels = dollar_format()) + scale_fill_brewer(palette = "Paired") + coord_flip() + labs(title = "Salary Distribution by Subfield", x = NULL, y = "Estimated Annual Salary") p_sal_rank / p_sal_sf ``` ## Salary Trend Over Time ```{r p-salary-trend, fig.height=5} sal_df %>% filter(!is.na(year)) %>% group_by(year) %>% summarise(median_sal = median(salary_est), mean_sal = mean(salary_est), n = n(), .groups = "drop") %>% ggplot(aes(x = year)) + geom_ribbon(aes(ymin = median_sal, ymax = mean_sal), alpha = 0.18, fill = "steelblue") + geom_line(aes(y = median_sal, colour = "Median"), linewidth = 1) + geom_line(aes(y = mean_sal, colour = "Mean"), linewidth = 1, linetype = "dashed") + scale_y_continuous(labels = dollar_format()) + scale_x_continuous(breaks = pretty_breaks(n = 8)) + scale_colour_manual(values = c(Median = "steelblue", Mean = "tomato"), name = NULL) + labs(title = "Salary Trend Over Time", subtitle = paste0("n = ", comma(nrow(sal_df)), " listings with explicit numeric salary"), x = NULL, y = "Annual Salary") + theme(legend.position = "bottom", legend.text = element_text(size = 9, family = PAL)) ``` --- # Text Mining ## Top 30 Terms ```{r p-top-terms, fig.height=8} tidy_words %>% count(word, sort = TRUE) %>% slice_head(n = 30) %>% ggplot(aes(x = n, y = reorder(word, n), fill = n)) + geom_col(show.legend = FALSE) + geom_text(aes(label = comma(n)), hjust = -0.2, size = 3.3, family = PAL) + scale_fill_viridis_c(option = "D", direction = -1) + scale_x_continuous(labels = comma, expand = expansion(mult = c(0, .15))) + labs(title = "Top 30 Terms Across All Job Listings", subtitle = "After removing stopwords; based on rank + unit fields", x = "Frequency", y = NULL) ``` ## TF-IDF: Distinctive Terms by Subfield > **TF-IDF** (Term Frequency–Inverse Document Frequency) surfaces words that > are *unusually common* in one subfield relative to all others — revealing > the distinctive vocabulary of each field. ```{r p-tfidf-subfield, fig.height=10} tidy_words %>% filter(!is.na(subfield)) %>% count(subfield, word) %>% bind_tf_idf(word, subfield, n) %>% group_by(subfield) %>% slice_max(tf_idf, n = 8) %>% ungroup() %>% mutate(word = reorder_within(word, tf_idf, subfield)) %>% ggplot(aes(x = tf_idf, y = word, fill = subfield)) + geom_col(show.legend = FALSE) + facet_wrap(~ subfield, scales = "free_y", ncol = 3) + scale_y_reordered() + scale_fill_brewer(palette = "Paired") + labs(title = "Most Distinctive Terms by Subfield (TF-IDF)", subtitle = "Words uniquely associated with each subfield", x = "TF-IDF Score", y = NULL) + theme(axis.text.y = element_text(size = 8, family = PAL)) ``` ## TF-IDF: Distinctive Terms by Rank ```{r p-tfidf-rank, fig.height=10} tidy_words %>% filter(rank_category %in% tt_ranks) %>% count(rank_category, word) %>% bind_tf_idf(word, rank_category, n) %>% group_by(rank_category) %>% slice_max(tf_idf, n = 8) %>% ungroup() %>% mutate(word = reorder_within(word, tf_idf, rank_category)) %>% ggplot(aes(x = tf_idf, y = word, fill = rank_category)) + geom_col(show.legend = FALSE) + facet_wrap(~ rank_category, scales = "free_y", ncol = 3) + scale_y_reordered() + scale_fill_brewer(palette = "Spectral") + labs(title = "Most Distinctive Terms by Rank (TF-IDF)", x = "TF-IDF Score", y = NULL) + theme(axis.text.y = element_text(size = 8, family = PAL)) ``` ## Top Bigrams ```{r p-bigrams, fig.height=8} jobs %>% unnest_tokens(bigram, full_text, token = "ngrams", n = 2) %>% separate(bigram, into = c("w1","w2"), sep = " ") %>% filter(!w1%in% stop_words$word, !w2%in% stop_words$word, !w1%in% ps_stopwords$word, !w2%in% ps_stopwords$word, str_length(w1) > 2, str_length(w2) > 2, !str_detect(w1,"^\\d+$"), !str_detect(w2,"^\\d+$")) %>% unite(bigram, w1, w2, sep = " ") %>% count(bigram, sort = TRUE) %>% slice_head(n = 25) %>% ggplot(aes(x = n, y = reorder(bigram, n), fill = n)) + geom_col(show.legend = FALSE) + geom_text(aes(label = comma(n)), hjust = -0.2, size = 3.3, family = PAL) + scale_fill_viridis_c(option = "C", direction = -1) + scale_x_continuous(labels = comma, expand = expansion(mult = c(0, .15))) + labs(title = "Top 25 Bigrams in Job Listings", subtitle = "Common two-word phrases after stopword removal", x = "Frequency", y = NULL) ``` ## Word Clouds by Subfield ::: {.panel-tabset} ### American Politics ```{r wc-ap, echo=FALSE, fig.height=5, fig.width=7} wc <- tidy_words %>% filter(subfield == "AP") %>% count(word, sort = TRUE) %>% slice_head(n = 100) if (nrow(wc) >= 5) wordcloud(wc$word, wc$n, min.freq = 2, max.words = 80, random.order = FALSE, rot.per = 0.25, colors = brewer.pal(8,"Dark2"), scale = c(3.5, 0.5), family = PAL) title(main = "American Politics", cex.main = 1.2, family = PAL) ``` ### Comparative Politics ```{r wc-cp, echo=FALSE, fig.height=5, fig.width=7} wc <- tidy_words %>% filter(subfield == "CP") %>% count(word, sort = TRUE) %>% slice_head(n = 100) if (nrow(wc) >= 5) wordcloud(wc$word, wc$n, min.freq = 2, max.words = 80, random.order = FALSE, rot.per = 0.25, colors = brewer.pal(8,"Dark2"), scale = c(3.5, 0.5), family = PAL) title(main = "Comparative Politics", cex.main = 1.2, family = PAL) ``` ### International Relations ```{r wc-ir, echo=FALSE, fig.height=5, fig.width=7} wc <- tidy_words %>% filter(subfield == "IR") %>% count(word, sort = TRUE) %>% slice_head(n = 100) if (nrow(wc) >= 5) wordcloud(wc$word, wc$n, min.freq = 2, max.words = 80, random.order = FALSE, rot.per = 0.25, colors = brewer.pal(8,"Dark2"), scale = c(3.5, 0.5), family = PAL) title(main = "International Relations", cex.main = 1.2, family = PAL) ``` ### Methodology ```{r wc-methods, echo=FALSE, fig.height=5, fig.width=7} wc <- tidy_words %>% filter(subfield == "Methods") %>% count(word, sort = TRUE) %>% slice_head(n = 100) if (nrow(wc) >= 5) wordcloud(wc$word, wc$n, min.freq = 2, max.words = 80, random.order = FALSE, rot.per = 0.25, colors = brewer.pal(8,"Dark2"), scale = c(3.5, 0.5), family = PAL) title(main = "Methodology", cex.main = 1.2, family = PAL) ``` ### Political Theory ```{r wc-pt, echo=FALSE, fig.height=5, fig.width=7} wc <- tidy_words %>% filter(subfield == "PT") %>% count(word, sort = TRUE) %>% slice_head(n = 100) if (nrow(wc) >= 5) wordcloud(wc$word, wc$n, min.freq = 2, max.words = 80, random.order = FALSE, rot.per = 0.25, colors = brewer.pal(8,"Dark2"), scale = c(3.5, 0.5), family = PAL) title(main = "Political Theory", cex.main = 1.2, family = PAL) ``` ### Public Law ```{r wc-pl, echo=FALSE, fig.height=5, fig.width=7} wc <- tidy_words %>% filter(subfield == "PL") %>% count(word, sort = TRUE) %>% slice_head(n = 100) if (nrow(wc) >= 5) wordcloud(wc$word, wc$n, min.freq = 2, max.words = 80, random.order = FALSE, rot.per = 0.25, colors = brewer.pal(8,"Dark2"), scale = c(3.5, 0.5), family = PAL) title(main = "Public Law", cex.main = 1.2, family = PAL) ``` ### Public Policy ```{r wc-pp, echo=FALSE, fig.height=5, fig.width=7} wc <- tidy_words %>% filter(subfield == "PP") %>% count(word, sort = TRUE) %>% slice_head(n = 100) if (nrow(wc) >= 5) wordcloud(wc$word, wc$n, min.freq = 2, max.words = 80, random.order = FALSE, rot.per = 0.25, colors = brewer.pal(8,"Dark2"), scale = c(3.5, 0.5), family = PAL) title(main = "Public Policy", cex.main = 1.2, family = PAL) ``` ::: --- # Browse All Jobs {#browse} ```{r datatable-all, echo=FALSE} jobs %>% select(ejobs_id, year, subfield, org_name, rank, rank_category, unit, start_date, date_posted, deadline, salary, region, state_raw) %>% arrange(desc(year), subfield) %>% datatable( caption = "All Unique eJobs Listings", filter = "top", rownames = FALSE, options = list( pageLength = 15, scrollX = TRUE, columnDefs = list(list(width = "200px", targets = c(3, 6))) ) ) ``` --- # Appendix ## R Session Info ```{r session-info} sessionInfo() ``` ## Data Pipeline Summary | Stage | Tool | Output | |-------|------|--------| | PDF text extraction | `pdftools::pdf_text()` | Raw character vectors | | TOC count parsing | `stringr` regex on pages 1–3 | `ps_jobs_toc_counts.csv` | | Body text scraping | Line-by-line section tracker + eJobs ID flush | `ps_jobs_all_raw.csv` | | Deduplication | `dplyr::distinct(ejobs_id)` | `ps_jobs_all_unique.csv` | | Verification | TOC count vs scraped count per issue × subfield | `ps_jobs_verification.csv` | | Analytics & Report | `ggplot2`, `tidytext`, `maps`, Quarto | This document | ## Subfield Code Reference | Code | Full Name | |------|-----------| | AP | American Government and Politics | | CP | Comparative Politics | | IR | International Relations | | Methods | Methodology | | PT | Political Theory | | PL | Public Law | | PP | Public Policy | | PAdmin | Public Administration | | Admin | Administration | | Non-Academic | Non-Academic Positions | | Open | Open Subfield | | Other | Other |