Primary and Secondary Syphilis Growth Rates
To estimate the growth rates for Primary and Secondary Syphilis we fit a Bayesian growth curve model to the data.
We modelled the yearly number of primary and secondary syphilis cases by county (\(i\)) and year (\(t\)) using a negative binomial distribution and an offset to account for the population size of the county (\(pop_i\)).
\[
Cases_i \sim NB(mu_i, phi)
\]
To account for the spatial and temporal autocorrelation we used:
- A fixed overall year effect (\(\beta_{year}\)) to capture the overall temporal trend
- A smooth temporal trend (\(s(t)\)) using a second-order random walk (RW2) to capture non-linear secular trends
- County-specific random intercepts (\(\beta_{0i}\)) and random slopes (\(\beta_{year,i}\)) to allow each county to have its own baseline and growth rate
This allows us to account for the secular trends in the data, while examining the county-specific trends in growth rates.
\[\begin{align*}
Cases_{it} &\sim \mathrm{NB}(\mu_{it}, \phi) \\
\mu_{it} &= \exp(\beta_0 + \beta_{year} \cdot t + s(t) + \beta_{0i} + \beta_{year,i} \cdot t + \log(pop_i)) \\
\beta_{0i} &\sim \mathcal{N}(0, \sigma_{county}) \\
\beta_{year,i} &\sim \mathcal{N}(0, \sigma_{slope}) \\
s(t) &\sim \text{RW2}(\tau)
\end{align*}\]
where \(t\) is the centered year, \(pop_i\) is the population of county \(i\), and the county-specific year coefficients (\(\beta_{year,i}\)) represent the deviation from the overall year trend for each county. The smooth temporal trend \(s(t)\) follows a second-order random walk (RW2), which penalizes second-order differences \(\Delta^2 s_t = s_t - 2s_{t-1} + s_{t-2}\) to encourage smoothness over time. The RW2 prior is specified via a precision matrix \(Q\) such that \(s \sim \mathcal{N}(0, (\tau Q)^{-1})\), where \(\tau\) controls the smoothness of the trend.
We specified the following weakly informative priors:
- Fixed effects (\(\beta_0\), \(\beta_{year}\)): \(\mathcal{N}(0, 0.5)\)
- Random effect standard deviations (\(\sigma_{county}\), \(\sigma_{slope}\)): \(\text{Exponential}(2)\)
- RW2 precision parameter (\(\tau\)): \(\text{Exponential}(2)\)
- Negative binomial dispersion parameter (\(\phi\)): \(\text{Exponential}(2)\)
The model was run using NumPyro/JAX with NUTS sampling using 4 chains, 1000 warmup iterations, and 2000 sampling iterations per chain. We then extracted the county-specific year coefficients from the model, which represent the total year effect for each county (combining the fixed overall year effect and the county-specific random slope).
Cross-correlation of Primary and Secondary Syphilis and Congenital Syphilis
To examine the relationship between primary and secondary syphilis and congenital syphilis, we calculated cross-correlations between the two time series of rates and cases(per 100,000 population) at multiple lags.
Primary and secondary syphilis cases were aggregated to state-level totals by year. Cross-correlation was then calculated separately for cases and rates (per 100,000 population) to examine the relationship between primary and secondary syphilis (leading variable) and congenital syphilis (lagged variable) at lags of 0 to 5 years.
Cross-correlation measures the similarity between two time series as a function of the lag between them. For each lag \(k\), we calculated the Pearson correlation coefficient:
\[
r_k = \frac{\text{Cov}(PS_{t}, CS_{t+k})}{\sigma_{PS} \cdot \sigma_{CS}}
\]
where \(PS_t\) represents primary and secondary syphilis at time \(t\), \(CS_{t+k}\) represents congenital syphilis at time \(t+k\), and positive lags indicate that primary and secondary syphilis leads congenital syphilis.
We calculated cross-correlations for the Southeast region overall and separately for each state. The lag with the highest correlation coefficient was identified for each analysis.
Multivariate Changepoint Analysis
To identify counties experiencing significant shifts in STI trends across multiple pathogens simultaneously, we performed multivariate changepoint detection. This analysis examines whether counties show coordinated changes in HIV prevalence, Chlamydia (CT) rates, Gonorrhea (GC) rates, and Primary and Secondary Syphilis rates over time.
For each county, we constructed a multivariate time series matrix where each row represents a year and each column represents one of the four STI indicators. The data were standardized (mean-centered and scaled by standard deviation) to ensure all variables contribute equally to the changepoint detection, regardless of their different scales.
We used the PELT (Pruned Exact Linear Time) algorithm with an L2 cost function to detect changepoints in the multivariate time series. The PELT algorithm efficiently identifies the optimal segmentation of the time series by minimizing a cost function that balances model fit with a penalty for the number of changepoints:
\[
\text{Cost}(y_{1:n}) = \min_{\tau} \left[ \sum_{i=1}^{m+1} \mathcal{C}(y_{\tau_{i-1}+1:\tau_i}) + \beta \cdot m \right]
\]
where \(\tau = \{\tau_1, \tau_2, \ldots, \tau_m\}\) represents the changepoint locations, \(\mathcal{C}\) is the cost function for each segment, \(m\) is the number of changepoints, and \(\beta\) is a penalty parameter (set to 10.0) that controls the trade-off between model fit and model complexity.
The L2 cost function measures the sum of squared deviations from the segment mean:
\[
\mathcal{C}(y_{a:b}) = \sum_{t=a}^{b} \|y_t - \bar{y}_{a:b}\|^2
\]
where \(\bar{y}_{a:b}\) is the mean vector for the segment from time \(a\) to \(b\), and \(\|\cdot\|\) denotes the Euclidean norm.
After detecting changepoints, we identified which specific pathogens changed at each changepoint by comparing the mean values before and after the changepoint. For each variable, we calculated a standardized difference (effect size):
\[
d_j = \frac{|\bar{y}_{j,\text{after}} - \bar{y}_{j,\text{before}}|}{\sigma_{j,\text{pooled}}}
\]
where \(\bar{y}_{j,\text{before}}\) and \(\bar{y}_{j,\text{after}}\) are the mean values of variable \(j\) before and after the changepoint, and \(\sigma_{j,\text{pooled}}\) is the pooled standard deviation. Variables with \(d_j > 0.5\) were considered to have changed significantly at that changepoint.
Counties were then classified based on the direction of changes:
-
Increasing: At least one changepoint with increasing trends in one or more pathogens
-
Decreasing: Changepoint(s) with only decreasing trends (no increasing trends)
-
Mixed: Both increasing and decreasing changepoints detected
-
No change: No significant changepoints detected
This multivariate approach allows us to identify counties experiencing syndemic shifts—coordinated changes across multiple STIs—which may indicate changes in sexual networks, prevention programs, or other population-level factors affecting STI transmission.
Results are visualized in the Multivariate Changepoint Analysis page.
Local Indicators of Spatial Association (LISA)
To identify spatial clusters and outliers in STI case rates, we calculated Local Indicators of Spatial Association (LISA) using Local Moran’s I. LISA provides a local measure of spatial autocorrelation, identifying where similar values cluster together in space and where outliers occur.
For each county \(i\), Local Moran’s I is calculated as:
\[
I_i = \frac{(y_i - \bar{y})}{S^2} \sum_{j=1}^{n} w_{ij}(y_j - \bar{y})
\]
where \(y_i\) is the case rate for county \(i\), \(\bar{y}\) is the mean case rate across all counties, \(S^2\) is the variance of case rates, \(w_{ij}\) are the elements of a spatial weights matrix (1 if counties \(i\) and \(j\) are neighbors, 0 otherwise), and \(n\) is the total number of counties.
We used a Queen contiguity spatial weights matrix, where two counties are considered neighbors if they share a border or a vertex. The weights matrix was row-standardized so that each row sums to 1.
Counties were classified into five categories based on their Local Moran’s I value and statistical significance (p < 0.05):
-
High-High (HH): High-rate counties surrounded by high-rate counties (hot spots)
-
Low-Low (LL): Low-rate counties surrounded by low-rate counties (cold spots)
-
High-Low (HL): High-rate counties surrounded by low-rate counties (spatial outliers)
-
Low-High (LH): Low-rate counties surrounded by high-rate counties (spatial outliers)
-
Not Significant: Counties that do not show significant spatial clustering
Statistical significance was assessed using simulation-based p-values (999 permutations) to account for multiple testing.
LISA clusters help identify:
-
Spatial hot spots (HH clusters): Areas where high case rates cluster together, potentially indicating shared transmission networks or common risk factors
-
Spatial cold spots (LL clusters): Areas with consistently low rates, potentially indicating effective prevention programs or lower-risk populations
-
Spatial outliers (HL or LH): Counties that differ significantly from their neighbors, which may indicate unique local factors or data quality issues
Results are presented in the Spatial Point Process Analysis page.
Spatial Point Process Analysis
To understand whether the spatial distribution of STI cases can be explained by population size alone, or if there are additional spatial patterns indicating transmission networks or shared risk factors, we performed spatial point process analysis using spatial regression models.
We fit a Spatial Lag model (also known as a spatial autoregressive model) to predict case rates based on population size while accounting for spatial autocorrelation:
\[
y = \rho W y + X\beta + \epsilon
\]
where:
-
\(y\) is the vector of case rates per 100,000 population
-
\(W\) is the spatial weights matrix (Queen contiguity, row-standardized)
-
\(\rho\) is the spatial autoregressive coefficient
-
\(X\) is a matrix of independent variables (log-transformed population)
-
\(\beta\) is a vector of regression coefficients
-
\(\epsilon\) is a vector of error terms
The spatial lag term \(\rho W y\) captures the influence of neighboring counties’ case rates. If \(\rho\) is significantly different from zero, it indicates that spatial autocorrelation is present—meaning that neighboring counties influence each other’s case rates beyond what would be expected from population size alone.
The model was estimated using maximum likelihood estimation. We then calculated: 1. Predicted rates: Expected case rates based on the model (accounting for population and spatial autocorrelation) 2. Residuals: Observed rates minus predicted rates, which identify counties that are over-performing (positive residuals) or under-performing (negative residuals) relative to model expectations
The spatial regression model allows us to test the hypothesis that spatial distribution is solely a function of population size. If spatial autocorrelation is significant, it suggests that transmission patterns, shared risk factors, or healthcare access create spatial dependence that cannot be explained by population alone.
Results, including maps of observed rates, predicted rates, residuals, and LISA clusters, are presented in the Spatial Point Process Analysis page.
Small Area Estimation and Hotspot Analysis
We employ three Bayesian spatial models:
ICAR (Intrinsic Conditional Autoregressive) - Also known as the Besag model, assumes spatial effects follow an improper normal distribution with precision matrix proportional to the graph Laplacian (D - A, where D is diagonal matrix of neighbor counts and A is adjacency matrix).
CAR (Conditional Autoregressive) - Extends ICAR with a spatial correlation parameter α (\(0 ≤ \alpha < 1\)). When \(\alpha = 0\), effects are independent; as \(\alpha\) approaches 1, the model approaches ICAR.
BYM (Besag-York-Mollié) - Combines ICAR spatial component with independent random effects, providing flexibility to model both structured and unstructured spatial variation.
All models use Poisson likelihoods with case counts and population denominators. Hotspots are identified as counties where spatial random effects exceed the 90th percentile.
For more details on areal data modeling, see this tutorial.