Decoding nonlinear signals in large observational datasets

We are excited to share our new Towards Data Science post that pulls together three recent publications from our group. It is a simple overview describing how we built a robust precipitation dataset, what linear methods say about it, and what we can also learn from nonlinear approaches. Read it here: https://towardsdatascience.com/decoding-nonlinear-signals-in-large-observational-datasets/

What we set out to do

Modern observing systems produce a lot of data. If we want better precipitation science and improved global retrievals, we need to understand then structures within that data. This series walks through three steps for examining these structures:

1) Curation of a robust multidimensional dataset
2) Examining linear embeddings with PCA
3) Exploring nonlinear features with UMAP

Below is a quick summary of each step and what the results mean in practice.

1) The microphysical dataset

Paper: Earth and Space Science (Data Paper) — https://doi.org/10.1029/2024EA003538

We assembled more than one million minutes of particle observations across ten sites and matched them with surface meteorology. Using the NASA PIP instrument, we extracted particle size distributions, fall speeds, and effective densities, then ran strict quality checks and published everything in CF-compliant NetCDF.

2) Linear embeddings with PCA

Paper: Journal of the Atmospheric Sciences — https://doi.org/10.1175/JAS-D-24-0076.1

We focused first on snowfall cases, and applied PCA to six PIP variables over 5 minute windows. Three principal components captured about 95 percent of the variance.

PC1 behaved like an intensity embedding
PC2 was linked to fall speed and wetness or temperature (particle density)
PC3 related to the size and regime of the particles

3) Nonlinear features with UMAP + HDBSCAN

Paper: Science Advances — https://doi.org/10.1126/sciadv.adu0162

We expanded to rain, mixed phase, and snow using twelve input variables. UMAP exposed a smooth 3D manifold of precipitation processes. HDBSCAN then produced nine clear clusters plus an ambiguous group.

The primary latent embedding now tracked particle phase
The second captured intensity
The third reflected size and shape of the falling particles

Case studies displayed smooth transitions across the manifold (for example rain to mixed phase to snow as surface temperature dropped), and attributions aligned well with indepdent radar observations.

What our results say

Good data matters. Careful data curation pays off more than any single algorithm choice (remember the 80/20 rule for these types of projects).
Linear structures are real and can be quite useful. You can describe a lot with a few axes, and assumptions of linearity.
Nonlinear structure is needed to separate complex, mixed phase particles and to recover their evolutionary pathways between rain and snow.

If you want to explore the data and figures yourself, the post links to an interactive notebook and viewer.

Stay tuned for more updates!