STEM-LM: Spatio-Temporal Ecological Modeling via Masked Language Model for Joint Species Distribution
STEM-LM: Spatio-Temporal Ecological Modeling via Masked Language Model for Joint Species Distribution
Li, J. K.; Lim, W.; Callahan, F. M.; Raskin, L. Y.; Lemmon-Kishi, M.; Nielsen, R.
AbstractJoint species distribution models (JSDMs) are central to biodiversity forecasting and conservation decision-making. As ecological datasets grow in size, dimensionality, and spatio-temporal resolution, there is a need for flexible yet scalable JSDMs tailored to large-scale species observation data. Recent advances in masked language modeling for text and genomics suggest a natural alternative: by treating each species' presence or absence as a token, and a site's species assemblage together with its spatio-temporal and ecological covariates as a sentence, we can learn joint co-occurrence structure by reconstructing masked species from their neighboring sites. We propose STEM-LM a Transformer-based JSDM that frames joint species distribution modeling as masked language modeling. By varying the masking rate during training, a single trained model supports both purely spatio-temporal/ecological prediction and conditioning on arbitrary subsets of observed species for joint co-occurrence inference at a given site. On a North American butterfly and a global plant distribution dataset, STEM-LM performs better or on par with other statistical and deep-learning based methods in terms of discriminative ranking, while producing substantially better rank-calibrated occurrence probabilities. Utilizing partial species observations at the same site greatly enhances prediction performance.