Abstract: Learning and representing low-dimensional structures from noisy, high-dimensional
data is a cornerstone of modern biomedical data science. Stochastic neighbor embedding
algorithms, a family of nonlinear dimensionality reduction and data visualization methods, with
t-SNE and UMAP as two leading examples, have become especially influential in recent years,
particularly in single-cell analysis. Yet despite their popularity, these methods remain subject to
points of debate, including limited theoretical understanding, ambiguous interpretations, and
sensitivity to tuning parameters. In this talk, I will present our recent efforts to decipher,
demystify, and improve these nonlinear embedding approaches. Our key results include a
rigorous theoretical framework that uncovers the intrinsic mechanisms, large-sample limits, and
fundamental principles underlying these algorithms; a set of theory-informed practical
guidelines for their principled use in trustworthy biological discovery; and a collection of new
algorithms that address current limitations and improve performance in areas such as bias
reduction and stability. Throughout the talk, I will highlight how these advances not only
deepen our statistical understanding but also open new avenues for biological insight.