Supporting interactive queries and analytics over large data is a critical requirement in many data-driven applications. The classic external memory model based on IO optimizations no longer works well in the era of big data due to its high latency. Instead, newer systems (e.g., Spark, Impala) rely on in-memory computing over a cluster of commodity machines to offer scale-out interactive data analytics.
In the context of large spatio-temporal data, this talk presents the Simba system that offers scalable and efficient in-memory analytics over a cluster. Simba extends the Spark SQL engine to support rich query and analytical semantics through both SQL and DataFrame API (e.g., spatial join, knn join, trajectories).
An effective query optimizer leveraging its indexing support and geometry-aware query optimization is designed. Furthermore, the system is able to provide online analytics that explores the accuracy-efficiency tradeoff through novel online aggregation techniques that support complex multi-way join queries and random sampling over joins. Lastly, we will also present ongoing extensions to Simba that explores spatio-temporal learning and sentiment analysis over large data.