Research Seminars @ Illinois

View Full Calendar

Tailored for undergraduate researchers, this calendar is a curated list of research seminars at the University of Illinois. Explore the diverse world of research and expand your knowledge through engaging sessions designed to inspire and enlighten.

To have your events added or removed from this calendar, please contact OUR at ugresearch@illinois.edu

PhD Final Defense - Saipraneeth Devunuri

Event Type

Seminar/Symposium

Sponsor

Civil and Environmental Engineering

Location

Newmark 1311

Date

Feb 21, 2025 10:00 am

Views

Originating Calendar

CEE Seminars and Conferences

Towards Transit Data Accessibility: Large Language Models and Software Tools for GTFS

Advisor: Lewis Lehe

Abstract

In an era characterized by data-driven decision-making, the General Transit Feed Specification

(GTFS) has emerged as a global standard for publishing public transit data, enabling unprecedented

transparency and accessibility. Despite its widespread adoption, extracting and analyzing transit data

from GTFS remains challenging due to its complexity, optional components, and varying agency

adherence to the standard. This dissertation addresses these challenges by proposing new tools and

methods that make transit data more accessible using software and large language model (LLM)-

based techniques.

The dissertation begins with a systematic survey of errors in GTFS data across 632 US transit

feeds. Approximately 21% of the feeds contain at least one error. The analysis identifies the most

common issues, with errors related to the optional shape_dist_traveled field accounting for the

majority, and fare-related discrepancies forming a secondary cluster. The analysis also demonstrates

the limits of identifying errors programmatically, showing that manual inspection is necessary to

catch some of the most severe errors.

Subsequently, this dissertation addresses the absence of tools for calculating bus stop spacings

from GTFS feeds by introducing gtfs-segments, a Python package that computes summary statistics

and visualizes spacing distributions. In addition, it establishes terminology and various weighting

schemes for calculating stop spacing statistics. Using gtfs-segments, stop spacings were computed

for 539 U.S. transit providers and 83 Canadian providers, while detailed statistics were produced for

30 U.S. providers, 10 Canadian providers, and a sample of 38 international providers. The analysis

shows that different weighting schemes yield distinct “average" spacing values on both a hypothetical

sample network and actual transit networks. Notably, the weighted spacings in the U.S. and Canada

are narrower than those observed in other regions, yet remain broader than what references in the

literature suggest from anecdotal evidence.

GTFS data is intricate, comprising over 20 interlinked files with 250+ attributes, each having a

description, presence condition, and data type. This dissertation investigates the potential of LLMs

in extracting information from GTFS feeds by introducing the ‘GTFS Semantics’ and ‘GTFS Retrieval’

benchmarks to evaluate their comprehension and retrieval capabilities. Benchmarking Chat-

GPT (GPT-3.5 Turbo and GPT-4) reveals that LLMs exhibit a reasonable understanding of GTFS

semantics and can perform ‘simple’ extraction tasks by generating Python code. However, they are

prone to hallucinations, particularly in distinguishing attribute-file associations and enumerated attribute

types. Furthermore, this leads to poor performance on ‘complex’ tasks that involve multiple

files and attributes.

The culmination of this dissertation is the creation of TransitGPT, a chatbot that leverages LLMs

to answer natural language queries about GTFS data such as “What is the longest bus route in

Chicago?". TransitGPT helps guide the LLM to generate Python code that extracts and manipulates

relevant transit data, which is then executed on a server hosting the GTFS feeds. This framework

supports a wide range of tasks–including data retrieval, calculations, and interactive visualizations–

without requiring users to have extensive knowledge of GTFS or programming. The LLMs are

guided entirely by prompts (through prompt engineering techniques) without the need for fine-tuning

or direct access to the feeds, allowing any LLM to serve as a drop-in replacement. Evaluations

using GPT-4o and Claude-3.5-Sonnet on a benchmark dataset of 100 tasks demonstrate that TransitGPT

significantly enhances the accessibility and usability of transit data, empowering planners,

researchers, and the public with an intuitive interface for complex data analysis

link for robots only