Towards Transit Data Accessibility: Large Language Models and Software Tools for GTFS
Advisor: Lewis Lehe
Abstract
In an era characterized by data-driven decision-making, the General Transit Feed Specification
(GTFS) has emerged as a global standard for publishing public transit data, enabling unprecedented
transparency and accessibility. Despite its widespread adoption, extracting and analyzing transit data
from GTFS remains challenging due to its complexity, optional components, and varying agency
adherence to the standard. This dissertation addresses these challenges by proposing new tools and
methods that make transit data more accessible using software and large language model (LLM)-
based techniques.
The dissertation begins with a systematic survey of errors in GTFS data across 632 US transit
feeds. Approximately 21% of the feeds contain at least one error. The analysis identifies the most
common issues, with errors related to the optional shape_dist_traveled field accounting for the
majority, and fare-related discrepancies forming a secondary cluster. The analysis also demonstrates
the limits of identifying errors programmatically, showing that manual inspection is necessary to
catch some of the most severe errors.
Subsequently, this dissertation addresses the absence of tools for calculating bus stop spacings
from GTFS feeds by introducing gtfs-segments, a Python package that computes summary statistics
and visualizes spacing distributions. In addition, it establishes terminology and various weighting
schemes for calculating stop spacing statistics. Using gtfs-segments, stop spacings were computed
for 539 U.S. transit providers and 83 Canadian providers, while detailed statistics were produced for
30 U.S. providers, 10 Canadian providers, and a sample of 38 international providers. The analysis
shows that different weighting schemes yield distinct “average" spacing values on both a hypothetical
sample network and actual transit networks. Notably, the weighted spacings in the U.S. and Canada
are narrower than those observed in other regions, yet remain broader than what references in the
literature suggest from anecdotal evidence.
GTFS data is intricate, comprising over 20 interlinked files with 250+ attributes, each having a
description, presence condition, and data type. This dissertation investigates the potential of LLMs
in extracting information from GTFS feeds by introducing the ‘GTFS Semantics’ and ‘GTFS Retrieval’
benchmarks to evaluate their comprehension and retrieval capabilities. Benchmarking Chat-
GPT (GPT-3.5 Turbo and GPT-4) reveals that LLMs exhibit a reasonable understanding of GTFS
semantics and can perform ‘simple’ extraction tasks by generating Python code. However, they are
prone to hallucinations, particularly in distinguishing attribute-file associations and enumerated attribute
types. Furthermore, this leads to poor performance on ‘complex’ tasks that involve multiple
files and attributes.
The culmination of this dissertation is the creation of TransitGPT, a chatbot that leverages LLMs
to answer natural language queries about GTFS data such as “What is the longest bus route in
Chicago?". TransitGPT helps guide the LLM to generate Python code that extracts and manipulates
relevant transit data, which is then executed on a server hosting the GTFS feeds. This framework
supports a wide range of tasks–including data retrieval, calculations, and interactive visualizations–
without requiring users to have extensive knowledge of GTFS or programming. The LLMs are
guided entirely by prompts (through prompt engineering techniques) without the need for fine-tuning
or direct access to the feeds, allowing any LLM to serve as a drop-in replacement. Evaluations
using GPT-4o and Claude-3.5-Sonnet on a benchmark dataset of 100 tasks demonstrate that TransitGPT
significantly enhances the accessibility and usability of transit data, empowering planners,
researchers, and the public with an intuitive interface for complex data analysis