GTFS arrivals.

I created a MacOS status bar app for public transit arrival times last year. It’s mostly just a fun demo app, but I do use it occasionally to check train times before my commute. The first iteration was a simple UI over the /StopPoint/:id/Arrivals endpoint of the Transport for London Unified API, meaning it was only useful if you happened to live in London.

On a trip to New York in November, I discovered the MTA also publishes real-time data for the NYC Subway. I went down the rabbit hole.

Next 3 arrivals at Hoyt-Schermerhorn

GTFS

NYC Subway data is published using an open standard called GTFS: General Transit Feed Specification. This format was originally developed at Google back in 2006 to get public transit data into early versions of Google Maps. It has two components:

  • GTFS: a ZIP file of CSVs with schedule data that changes infrequently (like the names and locations of stops)
  • GTFS-RT: a Protocol Buffer feed of real-time network status (like current vehicle positions)

Let’s adapt this app to support GTFS. The first step is to fetch the real-time feed and see what data structures we’re working with. Since this is a Protobuf response, we need a library like Wire to decode the raw body bytes with the GTFS-RT schema.

It turns out to be a wildly different approach from the TfL API, where one of the many endpoints returns a JSON response with a list of arrival predictions for a given station. All that’s required in that case is to filter the list by direction or platform and then add some formatting logic.

[{
    "$type": "Tfl.Api.Presentation.Entities.Prediction, Tfl.Api.Presentation.Entities",
    "id": "1800681618",
    "operationType": 1,
    "vehicleId": "001",
    "naptanId": "940GZZLUOVL",
    "stationName": "Oval Underground Station",
    "lineId": "northern",
    "lineName": "Northern",
    "platformName": "Northbound - Platform 1",
    "direction": "outbound",
    "bearing": "",
    "destinationNaptanId": "940GZZLUEGW",
    "destinationName": "Edgware Underground Station",
    "timestamp": "2025-01-11T22:38:30.0937667Z",
    "timeToStation": 494,
    "currentLocation": "Approaching Clapham South",
    "towards": "Edgware via Bank",
    "expectedArrival": "2025-01-11T22:46:44Z",
    "timeToLive": "2025-01-11T22:46:44Z",
    "modeName": "tube"
}, ... ]

GTFS-RT gives us the status of the whole transit system in one go. The MTA actually splits NYC Subway feeds into groups: G, L, ACE, 1-7, and so on. Still, the 1-7 lines alone cover four of the five boroughs with 349 active station platforms at the time of writing. We’ll need to sift through hundreds of TripUpdate objects. Each of these updates represents a journey that one train is making and the predicted arrival times for each remaining stop along the line.

TripUpdate {
    trip=TripDescriptor {
        trip_id=101100_A..N34X001,
        route_id=A,
        start_time=16:57:08,
        start_date=20250111
    },
    stop_time_update=[
        StopTimeUpdate {
            stop_id=A33N,
            arrival=StopTimeEvent { time=1736637142 },
            departure=StopTimeEvent { time=1736637142 },
            departure_occupancy_status=NO_DATA_AVAILABLE
        }, ...
    ]
}

This firehose of data is super powerful. It’s enabled projects like real-time maps of the whole transit system with live train locations. However, for a simple arrival times board we’ll use less than 1% of the data.

Filtering

Now that the response is decoded, we need to transform data about every train into the next three arrival predictions for one specific stop. Here’s the basic algorithm:

  1. Get every trip update containing a stop time for our stop ID
  2. Find the ID of the last stop on that trip (to display the destination)
  3. Calculate seconds to the stop (arrival time - now)
  4. Sort the matching stop times and take 3

We can use the arrival timestamp from the StopTimeUpdate to calculate countdown times that get formatted like “5 min” in the UI. However, this is missing for platforms at the end of a line – 8th Av on the L or Court Square on the G – so we need to fall back to departure timestamps in those cases. (This actually makes sense. In contrast, the TfL arrival times API is pedantically arrival times only and returns no data at all for terminal stations.)

Sometimes trip updates contain stop times in the past. Anecdotally, this seems to happen more often when there are delays on the line. Simply ignoring any events in the past gives us a set of arrival predictions exactly matching the MTA website.

Quirks

If you look closely at the TripUpdate example above, you might notice that everything is an ID or a timestamp. Protobuf schemas optimise for sending as few bytes as possible over the wire. Given the nature of GTFS-RT (frequent updates to a very broad data set) it makes sense that repetitive details like station names would be excluded. This means we also need to download and cache the GTFS schedules ZIP so we can use the stops.txt CSV to map a stop ID like “A33N” to the northbound platform at Spring Street.

Settings menu to select a stop on the G

At this point, all the basic building blocks are in place and we can post triumphantly to social media. But software is never done. I ran into several mapping errors when I was building the settings UI. Not every stop ID in the real-time feed has a corresponding entry with name and location in the CSV. This is the fun facts portion of the post. Thanks to this blog I learned that there are several ghost stations in the real-time Subway feeds that represent places where lines merge, tracks enter/exit tunnels, or shuttle services turn around.

100+ transit systems

Almost none of this solution is specific to the NYC Subway. This is the power of open standards. In theory, this little app now supports hundreds of different public transit systems across the world (though mostly in North America) that publish GTFS data. Mobility Database lists over 800 GTFS-RT feeds. Some of these transit systems do require registering for an API key, though. That’s unsupported for now.

Summary

So we’ve seen two polar opposite approaches for making public transit data accessible. Neither is necessarily superior. TfL publishes dozens of API endpoints, allowing data for specific use cases to be queried with precision. The MTA adopts an open standard that delivers much of the same data between just two endpoints but requires the consumer to do significantly more data processing. The TfL approach is like ordering takeout; the MTA approach is like going to the supermarket.

That being said, I like the idea behind GTFS. Trying to convince thousands of independent transit systems with different technology stacks to implement a broad API spec would be near impossible. Getting them to implement two endpoints with schemas that can be validated automatically… that might work.