Glasgow academics have improved the way a popular AI tool understands geography – helping data scientists make faster, more accurate predictions.

Researchers from the University of Glasgow and Florida State University have found a way to overcome a key limitation of ‘TabPFN’, one of a class of AI tools known as foundation models.

TabPFN is used by AI professionals in a range of fields, from transport to healthcare, financial services, insurance, manufacturing, utilities.

However, whilst it is useful to analyse and predict the outcomes of tabulated data, such as in spreadsheets or databases, it has not until now been able to automatically understand connections between location data.

The researchers’ tweak, called ‘Geospatial Sparse Attention’, gives the existing AI model a sense of place – with no expensive rebuild required.

Dr Mingshu Wang, of the University of Glasgow’s School of Geographical & Earth Sciences, was one of the study leads.

He said: “The first law of geography is that ‘everything is related to everything else, but near things are more related than distant things’. In geospatial data, that means that we can scrutinise how closely data points are related to each other in space in order to find connections and draw conclusions.

“General-purpose tabular models can be very powerful, but they are trained to treat rows as independent observations – they don’t automatically understand the principles of geospatial data. That’s why we set out to expand TabPFN’s ability to make the connections between tabulated geospatial data instead of trying to build and train a new model from scratch.”

The researchers set TabPFN-GSA’s to work on four real-world datasets spanning environmental and socioeconomic topics: air-pollution readings, county-level results from the 2020 US presidential election, housing prices, and neighbourhood-level poverty across the continental United States. The datasets span a wide range of scale, from just over a thousand records to roughly 70,000.

They found that TabPFN-GSA generally produced more accurate and robust predictions than the standard model, and reduced the memory failures that prevented the original from running on the largest datasets. Notably, it was able to complete predictions on the 70,000-row poverty dataset, which the unmodified model could not handle.

It is expected that the open source will be useful to data science researchers in a wide range of contexts, from academia to local councils, national agencies and data-analytics companies.