Journalist/developer. Storytelling developer @ USA Today Network. Builder of @HomicideWatch. Sinophile for fun. Past: @frontlinepbs @WBUR, @NPR, @NewsHour.
2058 stories
·
45 followers

Conflating Overture Places Using DuckDB, Ollama, Embeddings, and More

1 Share

An NYC restaurant grade certificant being hung.

An Intro to Matching Place Data on Your Laptop

One of the trickiest problems in geospatial work is conflation, combining and integrating multiple data sources that describe the same real-world features.

It sounds simple enough, but datasets describe features inconsistently and are often riddled with errors. Conflation processes are complicated affairs with many stages, conditionals, and comparison methods. Even then, humans might be needed to review and solve the most stubborn joins.

Today we’re going to demonstrate a few different conflation methods, to illustrate the problem. The following is written for people with data processing or analysis experience, but little geospatial exposure. We’re currently experiencing a bit of a golden age in tooling and data for the geo-curious. It’s now possible to quickly assemble geo data and analyze it on your laptop, using freely available and easy-to-use tools. No complicated staging, no specialized databases, and (at least today) no map projection wrangling.

Further, the current boom in LLMs has delivered open embedding models – which let us evaluate the contextual similarity of text, images, and more. (Last year I used embeddings to search through thousands of bathroom fauces for our remodel.) Embeddings are a relatively new tool in our conflation toolkit, and (as we’ll see below) deliver promising results with relatively little effort.

We’re going to attempt to join restaurant inspection data from Alameda County with places data from the Overture Maps Foundation, enabling us to visualize and sort restaurants by their current inspection score.

And – inspired by the Small Data conference I attended last week – we’ll do this all on our local machine, using DuckDB, Ollama, and a bit of Python.

Staging the Data

First up, getting the data and staging it where we can work on it. Alameda County has its own data website where it hosts restaurant inspection records. We can see when it was updated (September 23rd, last week) and download the CSV. Go ahead and do that.

We need to get this CSV into a form where we can interrogate it and compare it to other datasets. For that, we’re going to use DuckDB. Let’s set up our database and load our inspections into a table:

import duckdb
# Create the database we'll save our work to and load the extensions we'll need
con = duckdb.connect("conflation_demonstration.ddb")
con.sql("install spatial")
con.sql("install httpfs")
con.sql("load spatial")
con.sql("load httpfs")
# Load the CSV downloaded from the Alameda County site
con.sql("CREATE TABLE inspections AS SELECT * FROM read_csv('inspections_092324.csv', ignore_errors=True)")

Because we’ll need it later, I’m going to present all these examples in Python. But you could do most of this without ever leaving DuckDB.

Now we need to get Overture’s Places data. We don’t want the entire global dataset, so we’ll compute a bounding box in DuckDB that contains our inspection records to filter our request. We’ll take advantage of the confidence score to filter out some lower-quality data points. Finally, we’ll transform the data to closely match our inspection records:

# Download the Overture Places data. There's a lot going on here, but what we're doing is...
# 1. Create a bounding box around all the Alameda County records
# 2. Get all the places from Overture in that bounding box, with a confidence score > 0.5
# 3. Finally transform these results into a format that matches the Alameda County Data
con.sql("""
CREATE TABLE IF NOT EXISTS places AS
WITH bounding_box AS (
SELECT max(Latitude) as max_lat, min(Latitude) as min_lat, max(Longitude) as max_lon, min(Longitude) as min_lon
FROM inspections
)
SELECT
id,
upper(names['primary']) as Facility_Name,
upper(addresses[1]['freeform']) as Address,
upper(addresses[1]['locality']) as City,
upper(addresses[1]['region']) as State,
left(addresses[1]['postcode'], 5) as Zip,
geometry,
ST_X(geometry) as Longitude,
ST_Y(geometry) as Latitude,
categories
FROM (
SELECT *
FROM read_parquet('s3://overturemaps-us-west-2/release/2024-09-18.0/theme=places/type=place/*', filename=true, hive_partitioning=1),
bounding_box
WHERE addresses[1] IS NOT NULL AND
bbox.xmin BETWEEN bounding_box.min_lon AND bounding_box.max_lon AND
bbox.ymin BETWEEN bounding_box.min_lat AND bounding_box.max_lat AND
confidence > 0.5
);
""")

We now have 56,007 places from overture and inspection reports for 2,954 foodservice venues.

All the restaurants and places in our database.

Looks good, but the blue points (the Overture data) go much further east than our inspection venues. Our bounding box is rectangular, but our county isn’t. If we were concerned with performance we could grab the county polygon from Overture’s Divisions theme, but for our purposes this is fine.

Generating H3 Tiles

In each conflation method, the first thing we’ll do is collect match candidates that are near our input record. We don’t need to compare an Alameda restaurant to a listing miles away in Hayward. We could use DuckDB’s spatial extension to calculate distances and filter venues, but it’s much easier to use a tile-based index system to pre-group places by region. Tile indexes aren’t perfect – for example, a restaurant could sit on the edge of tile – but for most matches they’ll work great.

We’re going to use H3 to tile our venues, because it’s fast and it has a DuckDB community extension so we don’t even have to bounce back up to Python. (Also, I ran into Isaac Brodsky, H3’s creator, at the Small Data conference. So it’s on theme!)

con.sql("INSTALL h3 FROM community")
con.sql("LOAD h3")
# Add H3 indexs to each table
con.sql("ALTER TABLE places ADD COLUMN IF NOT EXISTS h3 uint64")
con.sql("ALTER TABLE inspections ADD COLUMN IF NOT EXISTS h3 uint64")
con.sql("UPDATE places SET h3 = h3_latlng_to_cell(Latitude, Longitude, 7)")
con.sql("UPDATE inspections SET h3 = h3_latlng_to_cell(Latitude, Longitude, 7)")

We’re now ready to try some matching. To sum up:

  1. We have two tables: inspections and places.
  2. Each has a Facility_Name, Address, City, Zip, Longitude, Latitude, and h3 column. These columns have been converted to uppercase to facilitate comparisons.
  3. The Overture places table also has a categories column with primary and alternate categories. It might come in handy.
  4. The inspections table has a Grade column for each inspection, containing one of three values: G for green, Y for yellow, and R for red. Meaning ‘all good’, ‘needs fixes’, and ‘shut it down’ – respectively.

Let’s step through several different matching methods to illustrate them and better understand our data.

Method 1: Exact Name Matching

We’ll start simple. We’ll walk through all the facilities in the inspection data and find places within the same H3 tile with exactly matching names:

exact_name_match_df = con.sql("""
SELECT
i.Facility_ID as fid, p.id as gers, i.Facility_Name as i_name, p.Facility_Name as p_name, i.Address, p.Address
FROM (
SELECT DISTINCT Facility_Name, Facility_ID, Address, h3
FROM inspections
) i
JOIN places p
ON i.h3 = p.h3
AND i.Facility_Name = p.Facility_Name
""").df()

Inspecting the results, we matched 930 facilities to places. About ~31%, not bad!

We included the address columns so that when we scan these matches (and you always should, when building conflation routines), we can see if the addresses agree. Out of our 930 matched records, 248 have disagreeing addresses – or 26%. Taking a look, we see two genres of disagreement:

  1. The places table doesn’t capture the unit number in the address field. For example, BOB'S DISCOUNT LIQUOR is listed as having an address of 7000 JARVIS AVE, but has an address of 7000 JARVIS AVE #A in the inspection data.

This isn’t a problem, for our use case. We’re still confident these records refer to the same business, despite the address mismatch.

  1. Chain restaurants – or other companies with multiple, nearby outlets – will incorrectly match with another location close enough to occur in the same H3 tile. For example, a “SUBWAY” at 20848 MISSION BLVD is matched with a different SUBWAY in the Overture data at 791 A ST.

This is a problem. We used a resolution of 7 when generating our H3 tiles, each of which has an average area of ~5 km. Sure, we could try using a smaller tile, but I can tell you from experience that’s not a perfect solution. There are a shocking amount of Subways (over 20,000 in the US alone!), and plenty of other culprits will spoil your fun1.

We’ll need to solve this problem another way, without throwing out all the correctly matched venues from point 1 above.

Method 2: String Similarity

To find very similar addresses, we’ll use string distance functions, which quantify the similarity of two strings. There are plenty of different distance funcitons, but we’re going to use Jaro-Winkler distance because it weights matching at the beginning of strings more highly – which well suits our missing address unit situation. And hey, it’s built into DuckDB!

String distance functions produce a score, from 0 to 1, so we’ll have to figure out a good cut-off value. Let’s plot the distribution of the scores to get an idea:

import matplotlib.pyplot as plt
df = con.sql("""
SELECT
i.Facility_ID as fid, p.id as gers, i.Facility_Name as i_name, p.Facility_Name as p_name, i.Address, p.Address, jaro_winkler_similarity(i.Address, p.Address) as similarity
FROM (
SELECT DISTINCT Facility_Name, Facility_ID, Address, h3
FROM inspections
) i
JOIN places p
ON i.h3 = p.h3
AND i.Facility_Name = p.Facility_Name
""").df()
# Visualize the distribution of Jaro-Winkler Similarity (jws) scores
plt.figure(figsize=(10, 4))
plt.hist(df['similarity'], bins=40, color='blue', edgecolor='black')
plt.title('Distribution of Jaro-Winkler Similarity Scores')
plt.xlabel('Jaro-Winkler Similarity of Addresses')
plt.ylabel('Frequency')
plt.show()

Which produces:

The distribution of similarity scores for addresses among venues whose names match exactly

This looks super promising. Cracking open the data we can see the low scores are catching the nearby chain pairs: Subway, McDonald’s, Peet’s Coffee, etc. There are a couple false negatives, but we can solve those with another mechanism. Everything above a score of 0.75 is perfect. So our query looks like:

exact_name_match_df = con.sql("""
SELECT
i.Facility_ID as fid, p.id as gers, i.Facility_Name as i_name, p.Facility_Name as p_name, i.Address, p.Address
FROM (
SELECT DISTINCT Facility_Name, Facility_ID, Address, h3
FROM inspections
) i
JOIN places p
ON i.h3 = p.h3
AND i.Facility_Name = p.Facility_Name
AND jaro_winkler_similarity(i.Address, p.Address) > 0.75
""").df()

Which produces 903 matches we trust.

But what if we use Jaro-Winkler (JW) string distance to find very similar venue names, not just exact matches? Looking at the scores for similar names (where the venues share the same H3 tile and have addresses with a score of 0.75) we see the algo spotting the same venues, with slightly different names. For example, the inspections dataset often adds a store number to chain restaurants, like, CHIPTLE MEXICAN GRILL #2389. JW matches this strongly with CHIPOTLE MEXICAN GRILL. Everything above a score of 0.89 is solid:

jws_matches_df = con.sql("""
WITH ranked_matches AS (
SELECT
i.Facility_ID as fid, p.id as gers,
i.Facility_Name as i_name, p.Facility_Name as p_name,
i.Address, p.Address,
jaro_winkler_similarity(i.Facility_Name, p.Facility_Name) as name_similarity,
jaro_winkler_similarity(i.Address, p.Address) as address_similarity,
ROW_NUMBER() OVER (
PARTITION BY i.Facility_ID
ORDER BY jaro_winkler_similarity(i.Facility_Name, p.Facility_Name) DESC
) as rank
FROM (
SELECT DISTINCT Facility_Name, Facility_ID, Address, h3
FROM inspections
) i
JOIN places p
ON i.h3 = p.h3
AND jaro_winkler_similarity(i.Facility_Name, p.Facility_Name) > 0.89
AND jaro_winkler_similarity(i.Address, p.Address) > 0.75
)
SELECT
fid, gers, i_name, p_name, Address, address_similarity, name_similarity
FROM ranked_matches
WHERE rank = 1
""").df()

This produces 1,623 confident matches, or 55%.

But looking at our potential matches, there are very few false positives if we extend our JW score to a threshold of 0.8, which would get us a 70% match rate. The problem is those errors are very tricky ones. For example, look at match candidates below:

Inspection Name Overture Name Name JW Score Inspection Address Overture Address Address JW Score
BERNAL ARCO BERNAL DENTAL CARE 0.87 3121 BERNAL AVE 3283 BERNAL AVE 0.92
BERNAL ARCO ARCO 0.39 3121 BERNAL AVE 3121 BERNAL AVE 1

Jaro-Winkler scoring is poorly suited to these two names, missing the correct match because of the BERNAL prefix. Meanwhile, the addresses match close enough – despite being obviously different – allowing the first record to outrank the correct pair in our query above.

How might we solve this? There are a few approaches:

  1. Gradually Zoom Out: Use escalating sizes of H3 tiles to try finding matches very close to the input venue, before broadening our area of evaluation. (Though this would fail for the above example – they’re ~150 meters apart.)
  2. Pre-Filter Categories: Use the venue categories present in the Overture data to filter out places not relevant to our data. We could add “convenience_store” and others to an allow list, filtering out BERNAL DENTAL CARE. (But this tactic would remove plenty of unexpected or edge case places that serve food, like an arena, school, or liquor store.)
  3. Add conditionals: For example, if an address matches exactly it outweighs the highest name match.

For fun2, we’ll write up the last method. It works pretty well, allowing us to lower our JW score threshold to 0.83 and delivers 2,035, for a ~68% match rate. But those extra 13% of matches come at the cost of some very nasty SQL!

This is why most conflation is done with multistage pipelines. Breaking out our logic into cascading match conditions, of increasing complexity achieves the same result and is much more legible and maintable. Further, it allows us to only use our most expensive methods of matching for the most stubborn records. Bringing us to our next method…

Method 3: Embeddings

Also at the Small Data conference was Ollama, a framework for easily running LLMs on your local machine. We’re going to use it today to generate embeddings for our venues. To follow along, you’ll need to:

  1. Download and install Ollama
  2. Run ollama pull mxbai-embed-large in your terminal to download the model we’ll be using
  3. Install the Ollama python library with: pip install ollama.
  4. Finally, run ollama serve in your terminal to start the server.

The Python code we’ll write will hit the running Ollama instance and get our results.

Vicki Boykis wrote my favorite embedding explanation, but it’s a bit long for the moment. What we need to know here is that [embeddings measure how contextually similar things are to each other][embededings], based on all the input data used to create the model generating the embedding. We’re using the mxbai-embed-large model here to generate embedding vectors for our places, which are large arrays of numbers that we can compare.

To embed our places we need to format each of them as single strings. We’ll concatenate their names and address information, then feed this “description” string into Ollama.

import ollama
def get_embedding(text):
return ollama.embeddings(
model='mxbai-embed-large',
prompt=text
)['embedding']
inspections_df = con.sql("""
SELECT Facility_ID as fid, concat(Facility_Name, ',', Address, ',', City, ',', Zip) as description FROM inspections GROUP BY description, fid
""").df()
places_df = con.sql("""
SELECT id as gers, concat(Facility_Name, ',', Address, ',', City, ',', Zip) as description FROM places GROUP BY description, gers
""").df()
# Compute the embeddings
inspection_string_df['embedding'] = inspection_string_df['description'].apply(lambda x: get_embedding(x))
places_df['embedding'] = places_string_df['description'].apply(lambda x: get_embedding(x))

We could store the generated embeddings in our DuckDB database (DuckDB has a vector similarity search extension, btw), but for this demo we’re going to stay in Python, using in-memory dataframes.

The code here is simple enough. We create our description strings as a column in DuckDB then generate embedding values using our get_embedding function, which calls out to Ollama.

But this code is slow. On my 64GB MacStudio, calculating embeddings for ~3,000 inspection strings takes over a minute. This performance remains consistent when we throw ~56,000 places strings at Ollama – taking just shy of 20 minutes. Our most complicated DuckDB query above took only 0.4 seconds.

(An optimized conflation pipeline would only compute the Overture place strings during comparison if they didn’t already exist – saving us some time. But 20 minutes isn’t unpalpable for this demo. You can always optimize later…)

Comparing embeddings is much faster, taking only a few minutes (and this could be faster in DuckDB, but we’ll skip that since a feature we’d need here is not production-ready). Without using DuckDB VSS, we’ll need to load a few libraries. But that’s easy enough:

from sentence_transformers.util import cos_sim
import numpy as np
import pandas as pd
def generate_search_embedding(text):
return ollama.embeddings(
model='mxbai-embed-large',
prompt=text
)['embedding']
results_df = pd.DataFrame(columns=['i_description', 'p_description', 'fid', 'gers', 'h3', 'similarity_score'])
for index, row in inspection_string_df.iterrows():
# Generate the candidate embeddings
candidate_places = places_string_df[places_string_df['h3'] == row['h3']]
sims = cos_sim(row['embedding'], candidate_places['embedding'].tolist())
# Find the highest ranking score and the associated row
max_sim_index = sims.argmax().item()
max_sim_score = sims[0][max_sim_index].item()
highest_ranking_row = candidate_places.iloc[max_sim_index]
# Print the results
# Add results to the new DataFrame
new_row = pd.DataFrame({
'i_description': row['description'],
'p_description': highest_ranking_row['description'],
'fid': row['fid'],
'gers': highest_ranking_row['gers'],
'h3': row['h3'],
'similarity_score': max_sim_score
}, index=[index])
results_df = pd.concat([results_df, new_row], ignore_index=True)
results_df

Scrolling through the results, its quite impressive. We can set our embedding distance score to anything greater than 0.87 and get matches with no errors for 71% of inspection venues. Compare that to the 68% we obtained with our gnarly SQL query. The big appeal of embeddings is the simplicity of the pipeline. It’s dramatically slower, but we achieve a similar match performance with a single rule.

And it’s pulling out some impressive matches:

  • RESTAURANT LOS ARCOS,3359 FOOTHILL BLVD,OAKLAND,94601 matched with LOS ARCOS TAQUERIA,3359 FOOTHILL BLVD,OAKLAND,94601
  • SOI 4 RESTAURANT,5421 COLLEGE AVE,OAKLAND,94618 matched with SOI 4 BANGKOK EATERY,5421 COLLEGE AVE,OAKLAND,94618
  • HARMANS KFC #189,17630 HESPERIAN BLVD,SAN LORENZO,94580 matched with KFC,17630 HESPERIAN BLVD,SAN LORENZO,94580

It correctly matched our BERNAL ARCO from above with a score of 0.96.

The only issue we caught with high-scoring results were with sub-venues, venues that exist within a larger property. Like a hot dog stand in a baseball stadium or a restaurant in a zoo. But for our use case, we’ll let these skate by.

Bringing It Together

String distance matches were fast, but embedding matches were easier. But like I said before: conflation jobs are nearly always multistep pipelines. We don’t have to choose.

Bringing it all together, our pipeline would be:

  1. DuckDB string similarity scores and some conditional rules: Matching 2,254 out of 2,954. Leaving 800 to match with…
  2. Embedding generation and distance scores: Matching 81 out of the remaining 800.

With these two steps, we confidently matched 80% of our input venues – with a pretty generic pipeline! Further, by running our string similarity as our first pass, we’ve greatly cut down on our time spent generating embeddings.

Which in many cases is good enough! If you’re more comfortable with false positives, tweak some of the thresholds above and get above 90%, easily. Or spend a day coming up with more conditional rules for specific edge cases. But many of these venues simply don’t have matches. This small Chinese restaurant in Oakland, for example, doesn’t exist in the Overture Places data. No conflation method will fix that.

Mostly green, thankfully

Our matched Overture places, colored by their most recent inspection rating. Mostly green, thankfully!

I was pleasantly surprised by the ease of embedding matching. There’s a lot to like:

  • It’s easy to stand up: All you have to know about your data is what columns you’ll join to generate the input strings. No testing and checking to figure out SQL condition statements. For small, unfamiliar datasets, I’ll definitely be starting with embeddings. Especially if I just need a quick, rough join.
  • It matched 10% of the records missed by our complex SQL: Even when we spent the time to get to know the data and create a few conditionals, embedding distances caught a significant amount of our misses. I can see it always being a final step for the stubborn remainders.
  • The hardest part is first-time set-up: Most of our time spent generating and comparing embeddings (besides compute time) was getting Ollama set up and finding a model. But Ollama makes that pretty easy and it’s a one-time task.
  • There’s certainly room for improvement with embeddings: I didn’t put much thought into picking an embedding model; I just grabbed the largest model in Ollama’s directory with ‘embedding’ in the name. Perhaps a smaller model, like [nomic-embed-text][nomic] performs just as well as mxbai-embed-large, in a fraction of the time. And these are both general models! Fine-tuning a model to match place names and their addresses could reduce our error rate significantly. And hey: get DuckDB VSS up and running and comparisons could get even faster…

Best of all: I ran this all on a local machine. The staging, exploration, and conflation were done with DuckDB, Ollama, H3, Rowboat3, Kepler (for visualization), and some Python. Aside from H3 and generating our bounding box for downloading our Overture subset, we didn’t touch any geospatial functions, reducing our complexity significantly.

Read the whole story
chrisamico
1 day ago
reply
Boston, MA
Share this story
Delete

North Korea’s trash balloons explained

1 Share
Read the whole story
chrisamico
7 days ago
reply
Boston, MA
Share this story
Delete

Arc was supposed to be a key to The Washington Post’s future. It became a problem instead.

1 Share
Read the whole story
chrisamico
11 days ago
reply
Boston, MA
Share this story
Delete

Back to Basics – PAUL BRADLEY CARR

1 Share

A couple of weeks ago, I quit my semi-regular column at the Mike Moritz-funded SF Standard.

I still love the writers and editors there and still think they’re doing a better job covering the whole city than the Chronicle. Also, they paid me astonishingly generously by online publication standards and in return I wrote (so they tell me) some of the most read opinion pieces they ever published. These are all very good things.

And yet. After the recent “pushback” (read: pathetic tantrum) from Ben and Felicia Horowitz over the Standard’s excellent reporting on their recent MAGA-conversion, I got the distinct feeling that the publication is feeling gun-shy when it comes to criticizing tech billionaires. Particularly tech billionaires who have links – for good or ill – to Mike Moritz. 

A column I wrote about billionaire midlife crises, that mentioned Horowitz? Spiked. Another I pitched about tech dudes (including Moritz and Marc Andreessen) trying to set up their own city? Rejected, on the basis that the publication has to pick its battles. 

It isn’t just me. Since the Horowitz kerfuffle, the Standard hasn’t published a single op-ed critical of big tech or tech founder. Right now its main front page story is a “bare all” (ew) interview with Sam Altman in which AI’s leading sociopath boldly (and wrongly) claims that Blink 182 is “not a good band.” 

To be clear: Nobody at the Standard specifically told me to “leave Mike and his frenemies alone” and there’s nothing inherently wrong with puffy lifestyle features involving high-profile techies.

But for goodness sake. Not a single critical word about tech billionaires since the Horowitz-Moritz shit fit?

I’ve worked at (and founded!) my fair share of billionaire-funded publications and I’ve always had a firm rule: You have to be more critical of the people writing the checks (and their cronies) than you are of anyone else. It’s the only way to offset the inherent bias of taking their money.

What you definitely cannot do is have people who are off-limits or generally-to-be-avoided in fear of killing the golden goose. Most publications I’ve worked for/founded understood that perfectly well. See my pieces about Peter Thiel at Pando, or Tony Hsieh at NSFWCORP, or my resignation from TechCrunch. 

In the good old days, the billionaires understood the church and state separation too, however much they hated it. Remarkably, Thiel never once complained to anyone at Pando about our coverage (he knew what we’d say if he did). Tony Hsieh and I remained good friends until the end. Sure, sometimes we got threatened with a lawsuit (IIRC our biggest was a threat of an $850m suit from a private equity weirdo with ties to Chris Christie) but when that happened, we fought back.

But those were the good old days, and these are the bad new days. Two of the three publications I just mentioned don’t exist any more, and the one that does is unrecognizable. Tech billionaires own everything now – from the SF Standard to the Washington Post and every editor and journalist is just waiting for the next round of layoffs. Like a Crichton T-rex, Mike Mortiz’s vision is based on movement so it’s always safer to keep still vs flailing around with a flashlight.

Still, it’s healthy to be reminded every so often why I quit journalism to write and sell books. And why so many others are fleeing to Substack, and its less Nazi-enabling alternatives. (Welcome back to my WordPress blog!)

The marketplace of ideas is always, always stronger when the ultimate paying customers are readers rather than billionaires or advertisers.

Read the whole story
chrisamico
12 days ago
reply
Boston, MA
Share this story
Delete

‘The data on extreme human ageing is rotten from the inside out’ – Ig Nobel winner Saul Justin Newman

4 Shares

From the swimming habits of dead trout to the revelation that some mammals can breathe through their backsides, a group of leading leftfield scientists have been taking their bows at the Massachusetts Institute of Technology for the 34th annual Ig Nobel Prize ceremony. Not to be confused with the actual Nobel prizes, the Ig Nobels recognise scientific discoveries that “make people laugh, then think”.

We caught up with one of this year’s winners, Saul Justin Newman, a senior research fellow at the University College London Centre for Longitudinal Studies. His research finds that most of the claims about people living over 105 are wrong.

How did you find out about your award?

I picked up the phone after slogging through traffic and rain to a bloke from Cambridge in the UK. He told me about this prize and the first thing I thought of was the lady who collected snot off of whales and the levitating frog. I said, “absolutely I want to be in this club”.

What was the ceremony like?

The ceremony was wonderful. It’s a bit of fun in a big fancy hall. It’s like you take the most serious ceremony possible and make fun of every aspect of it.

But your work is actually incredibly serious?

I started getting interested in this topic when I debunked a couple of papers in Nature and Science about extreme ageing in the 2010s. In general, the claims about how long people are living mostly don’t stack up. I’ve tracked down 80% of the people aged over 110 in the world (the other 20% are from countries you can’t meaningfully analyse). Of those, almost none have a birth certificate. In the US there are over 500 of these people; seven have a birth certificate. Even worse, only about 10% have a death certificate.

The epitome of this is blue zones, which are regions where people supposedly reach age 100 at a remarkable rate. For almost 20 years, they have been marketed to the public. They’re the subject of tons of scientific work, a popular Netflix documentary, tons of cookbooks about things like the Mediterranean diet, and so on.

Okinawa in Japan is one of these zones. There was a Japanese government review in 2010, which found that 82% of the people aged over 100 in Japan turned out to be dead. The secret to living to 110 was, don’t register your death.

The Japanese government has run one of the largest nutritional surveys in the world, dating back to 1975. From then until now, Okinawa has had the worst health in Japan. They’ve eaten the least vegetables; they’ve been extremely heavy drinkers.

What about other places?

The same goes for all the other blue zones. Eurostat keeps track of life expectancy in Sardinia, the Italian blue zone, and Ikaria in Greece. When the agency first started keeping records in 1990, Sardinia had the 51st highest old-age life expectancy in Europe out of 128 regions, and Ikaria was 109th. It’s amazing the cognitive dissonance going on. With the Greeks, by my estimates at least 72% of centenarians were dead, missing or essentially pension-fraud cases.

What do you think explains most of the faulty data?

It varies. In Okinawa, the best predictor of where the centenarians are is where the halls of records were bombed by the Americans during the war. That’s for two reasons. If the person dies, they stay on the books of some other national registry, which hasn’t confirmed their death. Or if they live, they go to an occupying government that doesn’t speak their language, works on a different calendar and screws up their age.

According to the Greek minister that hands out the pensions, over 9,000 people over the age of 100 are dead and collecting a pension at the same time. In Italy, some 30,000 “living” pension recipients were found to be dead in 1997.

Regions where people most often reach 100-110 years old are the ones where there’s the most pressure to commit pension fraud, and they also have the worst records. For example, the best place to reach 105 in England is Tower Hamlets. It has more 105-year-olds than all of the rich places in England put together. It’s closely followed by downtown Manchester, Liverpool and Hull. Yet these places have the lowest frequency of 90-year-olds and are rated by the UK as the worst places to be an old person.

The oldest man in the world, John Tinniswood, supposedly aged 112, is from a very rough part of Liverpool. The easiest explanation is that someone has written down his age wrong at some point.

But most people don’t lose count of their age…

You would be amazed. Looking at the UK Biobank data, even people in mid-life routinely don’t remember how old they are, or how old they were when they had their children. There are similar stats from the US.

What does this all mean for human longevity?

The question is so obscured by fraud and error and wishful thinking that we just do not know. The clear way out of this is to involve physicists to develop a measure of human age that doesn’t depend on documents. We can then use that to build metrics that help us measure human ages.

Longevity data are used for projections of future lifespans, and those are used to set everyone’s pension rate. You’re talking about trillions of dollars of pension money. If the data is junk then so are those projections. It also means we’re allocating the wrong amounts of money to plan hospitals to take care of old people in the future. Your insurance premiums are based on this stuff.

What’s your best guess about true human longevity?

Longevity is very likely tied to wealth. Rich people do lots of exercise, have low stress and eat well. I just put out a preprint analysing the last 72 years of UN data on mortality. The places consistently reaching 100 at the highest rates according to the UN are Thailand, Malawi, Western Sahara (which doesn’t have a government) and Puerto Rico, where birth certificates were cancelled completely as a legal document in 2010 because they were so full of pension fraud. This data is just rotten from the inside out.

Do you think the Ig Nobel will get your science taken more seriously?

I hope so. But even if not, at least the general public will laugh and think about it, even if the scientific community is still a bit prickly and defensive. If they don’t acknowledge their errors in my lifetime, I guess I’ll just get someone to pretend I’m still alive until that changes.

Read the whole story
chrisamico
17 days ago
reply
Boston, MA
acdha
21 days ago
reply
Washington, DC
Share this story
Delete

When is a Minicar as Dangerous as a 3-Ton Truck?

1 Share

Theft Prevention is a Safety Feature

Last week I got nerd-sniped by a Hacker News link to an IIHS post on death rates by vehicle model. The IIHS report, from 2023, featured a table ranking cars by their rate of ‘other-driver deaths’, or how often a specific model kills the driver of a vehicle they crash into.

Mixed in with the usual giant trucks were a smattering of tiny Kias:

IIHS reported other-driver death rates

The size difference between these cars is massive – a Kia Rio is less than half the unladen weight of a Ram 2500 Crew Cab, nearly 2 tons lighter! How are these compact cars killing others at a similar rate to our largest trucks?

To dig in, I grabbed the IIHS data and the weights of each model and plotted them:

Data Source: IIHS, Auto Evolution

Deaths per million registered vehicle years, 2020 & equivalent earlier models, 2018-21. Bubble size represents curb weight.

With the data visualized, three key narratives emerged:

  1. Big trucks kill others. The blue points above show large trucks are much more likely to kill the passengers in the cars they hit while keeping their occupants relatively safe. Size is the dominant factor here, with smaller trucks producing figures in line with overall averages.
  2. Dodge Chargers are a unique danger to their drivers and others: The headline of the original IIHS piece is a bit misleading. “American muscle cars with high horsepower and a hot rod image rank among the deadliest vehicles on the road,” should perhaps have been written to focus on the Dodge Charger specifically (and, arguably, the Dodge Challenger). Mustangs and Cameros both rank in the top-20 for driver deaths, but neither rank for other-driver deaths. Chargers are unique outliers as a threat to themselves and others.
  3. Kias break the rules. They’re not big, they’re not fast, but the tiny Optima, Ria, and Forte cluster alongside the Chargers. The Rio and Forte rank in the top 20 for driver and other-driver death rates – with the Optima ranking 21st and 5th for driver and other-driver deaths, respectively.

The issue here isn’t the structural safety of the car. The 2020 Kia Optima has a 5-star crash rating for every metric measured by the NHTSA, matching the similarly classed Honda Accord and Camry.

No, the issue here is how easy it is to steal Kias, thanks to the company’s omission of immobilizers in models manufactured between 2011 and 2021.

Immobilizers are security devices that prevent a car from starting without a transponder or smart key. A 2016 study found that immobilisers, “lowered the overall rate of car theft by about 40% between 1995 and 2008.”

But then Kia left them out:

Aaron Gordon, writing for Vice in 2023, spelling out the impact in Chicago:

The scale of the Kia and Hyundai theft problem is astounding. In Chicago, during the “old normal” days prior to the summer of 2022, six to eight percent of all stolen cars were Kias or Hyundais, according to data obtained by Motherboard. This was in line with how many Kias and Hyundais were on Chicago’s roads, according to the lawsuit Chicago filed against Kia and Hyundai. Then, in June 2022, the percentage of stolen cars that were Kias and Hyundais edged up to 11 percent. In July, it more than doubled to 25 percent. By November, it had almost doubled again, to 48 percent. Through August 2023, the most recent month for which Motherboard has data, 35 percent of the 19,448 stolen cars in Chicago have been Kias or Hyundais.

The red line in the chart above is Milwaukee, MN, where bored kids first discovered the cars could be stolen with a screwdriver and USB cable. Eventually, they started posting their joyrides to TikTok, YouTube, and Snap. Right around May 2022 – when Tommy G posted his viral documentary about “Kia Boys” on YouTube – the phenomenon bubbled up above the local Milwaukee subculture and spread to other markets.

A software update shipped in 2023 cut theft rates by more than half – among those cars that had the update applied. My city’s police department even mailed letters to local Kia owners, encouraging them to take action:

The Alameda Police Department's letter to Kia and Hyundai owners

By July 2024, Kia reported ~60% of eligible vehicles had been addressed.

So when is a minicar as dangerous as a 3-ton truck? When it’s easy to steal. Kia’s omission of immobilizers was a poor design choice with unexpected, catastrophic, downstream effects. We can see the story clearly in the data: from crash tests scores to theft rates and, ultimately, fatality figures.

Hyundai and Kia have already settled a national class action lawsuit for $200 million, but new suits filed by cities continue to roll in.

Read the whole story
chrisamico
18 days ago
reply
Boston, MA
Share this story
Delete
Next Page of Stories