There’s never a useful data catalog
I was thinking about using Age as a feature in a regression model for churn prediction. I looked at the range of values in a box plot across categories and it was nicely differentiated, centered around different median values.

The weird thing was some of the values in Age were as high as 124.
Talking to the database owner I found out that the Age field is a run time calculation based on today’s date.
SELECT DATEDIFF(YEAR, birthday,
GETDATE()) –
CASE
WHEN (MONTH(birthday) > MONTH(GETDATE())) OR
(MONTH(birthday) = MONTH(GETDATE()) AND DAY(birthday) > DAY(GETDATE()))
THEN 1
ELSE 0
END AS Age
FROM people;
The way the field is calculated matters.
Someone’s age at the time a data record is captured is not necessarily the same as their age would be at the time of analysis.
When I’ve seen data dictionaries they are often inadequate or incomplete, missing definitions like this one which someone considered too obvious to annotate.
If you don’t have visibility to a full technical lineage it’s worth it as a data scientist to talk through your approach and feature selections with the data engineers.
If the maximum Age has been less than 100 when we did our exploratory data analysis we likely would not even have noticed the incompatibility of the calculation with our use case, we might have baked bad data right into the pipeline.

Leave a comment