When we have too many records...
Having too many records can be the sign of flaws in our processes.
AB
Antoine Bello · April 29, 2024
Following the import of large data files, we are now dangerously close to a 100% coverage of the US population. Why is it bad news? will you ask. Well, we know that we are still pretty far from that milestone in reality, if only because we have few files pertaining to children and teenagers. This in itself should keep us somewhere below 80%. The question is then: what are those excess records?
We’ve made lots of progress recently to identify duplicate profiles. We now know with greater certainty when to merge two similar records and when to let them live their own separate lives. We can still refine our algorithms but the crux of the explanation probably lies elsewhere.
We suspect the primary reason to be maiden names. Women account for 55% of our US records, vs 51.1% in official statistics. A difference that wide probably finds its source in the fact that most women change names when they get married, exposing themselves to be captured twice in our database. Few US organizations ever mention maiden names, unlike Catholic countries.
Other factors come into play, namely humans errors - ours and those committed by the official bodies providing our data. Names on PDF files are truncated (“Jacqueline” becomes “Jacqueli” or “Jacquel”), dates are transposed (“10/12/83” becomes “12/10/83”), typos appear here and there, rendering the comparison of new records with existing ones delicate if not outright impossible.
We make mistakes too. The first one is not to catch all those that happened before us. And sadly, we sometimes corrupt some perfectly good files. Labeling a column “Death year” instead of “Birth year” can result in the creation of half a million erroneous profiles. This has happened before and we’re humble enough to know it will always remain a possibility.
One thing gives up hope though. AI and machine learning will make it easier to detect patterns within huge volumes of data. Acquiring fresh data will remain key but tending to it will be just as important. Feel free to contact us if you have special talents in that department and would like to help.