We developed the open-source software package BlockDates using the Julia programming language to allow the extraction of fuzzy-matched dates from a block of text. The tool leverages contextual information and draws on external date data to find the best date matches. For each identified date, multiple interpretations are proposed and scored to find the best fit. The output includes several record-level variables that help explain the result and prioritize error detection.
The date is often a critical piece of information for safety data analysis. It provides context and is necessary for measurement of event frequency and time-based trends. In some data sources, such as narrative information about an event or subject, the date is provided in various non-standardized formats. The Bureau of Transportation Statistics uses data provided in narrative, free-text format to validate and supplement reported safety event data.
We developed the open-source software package BlockDates using the Julia programming language to allow the extraction of fuzzy-matched dates from a block of text. The tool leverages contextual information and draws on external date data to find the best date matches. For each identified date, multiple interpretations are proposed and scored to find the best fit. The output includes several record-level variables that help explain the result and prioritize error detection.
In a sample of 59,314 narrative records that include dates, the tool returned positive scores for 96.5% of records, meaning high confidence the selected date is valid. Of those with no matching date, 77.9% were recognized correctly as having no viable match.