Digitizing Directories: Lessons Learned from Digitizing Historical Directories of Physicians

Sean Morey Smith

Medical directories are rich sources of information about the historical state of the medical profession. However, their availability as printed text has limited their usefulness to historians of medicine who could more readily delve their contents in a digital format. Consisting of a list of physicians, usually along with their addresses and their professional and specialty affiliations, these directories have been used by historians to explore the consolidation of the medical profession and the emergence of specializations. However, because researchers have been limited by the print form of the directories, their work has been based on relatively small sample sets of entries from them. This presentation explores the lessons learned from attempting to digitize one historical “American Medical Directory” (“AMD”) in an attempt to make its contents more available to historians of medicine in a database format. It will explain the digitization process, which involves first scanning the AMD’s print version, then OCRing the generated scans, and finally running custom-built parsing tools on the generated text to create a digital database. The presentation will discuss where errors occur in the process and the diminishing returns of the labor input in terms of the number of directory entries successfully parsed into the database. It will also speculate about uses for these digitized directories outside of the history of medicine and about using similar techniques for other sorts of directories.