Record linkage is the process that joins separately recorded pieces of information for a particular individual from one or more sources. To facilitate record linkage, a reliable computer based approach is ideal. In genealogical research computerized record linkage is useful in combing information for an individual across multiple censuses.

In creating a computerized method for linking censuse records it needs to be determined if weights calculated from one geographical area, can be used to link records from another geographical area. Research performed by Marcie Francis calculates field weights using census records from 1910 and 1920 for Ascension Parish Louisiana. These weights are re-calculated to take into account population changes of the time period and then used on five data sets from different geographical locations to determine their robustness.

HeritageQuest provided indexed census records on four states. They include California, Connecticut, Illinois and Michigan in addition to Louisiana. Because the record size of California was large and we desired at least five data sets for comparison this state was split into two groups based on geographical location.

Weights for Louisiana were re-calculated to take into consideration visual basic code modifications for the field "Place of Origin", "Age" and "Location" (enumeration district). The validity of these weights, were a concern due to the low number of known matches present in the data set for Louisiana.

Thus, to get a better feel for how weights calculated from a data source with a larger number of known matches present, weights were calculated for Michigan census records. Error rates obtained using weights calculated from the Michigan data set were lower than those obtained using Louisiana weights.

In order to determine weight robustness weights for Southern California were also calculated to allow for comparison between two samples. Error rates acquired using Southern California weights were much lower than either of the previously calculated error rates. This led to the decision to calculate weights for each of the data sets and take the average of the weights and use them to link each data set to take into account fluctuations of the population between geographical locations.

Error rates obtained when using the averaged weights proved to be robust enough to use in any of the geographical areas sampled. The weights obtained in this project can be used when linking any census records from 1910 and 1920. When linking census records from other decades it is necessary to calculate new weights to account for specific time period fluctuations.



Physical and Mathematical Sciences; Statistics



Genealogy, Reocrd Linkage, Robustness, Census Records