*******************************************************. *** Data Management through e-Social Science ** ** www.dames.org.uk ** ** ** ESRC DAMES Research Node, data management training programme: ** ** ** Workshop of 24-25 August 2009 ** Data Management for Social Survey Research ** ** ** LAB 2: ADVANCED DATA MANAGEMENT WITH STATA ** SUBFILE: FILE_MATCHING_EXTENSIONS.DO ** ** ** www.dames.org.uk ** Paul Lambert / Vernon Gayle, 24 August 2009 *******************************************************. *******************************************************. *******************************************************. ****** LAB 2 SUB-FILE: FILE_MATCHING_EXTENSIONS.DO *******************************************************. *******************************************************. *** We argue that fluency in matching files is the single biggest hurdle to * effective data management **** We gave examples of 5 principle types of 'matching files' operations in * the example file 'lab0.do': it is well worth reviewing these and extending * them to further data examples **** Here we add a few short extensions on matching files. ************************************ *** 1) Using the 'mmerge' routine *** 2) Panelwhiz *** 3) Common pitfalls ************************************ *******************************************************. ** NOTIFICATION OF FILE LOCATIONS / DIRECTORIES AND STATA SETUP ** ** ** ** i) File location declarations: *** For the commands below to work, you should begin by running the following ** macros, which tell Stata where to look for the relevant data files (mentioned ** above) on your machine : . global path1 "d:\dames09\work\" * (the location of your working directory - where you will save * newly created data files and output) . global path3 "d:\dames09\data\bhps\w1to15\" * (the location of a folder where you have saved the BHPS microdata - * Stata format files from UKDA Study number 5151) . global path8 "d:\dames09\macros\" * (a folder for storing macros use or created during this lab: global path8b "d:\dames09\macros\occs\" * a subfolder of macros, featuring tools used when analysing occupations: * iscoisei.ado * casoc_isco.do * gb91soc90.dta * gb91isco88.dta * * (all available for download from: www.dames.org.uk/workshops) global path9 "d:\dames09\temp\" * (a location of a temporary folder where you can save intermediate files) . clear set mem 200m ** For the Stirling 2009 workshop: the data and files all ought to be in the locations as above ** ** If you are running these sessions elsewhere: you need to carefully substitute your own path ** locations as appropriate into the commands above . ********************** ********************** ************************************************ *** 1) Using the 'mmerge' routine ************************************************ ** 'mmerge' is a user-contributed routine written by Jeorone Weesie (Utrecht Univ.) ** It is designed to make file matching easier.. net set ado $path8 adopath + $path8 findit mmerge net describe mmerge net install mmerge.pkg, replace man mmerge ** Creating four related data files for use as mmerge demos: use pid asex aage using $path3\aindresp.dta, clear gen wave=1991 label data "BHPS individuals, 1991" sort pid sav $path9\m1.dta, replace use pid qsex qage using $path3\qindresp.dta, clear gen wave=2007 sort pid sav $path9\m2.dta, replace label data "BHPS individuals, 2007" use pid doby using $path3\xwavedat.dta, clear label data "BHPS individuals, any year" sort pid sav $path9\m3.dta, replace use $path9\m1.dta, clear append using $path9\m2.dta summarize label data "BHPS, panel data long format (1991+2007)" sort pid sav $path9\m4.dta, replace dir $path9\m1.dta dir $path9\m2.dta dir $path9\m3.dta dir $path9\m4.dta *** Mmerge works by forcing the user to specify the type of merging they are trying to * perform, and displaying diagnostics on the success of that merge: ** Match waves 1 and 2 use $path9\m1.dta, clear mmerge pid using $path9\m2.dta, type(1:1) unmatched(both) * Correct! ** Some illustrative variants of the above: use $path9\m1.dta, clear mmerge pid using $path9\m2.dta, type(1:1) unmatched(both) return list use $path9\m1.dta, clear mmerge pid using $path9\m2.dta, type(auto) * Here mmerge checks data and performs a 1:1 link by default help mmerge /* Various other useful features */ ** Some more tests.. ** Match wave 1 and the cross-wave file use $path9\m1.dta, clear mmerge pid using $path9\m3.dta, type(1:1) unmatched(both) * Correct! ** Match the panel data (multiple rows per person) with the cross-wave file (one row per person) use $path9\m4.dta, clear mmerge pid using $path9\m3.dta, type(n:1) unmatched(both) * Correct! ** Wrongly try to merge the panel and cross wave files as if they suited a one-to-one link use $path9\m4.dta, clear mmerge pid using $path9\m3.dta, type(1:1) unmatched(both) * Shows some errors, and lists all mismatching cases. ************************************************ ************************************************ ************************************************ *** 2) Panelwhiz ************************************************ ** Panelwhiz - www.panelwhiz.eu - is a Stata based software written by John Haisken-DeNew * which is designed to facilitate atching of complex survey data files * (it performs suitable matches behind the scenes, and returns combined files). * ** It is prepared for bespoke data collections (panel studies - hence the name) and * can be used to search for variables and merge variables between waves. * ** It is well worth checking this software if you work with panel survey data. * ** ************************************************ ************************************************ ************************************************ *** 3) Common pitfalls when matching files ************************************************ ****************** ** a) Forgetting about the automatically generated _merge variable use pid asex ajbsoc using $path3\aindresp.dta, clear sav $path9\m1.dta, replace use pid bsex bjbsoc using $path3\bindresp.dta, clear sav $path9\m2.dta, replace use pid csex cjbsoc using $path3\cindresp.dta, clear sav $path9\m3.dta, replace ** The below won't work *use $path9\m1.dta, clear *merge pid using $path9\m2.dta, sort *merge pid using $path9\m3.dta, sort /* Might expect this to work, but it would trip on _merge */ ** This will work though: use $path9\m1.dta, clear merge pid using $path9\m2.dta, sort drop _merge merge pid using $path9\m3.dta, sort summarize ** As will this (preferable) use $path9\m1.dta, clear merge pid using $path9\m2.dta, sort _merge(a_b) merge pid using $path9\m3.dta, sort _merge(ab_c) summarize ****************** ** b) Forgetting about unwanted cases use pid asex ajbsoc using $path3\aindresp.dta, clear sav $path9\m1.dta, replace use pid bsex bjbsoc using $path3\bindresp.dta, clear sav $path9\m2.dta, replace use $path9\m1.dta, clear merge pid using $path9\m2.dta, sort tab _merge * 1294 in master and not in donor (ie in m1 and not in m2) * 875 in donor and not in master (ie in m2 and not in m1) * 8970 in both master and donor * => A common selection would be only 1 or 2 of the three groups, e.g. keep if _merge==1 | _merge==3 drop _merge summarize * This limits data to the 10264 who were present in wave A, * some but not all of whom were also present in wave B ** Often, errors arise if we forget to implement a suitable 'drop' or 'keep' * command on such cases. ****************** ** c) Inadvert 'manys' on a one-to-many link use $path8b\gb91soc90.dta, clear tab soc90 describe sort soc90 sav $path9\m1.dta, replace * (This is a 'CAMSIS' database of information on occupations in soc90) * (from www.camsis.stir.ac.uk) use qjbsoc qjbcssm using $path3\qindresp.dta, clear tab qjbsoc * This is a survey data file with records which have a soc90: * we want to do a 'one-to-many' merge the CAMSIS file with this record rename qjbsoc soc90 sort soc90 merge soc90 using $path9\m1.dta tab _merge * This HASN'T worked properly * Generates two warning messages: observations were not unique in either file * This is a problem: it's fine if observations per key variable are not unique in one file * (that is a 'one-to-many' link), but not if they are not unique on either file (this is a * 'many-to-many' link but with no control over how links are made). * The problem arises because the CAMSIS file actually has more than one case per soc90 * (the multiple cases reflect multiple possible employment status's) * => Solution: use $path8b\gb91soc90.dta, clear tab soc90 numlabel _all, add tab ukempst keep if ukempst==0 /* Key change is this line */ tab soc90 /* Now we only have one record per soc90 */ sort soc90 sav $path9\m1.dta, replace * (This is a 'CAMSIS' database of information on occupations in soc90) * (from www.camsis.stir.ac.uk) use qjbsoc qjbcssm using $path3\qindresp.dta, clear tab qjbsoc * This is a survey data file with records which have a soc90: * we want to do a 'one-to-many' merge the CAMSIS file with this record rename qjbsoc soc90 sort soc90 merge soc90 using $path9\m1.dta tab _merge keep if _merge==1 | _merge==3 * This HAS worked properly summarize list soc90 qjbcssm mcamsis ns_sec in -100/-1 ****************** ** d) Inadvertent format mismatches on the key linking variable/variables ** It's not uncommon for key linking variables to have different formats on different data * files, e.g. use qjbisco qjbcssm using $path3\qindresp.dta, clear tab qjbisco describe use $path8b\gb91isco88.dta, clear keep if ukempst==0 /* Reduces this data file to one case per isco88 */ tab isco88 describe ** Here, the variables 'qjbisco' and 'isco88' represent the same measure and could be linked, * but (see the 'describe' output) they are stored in different formats on the two files. ** A match on the raw variables won't work: use $path8b\gb91isco88.dta, clear keep if ukempst==0 tab isco88 describe sort isco88 sav $path9\m1.dta, replace use qjbisco qjbcssm using $path3\qindresp.dta, clear tab qjbisco rename qjbisco isco88 sort isco88 merge isco88 using $path9\m1.dta * This generates an error, because the formats of the 'isco88' in the two files differ ** Solution: force the format to be numeric in both files: use $path8b\gb91isco88.dta, clear keep if ukempst==0 tab isco88 describe sort isco88 sav $path9\m1.dta, replace use qjbisco qjbcssm using $path3\qindresp.dta, clear tab qjbisco destring qjbisco, generate(isco88) sort isco88 merge isco88 using $path9\m1.dta tab _merge keep if _merge==1 | _merge==3 /* Keep all the survey microdata */ list in -100/-1 * This works fine ************************************************ ************************************************ ** EOF