*******************************************************. *** Data Management through e-Social Science ** ** www.dames.org.uk ** ** ** ESRC DAMES Research Node, data management training programme: ** ** ** Workshop of 24-25 August 2009 ** Data Management for Social Survey Research ** ** ** LAB 2: ADVANCED DATA MANAGEMENT WITH STATA ** SUBFILE: DOCUMENTATION_FOR_REPLICATION.DO ** ** ** www.dames.org.uk ** Paul Lambert / Vernon Gayle, 24 August 2009 *******************************************************. *******************************************************. *******************************************************. ****** LAB 2 SUB-FILE: DOCUMENTATION_FOR_REPLICATION.DO *******************************************************. ************************************************** *** Long (2009) emphasises the following key components to documenting * research using Stata in a manner that is efficient and replicable: * * 1) Constructing an overall 'workflow' through a 'master' file and related subfiles * 2) Planning the design of the workflow * 3) Maximising automation through programming * * * Long, J. S. (2009). The Workflow of Data Analysis Using Stata. * Boca Raton: CRC Press. *** *** In this example file we briefly illustrate what we think are some of the most * effective steps towards those goals and the ultimate aim of documentation for * replication ************************************************** *******************************************************. ** NOTIFICATION OF FILE LOCATIONS / DIRECTORIES AND STATA SETUP ** ** ** ** i) File location declarations: *** For the commands below to work, you should begin by running the following ** macros, which tell Stata where to look for the relevant data files (mentioned ** above) on your machine : . global path1 "d:\dames09\work\" * (the location of your working directory - where you will save * newly created data files and output) . global path3 "d:\dames09\data\bhps\w1to15\" * (the location of a folder where you have saved the BHPS microdata - * Stata format files from UKDA Study number 5151) . global path8 "d:\dames09\macros\" * (a folder for storing macros use or created during this lab: global path8b "d:\dames09\macros\occs\" * a subfolder of macros, featuring tools used when analysing occupations: * iscoisei.ado * casoc_isco.do * * (all available for download from: www.dames.org.uk/workshops) global path9 "d:\dames09\temp\" * (a location of a temporary folder where you can save intermediate files) . clear set mem 200m ** For the Stirling 2009 workshop: the data and files all ought to be in these locations * already ** ** If you are running these sessions elsewhere: you need to carefully substitute your own path ** locations as appropriate into the commands above . ********************** ********************** *********************************************** * 1) Constructing an overall 'workflow' through a 'master' file and related subfiles *********************************************** **** Here is an example of using master and subfiles: ** Analysis: BHPS data from 17 waves on educational * qualifications and prelimainary analysis of relation between * views on moving home and educational level over time ******* First, for information, here is the analysis in a single set of commands: *** Comment: When constructing a series of subfiles, I generally begin by * writing the files in one long working file, then, when components of them * are complete I parcel them off into sub-files. This is usually a bit * easier than trying to specify and edit your sub-files in advance... ******* ******************************* ** File locations: ** Location of command files and sub-files: global do_files "d:\dames09\work\bhps_educ\" ** Location of BHPS data (waves 1-17 microdata) global bhps_data "d:\dames09\data\bhps\w1to15\" ** Outputs: log files from analyses and preparation global logs "d:\dames09\work\bhps_educ\logs\" ** Outputs: graphs: global graphs "d:\dames09\work\bhps_educ\graphs\" ** Outputs: data: global data "d:\dames09\work\bhps_educ\data\" ** Temporary directory global path9 "c:\temp\" ****************************** ****************************** * Data preparation: pooling data across BHPS waves * use pid asex aage aqfedhi alkmove ajbstat ajbsoc axrwght using $bhps_data\aindresp.dta, clear renpfix a z gen wave=1991 sav $path9\w_a.dta, replace * use pid bsex bage bqfedhi blkmove bjbstat bjbsoc bxrwght using $bhps_data\bindresp.dta, clear renpfix b z gen wave=1992 sav $path9\w_b.dta, replace * use pid csex cage cqfedhi clkmove cjbstat cjbsoc cxrwght using $bhps_data\cindresp.dta, clear renpfix c z gen wave=1993 sav $path9\w_c.dta, replace * use pid dsex dage dqfedhi dlkmove djbstat djbsoc dxrwght using $bhps_data\dindresp.dta, clear renpfix d z gen wave=1994 sav $path9\w_d.dta, replace * use pid esex eage eqfedhi elkmove ejbstat ejbsoc exrwght using $bhps_data\eindresp.dta, clear renpfix e z gen wave=1995 sav $path9\w_e.dta, replace * use pid fsex fage fqfedhi flkmove fjbstat fjbsoc fxrwght using $bhps_data\findresp.dta, clear renpfix f z gen wave=1996 sav $path9\w_f.dta, replace * use pid gsex gage gqfedhi glkmove gjbstat gjbsoc gxrwght using $bhps_data\gindresp.dta, clear renpfix g z gen wave=1997 sav $path9\w_g.dta, replace * use pid hsex hage hqfedhi hlkmove hjbstat hjbsoc hxrwght using $bhps_data\hindresp.dta, clear renpfix h z gen wave=1998 sav $path9\w_h.dta, replace * use pid isex iage iqfedhi ilkmove ijbstat ijbsoc ixrwght using $bhps_data\iindresp.dta, clear renpfix i z gen wave=1999 sav $path9\w_i.dta, replace * use pid jsex jage jqfedhi jlkmove jjbstat jjbsoc jxrwght using $bhps_data\jindresp.dta, clear renpfix j z gen wave=2000 sav $path9\w_j.dta, replace * use pid ksex kage kqfedhi klkmove kjbstat kjbsoc kxrwght using $bhps_data\kindresp.dta, clear renpfix k z gen wave=2001 sav $path9\w_k.dta, replace * use pid lsex lage lqfedhi llkmove ljbstat ljbsoc lxrwght using $bhps_data\lindresp.dta, clear renpfix l z gen wave=2002 sav $path9\w_l.dta, replace * use pid msex mage mqfedhi mlkmove mjbstat mjbsoc mxrwght using $bhps_data\mindresp.dta, clear renpfix m z gen wave=2003 sav $path9\w_m.dta, replace * use pid nsex nage nqfedhi nlkmove njbstat njbsoc nxrwght using $bhps_data\nindresp.dta, clear renpfix n z gen wave=2004 sav $path9\w_n.dta, replace * use pid osex oage oqfedhi olkmove ojbstat ojbsoc oxrwght using $bhps_data\oindresp.dta, clear renpfix o z gen wave=2005 sav $path9\w_o.dta, replace * use pid psex page pqfedhi plkmove pjbstat pjbsoc pxrwght using $bhps_data\pindresp.dta, clear renpfix p z rename zid pid /* A wee danger with automation! */ gen wave=2006 sav $path9\w_p.dta, replace * use pid qsex qage qqfedhi qlkmove qjbstat qjbsoc qxrwght using $bhps_data\qindresp.dta, clear renpfix q z gen wave=2007 sav $path9\w_q.dta, replace * use $path9\w_a.dta, clear append using $path9\w_b.dta append using $path9\w_c.dta append using $path9\w_d.dta append using $path9\w_e.dta append using $path9\w_f.dta append using $path9\w_g.dta append using $path9\w_h.dta append using $path9\w_i.dta append using $path9\w_j.dta append using $path9\w_k.dta append using $path9\w_l.dta append using $path9\w_m.dta append using $path9\w_n.dta append using $path9\w_o.dta append using $path9\w_p.dta append using $path9\w_q.dta tab wave numlabel _all, add tab zsex keep if zsex==1 | zsex==2 xtdes, i(pid) t(wave) * 4126 balanced panel full interviews waves 1-17 notes drop _dta notes: PL, 23Aug09: BHPS waves 1-17 pooled, for highest educational qualifications data sav $data\bh_xw1.dta, replace dir $data\bh_xw1.dta ************************ ******************* ** Data preparation: recoding variables for analytical purposes use $data\bh_xw1.dta, clear datasignature notes * tab zqfedhi recode zqfedhi -9=.m -7=.p 13=.s tab zqfedhi gen educ1=(zqfedhi==1 | zqfedhi==2 | zqfedhi==3 | zqfedhi==4) if !missing(zqfedhi) gen educ2=(zqfedhi==1 | zqfedhi==2 ) if !missing(zqfedhi) label variable educ1 "Degree/diploma: Highest educational qualification categorisation" label variable educ2 "Degree: Highest educational qualification categorisation" tab1 educ1 educ2 * tab zlkmove recode zlkmove -9=.m -7=.p -2=.r tab zlkmove gen loc1 = (zlkmove==2) if !missing(zlkmove) label variable loc1 "Prefer to move: Attitude to moving home" tab loc1 tab zsex gen fem=(zsex==2) tab zage clonevar midage=zage recode midage -9=.m 15/29=.y 51/150=.o tab midage * (Because we suspect that edcuational qualifications are influenced by age, * this gives a conservative measure limiting analysis of age to a given age range) sav $path9\bh_temp.dta, replace ******************* ** Descriptive analysis: use $path9\bh_temp.dta, clear table wave , c(mean educ1 mean educ2 n educ1) table wave if !missing(midage) , c(mean educ1 mean educ2 n educ1) * Year-to-year, proportions with higher education increase; * age matters to these proportions => restrict to ages 30-50 in the below. table wave [aweight=zxrwght] if !missing(midage) /// , c(mean educ1 mean educ2 n educ1) * Weighting makes a bit of a difference (not surprising given differences in samples * but actually the magnitude of difference isn't huge) ** Does education impact upon desire to move? pwcorr loc1 educ1 educ2, star(05) sig bysort wave: pwcorr loc1 educ1 educ2, star(05) table wave loc1 [aweight=zxrwght] if !missing(midage) /// , c(mean educ1 mean educ2 n educ1) * By and large, some weak correlations, people with higher levels of education * are slightly more likely to want to move home summarize loc1 midage fem educ1 educ2 xtlogit loc1 if !missing(educ1), i(pid) est store nulla xtlogit loc1 if !missing(educ1) & !missing(midage) & !missing(fem), i(pid) est store nullb xtlogit loc1 educ1 , i(pid) est store mod1a xtlogit loc1 educ2 , i(pid) est store mod2a xtlogit loc1 midage fem educ1 , i(pid) est store mod1b xtlogit loc1 midage fem educ2 , i(pid) est store mod2b est table nulla mod1a mod2a nullb mod1b mod2b, stats(N ll bic rho) star b(%9.3g) ******************* ** Graphical summary: use $path9\bh_temp.dta, clear collapse (mean) loc1 (count) ncase=loc1 [aweight=zxrwght] /// if !missing(midage) & !missing(educ1), by(wave educ1) rename educ1 educ gen edtype=1 summarize sav $path9\bit1.dta, replace use $path9\bh_temp.dta, clear collapse (mean) loc1 (count) ncase=loc1 [aweight=zxrwght] /// if !missing(midage) & !missing(educ2), by(wave educ2) rename educ2 educ gen edtype=2 summarize sav $path9\bit2.dta, replace use $path9\bit1.dta, clear append using $path9\bit2.dta gen loc1e1=loc1 if educ==0 & edtype==1 gen loc1e2=loc1 if educ==1 & edtype==1 replace wave=wave+0.2 if educ==1 & edtype==1 gen loc1e4=loc1 if educ==1 & edtype==2 replace wave=wave+0.4 if educ==1 & edtype==2 graph twoway (bar loc1e1 wave, bcolor(gs8) barwidth(0.3) ) /// (bar loc1e2 wave, bcolor(gs12) barwidth(0.3) ) /// (bar loc1e4 wave, bcolor(gs14) barwidth(0.3) ) , /// title("Prefers to move home, by educational qualifications and year", span) /// note("Souce: BHPS waves 1-17, ages 30-50 only, weighted with [xrwght].") /// legend(label(1 "No degree or diploma") label(2 "Degree or diploma") /// label(3 "Degree") ) graph export $graphs\moving_home_views1.emf, as(emf) replace ****************************************************************** ****************************************************************** ****************************************************************** ****************************************************************** *** Now, here is the same analysis in a master file format : ******************************* ** File locations: * Location of command files and sub-files: global do_files "d:\dames09\work\bhps_educ\do_files\" * Location of BHPS data (waves 1-17 microdata) global bhps_data "d:\dames09\data\bhps\w1to15\" * Outputs: log files from analyses and preparation global logs "d:\dames09\work\bhps_educ\logs\" * Outputs: graphs: global graphs "d:\dames09\work\bhps_educ\graphs\" * Outputs: data: global data "d:\dames09\work\bhps_educ\data\" * Temporary directory global path9 "c:\temp\" ****************************** dir $do_files\*.do do $do_files\pooling_data1.do do $do_files\pre_analysis1.do do $do_files\data_review1.do do $do_files\graphs_moving_educ.do ****************************************************************** ****************************************************************** ** Finally, we can streamline further by writing a separate master file and invoking that: do $path1\bhps_educ\do_files\bhps_educ_master.do ****************************************************************** ****************************************************************** ***** Comment: A principle attraction of the master=sub-file arrangement is that * it is much easier to make edits to isolated components of the workflow * without altering other features. For example, we could redo the analysis * with some different classifications of educational qualifications very * easily by making a quick edit to the file 'pre_analysis1.do' **** Comment: More generally, this format supports extensibility in * programming - it is easier to share components, or to draw on * external resources, when things are in neat and well-documented packages **** Comment: One good question in this arrangement is to ask where in the file * structure you wish to define the path locations of files - at the top of * each respective file (as we've done in our lab exercises), * or in a separate file (the above example). * In our opinion the latter is generally preferable but there are often operational * reasons, such as involving collaborative work, where the former is more sensible. ***** *********************************************** *********************************************** *********************************************** * 2) Planning the design of the workflow *********************************************** *** There are some useful extended comments on designing stata workflows in * Long (2009). It's unlikely that anyone manages to keep to _all_ of the * recommendations, though, as they are quite stringent. * In our experience plans are easily blown off course by organisational factors * such as working in collaborative groups when not all can use the same structure, * or when computing facilities or versions vary between locations. ** We'd recommend the following priorities in planning your workflows: ***** a) Use a nested folder structure for distinct projects with sensible folder names (no spaces!) dir $path1\bhps_educ\ dir $path1\bhps_educ\do_files\ dir $path1\bhps_educ\logs\ ***** b) Always use path aliases rather than include full paths within files ** ..You don't need to be told this by this stage surely? ***** c) Choose your names for files, folders, variables etc in a consistent standard across projects ** Long (2009) is good on this: * - first, things like variable names should be short and concise, since labels are * often truncated in analytical outputs * - second, if you tend to use the same conventions across projects, it will be easier * in the future to resume work on an old study (and debug any errors) ***** d) Keep different projects separate, and keep generic resources separate again capture mkdir $path1\proj1\ capture mkdir $path1\proj1\logs\ global proj1 "$path1\proj1\" capture mkdir $path1\proj2\ capture mkdir $path1\proj2\logs\ global proj2 "$path1\proj2\" global external_data "d:\dames09\data\bhps\w1to15\" * Project 1: capture log close log using $proj1\logs\results.txt, replace text use pid aivfio using $external_data\aindresp.dta, clear codebook, compact numlabel _all, add tab aivfio, missing capture log close * Project 2: capture log close log using $proj2\logs\results.txt, replace text use pid qregion qsex using $external_data\qindresp.dta, clear codebook, compact numlabel _all, add tab qregion qsex, missing capture log close ******************************* ***** e) Remember version control on Stata, especially for complex/emerging commands *** Something that uses an extension routine my need to re-install the macros on a local machine ** Example of tabplot: ideally include this at the top of any file which will later use tabplot: net set ado $path8 adopath + $path8 net from http://fmwww.bc.edu/RePEc/bocode/t net describe tabplot net install tabplot man tabplot *** Many commands have changed over versions - the best defence is to name the version in all commands files. version /* Reveals the version of Stata currently being used */ version 10 /* States that commands from now on should be read as Version 10 format */ * (relevant if you are using (say) version 11 but your commands were written in v10 version 8 version version 10.1 version *********************************************** *********************************************** *********************************************** * 3) Maximising automation through programming *********************************************** * *** The topic of programming in Stata is a huge one with numerous extension issues and permutations. ** (as is true of most database facilities). *** However a particularly nice feature of Stata is that even a relatively novice user can * soon start exploiting basic programming approaches in order to improve the efficiency * of their workpractices *** Some issues that are particularly pertinent to social survey research include: *** **** ********************* **** i) Using macros * We've already used many examples of macros, in which the global command is used to substite text * with something with a $ sign. This approach is often helpful to summarise a group of variables * with a longer sequence of commands that will be repeated, for example ** ** Wave 1 analysis: global varlist "ajbsat ajbcssm ajbhrs " use $varlist using $path3\aindresp.dta, clear summarize $varlist mvdecode $varlist, mv(-9/0) summarize $varlist pwcorr $varlist, obs star(05) regress $varlist * Wave 12 replication: global varlist "qjbsat qjbcssm qjbhrs " use $varlist using $path3\qindresp.dta, clear summarize $varlist mvdecode $varlist, mv(-9/0) summarize $varlist pwcorr $varlist, obs star(05) regress $varlist ** Note how we only need to make a couple of edits to the opening 2 lines to repeat in w17. ********************* ********************* ********************* **** ii) Defining simple programmes : ** The simplest programmes merely repeat a segment of code in a convenient way use qsex qregion using $path3\qindresp.dta, clear numlabel _all, add tab qregion ** Programme which generates three region indicator variables: capture program drop regrec1 /* If the program already exists, trying to define it gives an error */ program define regrec1 gen reg4=zregion recode reg4 1 2 3 4 5 = 1 6/16=2 17=3 18=4 19=5 *=.m gen scot=(zregion==18) if !missing(reg4) gen wales=(zregion==17) if !missing(reg4) capture label drop reg4l label define reg4l 1 "Southern England" 2 "Northern England" 3 "Wales" 4 "Scotland" 5 "NI" label values reg4 reg4l tab1 zregion reg4 scot wales end * To invoke this year by year: * e.g. wave 1 use asex aregion using $path3\aindresp.dta, clear numlabel _all, add renpfix a z regrec1 * e.g. wave 17 use qsex qregion using $path3\qindresp.dta, clear numlabel _all, add renpfix q z regrec1 ********************* ********************* **** iii) We can also use do files as if they were simple programmes: **** Simple programmes can be run as do files rather than by formally defining them * as a programme. ** This can be useful in several instances for re-running oft repeated commands. * One example is in defining value labels , see e.g. * the programme 'soc90_labels.do' used in the first lab: * (inspect it's contents to see how it works) use qsex qjbsoc using $path3\qindresp.dta, clear tab qjbsoc do $path8\soc90_labels.do tab qjbsoc labe values qjbsoc soc90l tab qjbsoc tab qjbsoc if qsex==2 ************************* ************************* **** iii) Defining programmes with arguments: ** Slightly more complex files arise when arguments are used in the programme specification. * (This doesn't mean the sort of argument that makes your face turn red) ********* ** a) Arguments (variable terms) can be included in programmes by specifying during the * definition ********* ** b) Sometimes, an effective way to allow arguments is to combine a simple programme with * macros , and change the macros using 'global': capture program drop dataextr program define dataextr use pid *sex *age *jbsoc using $path3\$file.dta, clear sort pid sav "$path9\$file.dta", replace end global file "aindresp" dataextr global file "bindresp" dataextr global file "cindresp" dataextr dir $path9\*indresp.dta use $path9\aindresp.dta, clear merge pid using $path9\bindresp.dta, sort _merge(wb) merge pid using $path9\cindresp.dta, sort _merge(wc) summarize tab wb wc, missing * 8170 balanced panel cases here ************************* ************************* ************************* **** iv) Using 'foreach' to loop: ** The foreach command allows substitution of a specified sequence of text values into * a generic command. This is useful for anything requiring finite replications * (a very common requirement in survey research): the replications are known as 'loops'. foreach z in a b c cat dog mouse house { display "`z'" } **************** * Foreach Example (1): Perform 17 repeated extracts of BHPS survey data foreach z in a b c d e f g h i j k l m n o p q { use pid `z'sex `z'tenure `z'mastat using $path3\`z'indresp.dta, clear renpfix `z' z capture rename zid pid sort pid gen wave=" `z' " recode wave a=1991 b=1992 sav "$path9\`z'_foreachdemo.dta", replace } ** (Note how succinct this is compared to other comparable derivations shown above) dir $path9\*_foreachdemo.dta use $path9\a_foreachdemo.dta, clear append using $path9\b_foreachdemo.dta append using $path9\c_foreachdemo.dta append using $path9\d_foreachdemo.dta append using $path9\e_foreachdemo.dta append using $path9\f_foreachdemo.dta append using $path9\g_foreachdemo.dta append using $path9\h_foreachdemo.dta append using $path9\i_foreachdemo.dta append using $path9\j_foreachdemo.dta append using $path9\k_foreachdemo.dta append using $path9\l_foreachdemo.dta append using $path9\m_foreachdemo.dta append using $path9\n_foreachdemo.dta append using $path9\o_foreachdemo.dta append using $path9\p_foreachdemo.dta append using $path9\q_foreachdemo.dta tab wave encode wave, generate(year) replace year=year+1990 tab year ** => Used carefully, 'foreach' is a very helpful shortcut ** However, for tasks such as above, it can lack flexibility, e.g. catering to * wave-specific adjustments **************** ****** Foreach example (2) : deriving interaction terms (cf. Long 2009, p97) use asex ajbsat ajbcssm ajbhrs using $path3\aindresp.dta, clear summarize mvdecode a*, mv(-9/-1) summarize gen fem=(asex==2) foreach var in ajbsat ajbcssm ajbhrs { capture drop fem`var' gen fem`var'=fem*`var' if !missing(`var') } summarize *********************************************** *********************************************** ** EOF