*******************************************************. *** Data Management through e-Social Science ** ** www.dames.org.uk ** ** ** ESRC DAMES Research Node, data management training programme: ** ** ** Workshop of 24-25 August 2009 ** Data Management for Social Survey Research ** ** ** LAB 1: CORE DATA MANAGEMENT WITH STATA ** SUBFILE: LAB1_HIGHLIGHTS.DO ** ** ** www.dames.org.uk ** Paul Lambert / Vernon Gayle, 24 August 2009 *******************************************************. *******************************************************. *******************************************************. ****** LAB 1 SUB-FILE: LAB1_HIGHLIGHTS.DO *******************************************************. *******************************************************. **********************************************************. *****************************************************. ** ** *** This file illustrates a small selection of functions relevant to good data management ** in Stata which we have highlighted because they are both useful ** and because they are not always as widely known as they might be ** ** *** Most of these functions already feature somewhere in the file 'lab0.do', ** but here we add a few extensions ** ** The examples given below cover: ** ** 1) Understanding complex data - Missing data ** 2) Understanding complex data - Being assertive ** 3) Understanding complex data - Graphics options ** 4) Variables: Deriving variables with egen ** 5) Variables: Value label extensions ** 6) Variables: Reviewing measures using Statsby ** 7) Metadata: Labelling files ** ** **********************************************************. **************************************************************. *******************************************************. *******************************************************. *******************************************************. ** NOTIFICATION OF FILE LOCATIONS / DIRECTORIES AND STATA SETUP ** ** ** ** i) File location declarations: *** For the commands below to work, you should begin by running the following ** macros, which tell Stata where to look for the relevant data files (mentioned ** above) on your machine : . global path1 "d:\dames09\work\" * (the location of your working directory - where you will save * newly created data files and output) . global path3 "d:\dames09\data\bhps\w1to15\" * (the location of a folder where you have saved the BHPS data * file(s) described more fully in lab0.do) . global path8 "d:\dames09\macros\" * (the location of your copy of the demonstration Stata macro 'seglabelsv1.do' * downloadable from the website, and the file 'soc90labels.do', available from * http://www.camsis.stir.ac.uk/occunits/do_files/soc90_labels.do ) global path9 "d:\dames09\temp\" * (a location of a temporary folder where you can save intermediate files) . ** For the Stirling 2009 workshop: the data and files all ought to be in the locations as above ** ** If you are running these sessions elsewhere: you need to carefully substitute your own path ** locations as appropriate into the commands above . ********************** ********************** *******************************************************. *******************************************************. *******************************************************. *******************************************************. ** Log file: capture log close log using $path1\lab1_highlights_log.txt, replace text *******************************************************. *******************************************************. *******************************************************. ** 1) Understanding complex data - Missing data *******************************************************. ********* ** i) Using multiple missing value categories ********* use pid ksex kjbrgsc kjbsoc using $path3\kindresp.dta, clear numlabel _all, add tab kjbrgsc ** Here we'll use 'Clonevar' to create a duplicate new variable: clonevar kjbrgsc2 = kjbrgsc tab1 kjbrgsc kjbrgsc2 * (note how 'clonevar' keeps variable metadata, whereas 'gen' would not) tab kjbrgsc * Here there are two numeric codes assigned to cases that are essentially missing ** The most common approach is to blanket classify all missing value codes to a single * missing category, denoted '.' in Stata: mvdecode kjbrgsc, mv(-9/-1) tab kjbrgsc, missing * The information differentiating codes is now lost - e.g., mvencode kjbrgsc, mv(-999) tab kjbrgsc, missing /* Suboptimal */ * In Stata, we could however reassign missing categories to several different missing codes, * this is supported by using .a .b .c .. .z as valid codes which are automatically treated * as missing: tab kjbrgsc2 recode kjbrgsc2 -9 = .m -8=.i tab kjbrgsc2, missing /* An improvement! */ ********* ** ii) Reviewing and exploring missing data patterns ********* ** Most experts on missing data agree that reviewing missing data patterns across * variables of interest is the most important step in dealing with missing data ** In Stata, 'egen' has some convenient routines for checking missing data use pid kage ksex kvote kmastat kjbrgsc kjbsoc using $path3\kindresp.dta, clear numlabel _all, add ** Say our analysis was how voting was influenced by occupational class, age, gender, and * marital status codebook kage ksex kvote kmastat kjbrgsc , compact tab1 kage ksex kvote kmastat kjbrgsc recode kage -9=.m 15/17=.u recode kvote -9=.m -7=.p 11=.u 10=.a recode kmastat -1=.n 0=.u recode kjbrgsc -9=.m -8=.i * (Note how I've tried to use indicative alphabetic codes, * e.g. m for missing, u for under age, a for apathetic etc) tab1 kage ksex kvote kmastat kjbrgsc , missing codebook kage ksex kvote kmastat kjbrgsc , compact ** Here are some typical ways to derive missing data indicators (exploiting 'egen') tab kvote, missing capture drop misdep egen misdep=rmiss(kvote) tab1 kage ksex kmastat kjbrgsc capture drop misexpl egen misexpl=rmiss(kage ksex kmastat kjbrgsc ) tab1 misdep misexpl capture drop misexpl2 gen misexpl2 = (misexpl >= 1) tab misdep misexpl2 *** Examples of exploring missingness patterns: table misdep misexpl2, c(mean kage n kage) format(%9.3g) * People with missing voting data are generally younger; * people with missing explanatory variables are generally older xi:logit misdep kage i.ksex i.kmastat i.kjbrgsc * Age and class are most clearly related to missing data on voting xi:logit misexpl i.kvote * Particular voting preferences are also disproportionately linked to * missing data on class, age or marital status ** We could (and probably should) go much further, but the pattern emerging is * - we won't be able to claim the data is 'missing completely at random' * - the 'missing at random' assumption could prove reasonable, pending further analysis ** (Comment: Stata does have several missing data analysis/imputation routines, * but we don't review them here) *******************************************************. *******************************************************. *******************************************************. ** 2) Understanding complex data - Being assertive *******************************************************. use pid kage ksex kvote kmastat kjbstat kjbrgsc kjbsoc using $path3\kindresp.dta, clear numlabel _all, add codebook, compact tab1 kjbrgsc kjbstat ** The assert command is a particularly concise way to check if data such as conditional * responses are behaving as expected assert (kjbrgsc >= 1 & kjbrgsc <= 7) if kjbstat==2 assert (kjbstat >= 1 & kjbstat <=2) if (kjbrgsc >= 1 & kjbrgsc <= 7) ** Here, we see that there are quite a few people assigned a class on the basis of * their own occupation who are nevertheless not classified as currently working * on the jbstat variable * (this is actually ok, the current job could be a part-time or secondary status) ** There are also people who say they are in work but can't be assigned a job-based class * (this indicates missing data) ** After an 'assert' test, it is useful to follow up patterns for the contradictions: tab kvote assert kvote==11 if kage < 18 tab kvote if kage < 18 * The 'tab' makes it clear that plenty of under 18's still record a party supported assert kage < 18 if kvote==11 ** In this case, the contradictions are a puzzle and it might be interesting to find * out more about the 40 'can't vote' people who are aged 18 and over: list kage ksex kjbstat kjbrgsc kmastat if (kage >= 18 & kvote==11) sort pid sav $path9\m1.dta, replace use pid plbornc using $path3\xwavedat.dta, clear sort pid merge pid using $path9\m1.dta tab _merge keep if _merge==2 | _merge==3 list kage ksex kjbstat kjbrgsc kmastat plbornc if (kage >= 18 & kvote==11) * => after a bit of detection work, we see the people who can't vote are mostly overseas born ** The 'assert' command is a useful shorthand, but is easily replicated by * simpler tables and the 'if' or 'by' conditions tab kjbrgsc kjbstat, missing table kvote if kage >=16, c(min kage max kage mean kage) ** J Scott Long (2009) comments that some users swear by Stata's 'assert' command, * and other's hardly exploit it at all. * Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. * * One relevant consideration is that if an 'assert' command finds any contradications, * Stata responds as if it were an error and stops processing a file - so using 'assert' * can trip up a do file workflow. ** *******************************************************. *******************************************************. *******************************************************. ** 3) Understanding complex data - Graphics options *******************************************************. ********************** *** i) Using tabplot to visualise complex categorical data ********************** use pid ksex kjbrgsc kjbsoc using $path3\kindresp.dta, clear numlabel _all, add tab1 ksex kjbrgsc kjbsoc ** [Installation: might not be necessary] net set ado $path8 adopath + $path8 net from http://fmwww.bc.edu/RePEc/bocode/t net describe tabplot net install tabplot man tabplot ** [Tabplot is a user-generated extension command so needs to be installed onto * the particular version of Stata being used - see also a description in lab0.do] ** Conventionally tabplot is useful for reviewing a 2-way cross-tabulation: mvdecode kjbrgsc , mv(-9/-1) codebook kjbrgsc, compact tabplot kjbrgsc ksex ** For a single detailed variable, tabplot can also be manipulated to be quite helpful: tab kjbsoc capture drop soc90maj gen soc90maj=floor(kjbsoc/100) label variable soc90maj "SOC90 Major group" capture drop soc90res gen soc90res=kjbsoc - 100*(floor(kjbsoc/100)) label variable soc90res "SOC90 trailing 2 digits" mvdecode soc90maj, mv(-9/-1) tab1 soc90maj soc90res tab kjbsoc if ksex==2 tabplot soc90res soc90maj if ksex==2 , height(5) /// title("Occupational distribution of women, BHPS wave 11") ** This shows in detail where (female) jobs are spread in the SOC90 variable * ('height(5)' is used to make the 'clumpiness' of the female occupational * distribution more apparent) ** ********************** *** ii) Using 'graph save' and graph combine to compare multiple related measures ********************** use pid qsex qage qfrna qfrnb qfrnc qregion /// using $path3\qindresp.dta, clear codebook, compact numlabel _all, add tab1 qfrn* mvdecode qage qfrn*, mv(-9/-1) codebook, compact histogram qfrna, frequency normal title("Talks to neighbours") graph save g1, replace histogram qfrnb, frequency normal title("Frequency of meeting people") graph save g2, replace histogram qfrnc, frequency normal title("Talks to anyone at all") xlabel("") graph save g3, replace graph combine g1.gph g2.gph g3.gph, cols(3) ycommon title("BHPS 2007") graph save c1, replace histogram qfrna if qregion==18, frequency normal title("Talks to neighbours") graph save g4, replace histogram qfrnb if qregion==18, frequency normal title("Frequency of meeting people") graph save g5, replace histogram qfrnc if qregion==18, frequency normal title("Talks to anyone at all") xlabel("") graph save g6, replace graph combine g1.gph g2.gph g3.gph, cols(3) ycommon title("BHPS 2007, Scotland only") graph save c2, replace histogram qfrna if qage >= 60, frequency normal title("Talks to neighbours") graph save g7, replace histogram qfrnb if qage >= 60, frequency normal title("Frequency of meeting people") graph save g8, replace histogram qfrnc if qage >= 60, frequency normal title("Talks to anyone at all") xlabel("") graph save g9, replace graph combine g1.gph g2.gph g3.gph, cols(3) ycommon title("BHPS 2007, Aged 60+ only") graph save c3, replace histogram qage, title("BHPS respondent's age") color(gs14) graph save g10, replace graph hbar (mean) qage, title("Frequency of talking to neighbours", span) /// over(qsex) over(qfrna) bar(1, color(gs9)) graph save g11, replace graph combine g1.gph g2.gph g3.gph g4.gph g5.gph g6.gph g7.gph g8.gph g9.gph /// g10.gph g11.gph , cols(3) graph combine c1.gph c2.gph c3.gph, cols(1) graph export $path9\social_graph1.emf, as(emf) replace graph combine g11.gph g10.gph , cols(1) title("Age and social contacts") note("Souce: BHPS 2007.") graph export $path9\social_graph2.emf, as(emf) replace ** These are small scale examples of how Stata can hold graphs in memory then add them * into combined plots ** This is useful in many data management operations, typically when you are holding * in mind quite a lot of candidate alternative measures of a similar thing. *******************************************************. *******************************************************. *******************************************************. ** 4) Variables: Deriving variables with egen *******************************************************. ** The 'egen' command ('extensions to generate') supports numerous * useful variable derivations, the most important of which are: * - derivation of summary statistics across subgroups of another variable * - derivation of statsitics summarising values across responses on several variables ** Some examples: use pid kage ksex kvote kvote3 kvote8 kjbstat kjbrgsc using $path3\kindresp.dta, clear numlabel _all, add codebook, compact ** Summary statistic across groups: tab kage recode kage -9=.m table kvote, c(mean kage) egen voteage=mean(kage) if kvote > 0, by(kvote) gen relvage=kage - voteage graph hbar (mean) relvage, over(kjbrgsc) over(ksex) /// title("Own age relative to average age of voters in favoured party", span) ** Summaries of responses on several variables: tab1 kvote kvote3 kvote8 egen vnat=eqany(kvote kvote3 kvote8), values(4 5 16) label variable vnat "Any of voting preferences are for nationalist party" tab kvote3 vnat *******************************************************. *******************************************************. *******************************************************. ** 5) Variables: Value label extensions *******************************************************. use pid kage ksex kjbrgsc kjbsoc using $path3\kindresp.dta, clear codebook, compact ***** *** i) Using an external do file to define value labels tab kjbsoc ** This measure is the UK's 'Standard Occupational Group 1990' ** Value label definitions are available in a macro which is also available online: ** http://www.camsis.stir.ac.uk/occunits/do_files/soc90_labels.do do $path8\soc90_labels.do label values kjbsoc soc90l numlabel _all, add tab kjbsoc ***** ***** *** ii) Techniques to transfer value labels between variables * The 'clonevar' command suits this purpose tab kjbsoc gen soc2=kjbsoc clonevar soc3=kjbsoc tab1 soc2 soc3 * We can manually identify the variable label item and add it to soc3: describe kjbsoc label values soc2 soc90l /* Because we saw the 'soc90l' item gives the labels */ tab soc2 *******************************************************. *******************************************************. *******************************************************. ** 6) Variables: Reviewing measures using Statsby *******************************************************. ** We often have several candidate measures of the same concept which we * want to evaluate; Stata is generally well suited to this sort of operation, * and in particular the 'statsby' command can be usefully used for this purpose ****** *** i) Stata being generally very efficient for testing several related variables: ****** use pid qsex qage qvote* qjbrgsc qjbgold qjbsoc qregion using $path3\qindresp.dta, clear numlabel _all, add tab qregion keep if qregion >= 1 & qregion <= 18 ** Research question: How is voting influenced by current occupational class category in Britain ** To approach this, we can try out different voting outcomes and different class measures ** Derive 3 voting measures tab qvote3 recode qvote4 -9=.m -8=.i -7=.p -2=.r -1=.d 11/17=.e tab qvote4 gen convot=(qvote4==1) if !missing(qvote4) gen labvot=(qvote4==2) if !missing(qvote4) gen leftvot=(qvote4==2 | qvote4==3 | qvote4==6) if !missing(qvote4) summarize *vot ** Derive 5 class measures (excluding the armed forces in this example) tab1 qjbrgsc qjbgold qjbsoc recode qjbrgsc -9=.m -8=.i -1=.d 7=.a gen rgsc2=(qjbrgsc==1 | qjbrgsc==2) if !missing(qjbrgsc) recode qjbgold -9=.m -8=.i -1=.d replace qjbgold=.a if qjbrgsc==.a gen gold2=(qjbgold==1 | qjbgold==2) if !missing(qjbgold) recode qjbsoc -9=.m -8=.i -1=.d replace qjbsoc=.a if qjbrgsc==.a gen socm=floor(qjbsoc/100) if !missing(qjbsoc) tab1 qjbrgsc rgsc2 qjbgold gold2 socm ** Other measures: tab qsex tab qage recode qsex -7=.p recode qage -9=.m ** Analytical file summarize convot labvot leftvot qage qsex qjbrgsc rgsc2 qjbgold gold2 socm egen nmiss=rmiss(convot labvot leftvot qage qsex qjbrgsc rgsc2 qjbgold gold2 socm) tab nmiss keep if nmiss==0 summarize convot labvot leftvot qage qsex qjbrgsc rgsc2 qjbgold gold2 socm *** Analysis taking advantage of Stata's 'est store': xi:logit convot qage i.qsex i.qjbrgsc est store con1 xi:logit convot qage i.qsex i.rgsc2 est store con2 xi:logit convot qage i.qsex i.qjbgold est store con3 xi:logit convot qage i.qsex i.gold2 est store con4 xi:logit convot qage i.qsex i.socm est store con5 xi:logit labvot qage i.qsex i.qjbrgsc est store lab1 xi:logit labvot qage i.qsex i.rgsc2 est store lab2 xi:logit labvot qage i.qsex i.qjbgold est store lab3 xi:logit labvot qage i.qsex i.gold2 est store lab4 xi:logit labvot qage i.qsex i.socm est store lab5 xi:logit leftvot qage i.qsex i.qjbrgsc est store left1 xi:logit leftvot qage i.qsex i.rgsc2 est store left2 xi:logit leftvot qage i.qsex i.qjbgold est store left3 xi:logit leftvot qage i.qsex i.gold2 est store left4 xi:logit leftvot qage i.qsex i.socm est store left5 est table con1 con2 con3 con4 con5, star stats(r2_p ll bic N) b(%9.3g) est table lab1 lab2 lab3 lab4 lab5, star stats(r2_p ll bic N) b(%9.3g) est table left1 left2 left3 left4 left5, star stats(r2_p ll bic N) b(%9.3g) ** Description: soc90 major groups give the best explanation for tory and leftist voting; * goldthorpe scheme gives the best explanation for labour voting. ** Note above: we claimed that Stata is generally very efficient for this type * of comparison. Although the above is quite extended, imagine how long the same * evaluation could take to perform in some other packages. ****** *** ii) Using 'statsby' for the same style of comparison ****** ** Using the same data, we can use statsby (which performs the same analysis * across different groups of data) to achieve the same outcome by merging/combining data: sav $path9\votes.dta, replace ** We'll use some macros to reduce the typing a little: capture program drop mac1 program define mac1 use $path9\votes.dta, clear keep $dep $expl qage qsex rename $dep outcome rename $expl expl gen measure=$numb sav $path9\vote$numb.dta, replace end * global dep "convot" global expl "qjbrgsc" global numb "11" mac1 * global dep "convot" global expl "rgsc2" global numb "12" mac1 * global dep "convot" global expl "qjbgold" global numb "13" mac1 * global dep "convot" global expl "gold2" global numb "14" mac1 * global dep "convot" global expl "socm" global numb "15" mac1 ** * global dep "labvot" global expl "qjbrgsc" global numb "21" mac1 * global dep "labvot" global expl "rgsc2" global numb "22" mac1 * global dep "labvot" global expl "qjbgold" global numb "23" mac1 * global dep "labvot" global expl "gold2" global numb "24" mac1 * global dep "labvot" global expl "socm" global numb "25" mac1 ** * global dep "leftvot" global expl "qjbrgsc" global numb "31" mac1 * global dep "leftvot" global expl "rgsc2" global numb "32" mac1 * global dep "leftvot" global expl "qjbgold" global numb "33" mac1 * global dep "leftvot" global expl "gold2" global numb "34" mac1 * global dep "leftvot" global expl "socm" global numb "35" mac1 ** dir $path9\vote*.dta ** use $path9\vote11.dta, clear append using $path9\vote12.dta append using $path9\vote13.dta append using $path9\vote14.dta append using $path9\vote15.dta append using $path9\vote21.dta append using $path9\vote22.dta append using $path9\vote23.dta append using $path9\vote24.dta append using $path9\vote25.dta append using $path9\vote31.dta append using $path9\vote32.dta append using $path9\vote33.dta append using $path9\vote34.dta append using $path9\vote35.dta summarize tab measure * i.e., we have 15 replications of the data with different explanatory variables xi: statsby rsquare=e(r2_p) ll=e(ll) llnull=e(ll_0) numb=e(N) , /// saving($path9\results.dta, replace) by(measure): /// logit outcome qage i.qsex i.expl use $path9\results.dta, clear label define measurel /// 11 "Convot-RGSC" 12 "Convot-RGSC2" 13 "Convot-GOLD" 14 "Convot-GOLD2" 15 "Convot-SOCM" /// 21 "Labvot-RGSC" 22 "Labvot-RGSC2" 23 "Labvot-GOLD" 24 "Labvot-GOLD2" 25 "Labvot-SOCM" /// 31 "Leftvot-RGSC" 32 "Leftvot-RGSC2" 33 "Leftvot-GOLD" 34 "Leftvot-GOLD2" 35 "Leftvot-SOCM" label values measure measurel graph hbar (mean) rsquare, over(measure) gen llred=ll-llnull graph hbar (mean) llred, over(measure) title("Reduction in LL") ** Comments: Again, at first sight this is a long process, but in practice it is * getting a great deal done very quickly. * The graphs make it clear that there is a loss of information from dichotomising * class measures. *******************************************************. *******************************************************. *******************************************************. ** 7) Metadata: Labelling files *******************************************************. ** There are several features to metadata on Stata format files * datasets that are not widely exploited **** Labelling files: use pid aage asex ajbstat ajbsoc aqfedhi using $path3\aindresp.dta, clear describe notes datasignature gen fem=(asex==2) drop asex order pid fem aage aqfedhi ajbsoc ajbstat /* Changes file order of data */ summarize label data "PL: Subset from BHPS wave 1 'indresp' file, 24 Aug 2009" notes: Famously 10264 cases, but that includes 352 proxies notes: Must remember to include weights next time notes ajbsoc: Standard Occupational Classification 3-digit (371 units) sav $path9\temp.dta, replace use $path9\temp.dta describe notes codebook ajbsoc, notes datasignature * Comment: Notice how 'describe' highlights that there are some notes on this file. * The 'label data' option is useful as it's label is often displayed * Adding notes to variables can be helpful way to add details with them * cluttering up the variable label * The 'datasignature' is potentially important for sorting out a mess * (any two files with even a slight difference (e.g. variable names, values * number of cases, etc, won't share the same datasignature) * There is an extended section on the value of labelling data like this * in Long (2009: c5). *******************************************************. *******************************************************. *******************************************************. *******************************************************. ** End log file: capture log close *******************************************************. *******************************************************. *******************************************************. * EOF