*******************************************************. *** Data Management through e-Social Science ** ** www.dames.org.uk ** ** ** ESRC DAMES Research Node, data management training programme: ** ** ** Workshop of 24-25 August 2009 ** Data Management for Social Survey Research ** ** ** LAB 2: ADVANCED DATA MANAGEMENT WITH STATA ** SUBFILE: VARIABLE_OPERATIONALISATIONS.DO ** ** ** www.dames.org.uk ** Paul Lambert / Vernon Gayle, 24 August 2009 *******************************************************. *******************************************************. *******************************************************. ****** LAB 2 SUB-FILE: VARIABLE_OPERATIONALISATIONS.DO *******************************************************. *******************************************************. **********************************************************. *****************************************************. ** ** *** This file illustrates a small selection of functions relevant to good data management ** in Stata which we have highlighted because they are both useful ** and because they are not always as widely known as they might be ** ** *** Most of these functions already feature somewhere in the file 'lab0.do', ** but here we add a few extensions ** ** The examples given below cover: ** ** 1) Standard categories: Education and occupations ** 2) Handling variables: More on names and labels ** 3) Scaling categories: Effect proportional scaling ** 4) Standardisation: Examples for metrics and categories ** 5) Interpreting coefficients: Standard errors and Quasi-Variance ** ** **********************************************************. *******************************************************. *******************************************************. *******************************************************. ** NOTIFICATION OF FILE LOCATIONS / DIRECTORIES AND STATA SETUP ** ** ** ** i) File location declarations: *** For the commands below to work, you should begin by running the following ** macros, which tell Stata where to look for the relevant data files (mentioned ** above) on your machine : . global path1 "d:\dames09\work\" * (the location of your working directory - where you will save * newly created data files and output) . global path3 "d:\dames09\data\bhps\w1to15\" * (the location of a folder where you have saved the BHPS microdata - * Stata format files from UKDA Study number 5151) . global path8 "d:\dames09\macros\" * (a folder for storing macros use or created during this lab: global path8b "d:\dames09\macros\occs\" * a subfolder of macros, featuring tools used when analysing occupations: * iscoisei.ado * casoc_isco.do * gb91soc90.dta * gb91isco88.dta * * (all available for download from: www.dames.org.uk/workshops) global path9 "d:\dames09\temp\" * (a location of a temporary folder where you can save intermediate files) . clear set mem 200m ** For the Stirling 2009 workshop: the data and files all ought to be in the locations as above ** ** If you are running these sessions elsewhere: you need to carefully substitute your own path ** locations as appropriate into the commands above . ********************** ********************** ************************************************************** ************************************************************** ************************************************************** ************************************************************** *******************************************************. ** 1) Standard categories: Education and occupations *******************************************************. **************************** *** a) Occupational categories **************************** **** a1) Linking SOC2000 and employment status with several popular occupation-based social classifications ** SOC2000 is the current UK standard classification for occupational titles, see e.g.: ** http://www.geode.stir.ac.uk/ougs.html#soc2000 use pid qsex qage qjbsoc00 qjbsemp using $path3\qindresp.dta, clear tab qjbsoc00 * This is a typical survey with some cases with soc2000 codes ** At the CAMSIS website (and also on the GEODE server www.geode.stir.ac.uk) it is possible to * download a data file with information on soc2000 units cross-classified by employment status. use $path8b\gb91soc2000.dta, clear summarize tab1 soc2000 ukempst * These will be the key linking variables, we'll need data in both their codes: ** http://www.geode.stir.ac.uk/ougs.html#soc2000 ** http://www.geode.stir.ac.uk/ougs.html#ukempst sort soc2000 ukempst sav $path9\m1.dta, replace ** BHPS survey data with SOC2000 on it and other data relevant to employment status: use pid qsex qage qjbsoc00 qjbsemp qjbmngr qjsboss qjssize using $path3\qindresp.dta, clear numlabel _all, add * we can use qjbsemp qjbmngr qjsboss qjssize to get employment status measure tab1 qjbsemp qjbmngr qjsboss qjssize gen ukempst=0 replace ukempst=7 if (qjbsemp==1 & qjbmngr==3) replace ukempst=6 if (qjbsemp==1 & qjbmngr==2) replace ukempst=5 if (qjbsemp==1 & qjbmngr==1) replace ukempst=3 if (qjbsemp==2 & qjsboss==2) replace ukempst=2 if (qjbsemp==2 & qjsboss==1 /// & (qjssize==1 | qjssize==2 | qjssize==3 | qjssize==10) ) replace ukempst=1 if (qjbsemp==2 & qjsboss==1 & (qjssize >= 4 & qjssize ~= 10) ) tab ukempst ** Note that we could do the link on the soc2000 variable if we pretended that we * didn't know the current employment status (ie ukempst=0); * however, it is preferable to derive the variable from other information on the BHPS ** Now we have our key linking measures: summarize qjbsoc00 ukempst gen soc2000 = qjbsoc00 sort soc2000 ukempst merge soc200 ukempst using $path9\m1.dta tab _merge keep if _merge==1 | _merge==3 /* Keeps the survey microdata only */ summarize histogram mcamsis tab1 rgsc ns_sec ** Summary: * These are new occupation-based social classifications we've added to our data by * exploiting the soc2000 and ukempst standard categories ** ********* ********* **** a2) Coding ISCO88 into ISEI ** ISCO88 is a widely used standard classification for occupational titles, see e.g.: ** http://www.geode.stir.ac.uk/ougs.html#isco88 ** A popular scale of occupational advantage is called 'ISEI': ** International Socio-Economic Status Index *** It has been generated and linked to ISCO88 units by: * ** Ganzeboom, H. B. G. (2008). Tools for deriving status measures from ISKO-88 and ISCO-68. * Retrieved 1 March, 2008, from http://home.fsw.vu.nl/~ganzeboom/PISA/ * * and stata format translation files for this scale are published by: * * Hendrickx, J. (2002). ISKO: Stata module to recode 4 digit ISCO-88 occupational codes. Boston * College Department of Economics: Statistical Software Components S425802, revised 20 Oct 2004. * ** ** To use these resources to code your own ISCO data into ISEI, use for example: use pid qsex qjbisco qjbcssm using $path3\qindresp.dta, clear summarize * In the BHPS, ISCO is a string format, but is readily converted to numeric: destring qjbisco, replace summarize keep if (qsex==1 | qsex==2) & qjbisco >= 1 & qjbcssm >= 1 /* Listwise deletion for convenience */ summarize capture do $path8b\iskoisei.ado /* Serves to define the iskoisei macro */ summarize qjbisco iskoisei qjbisei, isko(qjbisco) /* Syntax required by the iskoisei macro */ summarize graph bar (mean) qjbisei qjbcssm , asyvars over(qsex) * * (The measures aren't on the same scale so it is good to standardise - see later section below) **************************** *** b) Educational categories **************************** ************** ** Question: How do we recode educational qualifications data * ** There are various publications where schemas for educational qualifications categories * are proposed or documented * ** Here's an example from the BHPS, which uses data on age at which completed school, and * highest educational qualification recorded at point of interview: ************** ** Retrieve age-left-school data from cross-wave file: use pid scend using $path3\xwavedat.dta, clear sort pid sav $path9\m1.dta, replace ** Retrieve highest qualifications data (and job info) from individual file (at wave q) use pid qage qsex qqfedhi qqfachi qqfvoc qjbcssm using $path3\qindresp.dta, clear sort pid ** Merge the files and save the combined data merge pid using $path9\m1.dta tab _merge keep if _merge==1 | _merge==3 drop _merge numlabel _all, add tab1 qqfedhi qqfachi qqfvoc label data "BHPS qualifications extract from wave 17 (2007)" sav $path9\bhps_educ_extract.dta, replace ************ Recoding illustration: Derived educational qualifications measures use $path9\bhps_educ_extract.dta, clear codebook, compact * (These are the easy-to-access educational measures from the BHPS; * other information on qualifications is available but may need linkage between waves) ***** 1) An approximation to ISCED (International Standard Classification of Education): ** This uses Table 2 from: * * Brynin, M. (2003). Using CASMIN: The Effect of Education on Wages in Britain and Germany. * In J. H. P. Hoffmeyer-Zlotnik & C. Wolf (Eds.), Advances in Cross-National Comparison: * A European Working Book for Demographic and Socio-Economic Variables (pp. 327-344). * New York: Kluwer Academic. * capture label drop iscedl label define iscedl 11 "1a Incomplete" 12 "1b Elementary" 13 "1c Basic vocational" /// 21 "2a Intermediate vocational (+ intermediate general)" /// 22 "2b Intermediate general" 23 "2c General: General maturity certificate" /// 24 "2d Vocational: Vocational maturity (+with general maturity)" /// 31 "3a Lower tertiary" 32 "3b Higher tertiary" tab1 qqfedhi qqfachi qqfvoc scend tab qqfedhi qqfachi, missing capture drop isced gen isced=-9 replace isced=11 if qqfedhi==12 & scend <= 11 replace isced=12 if qqfedhi==12 & ((scend >= 12 & scend <= 16) | scend==. | (scend >= 17 & scend <= 19)) /// | (qqfedhi==11 & qqfachi==7) replace isced=13 if qqfedhi==8 | (qqfvoc==1 & (qqfachi==6 | qqfachi==7)) | (qqfedhi==4 | qqfedhi==5) /// | qqfedhi==9 | qqfachi==6 | (qqfedhi==11 & qqfachi==7 & qqfvoc==1) replace isced=22 if (qqfedhi==7 | qqfachi==5 ) replace isced=21 if qqfedhi==10 | (qqfedhi==7 & qqfvoc==1) | (qqfvoc==1 & (qqfedhi==4 | qqfedhi==5)) replace isced=23 if qqfachi==4 | qqfedhi==6 | (qqfedhi==4 & qqfachi==4) | (qqfedhi==4 & qqfachi==5) replace isced=24 if (qqfvoc==1 & (qqfedhi==6 | (qqfedhi==4 & qqfachi==4) | (qqfedhi==4 & qqfachi==5))) /// | ((qqfedhi==4 | qqfedhi==5) & qqfachi==4) replace isced=31 if qqfachi==3 replace isced=32 if qqfedhi==1 | qqfedhi==2 | qqfachi==2 label values isced iscedl tab isced tab isced qqfedhi bysort isced: tab qqfedhi qqfachi, missing mvdecode isced, mv(-9) tab isced, missing mvdecode scend, mv(-9/-1) summarize scend table isced, c(mean scend n scend) ****** 2) A convenient BHPS oriented classification which reduces age correlation * [This scheme as advocated by Paul Lambert, Univ. Stirling, August 2009] * gen educ4=qqfedhi recode educ4 1/2=1 3/5=2 6 7 8 10 11=3 9 12=4 *=-9 label variable educ4 "BHPS 4-fold educational level classification" label define educ4l 1 "Degree" 2 "Diploma" /// 3 "Vocational or higher school level" 4 "Low school level or below" label values educ4 educ4l numlabel _all, add mvdecode educ4, mv(-9) tab qqfedhi educ4 tab educ4 *********************************************** * Review analysis: tab educ4 isced, V * char _dta[omit] "prevalent" * Technical precursor: ensures the reference category on xi below is the largest group **check on age cohort patterns mvdecode qage qqfedhi, mv(-9/-1) summarize qage qqfedhi educ4 isced xi:regress qage i.qqfedhi if qage >= 22 est store fedhi xi:regress qage i.educ4 if qage >= 22 est store educ4 * (age correlation has halved from fedhi to educ4) xi:regress qage i.isced if qage >= 22 est store isced * (age correlation for isced is nearly as high as fedhi) ** Construct validity test - correlation to occupational advantage mvdecode qjbcssm, mv(-9/-1) summarize qjbcssm qage qqfedhi educ4 isced xi:regress qjbcssm qage i.qqfedhi if qage >= 22 est store fedhio xi:regress qjbcssm qage i.educ4 if qage >= 22 est store educ4o * (age correlation has halved from fedhi to educ4) xi:regress qjbcssm qage i.isced if qage >= 22 est store iscedo est table fedhi educ4 isced fedhio educ4o iscedo, stats(N r2 ll bic) star b(%8.3g) ** Summary: * ISCED is useful because it is a recognised standard used in other countries * educ4 is attractive because it it simple and not as strongly correlated to age, * whilst having a comparable level of explanatory power on occupations as others. ************* ** Comment on education measures: * ** In due course, in DAMES we are generating an extended databank of information on * ways of treating educational qualifications... ** See www.dames.org.uk ************* ************************************************************** ************************************************************** ************************************************************** ************************************************************** *******************************************************. ** 2) Handling variables: More on names and labels *******************************************************. *** Long (2009) has an extended section on the naming and labelling of variables ** He argues that this is often neglected in projects, but variable names and labels * provide critical data on which the efficiency and replication of most projects hinges. *************************** ** i) Improving metadata on variable labels *************************** use pid aage asex ajbrgsc ajbsoc using $path3\aindresp.dta, clear tab aage asex if aage < 25 codebook, compact ** The full label is a bit long - it'll be truncated in some outputs, so it's pref. to shorten it ** Also, the full label doesn't note that this is at Wave A label variable aage "Age:1991 intv" histogram aage tab aage asex if aage < 25 note aage: "Age at date of interview, BHPS wave A (1991)" codebook, compact codebook aage, notes *************************** ** ii) Checking on suitable/truncated value labels *************************** use pid aage asex ajbrgsc ajbmix ahlprb1 aophla ajbsoc ajlsoc using $path3\aindresp.dta, clear codebook, problems * This higlights three variables where some categories have value labels and others don't ** In many outputs, long value labels will be truncated, and in some instances * you may loose differentiation by categories on this basis labelbook * Note the check 'unique at..' * If either of these say 'no', the labels are sub-optimal * (The underlined codes are the first digits that the label is often abbreviated to - * if some of these are not unique, then detail will be lost) * see the example of the variable ajbmix labelbook ajbmix * The labels for categories 1 and 5 are suboptimal, e.g. mvdecode ajbrgsc ajbmix, mv(-9/-1) tab ajbrgsc ajbmix *************************** ** iii) Adding variable names to variable labels *************************** ** A common complaint about Stata outputs is that there is no way to * attach the variable name to the variable label (this is a problem because * many outputs only list the variable label, when it could be helpful to know the name) use pid aage asex ajbrgsc ajbsoc using $path3\aindresp.dta, clear sort pid sav $path9\m1.dta, replace use pid qage qsex qjbrgsc qjbsoc using $path3\qindresp.dta, clear sort pid merge pid using $path9\m1.dta tab _merge numlabel _all, add keep if _merge==3 tab1 *rgsc recode ajbrgsc -9=.m -8=.i recode qjbrgsc -9=.m -8=.i tab1 *rgsc tab ajbrgsc qjbrgsc * Note - it's not clear which variable is for 1991 and which one for 2007 on this table ** Because this is an FAQ, user workarounds have been published, such as the macro * called below, which is taken from p158-9 of Long (2009) do $path8\add_varnames_to_labels.do tab ajbrgsc qjbrgsc describe * For all variable in the current file, this macro has added variable names to the * corresponding variable label *************************** *************************** ** iv) Choosing variable names carefully *************************** **** Stata's * 'wildcard' indicator is useful in variable lists, but only * if the variable names are chosen well use aage aregion using $path3\aindresp.dta, clear numlabel _all, add tab aregion gen ilond=aregion==1 gen olond=aregion==2 gen mersey=aregion==10 * ..etc ** These dummy variables are mneumonic but will have to be typed in full to summarize them, e.g.: pwcorr aage ilond olond mersey if aage >= 16, star(0.05) ** If we choose structured names the analysis will be easier: gen r_ilond=aregion==1 gen r_olond=aregion==2 gen r_mersey=aregion==10 pwcorr aage r_* if aage >= 16, star(0.05) ** In fact, the generate command for dummy indicators is convenient here: tab aregion, generate(rd_) pwcorr aage rd_* if aage >= 16, star(0.05) *************************** ** v) Exporting variable metadata *************************** ** Labelbook gives a full listing of variable and value labels, which can be * logged in a log file or do file ** This is very important for documentation and transferability..! use pid aage asex ajbrgsc ajbsoc aregion using $path3\aindresp.dta, clear do $path8\soc90_labels.do label values ajbsoc soc90l labelbook, detail ** If data needs to be exported in plain text format (loosing Stata's metadata), * these logs can be attached to provide the suitable information use pid aage asex ajbrgsc ajbsoc aregion using $path3\aindresp.dta, clear do $path8\soc90_labels.do label values ajbsoc soc90l capture log close log using $path9\plain_codes.txt, replace text labelbook, detail log close outsheet using $path9\plain_data.dat, nolabel replace dir $path9\plain* ************************************************************** ************************************************************** *******************************************************. ** 3) Scaling categories: Effect proportional scaling *******************************************************. ***** 'Effect proportional scaling' is a convenient term to describe assigning * scores to categories on the basis of some other relevant information **** Example: Analysing the effects of educational qualifciations on health outcomes; use pid qage qsex qqfedhi qqfachi qqfvoc qjbcssm qhlstat qsmoker qtenure /// using $path3\qindresp.dta, clear sort pid numlabel _all, add tab qqfedhi recode qqfedhi -9=.m -7=.p 13=.y ** Educational data is a little complex, there may be other useful measures * Example: a simple categorical recode gen educ4=qqfedhi recode educ4 1/2=1 3/5=2 6 7 8 10 11=3 9 12=4 *=-9 label variable educ4 "BHPS 4-fold educational level classification" label define educ4l 1 "Degree" 2 "Diploma" /// 3 "Vocational or higher school level" 4 "Low school level or below" label values educ4 educ4l numlabel _all, add mvdecode educ4, mv(-9) tab qqfedhi educ4 ** Example: Effect proportional scaling using occupational advantage tab qqfedhi summarize qjbcssm mvdecode qjbcssm, mv(-9/-1) table qqfedhi, c(mean qjbcssm n qjbcssm) table qqfedhi if qage >= 20, c(mean qjbcssm n qjbcssm) capture drop educ_cs egen educ_cs=mean(qjbcssm) if qqfedhi >= 1 & qqfedhi <= 12, by(qqfedhi) label variable educ_cs "Educ scale:occs " notes educ_cs: Scaling using average Cambridge scale by holders of qualifications summarize qqfedhi educ4 educ_cs graph twoway (scatter educ_cs educ4 , mlabel(qqfedhi)) , xscale(range(1 6)) ** Example: Effect proportional scaling using several factors tab qqfedhi tab qtenure gen owner=qtenure==1 | qtenure==2 if qtenure >= 1 tab qsex gen fem=qsex==2 if qsex >= 1 tab qage mvdecode qage, mv(-9/-1) summarize qage fem qjbcssm owner * Uses 'stereotyped ordered regression' to obtain category scores for qualifications slogit qqfedhi qage fem qjbcssm owner ** The below uses codes from Slogit results that we transcribed earlier capture drop educ_sl gen educ_sl=qqfedhi recode educ_sl 1=100 2=89 3=70 4=49 5=37 6=49 7=37 8=36 9=23 10=1 11=39 12=0 *=. tab educ_sl label variable educ_sl "Educ scale: slogit" note educ_sl: Scaling of eduational qualifications using occupations, tenure, age and gender summarize qqfedhi educ4 educ_cs educ_sl graph twoway (scatter educ_cs educ4 , mlabel(qqfedhi)) /// (scatter educ_sl educ4 , mlabel(qqfedhi) mlcolor(purple) ) , xscale(range(1 6)) correlate educ_cs educ_sl * Comment: scaling we've used differs, but the correlations of the two scores are very high.. tab qhlstat recode qhlstat -9=.m -1=.n summarize qhlstat qage fem qqfedhi educ4 educ_cs educ_sl xi:regress qhlstat qage fem i.qqfedhi est store fedhi xi:regress qhlstat qage fem i.educ4 est store educ4 regress qhlstat qage fem educ_cs est store educ_cs regress qhlstat qage fem educ_sl est store educ_sl est table fedhi educ4 educ_cs educ_sl, stats(r2 ll bic N) star b(%8.3g) ** Comment - the effect proportional scales are favoured by the BIC statistic ************************************************************** ************************************************************** *******************************************************. ** 4) Standardisation: Examples for metrics and categories *******************************************************. ****** This note features comments on standarising metric variables; on * standardising categorical variables; and on using 'offsets' **************** ******** a) Metric stadardisation **************** *** We assert three things are true about metric standardisations: ** - they're very easy to do ** - they're usually necessary for comparative research ** - they're usually not implemented nevertheless, except by economists! ** Example of standardising: * Two alternative stratification scales, from 17 years apart): use pid qsex qjbisco qjbcssm qjbsat using $path3\qindresp.dta, clear summarize renpfix qjb jb rename qsex sex * In the BHPS, ISCO is a string format, but is readily converted to numeric: destring jbisco, replace summarize keep if (sex==1 | sex==2) & jbisco >= 1 & jbcssm >= 1 /* Listwise deletion for convenience */ summarize capture do $path8b\iskoisei.ado /* Serves to define the iskoisei macro */ summarize jbisco iskoisei jbisei, isko(jbisco) /* Syntax required by the iskoisei macro */ graph bar (mean) jbisei jbcssm , asyvars over(sex) gen year=2007 summarize sav $path9\m1.dta, replace * use pid asex ajbisco ajbcssm ajbsat using $path3\aindresp.dta, clear summarize renpfix ajb jb rename asex sex * In wave A, to cover ISCO to numeric format requires a further macro: do $path8b\casoc_isco.do iscoca, isco(jbisco) destring jbisco, replace summarize keep if (sex==1 | sex==2) & jbisco >= 1 & jbcssm >= 1 /* Listwise deletion for convenience */ summarize capture do $path8b\iskoisei.ado /* Serves to define the iskoisei macro */ summarize jbisco iskoisei jbisei, isko(jbisco) /* Syntax required by the iskoisei macro */ graph bar (mean) jbisei jbcssm , asyvars over(sex) gen year=1991 summarize sav $path9\m2.dta, replace use $path9\m1.dta, clear append using $path9\m2.dta tab year graph bar (mean) jbisei jbcssm , asyvars over(sex) over(year) * When we produced these graphs earlier, * we noted that within years the measures aren't on the same scale so it is better to standardise; * Now, with data combined across years, we see in any case increases in occupation-based rankings * over time which we would want to control for by standardisation ** Calculate standardised scores for national populations: capture drop mean1 capture drop sd1 egen mean1=mean(jbisei), by(year) egen sd1=sd(jbisei), by(year) capture drop zisei gen zisei= 50 + 15*((jbisei - mean1)/sd1) label variable zisei "ISEI (standardised to mean 50, sd 15 within years)" capture drop mean1 capture drop sd1 egen mean1=mean(jbcssm), by(year) egen sd1=sd(jbcssm), by(year) capture drop zcssm gen zcssm= 50 + 15*((jbcssm - mean1)/sd1) label variable zcssm "CSSM (standardised to mean 50, sd 15 within years)" graph bar (mean) zisei zcssm , asyvars over(sex) over(year) * This is a much fairer comparison by gender (CAMSIS scores higher for women; ISEI higher for men); * and it will give more meaningful analytical results in other comparisons over time numlabel _all, add tab jbsat recode jbsat -9 -7 -1=.m -8 0=.i gen y2007=(year==2007) gen fem=(sex==2) summarize jbsat fem y2007 jbisei jbcssm zisei zcssm program define jobmodel capture drop femjob gen femjob=fem*$jobvar regress jbsat fem y2007 $jobvar femjob end global jobvar "jbisei" jobmodel est store mod1 global jobvar "jbcssm" jobmodel est store mod2 global jobvar "zisei" jobmodel est store mod3 global jobvar "zcssm" jobmodel est store mod4 est table mod1 mod2 mod3 mod4, stats(r2 ll bic N) star b(%9.3g) **************** ******** b) Categorical standardisation **************** ** Standardisation of categories is usually approached 'by fiat' * (i.e., categories are felt to be the same if they have the same name) *** a) Harmonisation of categories should be approached by accessing documentation and * standard category definitions - see above, section 1. *** b) Sometimes however harmonised category definitions are impossible or unrealistic; * in those instances, effect proportional scaling of categories (see above) is * a more attractive means for standardising categories. ** For more on this, see * Lambert, P. S., Gayle, V., Bowes, A. M., Blum, J. M., Jones, S. B., * Sinnott, R. O., et al. (2009). Standards setting when standardizing categorical data. * Cologne, 24-26 June 2009: Paper presented to the Fifth International Conference on Social * Science Methodology, organised by GESIS and the National Centre for e-Social Science, * and http://www.dames.org.uk/publications.html. ** **************** ******** c) Using 'offsets' to assist standardisation **************** *for comparisons between regressions, it is sometimes suitable to force the coefficients of * some variables (e.g. controls) to have a certain fixed value across comparisons * * This can be done by constraining them to have 'offset' values ** Example: use pid aage asex ajbcssm atenure ahlstat using $path3\aindresp.dta, clear renpfix a z gen year=1991 sort pid sav $path9\m1.dta, replace use pid hage hsex hjbcssm htenure hhlstat using $path3\hindresp.dta, clear renpfix h z gen year=1998 sort pid sav $path9\m2.dta, replace use pid qage qsex qjbcssm qtenure qhlstat using $path3\qindresp.dta, clear renpfix q z gen year=2007 sort pid sav $path9\m3.dta, replace use $path9\m1.dta, clear append using $path9\m2.dta append using $path9\m3.dta numlabel _all, add summarize tab zsex gen fem=zsex==2 if zsex >= 1 tab ztenure gen owner=ztenure==1 | ztenure==2 if ztenure >= 1 tab zhlstat recode zhlstat -9=.m -1=.n summarize zage keep if zage >= 18 & zage <= 80 mvdecode zjbcssm, mv(-9/-1) summarize zhlstat fem owner zage zjbcssm tab year regress zhlstat fem owner zage zjbcssm if year==1991 est store y91 regress zhlstat fem owner zage zjbcssm if year==1998 est store y98 regress zhlstat fem owner zage zjbcssm if year==2007 est store y07 est table y91 y98 y07, stats(r2 ll bic N) star b(%9.3g) * It seems that over time, age is more important to self-rated health ** But, this comparison is difficult becuase the effects of gender, tenure, and job * are all being allowed to vary between waves as well; we can fix this with cnsreg: regress zhlstat fem owner zage zjbcssm est store xwave matrix define xwavmod=e(b) matrix list xwavmod scalar fem_coef=xwavmod[1,1] scalar owner_coef=xwavmod[1,2] scalar csm_coef=xwavmod[1,4] constraint def 1 fem=fem_coef constraint def 2 owner=owner_coef constraint def 3 zjbcssm=csm_coef cnsreg zhlstat fem owner zage zjbcssm if year==1991 , constraints(1 2 3) est store y91c cnsreg zhlstat fem owner zage zjbcssm if year==1998 , constraints(1 2 3) est store y98c cnsreg zhlstat fem owner zage zjbcssm if year==2007 , constraints(1 2 3) est store y07c est table xwave y91 y98 y07 y91c y98c y07c, stats(r2 ll bic N) star b(%9.3g) * In this instance, the offsets models seems to confirm the evidence of a * trend of growing influence of age on self-reported health over time ************************************************************** ************************************************************** *******************************************************. ** 5) Interpreting coefficients: Standard errors and Quasi-Variance *******************************************************. ** For this example we'll use the analysis of voting patterns which featured in the * file "lab_highlights.do" do $path1\voting\data_prep\voting_example_data.do summarize sav $path9\temp.dta, replace ** (This repeats the data preparation section from that file and saves to a temp location) ** Repeat the 15 models of interest: use $path9\temp.dta, clear xi:logit convot qage i.qsex i.qjbrgsc est store con1 xi:logit convot qage i.qsex i.rgsc2 est store con2 xi:logit convot qage i.qsex i.qjbgold est store con3 xi:logit convot qage i.qsex i.gold2 est store con4 xi:logit convot qage i.qsex i.socm est store con5 xi:logit labvot qage i.qsex i.qjbrgsc est store lab1 xi:logit labvot qage i.qsex i.rgsc2 est store lab2 xi:logit labvot qage i.qsex i.qjbgold est store lab3 xi:logit labvot qage i.qsex i.gold2 est store lab4 xi:logit labvot qage i.qsex i.socm est store lab5 xi:logit leftvot qage i.qsex i.qjbrgsc est store left1 xi:logit leftvot qage i.qsex i.rgsc2 est store left2 xi:logit leftvot qage i.qsex i.qjbgold est store left3 xi:logit leftvot qage i.qsex i.gold2 est store left4 xi:logit leftvot qage i.qsex i.socm est store left5 ** ************************************************************** *** Exploring these coefficients (i): ** In the below, our argument is that facility in handling and manipulating * results is important because: * (a) Different emphases give slightly different conclusions (...obviously) * (b) When you have facility in exploring results like this that * you start to feel in control of your analysis in a way that many * people may not otherwise experience (...a bit less obvious) * Standard errors of the coefficients: est table con1 con2 con3 con4 con5, se b(%9.4g) estimates stats * 'Esttab' is an extension routine which is very useful for summarising model coefficients: adopath + $path8 adopath net search esttab ** package st0085_1 from http://www.stata-journal.com/software/sj7-2 net describe st0085_1 net install st0085_1 man esttab esttab left1 left2 left5 using "$path1\voting\results\left_voting_model_1.rtf", /// b(%9.4f) se(%9.3f) pr2(%9.3f) bic(%9.0f) /// starlevels(* .10 ** .05 *** .01) stardetach /// label mtitles("QJBRGSC" "RGSC2" "SOCM") /// nogaps replace * For the techi types we would also have esttab left1 left2 left5 using "$path1\voting\results\left_voting_model_1.tex", /// b(%9.4f) se(%9.3f) pr2(%9.3f) bic(%9.0f) /// starlevels(* .10 ** .05 *** .01) stardetach /// label mtitles("QJBRGSC" "RGSC2" "SOCM") /// nogaps replace esttab left1 left2 left5 using "$path1\voting\results\left_voting_model_1.html", /// b(%9.4f) se(%9.3f) pr2(%9.3f) bic(%9.0f) /// starlevels(* .10 ** .05 *** .01) stardetach /// label mtitles("QJBRGSC" "RGSC2" "SOCM") /// nogaps replace ** Some further options for interpretation: esttab left1 left2 , abs brackets nostar * absolute values of t-stats, in [] rather than (), no signif stars esttab left1 left2, wide * wide rather than long format esttab left1 left2, starlevels(+ 0.05 * 0.01) * different star levels and symbols esttab left1 left2, starlevels(+ 0.05 * 0.01) stardetach staraux * stars in separate column; stars by the SE/t rather than by coeff esttab left1 left2, label * labels for regressors (if labels exist) esttab left1 left2, mtitles("Model 1" "Model 2") * bespoke column labels for models esttab left1 left2, nogaps * no blank lines between each coeff/se esttab left1 left2, nolines * no added lines in the table esttab left1 left2, nonotes esttab left1 left2, addnotes("-esttab- is a wonderful program") esttab left1 left2, nonotes addnotes("-esttab- is a wonderful program") * no notes at bottom, or custom note esttab left1 left2, r2 ar2 pr2 aic bic scalars(ll) esttab left1 left2, pr2(%9.5f) scalars(ll) sfmt(%9.0f) obslast * additional summary statistics, optional # d.p., optionally put # obs last **** Acknowledgment: Thanks to Stephen Jenkins, University of Essex, for * sharing some esttab exemplars with us ************************************************************** ************************************************************** *** Exploring these coefficients (ii): * Lets say we're really interested in the difference in propensity to vote conservative * between different categories of the RGSC scheme ** First a bivariate comparison: use $path9\temp.dta, clear table qjbrgsc, c(mean convot n convot semean convot) graph bar (mean) convot, over(qjbrgsc, label(angle(45)) ) * In a bivariate analysis we could do a simplified plot of standard errors around the mean * (when convot is a dichotomy, Standard error of proportion = sqrt((p*(1-p)/n) capture drop p_c capture drop n_a capture drop se_c capture drop lower capture drop upper egen p_c=mean(convot), by(qjbrgsc) egen n_a=count(convot) gen se_c=sqrt( (p_c*(1-p_c))/n_a ) gen lower = p_c - (1.96*se_c) gen upper = p_c + (1.96*se_c) graph twoway (scatter p_c qjbrgsc) (rspike lower upper qjbrgsc), subtitle("Proportion Conservative") /// xlabel(, valuelabel angle(30) labsize(medsmall)) xscale(range(0 6.5)) legend(off) xtitle("") /// note("Source: BHPS 2007, Unweighted" /// "Bars show 95% confidence intervals around proportion") graph save coef1, replace ** This is interesting, but the question is whether the same differences persist after controlling for other factors xi:logit convot qage i.qsex i.qjbrgsc est store con1 est table con1 , se b(%9.4g) stats(r2_p bic ll N) est table con1 , star b(%9.4g) stats(r2_p bic ll N) ** How do we examine whether different classes are significantly different to each other? ** ..please read this!... ** Gayle, V., & Lambert, P. S. (2007). Using Quasi-Variance to Communicate Sociological Results from * Statistical Models. Sociology, 41(6), 1191-1206. *** In summary: * a) It's wrong to just look at overlapping CI's for the dummy coefficients ('reference category problem') * b) In Stata, we have easy facility to test any two coefficients are different * c) It'd be nice to graph the coefficients similar to the bivariate graph: we can do this with Quasi-Variance ** Demo of (b) test _Iqjbrgsc_2==0 * Classes 1 and 2 are not significantly different in the model (as in the bivariate analysis) test _Iqjbrgsc_4== _Iqjbrgsc_5 * Classes 4 and 5 are significantly different in the model (as in the bivariate analysis) test _Iqjbrgsc_4== _Iqjbrgsc_6 * Classes 4 and 6 are not significantly different in the model (different from the bivariate analysis) ** This is a good way to proceed, but the many permutations of contrasts are difficult to communicate ** Demo of (c) ** For detailed instructions see: http://www.longitudinal.stir.ac.uk/qv/ xi:logit convot qage i.qsex i.qjbrgsc vce matrix mod1=get(VCE) matrix list mod1 matrix mod1_q=mod1[3..7,3..7] matrix rownames mod1_q= c2 c3 c4 c5 c6 matrix colnames mod1_q= c2 c3 c4 c5 c6 matrix list mod1_q ** To undertake the QV calculation, we export this matrix (eg using Excel) * to the webpage: * http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic/firth/software/qvcalc/ * then run the online calculation. ** This produces a database of coefficients with upper and lower quasi-variance standard errors clear input rgsc coef se qse qvar 1 0 0 0.141 0.01989 2 0.060788 0.152 0.0539 0.00291 3 0.1441189 0.164 0.0829 0.00687 4 -0.3564942 0.171 0.1003 0.01005 5 -0.7105281 0.186 0.1217 0.0148 6 -0.4879115 0.3 0.265 0.07023 end summarize label variable coef "Parameter estimates" gen upper=coef + (1.96*qse) gen lower=coef - (1.96*qse) label define rgscl 1 "Professional" 2 "Managerial and technical" 3 "Skilled non-manual" /// 4 "Skilled manual" 5 "Partly skilled occ" 6 "Unskilled occ" label values rgsc rgscl tab rgsc summarize graph twoway scatter coef rgsc /// || rspike upper lower rgsc, /// xlabel(1 2 3 4 5 6 , valuelabel angle(30) ) xscale(range(0 7)) xtitle("") legend(off) /// subtitle("Probability of Supporting Conservative") /// note("Source: BHPS 2007, Unweighted." /// "Coefficients from logistic regression (controls for age & gender)," /// " and 95% Quasi-variance comparison intervals" , span) graph save coef2, replace graph combine coef1.gph coef2.gph, cols(2) ** Comment: The graph highlights in shorthand how controlling for age and gender reduces the extent of * difference between classes in voting preferences ************************************************************** ************************************************************** ************************************************************** *** EOF