STOP *******************************************************. *** Longitudinal Data Analysis for Social Science Researchers ** ** ** ESRC Researcher Development Initiative training programme: ** ** ** Training materials lab 0: ** INTRODUCTORY DATA ANALYSIS AND DATA MANAGEMENT IN STATA . ** ** ** www.longitudinal.stir.ac.uk ** Paul Lambert / Vernon Gayle, 1 August 2008 ** ** with minor updates for the DAMES training workshop of 24/25 August 2009, ** see http://www.dames.org.uk/workshops/ ** *******************************************************. **********************************************************. *****************************************************. ** The file below comes in three sections: ** *** Part 1: Introductory data management techniques ** 1.1) Data entry and saving files ** 1.2) Handling variables ** 1.3) Subsample operations ** 1.4) Weighting data ** 1.5) Including survey data design effects ** 1.6) Handling results of models / calculations ** 1.7) Processing do and ado files ** 1.8) Some extensibility features ** *** Part 2: Introductory data analysis techniques ** 2.1) Selected univariate techniques ** 2.2) Selected bivariate techniques ** 2.3) Selected multivariate techniques ** *** Part 3: More data management: introducing file matching techniques ** **********************************************************. ** SYNTAX REMINDERS: * indicates a comment; * any line not beginning with a * is an active command * A new command ordinarily begins on a new line, but the same syntax command * may be spread over more than one line, by using the line continuation symbol /// ** *******************************************************. ** GENERAL INSTRUCTIONS ON THESE FILES ** ** Work through this file in the interactive do-file editor, replicating ** the Stata do-file commands. Further help on working with Stata is ** available from the LDA web site. ** ** *** This file assumes you have a number of files downloaded to your ** machine. You will need the following: * ** Downloadable from the LDA site : ** Data files (from http://www.longitudinal.stir.ac.uk/workshop_materials.html): * -wemp_s2.dat * -div3s2004.dat * -ghs95.dta ** Macros : * -seglabelsv1.do (downloadable from the LDA site) ** Downloadable from the UK Data Archive: * - ssa02.dta, ssa01.dta, ssa00.dta and ssa99.dta * (Scottish social attitudes 2002, 2001, 2000, 1999, * Stata datasets for study numbers 4808, 4804, 4503, 4346, Stata format files) * - ssa02.por (Scottish social attitudes 2002, SPSS format file, study number 4808) * * -All BHPS Waves 1-15 component files in Stata format (UK Data Archive Study number * 5151 (June 2007 release) (extracted from the zip file 5151Stata8.ZIP) * +warning - these are a large volume of files, 152 different files, ~ 700MB; * most of the exercise below does not use the BHPS data, so it could be more effective * to proceed without downloading this study at this point ) * * **************. ** ** These examples are written for Stata version 10.0 for Windows (2008) ** - most commands should be compatible with earlier versions of Stata and other ** operating systems ** - for Stata version 8 or earlier, note that the length of this do file may ** be too long for the Stata do file editor, in which case the do file should be ** split into two or three components ** *******************************************************. ** . ** Getting started command: version version 10.0 ** These two commands are not essential to the lab. ** 'version' on its own asks Stata to confirm what version of Stata is being run ** 'version 10.0' is for posterity: it tells your Stata session that from now on, ** commands should be interpreted as from version, even if (in the future) you may ** be operating version 11 or later. ** (if these commands cause you any trouble, just omit them) *******************************************************. ** NOTIFICATION OF FILE LOCATIONS / DIRECTORIES AND STATA SETUP ** ** ** ** i) File location declarations: *** For the commands below to work, you should begin by running the following ** macros, which tell Stata where to look for the relevant data files (mentioned ** above) on your machine : . global path1 "d:\dames09\work\" * (the location of your working directory - where you will save * newly created data files and output) . global path2 "d:\dames09\data\lda\" * (the location of a folder where you have saved the * open access online data files mentioned above - * see http://www.longitudinal.stir.ac.uk/workshop_materials.html) global path3 "d:\dames09\data\bhps\w1to15\" * (the location of a folder where you have saved the BHPS data * file(s) mentioned above) . global path4 "d:\dames09\data\ssa\" * (the location of your copies of the SSA data files mentioned above) global path8 "d:\dames09\macros\" * (the location of your copy of the demonstration Stata macro 'seglabelsv1.do' * downloadable from the website ) global path9 "d:\dames09\temp\" * (a location of a temporary folder where you can save intermediate files) . ** For the Stirling 2009 workshop: the data and files all ought to be in the locations as above ** ** If you are running these sessions elsewhere: you need to carefully substitute your own path ** locations as appropriate into the commands above . ********************** ********************** ** ii) Stata session management: *** The commands below are used to set some general preferences within Stata, ** it is usually good advice to run these but it is not essential clear * (clears any other data within Stata) set more off, perm * (switches off the default setting whereby output is shown one page at a time only) * (the 'perm' option should keep this as an ongoing preference for your machine) * (you may need to enter this manually via the command window for it to stay persistent) set memory 64M * (expands the memory allocated to the Stata session, usually necessary if using large files) capture log close capture log using $path1\log_lab0.txt, replace text * (These commands close any previous log file that may be open, then they set a new log file * for this lab, a plain text file where basic output is saved, located in the directory defined * as 'path1'). ** ** ** *******************************************************. **********************************************************. ***********************************************************. *** SOME COMMENTS ON USING Stata INTERACTIVE DO FILES : * * * * * Comment: This is an interactive command file ('do file'), you can run * it from the do editor window in Stata (open with 'ctrl-8'), * by highlighting the command line or lines which you wish to invoke * and either 'ctrl-d' or the 'do current file' icon * * Comment: (Some Stata users run their command files from their own text editor software) * * Comment: On an interactive command file, you write out (or paste) the textual Stata commands, * then invoke them (typically in groups of a few sequential commands at a time). * * Comment: Gradually learning the textual Stata commands is the valuable skill in learning * Stata. The best way is to begin working through example files such as this one. * In time you will start to remember a few common commands, but additionally, you * should learn how and where to find information on the necessary Stata commands * - i.e. from the Stata manuals, the online Stata pages, or the help system * * Comment: Notes on the example files below give some explanation of what each command is * doing, but they don't cover everything. Best is to try to work out yourself * what each line is doing, but ask the instructors for clarification if needed. * * Comment: On an interactive command file, a new command is indicated by a line break * (i.e. line breaks are the 'command line delimiter'). A comment can be added to * the comand file by begining the line with a *, which means that Stata will ignore * the rest of the line - comments are useful to use to make notes for yourself. * Any text on a new line which doens't start with a * is interpreted by Stata * as an attempt to run the Stata command given by the text. * If you wish to spread the text of a single command * over more than one line, you can use the /// * symbol to indicate a line continuation * * [advanced: if you run a whole file at once, as a 'batch file', it is also possible * to specify another symbol such as a ; as the delimiter] * * [advanced: the reason 'STOP' is written at the top if this file is to prevent * you from inadvertently running the whole of this file in one go by pressing 'ctrl-r' * rather than 'ctrl-d' : there is no valid Stata command called STOP, so Stata will * immediately trip up on this command and won't carry on] * * Comment: If running a group of command lines in a go in Stata, if any of your * command lines include an error, Stata will immediately stop running * the lines from that point onwards (this is different from the way SPSS * processes command syntax). One useful way of averting this is to use the * 'capture' prefix before any given command (eg used above on the 'log' example). This * handy prefix tells Stata to run the command if it is working fine, and to ignore * it if it has an error, but carry on running the next line * * Comment: Windows management. It is worth spending some time arranging your Stata windows * into a layout that suits you. You can change the font size and colour scheme * of the output window (right-click) and you can change the layout and colour schemes * for other windows. If like me you use 'Alt-tab' to toggle between windows, you will * notice that some poor programming by Stata means you need to double-tab to leave * the Stata output window for any other window. * * Comment: For most social scientists, processing Stata through the do file editor is * appropriate for most purposes. The next step up is learning to program in Stata, * which concerns writing programmes (SPSS = macros) and batch files (whole files * processed in one go), to run complex tasks. Stata now refers to its programming * language as 'Mata'. * * * SEE ALSO : http://www.longitudinal.stir.ac.uk/Stata_support.html * * * * ************************************************************** ************************************************************** ** ..finally, here is the lab exercise... ************************************************************ **********************************************************. *** PART 1: PRELIMINARY DATA MANAGEMENT TECHNIQUES **********************************************************. ** Comment - In part 1 we open and close a number of different data files ** - if this leads to some error messages, best advice is to start again from the top, ** or at least from the last 'use' command ... *********************************************************. ***** 1.1) DATA ENTRY AND SAVING FILES: **** I) READ IN DATA FROM A PLAIN TEXT FILE: infile case femp mune time und1 und5 age using $path2\wemp_s2.dat, clear summarize describe list in 1/10 * note the absence of file information such as value labels **** II) READ IN A DATA FROM A PRE-PREPARED STATA FORMAT FILE: use $path4\ssa02.dta, clear summarize describe tab rural list in 1/3 * Stata isn't especially good for interactive data reviewing, * particularly if the file has quite a few variables or cases * - there's no variable view popup or window as in SPSS * (Tip: with large Stata files, I often keep an SPSS version open * at the same time, solely for the purpose of examining the data) use serial rsex rage marstat miles using $path4\ssa02.dta , clear summarize list in 1/3 * dealing with only a few variables is much easier ***************** **** III) READ IN DATA FROM SPSS AND OTHER FILE FORMATS: *** A) Converting files from SPSS ** ONLINE RESOURCE: STATA CONVERSIONS TO SPSS AND SAS : ** http://www.ats.ucla.edu/stat/Stata/faq/convert_pkg.htm ** From SPSS v16 onwards : ** Exporting from SPSS to Stata format can be done with 'save translate': *** For example (run the below in SPSS syntax, changing path's as necessary): . ** import file="d:\dames09\data\ssa\ssa02.por". ** save translate outfile="d:\dames09\temp\ssa02_test.dta" /type=stata /version=8 /replace . ** ** Then in Stata: ** use $path9\ssa02_test.dta ** SPSS can also read Stata format data with 'get stata file': ** For example:. ** get stata file="d:\dames09\data\lda\ghs95.dta". ** (But: at v16, SPSS requires the Stata file to be in version 9 format or earlier, * so it may be necessary to export from Stata into version 9 format). *** Historical note: With earlier versions of SPSS, there was no direct means ** to convert to Stata, common solutions were to: ** - Export from SPSS into plain text format and read this into Stata ** (the cost was to loose file metadata such as value labels) ** - Use the freeware SPSS script 'spss2Stata.sbs' (downloadable from * http://www2.jura.uni-hamburg.de/instkrim/kriminologie/Mitarbeiter/Enzmann/Software/Enzmann_Software.html * which runs on an open SPSS file and coverts it into a format that Stata can read, whilst * also writing out a text file in 'do' file format, which contains metadata such as value * labels, and which can be run as a single command to achieve the conversion. This script * isn't compatible with SPSS v16 and above. *** B) Converting between various formats: Use stat-transfer * This is a proprietary software package for transferring between different * data analysis software formats * http://www.stattransfer.com/ *** C) More on converting file formats ** - To convert with other packages it is often necessary to read or write in ** plain text formats (csv), which is easily done in Stata and most other packages ** Some examples below in IV and V ** - Common problems with exporting in and out of Stata format are related to the ** manner in which Stata stores missing values, which is different to some other ** software conventions ** - A practical solution is to always convert missing data into numeric format ** prior to data transfer - in most circumstances this is adequate and effective ** Example of converting missing data: use serial rsex rage marstat miles using $path4\ssa02.dta , clear numlabel _all, add summarize tab miles mvdecode miles, mv(-1) tab miles summarize list in 1/100 * Note the '.' format for missing data (e.g for 'miles' on row 8) * Conventional plain text export: outfile using $path9\temp_out.dat , nolabel replace * If you open this file up in a simple editor, you'll see that * the missing data is stored with a '-' symbol (e.g. end of the 8th row) mvencode miles, mv(-999) tab miles list in 1/100 outfile using $path9\temp_out.dat , nolabel replace * There are other options, but this is one simple way to be clear about * the missing data ** (more on missing data commands shortly below) ************************ **** IV) ENTERING DATA TO STATA BY HAND : . **** ..ie some more on data entry options: *** a) Within an interactive syntax file : ** It's seldom worth it but for the simplest of data, * eg a small number of summary statistics : clear input england wales scotland nirel 49.14 2.90 5.06 1.69 end graph bar (asis) england wales scotland nirel, title("UK population 2001") *** b) Using 'insheet' to read in a plain text data : ** Even if you are creating a dataset by hand, it is ** usually preferable to write the data into a different, simple text file, ** and then use the 'insheet' command to read it, eg: insheet using $path2\div3s2004.dat, clear * (tip - look at the contents of the file in a plain text editor to see what it features) summarize list team points gscore glost corr gscore glost points graph bar (asis) points , /// over(team, sort(points) descending label(alternate) ) title("Division 3, 2003/4") graph bar (asis) points gscore glost, /// over(team, sort(points) descending label(alternate) ) title("Division 3, 2003/4") drop if (team=="East Stirling") corr gscore glost points * Defences, not forwards, win you 3rd division titles! *** c) Entering data Interactively : clear ** - You can use the data editor window ** - Tap in values to cells; add labels etc by syntax as below ** - Cumbersome and best avoided - better to enter in SPSS or other package **************** **** V) SAVING DATA OUT FROM STATA: ** Example of saving in Stata format: insheet using $path2\div3s2004.dat, clear summarize list team points gscore glost save $path9\div3s2004.dta , replace ** Example of saving in plain text format: summarize outfile team points using $path9\div3s2004_2.dat , nolabel replace ** Example of differences between plain text output formats: use serial rsex rage marstat miles using $path4\ssa02.dta , clear numlabel _all, add summarize tab miles list in 1/100 outfile using $path9\temp_out2.dat , replace * If you open this file up in a plain text editor, you'll see that the text value labels * are saved on the data file - you're unlikely to want this. outfile using $path9\temp_out3.dat , nolabel replace * This saves as numeric format outsheet using $path9\temp_out4.dat, nolabel replace * This also gives csv numeric format, and also by default includes variable names ************************************************************ ************************************************************ ***** 1.2) HANDLING VARIABLES : ************************************************************ infile case femp mune time und1 und5 age using $path2\wemp_s2.dat, clear list in 1/10 * this data has no labels at present ** **** i) To add variable labels : tab femp mune label variable femp "Wife's employment status" label variable mune "Husband's employment status" tab femp mune ** **** ii) To add value labels : label define fempl 0 "Not working" 1 "Employed" label define munel 0 "Employed" 1 "Unemployed" label values femp fempl label values mune munel tab femp mune ** **** iii) Looking at categorical data : tab femp * that's no use - want to know what values are assigned tab femp , nolabel * that's rubbish too - now there's no labels numlabel fempl , add tab femp numlabel fempl, remove tab femp * numlabel works, though complex * A shortcut is to put numeric labels on everything: tab femp mune numlabel _all, add tab femp mune * (This is best. However, numlabel _all might not work for * large files with some string or complex variables). ** ** Multiple tables: tab femp tab femp mune * Tab on its own gives a 2-way crosstab tab1 femp mune und1 * Tab1 is a convenient alternative option to show a series of single tabs **** iv) Computing new variables infile case femp mune time und1 und5 age using $path2\wemp_s2.dat, clear summarize gen age2=(age^2) gen age3=(age^3) graph matrix age age2 age3 * Now for a deliberate error: gen age3=(age*3) * this won't work if age3 is already in the data * - you can't overwrite an existing variable without dropping it first summarize ag* drop age3 summarize ag* gen age3=(age*3) graph matrix age age2 age3 * because of the above, it often makes sense to use 'capture drop' before a gen, * regardless of whether the variable was created before capture drop age4 gen age4=(age^4) graph twoway scatter age4 age capture drop age4 gen age4=(age^4)/ 10000 graph twoway scatter age4 age *** The 'gen' command is for basic functions on one or more variables within a case *** The 'egen' command is 'extensions to generate', it has various additional options *** but the most useful is the ability to summarise within groups *** Here's some (slightly harder) examples using 'egen': infile case femp mune time und1 und5 age using $path2\wemp_s2.dat, clear sort case list case age time femp in 1/50 * (in this data, case is the person marker, and there are multiple records per * person, observed at different times. femp is is they were working each time) egen pctwk=mean(femp), by(case) egen minage=min(age), by(case) list case age time femp pctwk minage in 1/50 label variable pctwk "Proportion of records in which case was employed" label variable minage "Youngest age at which sampled" histogram pctwk graph twoway scatter pctwk minage * Though this is messy, it does make sense! gen minage2=(floor(minage/5)) * 5 list case age minage minage2 in 1/50 graph bar (mean) pctwk, over(minage2) title("Mean proportion of times working, by age") * This might be a little easier to follow * This dataset also shows if the subjects husbands were unemployed at any point egen husun=max(mune), by(case) capture label drop husunl label define husunl 0 "Husb. never unemployed" 1 "Husb. unemployed 1+ year" label values husun husunl gen one=1 /* Just so that we can use 'one' as layering variable in the graph */ graph bar (mean) pctwk, over(husun) over(minage2) over(one) /// title("Mean proportion of times working, by age") ***** v) Recoding values infile case femp mune time und1 und5 age using $path2\wemp_s2.dat, clear histogram age gen age4=age recode age4 18/30=1 31/40=2 41/50=3 51/max =4 label define age4l 1 "18-30 yrs" 2 "31-40 yrs" /// 3 "41-50 yrs" 4 "51+ yrs" label values age4 age4l tab age4 * Tip - with complex data recoding (eg occupational unit codes), you can recode via * a macro in a separate do file (or an 'ado' file) ** **** vi) Dealing with missing values use serial rsex rage marstat miles using $path4\ssa02.dta , clear numlabel _all, add summarize tab miles * codes for -1, 6 and 8 are all possible missing values *** a) Keep the data but code it as missing tab miles mvdecode miles, mv(-1,6,8) tab miles mvencode miles, mv (-999) tab miles * note - you loose differentiation between values -1, 6 and 8 *** b) Drop the missing data histogram miles histogram miles if (miles >= 1 & miles <= 5) summarize drop if (miles==-999) summarize *** c) What about a specific cases - say we want to drop serial 40007 sort serial list serial in 1/10 drop if (serial == 40007) list serial in 1/10 summarize ** Tip - 1) Because it is so easy to use subsets (see 1.3 below) I don't very often ** drop cases from the data like this ** Tip - 2) don't save data to the same filename after manipulating data like this!! *********************************************************** ************************************************************ ***** 1.3) SUBSAMPLE OPERATIONS : ************************************************************ * Stata has a very concise syntax for dealing with subsamples: use pid lxrwght lxewght llrwght llewght lregion lsex lage lvote /// using $path3\lindresp.dta , clear summarize numlabel _all, add tab lregion graph pie, over(lregion) line(lcolor(gs1)) **** i) Analysis only on certain conditions : tab lvote if lregion==18 tab lvote lregion if ((lregion==17 | lregion==18) & lvote >= 1 & lvote <=10), col tab lvote if (lregion==17 & lsex==1 & lage <= 40) graph pie if lregion==18 & lvote >= 1 & lvote <=10, over(lvote) title("Voting pref. in Scotland in 2002") ** **** ii) Analysis split by another variable : sort lsex by lsex: tab lvote by lsex: tab lvote if (lregion==18 & lage >= 50) by lsex: tab lvote if (lregion==18 & lage >= 50) [aweight=lxrwght] **** iii) Analysis only for a certain range of data: tab lvote tab lvote if (lvote >= 1 & lvote <= 3) sort lsex by lsex: tab lvote /// if (lregion==18 & lage >= 50 & (lvote >= 1 & lvote <= 3)) [aweight=lxrwght] mvdecode lvote, mv(-9,-7,8,10,11) tab lvote lsex , col chi V * (an insignificant significance?) ******************************************************* ************************************************************ ***** 1.4) WEIGHTING DATA : ************************************************************ use pid lxrwght lxewght llrwght llewght lxrwtuk1 lxrwtuk2 lregion lsex lage /// using $path3\lindresp.dta , clear summarize numlabel _all, add tab lregion ** This is the BHPS wave 12 (2002), with only some variables extracted summarize lxrwght lxewght llrwght llewght lxrwtuk1 lxrwtuk2 * These are 6 of the main BHPS weighting variables * Lxrwght is the main weight, for adjusting a given wave to the British population: tab lregion , summarize(lxrwght) mean obs tab lregion [aweight=lxrwght] *Graph this (using a scaling factor ease interpretation) gen lxr2=lxrwght*1000 graph bar (sum) lxrwght (count) lxrwght (mean) lxr2 , over(lregion, label(angle(70) labsize(small) ) ) /// legend(order(1 2 3) label(1 "Weighted sample") label(2 "Unweighted cases") label(3 "Weights assigned") ) /// title("BHPS 2002: Relation between total sample & British weights") * For a cross-sectional weight including cases from Northern Ireland, use lxrwtuk1 tab lregion, summarize(lxrwtuk1) mean obs tab lregion [aweight=lxrwtuk1] gen luk2=lxrwtuk1*1000 graph bar (sum) lxrwtuk1 (count) lxrwtuk1 (mean) luk2 , over(lregion, label(angle(70) labsize(small) ) ) /// legend(order(1 2 3) label(1 "Weighted sample") label(2 "Unweighted cases") label(3 "Weights assigned") ) /// title("BHPS 2002: Relation between total sample & British weights") * The uk2 cross-sectional weights are for allowing representative analysis within * countries, ignoring cross-country divisions: tab lregion, summarize(lxrwtuk2) mean obs tab lregion [aweight=lxrwtuk2] * Other weights involve considering other longitudinal structures table lsex, c(mean lxrwght n lxrwght mean lxewght n lxewght) * note this shows: m lower initial non-response than f, but * m/f _dropout_ rates are more similar ** To use weights: with most commands, add an 'aweight' : tab lsex tab lsex [aweight=lxrwght] * The [aweight=..] works in most situations * comment: Stata has several types of weighting formula. * aweight is the normal sampling weight for population fractions ** BUT aweight sometimes can't be used in certain commands ** AND other weight commands can, but the might not accept noninteger weights tab lsex [fweight=lxrwght] * ..see! * For a useful short note on weighting for complex survey data: * http://sru.soc.surrey.ac.uk/SRU43.html ************************************************************ ***** 1.5) INCLUDING SURVEY DATA DESIGN EFFECTS : ************************************************************ * Stata has a very useful set of options for estimating survey data * design effects (it's well ahead of alternative general purpose packages) ** Example : BHPS wave 12 . ** The BHPS has a multistage cluster design, some possible survey ** effects : ** wstrata (in whhsamp): initial sample stratifying factor ** whhac (in whhsamp): household location sampling point (w1 only) ** wpsu (in whhsamp and others): local region of household ** whid (in windresp and others): household identifier ** Stata allows for two cluster effects ('strata' and 'psu') ** -> If studying households, it would be best to use wstrata and whhac ** -> If studying people, household clustering may be important, so ** better to use wpsu and whid : use pid lxrwght lxewght llrwght llewght lregion lsex lage lvote /// lhid using $path3\lindresp.dta , clear summarize lregion lhid numlabel _all, add tab lregion ** Stata's 'svy' commands allow survey structure to be recognised in analysis svydes *(this provokes an error message - nothing is defined yet) mvdecode lvote , mv(-9,-7,11) table lvote , c(mean lage n lage) sort lvote ci lage [aweight=llrwght], by(lvote) * Now add the survey definitions: svyset [pweight=llrwght], strata(lregion) psu(lhid) svydes mean lage mean lage [pweight=llrwght] * (the weighted estimate uses individual weights only) svy: mean lage * (the svy estimate takes account of both weighting and clustering) estat effects, deff deft meff meft * (deff, deft, meff and meft are diagnostics to tell us about the survey design * impact - see 'help svy_estat' ) mean lage, over(lvote) svy: mean lage, over(lvote) ** Comment: point estimates don't change much, but CI's have widened ** Good practice is always to check for these sort of effects ** (..and then decide to ignore them...) ** Comment: lregion isn't the best strata variable to use on the BHPS, but was the ** most convenient for this illustrative data ********************************************************** ************************************************************ ************************************************************ ****************************************************. * 1.6) HANDLING RESULTS OF MODELS / CALCULATIONS - see also section 2.3(iii) ** After any given statistical model, Stata temporarily stores information on the model * in memory, which the user may extract and store elsewhere. The information * is stored as the model 'estimates' and coefficient matrices: estimates clear *[prepare data] use $path2\ghs95.dta, clear numlabel _all, add tab soclase gen class56=(soclase==5 | soclase==6) tab class56 tab sex gen fem=(sex==2) tab fem gen class56f=(fem==1 & class56==1) tab class56f gen age2=age^2 summarize soclase class56 age age2 workhrs fem class56f * [now ready to model these variables:] logit class56 age age2 workhrs fem if (soclase > 1 & soclase <=6) ereturn list * (this gives everything saved for the model) matrix logcoef=e(b) est store logit1 matrix list logcoef matrix agecoef=logcoef[1,1..2] matrix list agecoef est stats est table , star stats(n r2_p bic) * Compare three models for working hours, with different explanatory variables est clear summarize workhrs age age2 fem class56 class56f if (soclase >= 1 & soclase <=6) regress workhrs age age2 if (soclase >= 1 & soclase <=6) est store hrsreg1 regress workhrs age age2 fem class56 if (soclase >= 1 & soclase <=6) est store hrsreg2 regress workhrs age age2 fem class56 class56f if (soclase >= 1 & soclase <=6) est store hrsreg3 est stats est table hrsreg1 hrsreg2 hrsreg3 , stats(N r2 ll) b(%9.3f) stfmt(%12.3f) star * Compare three models for occupational attainment where functional form differs est clear logit class56 age age2 workhrs fem if (soclase >= 1 & soclase <=6) est store logit1 ologit soclase age age2 workhrs fem if (soclase >= 1 & soclase <=6) est store ologit1 regress soclase age age2 workhrs fem if (soclase >= 1 & soclase <=6) est store regress1 est stats est table logit1 ologit1 regress1 , b(%9.3f) stats(N r2 r2_p ll ) star ************************************************************ ************************************************************ ************************************************************ ****************************************************. * 1.7) PROCESSING DO AND ADO FILES ** It is often helpful to call upon a longer piece of Stata programme ** within the course of your do-file editor session. ** This can be done by calling a 'do' or an 'ado' file within your session * (do and ado files are equivalent in their treatment by Stata, but the * latter term is often used to differentiate them from interactive files) ** Example: use $path2\ghs95.dta, clear numlabel _all, add tab segead ** This is the variable 'SEG' - a coding for the job classification of any * working adult in the household gen seg2=segead tab seg2 * Note that seg2 doesn't preserve the value labels, which need to be added in * some way. ** The most orthodox way of adding the data would be to write out all the labels: label define seg2l 1 "Employers in large establishments (25+ employees)" /// 2 "Managers in large establishments (25+ employees" /// 3 "Employers in small establishments (lt 25 employees)" /// 4 "Managers in small establishments (lt 25 employees)" /// 5 "Professional workers - self-employed" /// 6 "Professional workers - employees" /// 7 "Non manual workers, ancillary to professions" /// 8 "Foremen and supervisors of non-manual workers" /// 9 "Junior non-manual" /// 10 "Personal services workers" /// 11 "Foremen and supervisors of manual workers" /// 12 "Skilled manual" /// 13 "Semi-skilled manual" /// 14 "Unskiled manual" /// 15 "Own account workers, non-professional" /// 16 "Farmers - employers and managers" /// 17 "Farmers - own account workers" /// 18 "Agricultural workers" /// 19 "Armed forces" /// 20 "Full time student" /// 21 "Never worked" /// 22 "Inadequately described occupation" label values seg2 seg2l numlabel _all, add tab seg2 ** However, writing out all of these labels is cumbersome within the * single do file. * By running a separate file as a command, we could put these details * elsewhere and invoke them with a single command - eg: gen seg3=segead tab seg3 do $path8\seglabelsv1.do label values seg3 segl numlabel _all, add tab seg3 ** In practice do and ado files are often invoked within programmes to run * quite extended and complex processes ************************************************************ ************************************************************ ************************************************************ ** 1.8) SOME EXTENSIBILITY FEATURES ************************************************************ ** A very attractive feature of Stata is its extensibility - features ** are regularly added to Stata by users themeselves ('user contributed') ** and then are themselves made available to other researchers. Over the years ** this had led to many innovations in the Stata package that have ** greatly enhanced the program (and made it especially good at developing ** user-driven features) **************************** ** i) Reading online files *** A major contribution to Stata's extensibility is its ability to read and process ** data and commands directly from online sources. ** For example, many of the help manuals take you through examples which use ** online datasets. Here is an example from the v8 Stata manual on 'regress' : clear use http://www.stata-press.com/data/r8/auto.dta generate weightsq=weight^2 regress mpg weight weightsq foreign ** Stop and think... ** ...this is pretty useful, and opens all sorts of avenues..! ** In fact, the same can be done with do and ado files as well - try this example: do http://www.longitudinal.stir.ac.uk/labs/extras/uk_unis.do ** **************************** ** ii) Extension programs *** Stata has a vast library of user-contributed extension functions that ** are available by downloading and installing as packages. ** It is complicated to illustrate but there is a full description here: ** http://www.stata.com/help.cgi?net ************************************************ ** One example (there are many to choose from) is a program called 'tabplot' * which is not part of the main Stata software, but can be installed as an extension use $path2\ghs95.dta, clear numlabel _all, add tab soclase sex net search tabplot * This tells us that 'tabplot' is a package availabel from the repec website: net set ado $path8 adopath + $path8 * This is complicated and isn't always necessary. * What is happening is that when you try to load the 'tabplot' program, there will need * to be somewhere where Stata can save it on your machine. What Stata will save is an 'ado' * file, so the convention is that there must be an 'adopath' where Stata can save this new file. * In most sessions, you will already have a writable adopath associated with your work. However * you might not (say if you are running a networked version of Stata), so it makes sense to * declear an additional location on your own machine ('path8') where you know that Stata definitely * will be able to save the requisite ado file. The two lines above do this. net from http://fmwww.bc.edu/RePEc/bocode/t * Next, this tells us that we'll use the repec site to access tabplot net describe tabplot * We ask for information about tabplot that comes from the repec site net install tabplot * Now we've installed tabplot on our own Stata package * (if the extension package was already installed then nothing happens here) tabplot soclase sex * We now use the newly installed tabplot command: * it gives a graphic representation of the crosstabulation above. * This wasn't on the package before, and is only possible because the author * of tabplot (Nicolas Cox) wrote a programme to generate this graph, then * distributed it through the Stata system ****************************************** ** A second good example is the extension command called 'outreg2', which helps with ** processing the results of multiple regression equations (cf. 'est table' above) * (see eg http://www.ats.ucla.edu/stat/stata/faq/outreg.htm) * (http://econpapers.repec.org/software/bocbocode/s456416.htm) net from http://fmwww.bc.edu/repec/bocode/o net install outreg2 * This tells Stata where outreg2 will be accessed from, * then installs it to Stata's known adopath (specified just above) *************************************************** **Example regression from GHS file use $path2\ghs95.dta, clear numlabel _all, add gen longill2=(longill==1) gen fem=sex==2 gen age2=age^2 gen femage=fem*age gen soch1=(hohscle==1 | hohscle==2) gen soch3=(hohscle==5 | hohscle==6) tab hohscle soch1 tab hohscle soch3 summarize longill2 fem age age2 femage soch1 soch3 logit longill2 age age2 soch1 soch3 logit longill2 age soch1 soch3 logit longill2 fem age femage soch1 soch3 ** Exporting these results in outreg: logit longill2 age age2 outreg2 using $path9\eg_outreg, nolabel replace word logit longill2 age soch1 soch3 outreg2 using $path9\eg_outreg, nolabel append word logit longill2 fem age femage soch1 soch3 outreg2 using $path9\eg_outreg, nolabel append word ** Exercise: open up the file eg_outreg.rtf in Word, and see what it contains ** Comment: there are lots of variations available in formatting via outreg2. ************************************************** **************************** ** iii) Stata upgrades *** The core Stata files are also regularly updated with new content. ** We won't do so here, but the starting point is the command: update query **************************** ** iv) Extension packages ** Not only are there specialist extension programs for particular routines ** within the Stata framework - see (ii) above - but there are also some examples ** of entire software packages being developed, as freeware, that are deliberately ** designed to integrate with Stata sessions ** We won't give a practical illustration, but there are two such examples ** which are relevant to longitudinal data analysis applications: ** GLLAMM: http://www.gllamm.org/ ** SabreStata: http://sabre.lancs.ac.uk/sabreStatause_intro.html ************************************************************ ************************************************************ ************************************************************ ************************************************************ **********************************************************. *****************************************************. ** Part 2: INTRODUCTORY DATA ANALYSIS TECHNIQUES **********************************************************. ***********************************************************. * Comment: Part 2 uses just the single data file called * 'ghs95.dta', which should be saved on your machine and located * within the folder you've defined as 'path2' ************************************************************** ************************************************************. ************************************************************. ** 2.1) SELECTED UNIVARIATE TECHNIQUES ************************************************************. use $path2\ghs95.dta, clear numlabel _all, add ** Categorical data : tab sex tab typaccm list sex typaccm in 1/40 ** Some graphs: gen one=1 label variable one "Respondents" graph bar (count) one, over (typaccm) graph pie one, over(typaccm) ** Comment: there are numerous specifications which may be added to Stata graph * outputs to adjust their appearance, eg graph bar (count) one, over (typaccm, label(alternate labsize(vsmall))) /// ytitle(" ") title("UK Housing tenure 1995") /// note(" " "Source: General Household Survey 1995, n=4633") ** Metric data :. summarize workhrs list workhrs in 1/40 histogram workhrs graph box workhrs ci workhrs ***********************************************************. ***********************************************************. ** 2.2) SELECTED BIVARIATE TECHNIQUES ***********************************************************. ** Two categorical variables tab soclase genhlth tab soclase genhlth, row V tab soclase genhlth, col chi V gamma ** Two metric variables summarize age earnings gen lnearn=ln(earnings) if (earnings >= 10 & earnings <= 5000) summarize age earnings lnearn graph twoway scatter lnearn age correlate age lnearn regress lnearn age gen age2=age^2 regress lnearn age age2 * (Aside - what is the shape of the effect of age?) matrix coefs=e(b) matrix list coefs gen ageef=age*coefs[1,1] + age2*coefs[1,2] graph twoway (scatter ageef age) ** One categorical variable, one metric table soclase, c(mean lnearn sd lnearn n lnearn) sort soclase ci lnearn, by(soclase) graph box lnearn, over(soclase, label(alternate)) ***********************************************************. ***********************************************************. ** 2.3) SELECTED MULTIVARIATE TECHNIQUES ***********************************************************. *****************************************************. *** i) Multivariate comparisons ** (Stata is very good indeed for multivariate comparisons) summarize lnearn workhrs age age2 tab sex tab soclase tab genhlth ** If no variables are metric:. sort sex by sex: tab soclase genhlth, col chi V gamma ** If one variable only is metric : . table soclase sex if (soclase >= 1 & soclase <=6), /// c(mean lnearn sd lnearn n lnearn) sort sex by sex: table soclase if (soclase >= 1 & soclase <=6), /// c(mean lnearn sd lnearn n lnearn) graph bar (mean) lnearn, over(sex) over(soclase, label(alternate)) asyvars graph bar (mean) lnearn, over(sex) over(soclase, label(alternate)) asyvars stack graph box lnearn, over(sex) over(soclase, label(alternate)) asyvars ** If 2+ variables are metric graph twoway scatter lnearn age correlate age lnearn workhrs graph matrix age lnearn workhrs ******************************************************. *** ii) Multivariate modelling ** (Stata has a huge range of modelling options) summarize lnearn workhrs age age2 regress lnearn workhrs age age2 * To include some categorical variables, you can create some dummy variables * (see section 1 for variable constructions) tab sex gen fem=(sex==2) tab soclase gen class1=(soclase==1) gen class2=(soclase==2) gen class3=(soclase==3) gen class4=(soclase==4) gen class5=(soclase==5) gen class6=(soclase==6) gen class7=(soclase==7) gen class12=(soclase==1 | soclase==2) * (the above is the manual way to create dummy variable indicators) tab region gen england=(region >= 1 & region <= 15) gen scotland=(region >= 18 & region <= 22) gen wales=(region==16 | region==17) regress lnearn age age2 workhrs fem regress lnearn age age2 workhrs fem class1 class2 class4 class5 class6 class7 * categorical variables can alternatively be handled using 'xi' regressions xi:regres lnearn age age2 workhrs i.sex xi:regres lnearn age age2 workhrs i.sex i.soclase list sex soclase _Is* in 20/50 * (this illustrates how xi has created some new dummy indicators automatically). * Interaction terms can also be entered either by creating new variables, or xi: gen agefem=age*fem regress lnearn age age2 workhrs fem agefem xi:regress lnearn age age2 workhrs i.sex i.sex*age * Repeated regressions across subgroups: sort sex by sex: regress lnearn age age2 workhrs class1 class2 class4 class5 class6 class7 ** Regressions with categorical outcomes: tab class12 logistic class12 age age2 workhrs fem tab soclase mlogit soclase age age2 workhrs fem if (soclase > 0 & soclase <=6), baseoutcome(1) ologit soclase age age2 workhrs fem if (soclase > 0 & soclase <=6) ** Comment: use 'baseoutcome' on mlogit to specify the reference category. ** in older versions of Stata, 'basecategory' was used. mlogit soclase fem if (soclase > 0 & soclase <=2), baseoutcome (1) ologit soclase fem if (soclase > 0 & soclase <=2) ** Comment : compare the differences between these ologit/mlogits with 2 categories * above, with the 6 category version when ordinality mattered more ** Comment: note how regression by default uses listwise deletion of missing values ** good practice can be to make this explicit in your own programming: summarize lnearn workhrs age age2 regress lnearn workhrs age age2 regress lnearn workhrs age age2 if (lnearn >= 1 & lnearn <= 10 /// & age >= 16 & age <= 100 & workhrs >= 1 & workhrs <= 100) * The latter format takes longer to write out, but is ultimately more reliable) **********************. ***** iii) Summarising model outputs: ** an impressive Stata functionality covers analysing and storing the results of * one or more regression models (see also section 1.6 above): regress lnearn age age2 workhrs fem scotland wales ** i) testing coefficient differences test scotland=wales * (for last active statistical model, tests is coefficient for Scotland is * significantly different to Wales - here it isn't) ** ii) impacts of coefficient effects display 35^2 lincom _cons + 35*age + 1225*age2 + 40*workhrs + 0*fem + 1*scotland lincom _cons + 35*age + 1225*age2 + 40*workhrs + 1*fem ** Lincom gives you the effects of the coefficients for combinations of values: * Predicted log income in 1995 for a 35 year old Scottish male working 40 hours * a week is 5.62 (276pw); for a 35 year old English female working 40 hours it * is 5.36 (213pw) ** iii) comparing different model results est clear regress lnearn est store null regress lnearn age age2 workhrs est store mod1 regress lnearn age age2 workhrs fem est store mod2 regress lnearn age age2 workhrs fem scotland wales est store mod3 xi: regress lnearn age age2 workhrs fem i.fem*scotland i.fem*wales est store mod4 est stats est table null mod1 mod2 mod3 mod4, stats(N r2 bic) est table mod1 mod2 mod3 mod4, stats(N r2 bic) star est for mod1: adjust, by(age fem) * (predicted values for combinations of age and gender) ****************************************************. ****************************************************. ****************************************************. **********************************************************. *****************************************************. ** Part 3: More data management: introducing file matching techniques **********************************************************. *** COMBINING DATASETS ******************************************************* ******************************************************* *** 3.1) Adding Files (Eg, Repeated Cross-Sections; Panels) ******************************************************* **** EXAMPLE : REPEATED CROSS-SECTION : POOLING DIFFERENT YEARS **** FROM THE SCOTTISH SOCIAL ATTITUDES SURVEY ** The commands below open up 4 years of the Scottish Social ** Attitudes survey, save a selection of variables, then add ** them all together use serial rsex rage marstat scotpar2 ukintnat wtfactor using $path4\ssa02.dta , clear gen year=2002 save $path9\m1.dta, replace use serial rsex rage marstat scotpar2 ukintnat wtfactor using $path4\ssa01.dta , clear gen year=2001 save $path9\m2.dta, replace use serial rsex rage marstat scotpar2 ukintnat wtfactor using $path4\ssa00.dta , clear gen year=2000 save $path9\m3.dta, replace use serialno rsex rage marstat scotpar2 ukintnat wtfactor using $path4\ssa99.dta , clear gen year=1999 gen serial=serialno drop serialno save $path9\m4.dta, replace * Comment: all these variables are collected and equivalent between * the various survey years; beware: it usually takes quite a * lot of effort to pick out appropriate variables like this ** Add them all together : use $path9\m1.dta, clear append using $path9\m2.dta append using $path9\m3.dta append using $path9\m4.dta tab year graph box scotpar2, over(rsex) over(year, label(alternate)) asyvars title("Opposition to independent / devolved Scotland") summarize save $path9\ssa1.dta , replace ** We've pooled data on variables that are equivalent in each ssa * sweep - there aren't many choices though - a lot of variables are * not harmonised between even just these 4 years of surveys ** Often, an analyst would compute their own harmonised variables * using the same variable names across years. *************************************************************** ******************************************************* *** 3.2) One-to-One Matching (Eg, Studying transitions) *************************************************************** ************************************************* **** EXAMPLE: MATCHING TWO BITS OF INFO FROM DIFFERENT TIME POINTS ** TO THE SAME SUBJECTS **** Link data from the 1994 BHPS youth samples (11-15yrs) **** with data from the 2005 adult response (22-26 yrs) . ** 1994 youth records. use pid dypcomp dypsex using $path3\dyouth.dta , clear sort pid summarize save $path9\mtch1.dta, replace use pid oqfedhi osex oage using $path3\oindresp.dta, clear sort pid merge pid using $path9\mtch1.dta tab _merge ** The merge data here shows : ** 15,180 adults in wave o (2005) but not in d youth ** 326 kids in d youth but not in 2005 adult sample ave ** 447 are kids in d youth and also present in 2005 adult tab dypsex osex keep if (_merge==3) tab oage numlabel _all, add tab oqfedhi tab dypcomp mvdecode oqfedhi , mv(-9,-7) mvdecode dypcomp, mv(-9) tab oqfedhi dypcomp, col * The quals profile of comp users seems higher gen educ3=oqfedhi recode educ3 1/4=1 6=2 7/13=3 *=-9 tab educ3 mvdecode educ3, mv(-9) tab educ3 dypcomp, col V ******************************************************* ******************************************************* *** 3.3) One-to-Many Matching (Eg, Fixed in time information) ******************************************************* *** Egs : - regional average distributed to all people in region *** - household info distributed to all household members *** - Individual fixed data distributed to multiple person records ** The BHPS has multiple records per person, but some of it's data ** is fixed in time, and thus could be distributed to several ** different cases at once ** Example : household level details distributed to individual level data ** from the BHPS use $path3\ohhresp.dta, clear summarize ohid ocd8use numlabel _all, add tab ocd8use * This is one record per BHPS household sort ohid keep ohid ocd8use save $path9\mtch1.dta, replace use pid ohid osex oage using $path3\oindresp.dta, clear summarize ** This is one record per BHPS person *Do a one-to-many match : sort ohid merge ohid using $path9\mtch1.dta tab _merge * 1 = BHPS records who had no valid household level info (no valid cases) * 2 = 6 cases = records with household info but no individual record * 3 = 15627 = records with both household and individual info successfully linked summarize numlabel _all, add tab ocd8use table ocd8use if (ocd8use >= 0 & oage >=16), c(mean oage n oage) graph box oage if (ocd8use >= 0 & oage >= 16) , over(ocd8use) ******************************************************* ******************************************************* *** 3.4) Many-to-One Matching (Eg, Creating time series') ** (aka: Aggregating) ******************************************************* **** Examples : compute average income per sampling area and link back **** : find modal vote per sampling area and link back **** : find highest educational level in household and link back ** BHPS Illustration: highest education level in household, BHPS wave 12 use $path3\lindresp.dta, clear numlabel _all, add tab lqfedhi keep if (lqfedhi > 0 & lqfedhi < 13) gen qual5=lqfedhi recode qual5 1=1 2=2 3/5=3 6/11=4 12=5 label define qual5l 1 "Higher degree" 2 "Degree" 3 "Diploma" 4 "School or vocational" 5 "No qual" label values qual5 qual5l numlabel qual5l, add tab qual5 summarize pid lhid qual5 * lhid is within wave household identifier, 138832 recodes collapse (min) qual5, by(lhid) summarize * This has created a dataset where each record is a household per * year, and qual5 is the highest qualification held by anyone in * the household sort lhid rename qual5 hhqual5 label variable hhqual5 "Highest qual held in household" save $path9\mtch1.dta, replace ** Now match that information back to the panel file : use $path3\lindresp.dta, clear numlabel _all, add tab lqfedhi keep if (lqfedhi > 0 & lqfedhi < 13) gen qual5=lqfedhi recode qual5 1=1 2=2 3/5=3 6/11=4 12=5 label define qual5l 1 "Higher degree" 2 "Degree" 3 "Diploma" 4 "School or vocational" 5 "No qual" label values qual5 qual5l numlabel qual5l, add tab qual5 sort lhid merge lhid using $path9\mtch1.dta tab _merge * Because we treated it beforehand, we get a perfect fit - all * cases get a household average linked to them (_merge==3) tab qual5 hhqual5 * Q: How well is voting explained by own and by household education? tab lvote keep if (lvote==1 | lvote==2) tab lvote qual5, row V tab lvote hhqual5, row V * (only for those who are different to hhld value) tab lvote qual5 if (qual5 ~=hhqual5), row V tab lvote hhqual5 if (qual5 ~=hhqual5), row V ******************************************************* *** 3.5) Relationships-between-cases Matching **** Example : - BHPS, Match a spouse's job to own case **** (some other BHPS matching examples are found in lab 2) *** Link spouse's job status to an individual record . use ljbstat lsppid using $path3\lindresp.dta , clear rename lsppid pid rename ljbstat spjstat label variable spjstat "Spouse's job status" sort pid summarize save $path9\mtch1.sav , replace use pid ljbstat lsppid lmastat lsex using $path3\lindresp.dta, clear sort pid merge pid using $path9\mtch1.sav tab _merge ** _merge=1 : 6825 cases on the current file without spouses (the same 6825) ** _merge=2 : 6825 incoming cases without spouses ** _merge=3 : 9774 cases with spouse's data matched tab lmastat _merge * (ie lmastat is only valid for merge=1 or 3 tab ljbstat spjstat if (_merge==3) sort lsex by lsex: tab ljbstat spjstat if (_merge==3) by lsex: tab spjstat ljbstat if (_merge==3 & (ljbstat==2 | ljbstat==3)), col ** These tables show relation between own job status and spouse's ************************************************* ************************************************ ************************************************* ************************************************ capture log close * closes the log file ************************************************* ************************************************ ** FURTHER RESOURCES: INTRODUCING STATA ** THERE ARE NUMEROUS INTERNET RESOURCES COVERING WORKING WITH STATA, SEE: ** http://www.longitudinal.stir.ac.uk/Stata_support.html ** FOR AN INTRODUCTION TO MANY OF THEM ** WE ALSO HAVE SOME FURTHER EXAMPLE FILES FROM A 'SCOTTISH SOCIAL SURVEY ** NETWORK' MASTER CLASS ON STATA, WITH EXAMPLE DO FILES COVERING: ** ** - handling coefficients (things to do with model estimates and outputs) ** - sample selected data (introductory examples of selection models) ** - multilevel data and analysis (introductory examples of multilevel models) ** ** SEE THE THREE DO FILES AVAILABLE FROM: ** ** http://www.longitudinal.stir.ac.uk/surveynetwork/master_classes.html#stata ** ************************************************* ************************************************ **** EOF