*******************************************************************. ***** Census Programme Workshop on Spatial and Social Classification . ***** LEEDS, 8 June 2010 . ***** Practical session on the GESDE services . ***** Prepared by Paul Lambert, University of Stirling. ***** For further information, see the supplementary practical session handout. ***** (handout and command files are also available from: ***** http://www.dames.org.uk/workshops/ ). *********************************************************************. *********************************************************************. *********************************************************************. *********************************************************************. *** Part 1: Using specialist data resources to obtain and exploit sociological classifications. *********************************************************************. ************************************************************************************************************. ************************************************************************************************************. ************* PRELIMINARIES: DEFINE FILE LOCATIONS . ** The next lines of the file set the paths that will be used in the part1 exercises * (you should edit the paths to appropirate locations on your own machine). define !path1 () 'C:\dames\gemde\workshops\leeds\demo\data\' !enddefine. /* Where the derived extract files are saved */ define !path2 () 'C:\dames\gemde\workshops\leeds\demo\outputs\' !enddefine. /* Location of your working file for outputs and your own syntax files */ define !path5a () 'C:\dames\gemde\workshops\leeds\demo\occ_info\' !enddefine. /* Where the specialist occupational information data is stored (see Exercise 1.1) */ define !path5b () 'C:\dames\gemde\workshops\leeds\demo\educ_info\' !enddefine. /* Where the specialist educational information data is stored (see Exercise 1.2) */ define !path5c () 'C:\dames\gemde\workshops\leeds\demo\eth_info\' !enddefine. /* Where the specialist data on ethnicity is stored (see Exercise 1.3) */ define !path9 () 'c:\temp\' !enddefine. /* A location where temporary files can be saved */. * (In the Leeds lab session, we ought to be able to tell you new locations for each of these paths at the start of the lab). ****************************************************. ** In the first three examples in Part 1, we'll use data extracts from four secondary survey datasets which are * freely available to academic researchers (either from the UK Data Archive, or from the European Social Survey). ** To speed up the lab we have used derived extracts from these surveys, rather than leave you to access the original data. * For replication purposes, however, you can generate the derived extracts from the original data by using the * following subfile (i.e., uncomment the line below and run it if you want to run the extract process from the start). *include file=!path1+"gesde_workshop_data_setup_8jun2010.sps". * (If you're doing this excersise outwith the Leeds workshop session, you'll need to do this in advance). ************************************************** ** Additionally, in example 1.4, we'll use a unique dataset prepared for academic research which is currently * only available on request from its authors. ************************************************************************************************************. ************************************************************************************************************. *********************************************************************. *********************************************************************. *** Example 1.1: Handling Occupational data . *********************************************************************. ***********************************************************************************. *** i) DERIVING OCCUPATION-BASED SOCIAL CLASSIFICATIONS IN THE UK, FROM A SURVEY * DATASET WITH OCCUPATIONAL TITLES:. ** Open an illustrative microdata extract * (generated via the file 'gesde_workshop_data_setup_8jun2010.sps' above). get file=!path1+"lfs_2002extract.sav". fre var=soc2km ukempst sex . ** soc2km : SOC-2000 4-digit unit group occupations * (http://www.geode.stir.ac.uk/ougs.html#soc2000). ** ukempst : Employment status using UK standard measure * (http://www.geode.stir.ac.uk/ougs.html#ukempst). ** sex : Gender, 0 for male, 1 for female . ** (The data is an extract from the Labour Force Survey 2002 teaching dataset, * see http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=4736 ). ** The SOC2000 codes are typical of the starting data you may have with occuaptional data * (e.g. if you collected the survey yourself, and used CASCOT to code occupational descriptions * into SOC2000 units, see http://www2.warwick.ac.uk/fac/soc/ier/publications/software/cascot/ ). *** TASK: Add some standard occupation-based social classifications to the survey data . ** SOLUTION: There are a number of internet sites where you can access translation codes and * instructions to do this for different occupation-based measures. The conventional way to do this is * to go and find one of those files, then write a programme to exploit the relevant file. * The example below illustrates the process by using the CAMSIS data files for UK occupational data, * which are available at: http://www.camsis.stir.ac.uk/versions.html#Britain . ** i) Match male and female CAMSIS scores and other classificiations, using 4-digit SOC and ukempst . get file=!path5a+"gb91soc2000.sav" . rename variables (soc2000=soc2km). sort cases by soc2km ukempst. sav out=!path9+"mtch1.sav" /keep=soc2km ukempst mcamsis fcamsis rgsc ns_sec. get file=!path1+"lfs_2002extract.sav". sort cases by soc2km ukempst . match files file=* /table=!path9+"mtch1.sav" /by=soc2km ukempst . descriptives var=sex soc2km ukempst mcamsis fcamsis rgsc sex ns_sec. ** Output: We've enhanced our LFS data file by adding in some new occupation-based social * classifications as additional variables in the data (scroll right on the 'data view' to see this). ** (Here we've added measures only for people currently working, but there are various * strategies to match the same classifications for the non-working population as well). ** Adapt CAMSIS by generating a cross-gender scale, using male scores for men, female scores for women :. compute camsis=-999. if (sex=0) camsis=mcamsis. if (sex=1) camsis=fcamsis. missing values camsis (-999). variable label mcamsis "CAMSIS for males (using 4-digit SOC2000 + ES)". variable label fcamsis "CAMSIS for females (using 4-digit SOC2000 + ES)". variable label camsis "CAMSIS (using 4-digit SOC2000 + ES)". correlate mcamsis fcamsis camsis. ** Demo - an output :. fre var= levqual . missing values levqual (lo thru 0). graph /bar=mean(camsis) by sex by levqual. * Save this data file - we'll use it again shortly. sav out=!path9+"file1.sav". *****************************. ** Some extensions/variations on that exercise :. ** ii) Matching the same classifications but using 4-digit SOC only (not using employment status) . get file=!path5a+"gb91soc2000.sav" . rename variables (soc2000=soc2km). select if (ukempst=0). sort cases by soc2km . sav out=!path9+"mtch2.sav" /keep=soc2km mcamsis fcamsis rgsc /rename(mcamsis fcamsis rgsc=mcamsis2 fcamsis2 rgsc2) . get file=!path9+"file1.sav". sort cases by soc2km . match files file=* /table=!path9+"mtch2.sav" /by=soc2km . descriptives var=sex soc2km mcamsis fcamsis rgsc mcamsis2 fcamsis2 rgsc2 . compute camsis2=-999. if (sex=0) camsis2=mcamsis2. if (sex=1) camsis2=fcamsis2. missing values camsis2 (-999). variable label mcamsis2 "CAMSIS for males (using 4-digit SOC2000)". variable label fcamsis2 "CAMSIS for females (using 4-digit SOC2000)". variable label camsis2 "CAMSIS (using 4-digit SOC2000)". correlate mcamsis2 fcamsis2 camsis2. *. sav out=!path9+"file2.sav". ** iii) Matching using 1-digit SOC and ukempst (i.e., imagine you didn't have more precise occuaptional ata. get file=!path1+"lfs_2002extract.sav". fre var=sc2kmmj. get file=!path5a+"gb91soc2000.sav" . rename variables (soc2000=sc2kmmj). sort cases by sc2kmmj ukempst . sav out=!path9+"mtch3.sav" /keep=sc2kmmj ukempst mcamsis fcamsis rgsc /rename(mcamsis fcamsis rgsc = mcamsis3 fcamsis3 rgsc3) . get file=!path9+"file2.sav". fre var=sc2kmmj. sort cases by sc2kmmj ukempst . match files file=* /table=!path9+"mtch3.sav" /by=sc2kmmj ukempst . descriptives var=sex soc2km sc2kmmj ukempst mcamsis fcamsis rgsc mcamsis2 fcamsis2 rgsc2 mcamsis3 fcamsis3 rgsc3. compute camsis3=-999. if (sex=0) camsis3=mcamsis3. if (sex=1) camsis3=fcamsis3. missing values camsis3 (-999). variable label mcamsis3 "CAMSIS for males (using 1-digit SOC2000 + ES)". variable label fcamsis3 "CAMSIS for females (using 1-digit SOC2000 + ES)". variable label camsis3 "CAMSIS (using 1-digit SOC2000 + ES)". correlate mcamsis3 fcamsis3 camsis3. *** How do the measures at different levels of aggregation compare? . correlate variables=mcamsis fcamsis rgsc mcamsis2 fcamsis2 rgsc2 mcamsis3 fcamsis3 rgsc3 . *** How do some of the different possible measures compare ?. ** Outcome variable: whether gross weekly pay exceeds 400 pounds . graph /histogram=grsswk. compute hi_inc=(grsswk > 400). fre var=hi_inc . ** Explanatory variables : age, gender and occupation-based social classification . fre var=sex age. fre var=ns_sec. compute ns_service=(ns_sec >= 1 & ns_sec <= 2). fre var=ns_service. descriptives var=hi_inc sex age camsis camsis2 camsis3 rgsc rgsc2 ns_sec ns_service . logistic regression var=hi_inc /method=enter sex age camsis . /* Nag R2 = 0.389 */. logistic regression var=hi_inc /method=enter sex age camsis2 . /* Nag R2 = 0.387 */ . logistic regression var=hi_inc /method=enter sex age camsis3 . /* Nag R2 = 0.405 */ . logistic regression var=hi_inc /categorical=rgsc /method=enter sex age rgsc . /* Nag R2 = 0.379 */. logistic regression var=hi_inc /categorical=ns_sec /method=enter sex age ns_sec . /* Nag R2 = 0.399 */. logistic regression var=hi_inc /method=enter sex age ns_service . /* Nag R2 = 0.377 */. ** Comment: All the different variable operationalisations tell marginally different stories * (both about their own main effects, and in influencing the other variables' main effects). * We'd argue that looking at these patterns is especially important when: * - you're interested in relatively precise influences and/or using small sample data where power is lower * - considering interaction effects (as the functional form matters more, and influence on other vars is greater) ******************************************************. ******************************************************. ******************************************************. ***********************************************************************************. *** ii) DERIVING OCCUPATION-BASED SOCIAL CLASSIFICATIONS IN COMPARATIVE RESEARCH * - AN EXAMPLE FROM THE EUROPEAN SOCIAL SURVEY . ** Starting with the original ESS data file . get file=!path1+"ess_2001extract.sav". fre var=cntry . * (we limit attention to a subset of cases from nine different countries). ** Occupation data : . fre var=gndr emplrel jbspv njbspv iscoco . sort cases by cntry. split files by cntry. descriptives var= emplrel jbspv njbspv iscoco . split files off. *** TASK: Add some standard occupation-based social classifications to the survey data . ** SOLUTION: There are a number of internet sites where you can access translation codes and * instructions to do this for different occupation-based measures across the countries. * In this example we'll illustrate a couple of deliberately international resources which can be used * for this purpose - the CAMSIS scale files for different countries, and the ISEI cross-national scales. *. ***************************************. ** ISEI: Requires the file iskoisei.sps, downloaded from : http://home.fsw.vu.nl/~ganzeboom/pisa/ . define @isko () iscoco. !enddefine. define @isei () isei. !enddefine. include file=!path5a+"iskoisei.sps". descriptives var=iscoco isei. missing values isei (-999). descriptives var=iscoco isei. graph /bar=mean(isei) by cntry by gndr . * Note - ISEI match is fairly straigtforward because it only needs * one index variable (iscoco) and is the same across countries. *****************************************. *** CAMSIS :. * Note - CAMSIS match is more difficult because it needs needs two * index variables (iscoco and employment status), and because it works * in different ways for the nine different countries . fre var=cntry. ** * i) preliminary - need ISCO and employment status variables standardised across countries. * (the ES variable was calculated in the extract file). fre var=iscoco stdempst. compute idno=$casenum . sort cases by cntry idno . sav out=!path9+"temp.sav". ***************************. ** Locate occuaptional information one cntry at a time, adjusting any values * to coincide with ESS data, and merging with the ESS sample for that cntry. ** Britain :. get file=!path5a+"gb91isco88.sav". /* Available at http://www.camsis.stir.ac.uk/Data/Britain91.html */. fre var=stdempst . * no 5 is covered so recode 5 to make 0. sort cases by isco88 stdempst. sav out=!path9+"mtch1.sav" /keep=isco88 stdempst mcamsis fcamsis /rename(isco88=iscoco). get file=!path9+"temp.sav" /keep=cntry idno iscoco stdempst. select if (cntry="GB"). recode stdempst (5=0). sort cases by iscoco stdempst. match files file=* /table=!path9+"mtch1.sav" /by=iscoco stdempst. descriptives var=iscoco mcamsis fcamsis. sort cases by cntry idno. sav out=!path9+"ess1.sav". ** Czech Republic . get file=!path5a+"cz94isco88.sav". /* http://www.camsis.stir.ac.uk/Data/CzechRepublic.html */. sort cases by isco88 stdempst. fre var=stdempst. * no 2 or 5 so merge them to 1. sav out=!path9+"mtch1.sav" /keep=isco88 stdempst mcamsis fcamsis /rename(isco88=iscoco) . get file=!path9+"temp.sav" /keep=cntry idno iscoco stdempst. select if (cntry="CZ"). recode stdempst (2,5=1). sort cases by iscoco stdempst. match files file=* /table=!path9+"mtch1.sav" /by=iscoco stdempst. descriptives var=iscoco mcamsis fcamsis. sort cases by cntry idno. sav out=!path9+"ess2.sav". ** Hungary . get file=!path5a+"hu96isco88.sav". /* http://www.camsis.stir.ac.uk/Data/Hungary96.html */. sort cases by isco88 stdempst. fre var=stdempst. * No 5, so recode it to 0. sav out=!path9+"mtch1.sav" /keep=isco88 stdempst mcam fcam /rename(isco88=iscoco) (mcam fcam = mcamsis fcamsis). get file=!path9+"temp.sav" /keep=cntry idno iscoco stdempst. select if (cntry="HU"). recode stdempst (5=0). sort cases by iscoco stdempst. match files file=* /table=!path9+"mtch1.sav" /by=iscoco stdempst. descriptives var=iscoco mcamsis fcamsis. sort cases by cntry idno. sav out=!path9+"ess3.sav". ** Ireland . get file=!path5a+"ie96isco88.sav". /* http://www.camsis.stir.ac.uk/Data/Ireland96.html */. sort cases by isco88 stdempst. fre var=stdempst. * only zero so code all to zero. * (remove duplicate cases ). compute first=1. if ( (lag(isco88)=isco88) & (lag(stdempst)=stdempst) ) first=0. fre var=first. select if (first=1). sav out=!path9+"mtch1.sav" /keep=isco88 stdempst mcamsis fcamsis /rename(isco88=iscoco) . get file=!path9+"temp.sav" /keep=cntry idno iscoco stdempst. select if (cntry="IE"). recode stdempst (2,5,6=0). sort cases by iscoco stdempst. match files file=* /table=!path9+"mtch1.sav" /by=iscoco stdempst. descriptives var=iscoco mcamsis fcamsis. sort cases by cntry idno. sav out=!path9+"ess4.sav". ** Poland :. get file=!path5a+"plcherisco88.sav". /* http://www.camsis.stir.ac.uk/Data/Poland.html */. sort cases by isco88 . compute stdempst=0. * (no info so must code all to 0). sav out=!path9+"mtch1.sav" /keep=isco88 stdempst mcamsis fcamsis /rename(isco88=iscoco2) . get file=!path9+"temp.sav" /keep=cntry idno iscoco stdempst. select if (cntry="PL"). sort cases by iscoco stdempst. compute iscoco2=trunc(iscoco/100). recode stdempst (2,5,6=0). sort cases by iscoco2 stdempst. match files file=* /table=!path9+"mtch1.sav" /by=iscoco2 stdempst. descriptives var=iscoco mcamsis fcamsis. sort cases by cntry idno. sav out=!path9+"ess5.sav" /drop=iscoco2. ** Portugal :. get file=!path5a+"ptcherisco88.sav". /* http://www.camsis.stir.ac.uk/Data/Portugal.html */. sort cases by isco88 . compute stdempst=0. * (no info so must code all to 0). sav out=!path9+"mtch1.sav" /keep=isco88 stdempst mcamsis fcamsis /rename(isco88=iscoco2) . get file=!path9+"temp.sav" /keep=cntry idno iscoco stdempst. select if (cntry="PT"). sort cases by iscoco stdempst. compute iscoco2=trunc(iscoco/100). recode stdempst (2,5,6=0). sort cases by iscoco2 stdempst. match files file=* /table=!path9+"mtch1.sav" /by=iscoco2 stdempst. descriptives var=iscoco mcamsis fcamsis. sort cases by cntry idno. sav out=!path9+"ess6.sav" /drop=iscoco2. ** Slovenia :. get file=!path5a+"sv94isco88.sav". /* http://www.camsis.stir.ac.uk/Data/Slovenia.html */. sort cases by isco88 . fre var=stdempst. * (Just a 1 and 6 so recode 2 and 5 to 1). sav out=!path9+"mtch1.sav" /keep=isco88 stdempst mcamsis fcamsis /rename(isco88=iscoco) . get file=!path9+"temp.sav" /keep=cntry idno iscoco stdempst. select if (cntry="SI"). recode stdempst (2,5=1). sort cases by iscoco stdempst. match files file=* /table=!path9+"mtch1.sav" /by=iscoco stdempst. descriptives var=iscoco mcamsis fcamsis. sort cases by cntry idno. sav out=!path9+"ess7.sav" . ** Sweden :. get file=!path5a+"se90isco88.sav". /*http://www.camsis.stir.ac.uk/versions.html#Sweden */. sort cases by isco . fre var=stdempst. * just 2 and 6 so recode the 5 to 0. sav out=!path9+"mtch1.sav" /keep=isco stdempst mcamsis fcamsis /rename(isco=iscoco) . get file=!path9+"temp.sav" /keep=cntry idno iscoco stdempst. select if (cntry="SE"). recode stdempst (5=0). sort cases by iscoco stdempst. match files file=* /table=!path9+"mtch1.sav" /by=iscoco stdempst. descriptives var=iscoco mcamsis fcamsis. sort cases by cntry idno. sav out=!path9+"ess8.sav" . ** Switzerland :. get file=!path5a+"ch90isco88.sav". /* http://www.camsis.stir.ac.uk/Data/Switzerland90.html */. sort cases by isco88 . fre var=stdempst. * 2, 5 and 6 all present so no adjustment needed. sav out=!path9+"mtch1.sav" /keep=isco88 stdempst mcamsis fcamsis /rename(isco88=iscoco) . get file=!path9+"temp.sav" /keep=cntry idno iscoco stdempst. select if (cntry="CH"). sort cases by iscoco stdempst. match files file=* /table=!path9+"mtch1.sav" /by=iscoco stdempst. descriptives var=iscoco mcamsis fcamsis. sort cases by cntry idno. sav out=!path9+"ess9.sav" . ***** Combine all CAMSIS data from the 9 countries :. add files file=!path9+"ess1.sav" /file=!path9+"ess2.sav" /file=!path9+"ess3.sav" /file=!path9+"ess4.sav" /file=!path9+"ess5.sav" /file=!path9+"ess6.sav" /file=!path9+"ess7.sav" /file=!path9+"ess8.sav" /file=!path9+"ess9.sav" /by=cntry idno. descriptives var=all. sort cases by cntry idno. ** Link it to the original data file:. match files file=!path9+"temp.sav" /file=* /by=cntry idno. compute cs=-999. if (gndr=1 & mcamsis gt 0) cs=mcamsis. if (gndr=2 & fcamsis gt 0) cs=fcamsis. missing values cs (-999). descriptives var=iscoco mcamsis fcamsis cs. ****************************************. ** End of matching . descriptives var= isei cs mcamsis fcamsis . missing values isei cs mcamsis fcamsis (-999). descriptives var=isei cs mcamsis fcamsis. fre var=gndr. sort cases by gndr. split files by gndr. graph /bar=mean(isei) mean(cs) by cntry. split files off. * Comment - ISEI national differences should reflect genuine national differences in occupatonal structure (e.g. more * professionalised jobs in Swtizerland, UK). CAMSIS scores are meant to be standardised within countries so * do not have the same interpretation: national differences in their profiles may represent sampling differences. **Some patterns of assocations : . fre var=hincfel. correlatve var=isei cs mcamsis fcamsis hincfel . ** cntry specific. sort cases by cntry. split files by cntry. correlatve var=isei cs mcamsis fcamsis hincfel . split files off. *****************************************************************. ******************************************************************. *********************************************************************. *********************************************************************. *********************************************************************. *********************************************************************. *** Example 1.2: Handling Educational data . *********************************************************************. *********************************************************************. **. * 1) RECODE EXAMPLE - RECODING EDUCATIONAL DATA . **************. ** Question: How do we recode educational qualifications data? * ** There are various publications where schemas for educational qualifications categories * are proposed or documented * ** Here's an example from the BHPS, which uses data on age at which completed school, and * highest educational qualification recorded at point of interview. **************. get file=!path1+"bhps_educ_extract.sav". /* An extract produced by the file gesde_workshop_data_seupt_8jun2010.sps */. descriptives var=all. * (These are the easy-to-access educational measures from the BHPS; * other information on qualifications is available but may need linkage between waves). ***** 1) An approximation to ISCED (International Standard Classification of Education): . ** This uses Table 2 from: * * Brynin, M. (2003). Using CASMIN: The Effect of Education on Wages in Britain and Germany. * In J. H. P. Hoffmeyer-Zlotnik & C. Wolf (Eds.), Advances in Cross-National Comparison: * A European Working Book for Demographic and Socio-Economic Variables (pp. 327-344). * New York: Kluwer Academic. *. fre var=qqfedhi qqfachi qqfvoc scend. cro table=qqfedhi by qqfachi /missing = include. compute isced=-9. if (qqfedhi=12 & scend <= 11) isced=11 . if (qqfedhi=12 & ((scend >= 12 & scend <= 16) | scend < 0 | (scend >= 17 & scend <= 19)) ) | (qqfedhi=11 & qqfachi=7) isced=12 . if qqfedhi=8 | (qqfvoc=1 & (qqfachi=6 | qqfachi=7)) | (qqfedhi=4 | qqfedhi=5) | qqfedhi=9 | qqfachi=6 | (qqfedhi=11 & qqfachi=7 & qqfvoc=1) isced=13 . if (qqfedhi=7 | qqfachi=5 ) isced=22. if (qqfedhi=10 | (qqfedhi=7 & qqfvoc=1) | (qqfvoc=1 & (qqfedhi=4 | qqfedhi=5)) ) isced=21 . if (qqfachi=4 | qqfedhi=6 | (qqfedhi=4 & qqfachi=4) | (qqfedhi=4 & qqfachi=5) ) isced=23 . if (qqfvoc=1 & (qqfedhi=6 | (qqfedhi=4 & qqfachi=4) | (qqfedhi=4 & qqfachi=5))) | ((qqfedhi=4 | qqfedhi=5) & qqfachi=4) isced=24 . if (qqfachi=3) isced=31 . if (qqfedhi=1 | qqfedhi=2 | qqfachi=2) isced=32 . add value labels isced 11 "1a Incomplete" 12 "1b Elementary" 13 "1c Basic vocational" 21 "2a Intermediate vocational (+ intermediate general)" 22 "2b Intermediate general" 23 "2c General: General maturity certificate" 24 "2d Vocational: Vocational maturity (+with general maturity)" 31 "3a Lower tertiary" 32 "3b Higher tertiary" . fre var=isced. cro table=isced by qqfedhi. sort cases by isced. split files by isced. cro table=qqfedhi by qqfachi /missing=include. split files off. missing values isced (-9). fre var=isced. missing values scend (-9 thru -1). means tables scend by isced /cells=mean count . ****** Another BHPS oriented classification which seeks to reduce age correlation . * [This scheme as advocated by Paul Lambert, Univ. Stirling, August 2009]. compute educ4=qqfedhi. recode educ4 (1, 2=1) (3, 4, 5=2) (6, 7, 8, 10, 11=3) (9, 12=4) (else=-9). variable label educ4 "BHPS 4-fold educational level classification". add value labels educ4 1 "Degree" 2 "Diploma" 3 "Vocational or higher school level" 4 "Low school level or below" . missing values educ4 (-9). cro table = qqfedhi by educ4 . fre var=educ4. *** Some review analysis: . cro table= educ4 by isced /statistics=phi. **check on age cohort patterns. missing values qage qqfedhi (-9 thru -1) . descriptives var=qage qqfedhi educ4 isced . temp. select if (qage >= 22). means tables=qage by qqfedhi /statistics=anova. /* eta-squared = 0.25 */. temp. select if (qage >= 22). means tables=qage by educ4 /statistics=anova. /* eta-squared = 0.13 */. temp. select if (qage >= 22). means tables=qage by isced /statistics=anova. /* eta-squared = 0.20 */. * Comment: Age correlation for isced is less than the original codes (qqfedhi), but is still high. ** Construct validity test - correlation to occupational advantage for those in work. missing values qjbcssm (-9 thru -1). descriptives var=qjbcssm qage qqfedhi educ4 isced . temp. select if (qage >= 22). glm qjbcssm by qqfedhi with qage /print=parameter. /* r2 = 0.248 */. temp. select if (qage >= 22). glm qjbcssm by educ4 with qage /print=parameter. /* r2 = 0.233 */. temp. select if (qage >= 22). glm qjbcssm by isced with qage /print=parameter. /* r2 = 0.258 */. ** Summary: * ISCED is useful because it is a recognised standard used in other countries, and performs well for explanatory power. * educ4 is attractive because it it simple to describe and not as strongly correlated to age as some alternatives, * whilst not being dramtatically worse for explanatory power . *********************************************************************. *********************************************************************. *********************************************************************. **ii) SCALING EXAMPLE - SCALING EDUCATIONAL DATA . ** Scaling educational qualifications categories is often helpful when dealing with a comparative * dataset which spans a long time period or very different countries. get file=!path1+"ghs7204_extract.sav". fre var=pedfull. means tables=year by pedfull /cells=mean count . * The fact that the means are not equal in all categories suggests that adjustment for time period or age may be sensible. fre var=degree. missing values degree (-5). compute degree2=-1. if (page >= 25 & page <= 35) degree2=degree. missing values degree2 (-5, -1). variable label degree2 "Highest qualification is a degree (Ages 25-35 only)". graph /bar=mean(degree) mean(degree2) by year /missing=variablewise /title="Proportion of adults with a degree". * Some strong patterns of change here: * Proportions with degree level qualifications change over time; * The change is non-linear, with a particular jump in the mid 1990's. *** A plausible scaling strategy: * - scale attainment categories according to a plausible advantage indicator; * - within birth cohorts, standardise that fixed scale around the birth cohort distribution of qualifications . * i) Derive a scale ranking for the whole sample: . fre var=pgenhlth pcigsmk . missing values pgenhlth pcigsmk (-5). temp. select if (page > 25 & page <= 50). /* Not too young to allow finished educ; not too old to reduce avoid age-health correlation */. means tables = pgenhlth pcigsmk by pedfull / cells=mean count. ** We can use the well known health-education link as a rough and ready means of scaling qualifications in general. sort cases by pedfull. sav out = !path9+"m1.sav". select if (page > 25). compute pgenhlth2=pgenhlth. recode pgenhlth2 (1=3) (2=2) (3=1). fre var=pgenhlth2. aggregate outfile=!path9+"m2.sav" /break=pedfull /escore_1=mean(pgenhlth2) /escore_2=mean(pcigsmk) . get file=!path9+"m1.sav". sort cases by pedfull. match files file=* /table=!path9+"m2.sav" /by=pedfull. descriptives var=all. graph /bar=mean(escore_1) mean(escore_2) by pedfull. graph /errorbar=escore_1 by pedfull by psex. graph /errorbar (ci 95) =escore_2 by pedfull by psex . * Both are reasonable indicators but both feature a couple of counter-intuitive orders, possibly due to age/health relations. * => consider using a combined scale of the mean of these two measures: . means tables=escore_1 escore_2 by pedfull. compute escore = ((escore_1 - 2.51) / 0.136 ) + ((escore_2 - 4.20) / 0.457). /* Sum of standardised scores*/. graph /errorbar=escore by pedfull by psex. * It's debatable - but this seems a fair scale ranking scale of educaitonal qualifications . if (pedfull=-5) escore=-5. missing values escore (-5). variable label escore "Scale of relative value of educational attainment (using health patterns)". graph /bar=mean(escore) by pedfull . * ii) Now re-standardise that ranking according to relative distribution in that birth cohort:. descriptives var=escore page year. compute bcoh= year - page. graph /histogram=bcoh. recode bcoh (lo thru 1910=1910) (1980 thru hi=1980). graph /histogram=bcoh. graph /bar=mean(escore) by bcoh. temp. select if (page >= 25). graph /bar=mean(escore) by bcoh. sort cases by bcoh. sav out=!path9+"m3.sav". select if (page >= 25). aggregate outfile=!path9+"m4.sav" /break=bcoh /emean=mean(escore) /esd=sd(escore). get file=!path9+"m3.sav". sort cases by bcoh. match files file=* /table=!path9+"m4.sav" /by=bcoh. compute zescore=(escore - emean) / esd . variable label zescore "Relative educational advantage, by birth cohort". graph /bar=mean(escore) mean(zescore) by bcoh. graph /bar=mean(zescore) by bcoh by degree. ** Comment: We'd argue that zescore is a much better indicator of relative educational attainment. * E.g. the last graph suggests that the relative educational advantage of having a degree has halved * over the range of birth cohorts represented in the GHS datase. *****************************************************. *****************************************************. *********************************************************************. *********************************************************************. *********************************************************************. *********************************************************************. *** Example 1.3: Handling data on ethnicity . *********************************************************************. ********** GEMDE INTRODUCTORY EXERCISE ********** . ******* ILLUSTRATIVE EXAMPLE OF DATA FOR MUGS AND MIRS: BHPS WAVE Q DATA ** ** **** GEMDE is a product of the ESRC funded DAMES Research Node, www.dames.org.uk, hosted at **** University of Stirling and National e-Science Centre, University of Glasgow ****************************************************. *******************************************************. *** 1) Open the data and explore the information on ethnicity and religion. *** . get file=!path1+"gemde_bhps_extract.sav". fre var= race racel. * 14909 people in this extract: most but not all have data for 'race' (close to the 1991 census question) * and most but not all for 'racel' (close to the 2001 census question). sort cases by race. split files by race. fre var=racel. split files off. * Some have data on one but not the other; some have conflicting values between questions. *******************************************************. *******************************************************. *** 2) Define a new MUG, and document it: . * For analytical purposes, we'd typically recode ethnicity measures before analysis, * combining together sparse categories. **** i) Class led example: . * Exploring the data for patterns / possible recodes. fre var= racel memorig. descriptives var=qage . means tables=qage by racel /cells=mean count. compute eth2=racel. recode eth2 (1 thru 4=1) (6 thru 17=2) (5, 18=3) (else=-9). /* Recodes to three groups */. if (racel=2 & memorig ~= 7) eth2=3 . /* Makes 'white irish' a minority only if not in NI sample */. add value labels eth2 1 "White UK" 2 "Black or Asian" 3 "Other white/other". fre var=eth2. cro table=racel by eth2. * => We've created a new, 3 category MUG which is intended to differentiate ethnic groups in Britain . * with an emphasis on typcial age profiles . ** This was the MUG: 1 "White UK" 2 "Black or Asian" 3 "Other white/other" . ******************************************************* ******************************************************* *** 3) Exploit an existing MIR to link the two MUGs 'race' and 'racel' ** => At GEMDE, I've deposited an SPSS macro for 'harmonising' the 'race' and 'racel' measures ** (It can be found on GEMDE, or amongst the files for the workshps session (bhps_ethnicity_combined.sps). fre var= race racel. ** Tasks of the macro: follow the advice recommended by ONS on harmonising ethnicity, * plus take advantage of the household clustering in BHPS to impute race for those with * missing values on both variables but with valid values for some household sharers . include file=!path5c+"bhps_ethnicity_combined.sps". file handle tempdir /name='c:\temp' . xethbhps race=race racel=racel hid=qhid xeth=xeth xethh=xethh. ** => We've now created two new variables 'xeth' and 'xethh' which are 'harmonised' measures. descriptives var=xeth xethh. fre var=xeth xethh. ******************************************************* ******************************************************* *** 4) Create a new MIR summarising aggregate statistical data on minority groups *** Example 4.1: An analysis of average personal income, controlling for age, gender and education, * by ethnic group. graph /histogram=qfimn. compute lninc = ln(qfimn). if (qfimn <= 50 ) lninc = -9. missing values lninc (-9). variable label lninc "Log of monthly income (if over 50p/w)". graph /histogram=lninc. compute qage2=qage**2. descriptives var=qage qage2. compute fem=(sex=2). fre var=sex fem. fre var=degree diploma lowlev . descriptives var= lninc qage qage2 fem degree diploma lowlev xeth . delete variables incpred. glm lninc with qage qage2 fem degree diploma lowlev /print=parameters /save=pred (incpred). compute incres = exp(lninc) - exp(incpred). variable label incpred "Predicted log monthly income by age, gender and education". variable label incres "Residual between predicted and actual income". means tables = incpred incres by xeth /cells=mean count . graph /bar=mean(incres) by xeth by sex /title="Ethnic penalty / premium (BHPS 2007)". ** Export this result - it's a MIR we could later upload as an exercise. aggregate outfile=!path2+"bhps_ethnic_penalites.sav" /break=xeth sex /inc_pen=mean(incres) . **. get file=!path2+ "bhps_ethnic_penalites.sav" . descriptives var=all. ******************************************************* ******************************************************* ******************************************************* *******************************************************. *********************************************************************. *********************************************************************. *** Example 1.4: A further example combining geography, history and sociology . *********************************************************************. get file=!path1+"fhs_extract_multi_gen.sav" . descriptives var=all. ** Some explanation: * This is a geanealogical database. Each row refers to a single geanealogical record which is based on * a marriage which took place in the year indicated by variable 'gyear'. graph /histogram=gyear. * The extract covers selected occupations of men linked to that marriage event. These are indicated by the * variable names: g = groom; gf = grooms father; gl = groom's father in law; gff= grooms father's father, etc. * This extract stops at a maximum of four generations (though further detail is available in the FHS). ** Data on these men then covers their information aobut their occupation and the geographical location. * For every record, we have their occupation at the time of their marriage (for the groom) * or at the time of their children's marriage (all others). fre var=g gf . * For selected records we also have detailed geographical locations: we have northings and easings for * the events of: groom's marriage (g_m_*); groom's birth (g_b_*); groom's father's marriage (g_f_m_*) and groom's father's birth (g_f_b_8). graph /scatter=g_m_e with g_m_n. ** There are also some other indicators on the data, but these are of less importance to this exercise. **** Some examples of exploiting the data: ** Q: Is there a relationship between geographical mobility and occupational type? . *** => Add values to HISCO and see how they related to geographical mobility . ** The data on occupations is coded to the HISCO scheme (see http://historyofwork.iisg.nl/). ** A code file for adding value labels is available on GEODE (it's been placed in the 'macros' directory). * If we run it, it's easier to make sense of the occupational data :. fre var=g gf gff. include file=!path5a+ "hisco_labels9.sps". hislab occ={g gf gff}. fre var=g gf gff. * Now calculate distance moved. descriptives var=gf_m_n g_m_n gf_m_e g_m_e. missing values gf_m_n g_m_n (-999 thru -900). missing values gf_m_e g_m_e (900 thru 999). descriptives var=gf_m_n g_m_n gf_m_e g_m_e. graph /scatter= g_m_e with g_m_n . compute mardist = sqrt( (gf_m_n - g_m_n)**2 + (gf_m_e - g_m_e)**2 ). variable label mardist "Distance between location of grooms marriage and father's marriage (km)". graph /histogram=mardist. * Look at distances moved against particular occupations. means tables=mardist by g /cells=mean count . *****. *** Q) Are the occupations with greater mobility characterised by advantage or disadvantage? . ** A number of occuaption-based social classifications can be linked to HISCO. * Translation matrices / codes are available at various websites, and at GEODE. * As one example, use the HIS-CAM codes for the UK, at http://www.camsis.stir.ac.uk/hiscam/ . ** . sort cases by g. sav out=!path9+"m1.sav". get file=!path5a+"hiscam_gb.sav". descriptives var=all. sort cases by hisco . rename variables (hisco hiscam = g g_hc) . sav out=!path9+"m2.sav". rename variables (g g_hc = gf gf_hc). sav out=!path9+"m3.sav". get file=!path9+"m1.sav". sort cases by g. match files file=* /in=one /table=!path9+"m2.sav" /by=g. fre var=one. select if (one=1). variable label g_hc "Groom's job HIS-CAM". descriptives var=all. delete variables one. sort cases by gf. match files file=* /in=one /table=!path9+"m3.sav" /by=gf. fre var=one. select if (one=1). variable label gf_hc "Grooms father's HIS-CAM". descriptives var=all. correlate var=g_hc gf_hc mardist. regression /var=mardist g_hc gf_hc /dependent = mardist /method=enter . ** Comment: suggests own advantaged job, and parental advantage, are both independently associated with * greater geographical mobility. *********************************************************************. *********************************************************************. *********************************************************************. *********************************************************************. ** EOF.