Help and standards of good practice in survey data management

On this page we try to give a succinct overview, and links to our own resources and relevant external sites, on good practice in the methodology of data management in the domain of social survey research. We think that this topic is in general unduly neglected from methodological discussion in the social sciences, even though data management choices and activities are both a major component of research workload, and can be a substantial influence upon the results of any analysis.

Comments are given covering:


The content of this page is mostly written by Paul Lambert, with a few contributions from other researchers from the DAMES Node







Quick start - help me out!

I want to link together some data files! Try our 'matching files' section

I want to change the categories of some variables! Try our 'recoding variables' section

I want to learn more about using syntax! Good! Our software guide handout introduces the idea of using syntax and includes further links. There are many examples of SPSS and Stata syntax/do files in the downloadable materials from our workshop programme, see especially 'Documentation and workflows for social survey research'. There are also materials further down this page, starting with our section on 'documentation for replication'.

All this stuff is too much for me! Try some introductory materials on survey data analysis and data support, or a training workshop on these topics. The ESDS sites have many useful links, resources and recommendations (with a UK orientation).










Defining 'data management' and a methodology of data management.















Documentation for replication.

Shortcut! Here's our extended pdf handout on this general topic (produced for a workshop)

As described above, we are concerned with a methodology of 'data management' where we are interested in activities which manipulate or enhance data resources for the purpose of research analysis. The documentation of such processes can usefully be thought of as concerning the 'paper trail' which tells us adequately about tasks which have been undertaken. That trail is commonly in the form of software specific command files (e.g. 'do files' in Stata, or 'syntax files' in SPSS), but it could conceivably be in another format - for example, the hand-written laboratory notebook is a classic form of comparable documentation, whilst contemporary lab books often come in the form of a collection of related paper and electronic documents.

The most effective form of documentation can be described as 'documentation for replication'. This refers to the idea that the documentation must be sufficiently clear and detailed that other researchers (or the analyst themselves, at a later date) ought to be able to use the documentation to exactly replicate the data management and analysis undertaken. Dale (2006) and Freese (2007) have both written persuasively on the possibilities and desirability of documentation for replication in the social sciences, and we would argue that it is an essential prerequisite to adopting a scientific approach to social research.

Unfortunately, we'd claim, most social survey research projects do not achieve documentation for replication. It is rare to see published results being accompanied by immediate access to the 'paper trail' which documents what generated them, and indeed replication analyses themselves are very rarely published in social science domains. Part of the problem, we'd assert, lies with the inattention to replication analysis within professional standards in the social sciences(see also below). Most of the problem, though, we think, reflects little more than a lack of awareness and training amongst many researchers over how to undertake their work in a manner that readily generates good documentation.

Actually, in most circumstances, it is reasonably easy to undertake analysis whilst generating documentation of a suitable standard for replication. For most of us working with survey data, the trick is to use a syntactical programming language which, with just a little extra effort in tidying up the file, can readily provide a replicable log of the tasks undertaken. Popular examples include the use of SPSS 'syntax' files or Stata 'do' files (illustrated below). If you're already familiar with such approaches to syntactical programming you won't need to be convinced of their benefits; if you are not, however, suffice to say that whilst such languages may initially seem a little intimidating, they are in fact quite easy to pick up, and are hugely beneficial in the longer term.




The figures above depict respectively a researcher using SPSS and Stata syntax files to run their analysis. We have lots more examples of using syntax (plus onward links to external resources)in our workshop materials and software guide document. If you're not already familiar with the format of syntax files being used for documentation-for-replication, we'd suggest there are two important features to recognise:

So, the practical work of documentation for replication in social survey research usually centres on using software effectively in order to achieve an adequately replicable 'paper trail'. The figures we illustrate above depict examples of working in Stata and SPSS through syntax files. In most situations this will support adequate levels of documentation, and Long (2009), for example, gives a very thorough introduction to using Stata effectively for this purpose.

As an aside, the task of learning how to use software effectively, for most of us, transmutes into the task of learning to be a good programmer in the relevant software language - most successful data analysts, therefore, in the social sciences, are also succesful programmers in the language of their choice. At the time of writing, we'd argue that there are few sophisticated activities, in terms of data management and data analysis, that can be achieved without some degree of programming proficiency. One interesting contemporary development here can be seen in the ongoing ESRC-funded 'E-Stat' Node, which is trying to build a software system to better support the combined inputs of three groups of specialists - statisticians, programmers and social scientists - for the benefit of social science research. Amongst other contributions, this project is working on an 'ebook' tool which is intended to provide researchers with a dynamic log book featuring electronic records of all stages of the research undertaken in a coherent and clearly documented manner. The ebook being developed has the capacity to record automatically generate syntax code and so does not necessarily rely upon the researcher learning the relevant programming language in order to record a consistent log.

New resource! We've written an extended guide on how to use data analysis software effectively for the purposes of good standards of documentation and data management more generally (originally distributed as part of our workshop of 24/5 November 2010). We recommend it for detailed instructions on how to organise your work with data analysis software packages more effectively. It also has accompanying example command files in different software packages (Stata, SPSS, R, MLwiN and lEM), accessible from our the website for our November 2010 workshop.

Relevant workshop! We held a 2-day training workshop on the general topic of documentation for replication in social survey research in November 2010. The materalas for that workshop (e.g. presentation slides and lab session handouts and example files, including detailed guides to good practice in preparing documentation) are all available online, and we'd encourage you to look into them.

Other sources: There are many other places where it is possible to find more introductory material and worked examples of using software effectively for documentation for replication. The UCLA statistical computing pages are particularly widely used across the social sciences, and in the UK, the ESDS training sites include many software orientied resources. For our part, we have prepared many example command files, as well as collating further external links, which can be found from the 'Stata support' and 'SPSS support' from our earlier project on 'Longitudinal Data Analysis for Social Science Researchers' (these were written in 2008).












Matching files

By 'matching files' we are refering to linking together different electronic datasets in a structured way. In survey data analysis, the files involved are the characteristic 'variable-by-case' matrix which summarises the survey data, and some other relevant data. Matching files then refers to the process of linking components of different (but intentionally related) variable-by-case matrices.











This figure tries to illustrate a typical file matching operation. File A is a social survey microdata file, with a measure of the respondents' occupations ('bjbiscon'). File B is separate aggregate data with information (namely the ISEI codes taken from Ganzeboom and Treiman, 1996) on occupations ('isco88', which is in the same format as 'bjbjiscon'). After a file-matching routine is run, File C is generated, which in this example can be thought of as an augmented version of file A (it now has an extra variable, namely 'isei')

Various software packages have routines to support matching files, but their techniques are not widely taught in social science training courses, and in our experience many researchers are not confident in how to link files together. This is unfortunate, because a great many useful data enhancements require some sort of file matching exercise to be undertaken.

(Our discussion here focusses upon approaches to what is sometimes called 'deterministic' file matching, which means linking together data files according to shared and known identifier characteristics. This is in contrast to 'probabilistic file matching', which generally refers to using some statistical algorithm to impute values on one dataset (recipient) on the basis of statistical patterns in another related dataset (donor). There are quite a few current methodological projects in the UK involved in developing probabilistic matching, or 'data fusion', techniques - for instance ADMIN, BIAS, NeISS, and our own project theme on social care data).

Common examples of file matching can be described as 'one-to-many' links (whereby records from the cases of one file are distributed across a number of related records in a second file, such as when distributing aggregate level summaries to individual cases); the one-to-one link (whereby records from different files are linked on a shared identifier value, such as when linking individual responses from different years in a longitudinal study); and appending data files (where different datasets are added to the same record but individual cases between the datasets are not explicitly linked).

In a recent workshop we gave extended examples of matching files using Stata (see the materials from our training workshop of August 2009), and in earlier project we developed illustrative examples in both SPSS and Stata covering a range of file matching operations (see the online materials for the project on 'Longitudinal Data Analysis for Social Science Researchers').

Since file matching is such an important data management practice we give below illustrations of some of the most important mechanisms for matching data, across the three software package Stata, SPSS and R. We do not know of too many other online guides to file matching using these packages, although the ATS Statistical Computing (UCLA) webpages are one noble exception (see their guides for SPSS, Stata and R). These are brief and rather superficial examples, however - take a look at our workshop programme (see above) for more extended examples of matching files!



Example 1: A one-to-many match merge operation using one shared variable.

In this example, we have a microdata file from a survey (fileA) plus we want to link in aggregate data about occupations from an external files (isco88_isei). The merge below links in cases form the isco88_isei file whenever there is suitable data on fileA to allow that. After the match all records from fileA are retained, but the procedure discards any remaining data from isco88_isei which does not successfully match to the microdata file.


Example files:
fileA.sav; fileA.dta; fileA.dat; isco88_isei.sav; isco88_isei.dta; isco88_isei.dat (the three 'fileA' files are small extracts from a UK survey; the three 'isco88_isei' files are available on GEODE and are derived from Harry Ganzeboom's ISCO-88 to ISEI coding tool)

SPSS syntax example:

   get file="isco88_isei.sav".
   rename variables isco88=bjbiscon. 
   sort cases by bjbiscon. 
   sav out="temp.sav". 
   get file="fileA.sav". 
   sort cases by bjbiscon.
   match files file=* /table="temp.sav" /by=bjbiscon. 
   sav out="fileC.sav". 


Image representing one-to-many matching:

Stata syntax example:

   use isco88_isei.dta, clear
   rename isco88 BJBISCON 
   sort BJBISCON 
   sav temp.dta, replace 
   use fileA.dta, clear 
   sort BJBISCON
   merge BJBISCON using temp.dta
   tab _merge  /* (_merge was Auto-generated) */
   keep if _merge==1 | _merge==3
   drop _merge /* _merge used to keep fileA cases only */
   sav fileC.dta, replace

R syntax example:
   fileA <- read.table("fileA.dat", header=T)
   isco88_isei <- read.table("isco88_isei.dat", header=T)
   fileC <- merge(fileA, isco88_isei, by.x="BJBISCON",  by.y="isco88",
               all.x=T, all.y=F, sort=F, suffixes = c(".x",".y") )
   write.table(fileC, file="fileC.dat", col.names=T, row.names=F) 




Example 2: A one-to-one match merge operation using one shared variable.

The example shown here is of merging cases from two different data files which pertain (potentially) to the same respondents. Respondents are uniquely indicated by the 'id' variable, and the combined file shows, side-by-side, respondents' values on the two different files. In practice, not all respondents are on both files. In the illustrative figure, the _merge variable indicates coverage: _merge=3 means that the respondent does have values linked from both datasets; _merge=1 or _merge=2 means that they only have values from the A or O dataset respectively.


Example files:
waveAsubset.sav; waveAsubset.dta; waveAsubset.dat; waveOsubset.sav; waveOsubset.dta; waveOsubset.dat (the three 'waveAsubset' files are small extracts from a UK survey, the BHPS (University of Essex, 2010), conducted in 1991; the three 'waveOsubet' files are small extracts from the same survey as it was conducted in 2005. the waveA file has 350 cases in it; the waveO file has 559 cases in; when they are merged using the identifier variable 'id', there are 178 cases who answered the survey in both years, whilst 172 and 381 cases respectively only answered cases in one year and not the other.

SPSS syntax example:

   get file="waveOsubset.sav". 
   sort cases by id. 
   sav out="m1.sav".
   get file="waveAsubset.sav".
   sort cases by id. 
   match files file=* /in=merge1 /file="m1.sav" /in=merge2 /by=id. 
   cro tables=merge1 by merge2. /* merge used to indicate presence */  

Stata syntax example:

   use waveOsubset.dta, clear
   sort id 
   sav temp.dta, replace
   use waveAsubset.dta, clear
   sort id
   merge id using temp.dta
   tab _merge
   drop _merge /* _merge generated automatically */

R syntax example:
   waveA <- read.table("waveAsubset.dat", header=T)
   waveO <- read.table("waveOsubset.dat", header=T)
   waveAO <- merge(waveA, waveB, by.x="id",  by.y="id",
               all.x=T, all.y=T, sort=F, suffixes = c(".x",".y") )
   write.table(waveAO, file="waveAOcomb.dat", col.names=T, row.names=F) 


Image representing one-to-one matching:



Example 3: Appending multiple related files.

**Under construction - inputs to follow**

Further notes on the syntax examples.
    To simplify the example we haven't included the paths of the files within the syntax. The above syntax will work if you set the 'file handle' in SPSS, or use 'cd' in Stata or 'setwd()' in R, to ensure that the software is pointing to the folder where you've downloaded the example files. Alternatively, put the full path into the file call, or use macros to define paths for the relevant files (see our software guide on this issue).
    Common practical problems with match-merge operations are when the storage formats of the key linking variables are not compatible (some data processing may be necessary, such as converting a string format into a numeric format); and when data files have a linking variable in a 'one' file which is not in fact unique for each case (that is, in a one-to-many link or a one-to-one link, there should only be one distinct row for every unique combination of the key linking variable(s) for the 'one' part of the link; and anything else will give you some form of error, though the appearance will vary between software packages - to address this you may need to subset your data to ensure only one-case-per-value, see our section on aggregating data for examples).



       **DAMES ONLINE TOOL FOR MATCHING DATA FILES**

If you don't fancy writing out the necessary software syntax to merge to related data files, we also have an online service which will perform this task for you if you submit the two data files and provide information on the files that are to be matched.

[IN PREPARATION]












Recoding variables.

Text to follow



       **DAMES ONLINE TOOL FOR RECODING VARIABLES**

[IN PREPARATION]












Averaging and scaling variables.

Text to follow



       **DAMES ONLINE TOOL FOR AVERAGING AND SCALING VARIABLES**

[IN PREPARATION]












Professional standards.

After spending some time thinking about the processes and practice of data management in social survey research, we've come to develop fairly firm views on good and bad habits in this research domain. We present below some intentionally pejorative notes on what we think is required for more effective application of a scientific approach to social survey research.

The model of science that we adopt is influenced by Steuer (2003). We take as central tenets of a scientific approach to our field the principle that empirical survey research should be cumulative (i.e. motivated by, building upon and learning from previous endeavours); and that it should be open to cross-examination and further evaluation, by being explicitly recorded and its processes exposed to others. Of course, these are by no means agreed upon definitions for the term 'science' or social science research, but we think they are useful principles to aspire to in survey research, because we would claim that they are common features of many of the most productive examples of published social survey research (e.g. Townsend, 1979; Breen, 2004), whilst they are frequently absent in more problematic survey-based studies (cf. Huff, 1954).

It is all very well for us to claim that reseach would be better conducted by adopting certain principles, such as of documentation for replication and so forth. We may claim the above, and others may claim differently, and it may not be so obvious if either is always right or who should have influence. We can however appeal to the idea of professional standards in order to make a case that there could usefully be agreed upon standards for academic social survey research. The presence of a code of conduct that is both agreed upon and enforced is commonly seen as a defining feature of professionalism (e.g. Prandy, 1965), and we think that this principle would be a useful device for setting quality standards linked to empirical survey research.

To elaborate the analogy for the example of survey research, the specification of professional standards can be made in order to ensure that relevant work is conducted to a suitable standard, whereas it is argued that using qualification standards alone, and the membership of associations, are not adequate criteria for professionalism. In social survey research, we could point to skill requirements (e.g. the ability to coax a good graph out of a software package), and publication critera (e.g. the ability to successfully disseminate analytical results via publications), as broad parallels to qualification standards and association membership. Our claim is that neither of these prove sufficient to gaurantee that productive scientific work is conducted in our domain - researchers can be both qualified,and successful in publishing their work, but this doesn't ensure a desirable standard of scientific investigation is being undertaken. If such standards are desirable, it is necessary to express more clearly the methodological expectations relevant to such approaches in the context of survey research.

A codification of professional standards in survey research could therefore be used as a device for enforcing more rigorous scientific standards in survey analysis projects. That codification cannot, of course, be printed on quality paper and henceforth disseminated across the research community to immediate effect. It could, however, gradually filter through to a variety of relevant gatekeeping organisations, such as journal and research grant refereeing criteria, higher examining standards, appointment criteria and so forth, to the long term benefit of social science research standards. The content of the code of conduct, in turn, is obviously vital, and the critical claim of our argument is that better quality research (more cumulative and replicable) is only likely to be enabled if the relevance of data management is recognised, and prominent within, a professional code of conduct designed to promote better standards of survey research.

Our claim hinges upon the terms and definitions of data management given above. There, we sought to show that a range of activities involved in the organisation and preparation of data are potentially very consequential to the results of analysis, and that in general terms there was much room for imporvement in the way that that range of activities is typically undertaken. Elaborating on these issues, we claim that the following points ought in our view to be embedded within our professional expectations as social science researchers in the domain of social survey analysis:

Implicit in many of the above points is the claim that much contemporary research does not adhere to adequate professional standards. For example, many survey analysis studies select variables on fairly ad hoc grounds, and do not engage with previous studies, nor use sensitivity analysis, to inform their choice; many studies do not provide clear documentation of the syntax files used to generate their results, nor demonstrate selection of methods from a fluent knowledge of feasible alternative. Our rather tough position, therefore, is that scientific standards would be improved if criteria existed that allowed researchers and reviewers alike to identify such weaknesses, and aspire to higher standards.

We've included these components of professional standards for social survey research becuase they have emerged as important methodlogical principles during our consideration of data management as a methodology. The above points might not be perfectly defined or comprehensive, but we believe they reflect the most important components of the relevance of data management within the survey research process. We think that the points above are not widely recognised or acted upon within the existing research environment, and so we see the major challenge for contemporary social survey research as persuading the wider research community to accept these or similar professional standards, and to adequately enforce them.












References cited.

Last update: 23/DEC/2010