Cleaning and aggregating raw data
As described elsewhere (see data entry), raw data may be collected and entered in a variety of raw formats, and in many files.
Data collection and data entry may generate multiple files of original raw data for any given data collection. For example, in each of the early booklet studies, manual data entry typically generated different sets of files for different cohorts of twins. In later questionnaire studies, data entry by scanning was carried out in batches, generating a new file for every batch. In web studies, there were often two waves or two cohorts of data collection, carried out in successive years, each generating its own set of files. In such cases, permanently aggregating the data for each data collection (e.g. each questionnaire, or each web test) simplifies subsequent processes of dataset generation.
In paper booklet and questionnaire studies, the data from each questionnaire has usually been aggregated into one or more tables in the Access database for the given study. Sometimes two or more database tables are needed for a long questionnaire, because each Access table can hold no more than 255 fields (variables).
In web studies, where the files often contain very large numbers of variables, Access database tables are unsuitable. In such cases, the data are aggregated into a single delimited text file, usually a csv file. Generally, one file is needed for each distinct web test in a given study.
Historiclly, late returns of paper questionnaires were aggregated directly with the rest of the data by means of data entry directly into the Access database. In web studies, late returns do not happen because the web activities are halted prior to the final data download.
Aggregation, whether in an Access table or in a csv file, can be combined with processes of data cleaning. For example, duplicated questionnaires from the same participant must be eliminated, by checking for uniqueness; invalid participant IDs must also be eliminated, by checking against a lookup list; and it is essential to ensure that identical coding is used for every variable in every aggregated file.
Other data cleaning operations generally involve checking the set of values, or the range of values, recorded in every numeric field. Out-of-range or invalid values can either be corrected (if there is some record of the original raw data) or deleted. Missing values in the aggregated raw data are often replaced by coded values, typically -99 (missing) or -77 (not applicable), and more recently -11 (prefer not to answer) or -22 (don't know). Direct identifiers, such as names, usernames or email addresses, are permanently removed once they have served their purpose in data collection.
Text data, where collected, are typically converted into numeric codes that can be used as dataset variables. Once coded, the original raw data responses are generally deleted because they may be identifiable.
In conclusion, after a data collection, when the multitude of original raw data files have been aggregated, cleaned and de-identified, and when the data have been thoroughly checked, the original data files are permanently deleted and only the cleaned and aggregated data files are retained.
Sources of raw data
For paper-based data collections, including most of the earlier TEDS studies, the cleaned and aggregated raw data have stored in Access database files. There is a separate Access file for each main TEDS study. Within a study Access database file, there are typically many tables, used for logically distinct data collections. Typically, the Access database will contain separate tables for twin data, parent data, teacher data and admin data. Access has been conveniently used historically for several reasons: firstly because it enables convenient data entry via programmed forms; secondly because Access tables can enforce important data integrity rules; and lastly because multiple sources of data can been contained within a single file.
The files of twin web/app data (collected in web studies from age 10 onwards, and via a phone app in TEDS21) are generally too large for convenient storage in Access. Each web data file has very many columns, and would have to be split into many database tables. Furthermore, the raw electronic data files are complete, in the sense that further data are not added manually, and there is no need for the data entry and data integrity facilities that Access provides. Instead, the data from each test/study, sometimes aggregated across two or more waves of data collection, is saved in a single data file. Such files may be periodically updated during the course of the study, as more web data are collected; but once the study ends these files become "static", and do not need further updating in any way.
Some of the raw data used to make analysis datasets are not study-specific, but consist of general family background data. These data include twin sexes, birth dates, birth orders, zygosities, and medical conditions (used to classify medical exclusions). These types of data are maintained in the TEDS admin database; they are continually updated as a result of new information from families, zygosity test results, and so on. Such data must periodically be freshly exported (from the TEDS admin database into delimited text files) for the construction of new dataset versions.
Exporting raw data
SPSS, which is used to process the data, cannot easily open data directly from Access database tables. It is therefore necessary to "export" the raw data by copying them into a more suitable file format. Delimited plain text files (usually in .csv format) are used because they are platform-independent, can be read by many different software programs, have no limit on the number of rows or columns of data, and are non-proprietory (and therefore not subject to format changes with different versions, etc).
Generally, the data from each Access database table is exported into a separate csv file. A csv file consists of plain text, which can be opened in text editors like Notepad, and can also be opened in grid format by Excel. The csv format means that variables are comma-separated in rows of data, and each case starts on a new line. The first line of the file contains the column headings (or variable names).
Data are exported indirectly from Access tables using saved "queries" within each Access database file. Each query selects appropriate columns from the relevant table. The queries also modify the format of the data values in some columns, so that they are saved in a form that can easily be read by SPSS from the text file; examples are date columns (changed to dd.mm.yyyy format) and boolean true/false columns (changed to integer 1/0 values).
Background data from the TEDS admin database (SQL Server) can be exported in a similar way. The exported csv files from this source are stored alongside raw data files for the 1st Contact study.
Before a new version of the analysis dataset is made, by running scripts as described below, it is not always necessary to re-export the raw data. Some new dataset versions incorporate changes made to the syntax in the scripts, but not changes in the raw data. Repeating the export of data from Access is therefore only necessary if (1) additional raw data have been collected and stored, or (2) the raw data have been updated by cleaning or other changes.
Processing by scripts
Creating one of the TEDS analysis datasets involves a fairly long sequence of operations, which are described under the headings below. These operations are carried out in statistical software (SPSS). The instructions for each operation, and the order in which the operations are performed, are encoded in proprietory syntax for SPSS; the syntax is saved in a set of "scripts" or syntax files. Each script is a plain text file, saved with a proprietory file extension (.sps for SPSS syntax).
Typically, for each dataset, the syntax is divided into several scripts, mainly in order to reduce the length of each script; each script generally contains a logically related set of functions, such as merging raw data sources, or creating scales; and this makes it easier to find a given piece of syntax, and to understand the operations being carried out. The scripts must be executed in a given order for each dataset, as the syntax operations follow each other in a logical sequence.
Each script contains not only syntax but comments. The comments give an explanation of the operations being carried out. The syntax and comments together form a crucial part of the documentation for each dataset, as they explain exactly how the dataset is constructed.
Different TEDS datasets have been created using slightly different syntax methods, sometimes with operations performed in different orders. However, they generally all involve the operations that are described under the headings below. Some examples of variations between the ways the datasets are constructed are:
- The 1st Contact and 8 Year studies involved only parent booklets/questionnaires and no twin data. Hence, any steps involving on merging using TwinID, and double entry of twin data, do not apply.
- The in home study has its own exclusion variable, based on information collected at the time of the visits. Hence, the usual across-age exclusion variables are not used.
- At ages 7, 9, 10, 12 and 14, the studies involved teacher questionnaires. Hence, additional steps are needed to merge and process the teacher data (treated in essentially the same was as twin data).
- At ages 10, 12, 14, 16, 18 and 21 the studies included twin web tests and (in some cases) parent web activities, in addition to the usual paper booklets/questionnaires. Hence, construction of these dataset involves additional files, typically one per web activity. The web activities were very varied and required different types of processing.
- In TEDS 21, parents and twins had a choice of completing the questionnaires via mobile app, on the web or on paper. The app and web files have subsequently been aggregated into one cleaned file per activity, but additional steps are still needed to aggregate the paper-based data with the app/web data for each questionnaire.
Importing raw data
Before further processing can be carried out, the raw data (in csv format) must be "imported" into SPSS. Importing means not just opening the file, but transforming it into SPSS format. This can involve a range of variable transformations, which can be controlled by the syntax scripts. These transformations, including renaming, formatting and coding, are further described below.
The operation of importing a raw data file is generally followed by sorting the rows in ascending order of subject ID, sometimes followed by variable naming and recoding, followed by saving the file in SPSS format to be merged at a later stage.
Naming variables
In SPSS, variable naming is often done simultaneously with importing the raw data, because the syntax for importing a csv file allows the variable names to be specified at the same time. In some dataset scripts, the renaming of variables (from raw variable names to final analysis dataset variable names) is done at a later stage.
The naming of variables in TEDS datasets follows certain conventions, described in the variable naming page.
Formatting variables
To make variables easier to use, they are formatted in SPSS by modifying properties such as the "width", "decimals", "columns" and "measure".
Firstly, by using the SPSS FORMATS syntax command, or by specifying the format in syntax for importing a text file, each variable's "width" and "decimals" properties are set. For example, the SPSS format F1.0 would be used for a single-digit integer variable (width=1, decimals=0), and format F5.2 might be used for a decimal variable such as a twin's height (width=5, decimals=2). Setting the format does not alter the values stored in SPSS; it merely alters the default display of variable values in SPSS.
Secondly, by using the SPSS VARIABLE LEVEL syntax command, each variable can be described (in the SPSS "measure" property) as nominal, ordinal or scale. Conventionally in the TEDS datasets, integer-valued categorical variables are set to nominal or ordinal as appropriate, and this property makes it easier in SPSS to use the variables in analyses such as tabulations. Other variables set to measure=scale include quantitative measurements such as weights or ages, scales, composites and decimal-valued variables generally.
Further variable properties may also be set in SPSS syntax. These properties include the "columns" (set with the SPSS VARIABLE WIDTH syntax command), which determines the width of the column displayed in the SPSS data view. The default width is often appropriate, but may be changed when it is inconveniently wide or narrow. Another variable property that may be set is the "missing" property. For a variable containing a value such as 9=not applicable, such a value can be set as "missing" in SPSS (using the MISSING VALUES syntax command); the result is that the value is conveniently treated as missing in any SPSS analysis, including computation of scales. However, in TEDS datasets most such values are routinely recoded to the "system-missing" (SYSMIS) value in SPSS.
Recoding variables
In the raw data, especially where stored in Access database tables, missing values are generally denoted by the code value -99, and not-applicable values are denoted by the code value -77. In some recent web questionnaire collections, like TEDS26, there are further special missing data codes like -11 for "prefer not to answer" and -22 for "don't know". These codes are not used in the analysis datasets, so these values are recoded to missing for all item variables (in SPSS, a system-missing or SYSMIS value is used).
Often, the value coding used in the raw data differs from that used in the analysis datasets. For example, simple value codes of 1,2,3,... are often used at the point of web data storage or data entry of the raw data, but in the analysis dataset it might be more convenient to used value codes of 0,1,2,... (often to aid the computation of scales). Reverse-coding is sometimes necessary, to ensure that all related item variables are coded in a consistent way. The scripts therefore contain syntax for changing the value codes where necessary.
Occasionally, some further data cleaning is needed. This may occur, for example, in instances where a questionnaire allows participants to record contradictory responses in different questions; or a variable may contain extreme or outlying values that are either of dubious quality or at risk of increasing identifiability. Some scripts enact measures to check for and recode invalid or compromised data of these sorts. Where identified, invalid values are generally recoded to missing, to prevent them being used in analysis, or they may be recoded into broader category values. Examples of invalid values might include extreme or out-of-range quantitative measurements, such as ages or heights. Examples of compromised data might include inconsistent responses in multi-part questions.
Recoding and cleaning web/app data
The data collected in web (or app) studies, from age 10 onwards, are recoded for several purposes: in order to replace missing values, to identify item events such as timeouts, discontinues and branching, and to identify anomalies such as item crashes.
Furthermore, attempts have been made to identify instances of tests that were compromised by loss of data, malfunctioning of test rules, or random responding by twins. Such instances are flagged using test status variables, and are effectively excluded by setting the test variable values to missing.
The processes involved in recoding and cleaning the web data are described in more detail in a separate page.
Double entering parent/family data
In parent-reported questionnaire data, and in some admin data, there are items referring specifically to the elder and younger twins, and these occur in the same row of data for each family. Double entering these data consists of two main operations in the script: firstly, the data must be duplicated so that there is one row per twin, not just one row per family; secondly, the twin-specific variables must be "swapped over" in the duplicated data so that they correctly match the twin and co-twin for the relevant row of data. The script usually follows the following stages:
- Start with the parent- and/or family-specific data, as imported from the raw data files. This is often done after merging all the family-specific data together (using FamilyID as the key variable), but before merging the twin/teacher data.
- This is usually the point at which the random variable is added (see also below).
- Within the family-specific data, if not already done, rename twin-specific items so that "elder twin" item variables have names ending in '1', and "younger twin" item variables have names ending in '2'.
- Compute the twin identifier Atempid2 by appending the digit 1 to the end of the FamilyID value.
- At this stage, the dataset represents the elder twins; each variable with name ending in '1' represents the elder twin as identified by Atempid2, while each variable with name ending in '2' represents the younger twin, i.e. the co-twin. Save this as the "top" half of the dataset.
- Now make an equivalent dataset for the younger twins. Starting by re-computing the twin identifier Atempid2, so it ends in 2 instead of 1.
- Also, for the second twin, reverse the value of the random variable (0 to 1, or 1 to 0) that has been derived for the elder twin.
- Now swap over the twin-specific variables: the variables having names ending in '1' must be swapped over with the variables having names ending in '2'. This is conviently done by renaming the variables: for example, to swap variables var1 and var2, rename var1 to var1x, then rename var2 to var1, then rename var1x to var2.
- At this stage, the dataset represents the younger twins; each variable with name ending in '1' represents the younger twin as identified by Atempid2, while each variable with name ending in '2' represents the elder twin, i.e. the co-twin. Save this as the "bottom" half of the dataset.
- Combine the cases from the "top" and "bottom" datasets saved above, into a single dataset.
- Re-sort in ascending order of Atempid2.
- The family-specific data are now double entered. There is one row per twin, each twin being identified uniquely by Atempid2. The variables having names ending in '1' contain data for this twin (whether elder or younger). The variables having names ending in '2' contain data for this twin's co-twin.
Creating the random variable
The variable named random has values 0 and 1. For a given twin, the value 0 or 1 is assigned randomly in this variable; however, co-twins must have opposite values. So if the elder twin has random=1, then the younger twin must have random=0, and vice versa. During analysis, if the dataset is filtered or selected on either random=1 or random=0, then exactly half the twins will be selected, but only one twin per pair is selected; and the selection will include an approximately 50:50 mix of elder and younger twins.
Computing this variable cannot be done directly in a dataset that already has one row per twin, twin and cotwin from a given pair must have opposite values of random in two different rows. Instead, the random variable is often added to the dataset during double entry of the parent/family data (see above), because both processes involve combining "top" and "bottom" halves of the dataset. The processes involved are as follows:
- Start with a dataset containing the elder twin cases only. Elder twins are typically identified by variable Atempid2, with 1 as the final digit.
- Add the random variable, assigning the value 0 or 1 randomly to each elder twin, with equal probabilities.
- Save this as the "top" part of the dataset, representing the elder twins.
- Re-compute the twin identifier Atempid2, by changing the final digit from 1 to 2. This now identifies the younger twin cases.
- Recode the random variable, changing 1 to 0 or 0 to 1.
- Save this as the "bottom" part of the dataset, representing the younger twins.
- Combine the cases from the "top" and "bottom" datasets saved above, by aggregating rows, into a single dataset.
In the second step above, the random variable can be computed in the syntax using an appropriate SPSS random number generator function. In SPSS, a suitable command is COMPUTE random = RV.BERNOULLI(0.5). which will generate approximately equal numbers of 0 and 1 values randomly.
Merging data sources
In a given dataset, data typically come from several sources, e.g. from parent, twins and teachers, and from different booklets, questionnaires, web tests and admin sources. Separate raw data files are usually used to store the data from different sources; each raw data file is imported and saved as an SPSS data file (see importing raw data above). These files must be combined together, or "merged", into a single dataset. This must be done in such a way that all data relating to a given twin appears in the same row of the dataset; to do this, the files must be merged using a twin identifier as the key variable.
Many background variables that are commonly used in all datasets are conveniently merged in from a reference dataset containing such variables. This minimises the need to duplicate steps such as importing raw data and deriving the background variables in every dataset. Background variables treated in this way are twin sexes, zygosities and exclusion variables. See the background variables page for more details.
Different sets of data may identify twins in different ways, so sometimes it is necessary to modify the IDs so that they match up for merging. In raw twin and teacher data, twins are usually identified by a variable called TwinID. In raw parent data, each case is usually identified by a family identifier called FamilyID. As explained above, after double-entering the parent/family data, twins are identified by a variable called Atempid2. In the most scripts, files are merged in this order:
- Merge all per-family data (parent and admin data) on FamilyID.
- Double enter the per-family data, resulting in a file where twins are identified by Atempid2.
- Merge all per-twin data (twin and teacher data) on TwinID.
- Replace TwinID with Atempid2 in the per-twin data.
- Merge the per-family data with the per-twin data on Atempid2.
The means of replacing TwinID with Atempid2 is provided by the raw data file (exported from the TEDS admin database) containing the TwinID, FamilyID and twin birth order for every TEDS twin. After importing this file, Atempid2 can be computing by appending the twin birth order (1 or 2) to the end of the FamilyID value. This file can then be merged with the per-twin data (on TwinID), hence providing the Atempid2 variable needed for merging with the per-family data.
Before two datasets are merged, they must both be sorted in ascending order of the relevant key variable. The files must therefore be re-sorted if a new key variable is about to be used for the next merge. In the scripts, data files are usually sorted prior to saving, in steps preceding each merge.
Before merging, each set of data from a particular source is given a new flag variable to show the presence of data from that source. Sometimes these flag variables are already present in the imported raw data; if not, they are computed after importing the data into SPSS. These flag variables can be used later in the scripts to identify any cases without data from any source in the study; such cases can then be deleted.
Scrambling IDs
The family and twin identifiers used in the raw data are the same as those used in the TEDS administrative system and in all contacts with the families. These IDs are easily matched against identifiable individuals, and are never used in analysis datasets. In order to protect the identities of twins and their families, and to maintain confidentiality in the data, alternative "scrambled" (pseudonymous) IDs are used in the main TEDS datasets. The scrambling of the raw IDs is a key stage in dataset construction, and it is generally carried out at an early stage, once the files from all sources have been merged together.
Subsequently, when customised datasets are prepared for researcher analysis projects, variables from different datasets can be merged together using the same pseudonymous IDs. Before the merged dataset is delivered to a researcher, the pseudonymous IDs are changed into randomly anonymised IDs, as a way of protecting participant confidentiality. This final anonymisation of the IDs is not, however, part of the routine preparation of the main TEDS datasets.
The treatment of IDs is dealt with more fully in the scrambled IDs page.
Labelling variables
All variables to be retained in the analysis dataset are labelled, to help researchers identify what each variable represents, and to give information about value codes. While there is no rigid convention for labelling variables, where possible the following information is included in the variable label:
- The study or approximate twin age
- The source of data (e.g. parent, twin or teacher, booklet or web)
- The name of the measure
- For item variables, the item number where appropriate
- For questionnaire items, a truncated or simplified version of the text of the question
- Where appropriate, the nature of the item, e.g. response, score or response time for a web item.
- Some indication of the coding, range of values, or units of measurement
Value labels are also added to categorical integer-valued variables, where there are three or more categories present. The use of value labels removes the need to specify all the values within the text of the variable label.
Creating scales and composites
Scales, composites and other derived variables are generally added to the dataset after the initial stages of merging the data, and before labelling all the variables. Various methods are used to compute scales, but most involve calculation of the mean or sum of a set of related item variables. Sometimes, some of the item variables must be reversed-coded first (if this was not already done earlier in the scripts), so that all the relevant items are coded in a consistent way. Sometimes, the item variables must be transformed in other ways before computing scales: for example, some scales are derived from standardised items; or sometimes, neutral responses like "don't know" must be removed before scaling.
For per-twin data (from teachers and twins), it is helpful to create scales before double entry, so that each scale variable only needs to be computed and labelled once. The new scale variables are then effectively copied and matched to the co-twins during double entry.
For per-family data (typically from parents), double entry may precede the creation of scales and derived variables. If the twin-specific items are already correctly double entered, then any scale derived from them will also be correctly double-entered (see double entering parent/family data above). If double-entry has not yet been done when the scales are computed, then the scales will subsequently have to be double entered along with the items.
Some derived variables added may be referred to as "composites" rather than scales. Usually, the term "composite" variable implies that data from two or more different measures have been combined together, whereas a "scale" variable generally is computed from items of the same measure. Use of the term "composite" may also imply that the derivation is more complex than for a "scale", the latter typically involving a straightforward mean, sum or count. However, these distinctions are not strict.
Creating background variables
Many of the background variables are not derived in each separate dataset, but instead are merged in from a reference dataset as mentioned above. This reduces the need to duplicate the same derivations in every dataset, hence shortening the processing and reducing the risk of errors. The background variables merged in this way generally include zygosities and exclusion variables. See the background variables page for details. See also the exclusions page.
However, some background variables must be derived separately for each dataset. These include twin ages, which are specific to the data collection, and sometimes study-specific exclusions for example.
Within each dataset, there are typically different twin age variables for different data collections. For example, within a given study, the data from parent booklets, teacher questionnaires, twin web tests and twin booklets may all have been returned on quite different dates. The two twins in a pair may often return data on different dates, hence their age variables are double entered where appropriate.
Double entering twin data
Raw data from twin-specific sources, such as teacher questionnaires, twin tests and twin booklets, will contain one row of data per twin. Every data item relates to this twin. To double enter these data, the corresponding co-twin data items must be added to the same row of data for every twin. The data items must therefore be duplicated and matched to the co-twins.
Note that double entering the per-twin data, as described here, involves different processing from that involved in double entering the per-family data, as described above. Note also that double entering the per-family data is usually done at any early stage in the scripts, before the per-family data are merged with the per-twin data. However, double entering the per-twin data is more conveniently done later, once all the data sources have been merged together, after items have been labelled, and after scales have been computed.
In most cases, this double entry procedure comes after the script that scrambles the IDs. So the description below assumes that the twins are identified by the variable id_twin. The corresponding family identifier is id_fam. The value of id_twin comprises the twin birth order (1 or 2) appended to the value of id_fam. The same double entry procedure can be used with unscrambled IDs Atempid2 and FamilyID if necessary.
The exact method used for double entry can vary between scripts, but the procedure always follows some version of the following steps:
- Rename all item variables from twin-specific sources, so that the variable names end in '1'.
- Sort the dataset is ascending order of twin identifier (usually id_twin).
- At this stage, the dataset represents the data for the "index" twins, i.e. the data relating to the twin identified by id_twin in each row of data. Save this as the "left hand" part of the dataset.
- Now make an equivalent dataset for the co-twins. Start by dropping any per-family variables that do not require double entry, and also dropping variables that have already been double entered (for example, from the parent data).
- Rename all the twin-specific item variables so that the variable names end in '2' instead of '1'.
- Re-compute id_twin, to match the co-twin's id_twin value in each case: if the final digit of id_twin is 2, change it to 1, and vice versa.
- Sort the dataset in ascending order of the re-computed id_twin.
- At this stage, the dataset represents the data for the co-twins. Save this as the "right hand" part of the dataset.
- Merge the left and right hand parts, as saved above, using id_twin as the key variable. The dataset is now double entered.
After double entry, it may be possible to reduce the size of the dataset by deleting any twin pairs that do not have any data in the study. This may be done by selecting on the flag variables that show the presence of data from each source. To maintain paired twins in the dataset, and to avoid unpaired "singletons" in the dataset, selection should be based on the double-entered flag variables for both twin and co-twin for each part of the data.