TEDS Data Dictionary

Algorithm for Scrambling IDs

Contents of this page:

IDs used in TEDS
- Admin IDs
- Pseudonymous and anonymous dataset IDs
Scrambling IDs
Unscrambling IDs

IDs Used in TEDS

Admin IDs

In everyday contacts with families, in the TEDS admin database, and in the raw data, the following IDs are used to identify families and twins:

Name of ID	Identifies	Length	Structure	Fictional example	Comments
FamilyID	a family, or the contact parent of the twins	4 or 5 digits	Numeric values between roughly 1200 and 36000	24501	Assigned at the time of recruitment and unaltered since. (This variable may have names like XFamilid, AFamilid in old scripts)
TwinOrder	a twin within a given family	1 digit	1=elder twin or 2=younger twin	2	Denotes the twin birth order. Generally named twin in scripts and datasets. (May have names like atwin, gtwin in old scripts and datasets.)
TwinID	a twin	7 or 8 digits	Comprises the FamilyID followed by the TwinOrder followed by two randomly generated digits.	24501273	Assigned at the time of recruitment and unaltered since. In rare cases where the birth order (TwinOrder) has been corrected, the value of TwinID has been left unchanged, so in these cases the 3rd-last digit does not match the value of TwinOrder.
Atempid2	a twin	5 or 6 digits	Comprises the FamilyID followed by the TwinOrder.	245012	Used when building datasets, but not generally used in study admin. May have names like Xtempid2, gtempid2 in old scripts.
ChildID	a child of a twin	2, 3 or 4 digits	numeric values between roughly 10 and 2000 (growing as more children are recorded)	1852	Assigned when a child is first recorded in the TEDS admin system. Used to identify a child in the CoTEDS study.
BirthOrder	a child of a given twin	1 digit	1=first born, 2=second born, 3=third born, etc	3	Denotes the child birth order. Generally named childbirthorder in scripts and datasets.
ParentID	a coparent of a child, and partner of a twin	3 or 4 digits	Numeric values between roughly 100 and 1000 (growing as more coparents are added)	623	Assigned when a coparent is first recorded in the TEDS admin system. Used to identify a coparent in the CoTEDS study.

These IDs (except for TwinOrder and BirthOrder) can be directly linked to confidential information about individual TEDS families, twins and children. For this reason, the IDs above are not used in the TEDS analysis datasets. (Furthermore, identifying data such as names and postcodes are not included in the datasets.)

As noted in the table above, the main family and twin identifiers (FamilyID, TwinID) were assigned at the time of recruitment and 1st Contact, when family and twin details were first entered in the TEDS admin database. The twin birth order (TwinOrder) for each named twin was established in the 1st Contact study; in rare cases this was subsequently found to be incorrect and then corrected in the admin database, but FamilyID and TwinID have been left unchanged in all cases. The current and most reliable source of twin birth orders is the admin database; birth order digits in variables like TwinID and TEDS IDtwin, id_twin (see below).

In the CoTEDS study, the main child identifier (ChildID) is assigned to each child automatically when a new child is recorded in the TEDS admin system. At the same time, the child's birth order (BirthOrder) is determined from the date of birth relative to any other known children of the same twin. In rare cases, the birth order may subsequently be corrected on discovery of other children, or on correction of birth dates. However, the ChildID remains unaltered and will be linked to CoTEDS questionnaire data.

Additionally, the following IDs have been used to identify DNA samples kept in the lab:

Name of ID	Identifies	Length	Structure	Fictional example	Comments
TEDS ID	a twin	string of 7 or 8 characters	Comprises the letters "TD" followed by the FamilyID followed by the TwinOrder.	TD245012	Historically used for all twin DNA samples collected by TEDS until 2015.
ParentSampleID	parent of a twin	string of 9 characters	Comprises the letters "TDCP" (for the contact parent) or "TDSP" (for the second parent), followed by the FamilyID. A zero is added before a 4-digit FamilyID value, to ensure a consistent ID length of 9 characters.	TDCP24501	Used for the parent DNA collection in 2022.
DNASampleID	twin, child or coparent of child	string of 8 or 9 characters	Comprises the letters "CD" (indicating CoTEDS), followed by a unique 5-digit twin identification number, followed by either "t" (for a twin sample) or "s1" (for a second parent sample) or "c1"/"c2"/"c3" for a child sample, the final digit indicating the child birth order. The 5-digit number is unique to each twin and is different from other participant IDs mentioned above; in the DNASampleID, this 5-digit identifier helps to link members of the same twin family. Note that the DNASampleID was used in the initial CoTEDS DNA collection phases, but has subsequently been replaced by the use of DNAtubeID (below).	CD12345c3	Used for the CoTEDS DNA collections that started in 2023.
DNAtubeID	twin, child or coparent of child	string of 10 characters	Comprises the letter "G" followed by a unique 9-digit sample identifier. These IDs are generated by the manufacturer and printed on each sample tube, using a barcode. When assigned to a participant for DNA collection, the DNAtubeID is scanned from the barcode and copied into the TEDS admin system for the given participant. The structure of this ID does not contain information about whether the sample belongs to a twin, child or coparent. The use of DNAtubeID has now largely replaced the use of DNASampleID (above) for ongoing CoTEDS DNA collections.	G123456789	Used for the CoTEDS DNA collections that started in 2023.

Pseudonymous and anonymous dataset IDs

For the purposes of the main TEDS analysis datasets, the IDs are 'scrambled' in order to protect the confidentiality of the data. The IDs in the main TEDS datasets are named id_fam (family ID) and id_twin (twin ID). In CoTEDS datasets, a corresponding ID called id_child may be used. This scrambling is done using an algorithm, devised by Tom Price, which is outlined below. The resulting IDs can be converted back to their original form by a process of 'unscrambling'. Therefore, data identified in this was is categorised as pseudonymous, not strictly anonymous. For all twins, the values of id_fam and id_twin are the same across different datasets, allowing variables to be merged longitudinally.

A further non-reversible and randomised encryption process is necessary to make the IDs truly anonymous. This encryption is a useful additional step in the construction of datasets that are to be shared with researchers; it significantly reduces any risk that the participants could be identified, by irreversibly and randomly modifying the family identifiers. The new IDs are named randomfamid (family ID) and randomtwinid (twin ID) in TEDS datasets and randomchildid (child of twin ID) in CoTEDS datasets. In any given dataset, the encryption is made unique by re-computing the IDs, incorporating random number generation in the computation. Hence, the IDs created in this way differ from one dataset to another, making it impossible to merge with other datasets. This encryption process is not described further on this page. For longitudinal datasets, the data are first merged using identifiable or pseudonymous IDs before the final encryption step.

It is now TEDS policy to use anonymous (not pseudonymous) IDs in datasets provided to researchers. The main exception to this rule is where researchers need to merge their phenotypic dataset with genotypic data for analysis; in these cases, pseudonymous IDs are used. All analysis of genotypic data is done within KCL, and raw genotypic data are not shared externally, hence any dataset shared outside KCL will always be anonymous not pseudonymous.

Datasets used within the LLC TRE, where they can be linked with NHS medical records, have a different twin identifier called STUDY_ID. This is a pseudonymous twin identifier, with long string values. The raw values of the identifier are stored in TEDS, and included as a variable in each dataset submitted to the LLC; the values are then irreversibly hashed by LLC. The hashed values that will be found in the STUDY_ID variable in datasets within the LLC are therefore different from the raw values held in TEDS. However, the hashing is carried out identically for every TEDS dataset, which means that it can still be used for linking TEDS datasets inside the LLC, and it is therefore pseudonymous. The TEDS family identifier randomfamid, as described above, will also be made available to researchers within the LLC; its function is to act as a family 'grouping variable', enabling researchers to identify any pairs of twins related as siblings.

This table summarises the pseudonymous and anonymous IDs used in TEDS datasets:

Name of ID	Type	Purpose	Identifies	Length	Structure	Fictional example
twin	-	to specify twin birth order within a pair	a twin within a given family	1 digit	1=elder twin or 2=younger twin	2
childbirthorder	-	to specify birth order for the child of a twin	a child of a given twin	1 digit	1=first born, 2=second born, 3=third born, etc	3
id_fam	Pseudonymous	Protection of confidentiality within the main TEDS datasets while allowing longitudinal data to be merged. Given the scrambling algorithm, it is possible to convert the IDs back to identifiable form.	a family	up to 6 digits	Numeric values between roughly 100 and 999999	87654
id_twin			a twin	up to 7 digits	Comprises the id_fam value followed by the twin birth order.	876542
id_child			a child of a twin	up to 8 digits	Comprises the id_twin value followed by the child birth order.	8765423
STUDY_ID	pseudonymous	The unique twin identifier used in all TEDS datasets within the LLC TRE, allowing TEDS datasets to be merged with the linked NHS medical record datasets. It will not be possible to trace the hashed values back to original twin identifiers, so from practical purposes the values are anonymous.	a twin	unknown	Hashed string values.	(not available)
randomfamid	Anonymous	Complete protection of confidentiality for shared datasets; randomly generated and unique to each dataset, so merging is impossible. The encryption of the IDs is irreversible.	a family	5 digits	Numeric values between roughly 50000 and 70000	54321
randomtwinid			a twin	6 digits	Comprises the randomfamid value followed by the atwin value.	543212
randomchildid			a child	7 digits	Comprises the randomtwinid value followed by the childbirthorder value.	5432123

Note that dataset variable twin is the same as the variable TwinOrder used in admin and in the raw data.

Scrambling IDs

Scrambling of IDs refers to the process of converting the original admin IDs (as used in the raw data) into the pseudonymous dataset IDs, as described above. Scrambling of IDs is therefore a routine part of dataset construction, and is included in the scripts used to make the TEDS datasets.

Because this data dictionary is widely shared, as are the TEDS analysis datasets, the actual algorithm for scrambling IDs (in the form of a syntax or script) is not shown here. This is to help protect the confidentiality of the TEDS twins and parents whose data are in the datasets. The detailed algorithm is no longer shared with researchers.

The essential properties of the scrambling algorithm are:

It is reversible.
It converts a value of FamilyID into a value of id_fam. Both values are unique to a specific TEDS family.
For twin data, the algorithm also creates a value of id_twin by appending the twin order (1=elder or 2=younger) to the end of the value of id_fam. Hence, each value of id_twin is unique to a specific TEDS twin.
For CoTEDS child data, the algorithm is adapted to create a value of id_child by appending the child birth order to the end of the value of id_twin. Hence, each value of id_child is unique to a specific twin's child.
The algorithm is guaranteed to work for the fixed range of FamilyID values (roughly 1200 to 36000). For ID values outside this range, the algorithm may not generate unique values of id_fam.

The scrambling algorithm is a fairly simple form of encryption, which can be encoded in a syntax or script. The mechanism of the algorithm includes a sequence of steps, which are not described here for reasons of data protection. The algorithm effectively achieves the aim of disguising an original value of FamilyID, such that it is not in any way recognisable from the value of id_fam into which it is converted: the length is typically different, and some or all of the component digits are different.

Unscrambling IDs

This is the reverse of the scrambling algorithm. It converts values of id_fam (or id_twin or id_child) back into the original values of FamilyID. This is effectively achieved by reversing the steps of the scrambling algorithm. For a twin, the unscrambling also generates the twin's birth order, enabling unique identification alongside the FamilyID value. For a child of a twin, the unscrambling generates both the twin's and the child's birth orders, hence enabling unique identification alongside the FamilyID value.

Unscrambling of IDs may be needed for specific admin or research purposes, to identify individual twins or families whose data are of interest in any way, or for checking anomalies in the data, and so on. This unscrambling may be done under the control of the TEDS data manager, but the process is not made available to researchers.