What are 'Data' and 'Data linkage'?

On this page you can read more about


What is ‘data’?

According to the Oxford English Dictionary, Data is

'facts and statistics collected together for reference or analysis’.

What is ‘Big Data’?

Advances in storage and analytics mean we can now capture, store and work with many different types of data all at once.

‘Big data’ just means that the file(s) are too big to process on a normal spreadsheet or database. We need to use a combination of maths, statistics and computer science to get answers from these large, complex datasets.  


For the projects in MHDSS, our main source of ‘data’ are NHS health records.

Every time you access a health service in the UK information (or ‘data’) is created. This information is confidential and controlled by strict privacy laws. However, by removing your personal information (name, full address, date of birth etc.) this patient data becomes de-identified and can no longer be traced back to you.  Our researchers always work with de-identified data (See What about Security and Privacy page for more details)


DNA is another key source of data

By comparing the genetic information of hundreds-of-thousands of people, we hope to gain insights into the causes and potential treatments of common mental health conditions.
Many thousands of volunteers have kindly donated their DNA data to medical research. (Thank you).


Other data sources

Other data may come from sources like census records, birth/death/marriage records and other research projects who have signed up to be our partners.
Complex legal contracts control all of this data sharing (See What about Security and Privacy [link] for more details).


What is Data Linkage?


The theory is that the more data you have, the more you know. So, by comparing more data points, relationships that were previously hidden, may now be revealed.

Data linkage allows our researchers to bring together information from a wide variety of sources (see above), to create a new, richer dataset.

[video contains no sound]

Data linkage is done by assigning a number to each person and storing a set of links to all their records. Strict privacy rules ensure the security and confidentiality of the data and only the link is stored - the actual data is never brought together in one place.  (See What about Security and Privacy [link] for more details).

Example of a linked data set:
PPID stands for Project Person Identifier – the number assigned to each person in the data.
Researchers receive the minimum amount of data possible, to allow them to complete their research. 

 Dataset  PPID  Year of Birth   Gender   Year Admission   Length of stay   Postcode   Primary Diagnosis   Additional Diagnosis   Procedure Code   ARDRG
 Admitted Patient   254431   1982  Male  2005  12  7001  125.10  C78.0  210093  F62B
 Admitted Patient  254431  1982  Male  2008  15  7001  125.89    210099  F62B

[Table contains 2 lines from an example dataset, for 2 fictitious 'patients'. The table headings are PPID (project person identifier), Year of Birth, Gender, Year of Admission, Length of stay, postcode, Primary diagnosis, additional diagnosis, procedure code, ARDRG) It shows how data are coded for privacy, for example postcode appears as '7001' and diagnosis as 125.10]

Source: http://www.menzies.utas.edu.au/research/research-centres/data-linkage-unit/what-is-data-linkage [accessed 20/9/18]


How is data analysed?


The linked data sets we receive are hundreds of columns wide and hundreds-of-thousands of rows long. It would not be possible to look at this data and make sense of it by hand.

Instead our researchers use complex statistical programmes and machine learning techniques, which spot patterns much more quickly and reliably than humans ever could.


What is Machine Learning?

Machine learning means that computers use the data they are given to teach themselves how to do tasks, how to recognise patterns and how to make decisions. Machine learning makes it possible for computing systems to become ‘smarter’ as they encounter additional data.

For our researchers, this means they give the computer an example of the data as a starting point (training data). Once the computer has found patterns in this training data, it will know what to look for in any similar dataset it is given. 
Our researchers then examine and interpret these data patterns, by comparing them to currently known facts about that health condition and patterns found by other research methods (e.g. data linkage).

Example: Natural Language Processing

Imagine reading a children’s book and then being asked at the end, was the character happy or sad? You could do this relatively easily for one book or even one shelf of books, but how about a whole library of books, or every children’s book in every library in Scotland? This would be very time consuming.

However, computers can be taught to analyse language. They can be told at the start that words like ‘cry’, ‘frown’ and ‘down’ mean sad and ‘glad’, ‘smile’ and ‘joy’ mean happy. They will start off looking for these words but will gradually learn that other words often appear beside them (e.g. ‘glee’ and ‘merry’ often appear in happy books). They will therefore ‘learn’ these new words and look out for them, in the next set of books that they analyse. 

Our researchers will use Natural Language Processing technology to ‘read’ the medical notes of stroke patients and investigate the relationship between mental health and recovery after a stroke.
(Read more about this research at  Linking Physical & Mental Health[link])