Newsportal - Ruhr-Universität Bochum
Big data: opportunity or risk?
Big data: the term has many greedily licking their lips in anticipation. Intelligence services sense the opportunity to find solutions for security issues within mass data, marketing experts rejoice at the wealth of information about consumers and researchers gain access to data sets of a size that they themselves could never collect.
We all contribute in passing to the ever increasing mountain of data: when we google something, when we enter data about ourselves in health apps or simply due to the fact that we are registered with the authorities. We spoke to Prof Dr Thomas Bauer to find out what big data could change for better or for worse.
Mr. Bauer, you have already worked a lot with big data. What questions do you seek to answer by analysing large volumes of data?
Labour economists have been working with social insurance data for a long time, especially data held by the Federal Employment Agency. Everyone that has ever held a job subject to mandatory social security contributions in Germany is included in this data set.
This data has been used, for example, to evaluate measures of active labour market policy. The evaluations demonstrated that some of these measures do not help in returning unemployed people to the job market. On the contrary, they worsen their chances. However, it is debatable whether we are really dealing with big data in this context.
When we talk about big data, we often refer to unstructured and large volumes of data. This would not include social insurance data because it is structured. Unstructured data always occurs, for example, when somebody searches for a product using Google. Or if lots of different information is recorded in a hospital in the form of a doctor’s letter.
Prof Dr Thomas Bauer is Head of the Empirical Economics Department at the RUB and Vice President of the Leibnitz-Institute for Economic Research in Essen. He has been a Member of the Statistical Advisory Board of the Federal Statistics Office since 2005. He obtained a degree in economic studies at the Ludwig Maximilian University of Munich. In 1997, he completed his doctorate on the subject of “The effects of migration and immigration policy on the job market”.
Have you also previously worked with this type of unstructured data?
I have worked with partially unstructured data from the company Immobilienscout, the largest German Internet provider of advertisements for the sale of houses and the rental of apartments or houses.
One reason for the last financial crisis was that a real estate price bubble in a small region of the USA burst and the associated problems spread across the entire world like an infection. We asked ourselves whether we could identify this type of real estate price bubble in a small region in Germany.
The answer was no. There was no data available that served this purpose. There was no indicator to show real estate prices rising in a small region within short periods of time. As a result, we developed a real estate price index in cooperation with Immobilienscout that could fill this gap.
It almost appears as though we could answer any research question if we could only collect enough data.
No, that is certainly not the case. The potential offered by big data in this context is often completely overrated. It is always dependent on the specific question being posed.
From a statistical standpoint, there are often problems with big data because the data generating process and the underlying population are often completely unknown. Yet without this information, we cannot assess the reliability of our statements based on this data.
Do you have an example of this?
In August, a ranking list was circulated by the media that suggested that an above-average number of football fans have a university degree. However, the sample had been put together in such a way that it already contained an above-average number of academics.
This is a complete fallacy.
There is also a second problem when it comes to analysing big data. One often hears from the big data community that due to the mass of data, there is no longer any need to worry about causalities. This is a complete fallacy. It is irrelevant whether thousands or millions of observations flow into an empirical analysis. Although this increases the precision of the estimates, it does not help per se to encode the causal relationship between an identified correlation.
We thus do not know what is the cause and what is the effect.
Exactly. Let us assume that the figures from the football example were representative. We would still not know the reason for this relationship. Do you have to have a university degree to understand football? Or does football in some way increase intelligence? Even if we were to ask the whole German population, we still couldn’t say anything about the matter. In order to shed some light on the subject, we would require more in-depth strategies and methods.
Statistical illiteracy has become widespread – it is even considered hip in Germany.
Collecting data is one thing but correctly evaluating the data is another thing entirely. You and your colleagues regularly choose your “Unstatistic of the Month”.
This enables us to draw attention to the problems associated with evaluating data and interpreting statistics. It is a subject close to our hearts. Statistical illiteracy has become widespread – it is even considered hip in Germany. It is often not even the case that the statistics themselves are incorrect but rather that they have been incorrectly interpreted or the results incorrectly presented. We want to raise awareness for the correct interpretation of statistics.
How do you find the topics for your unstatistics?
We find them by reading the newspaper. Alongside the above-mentioned football example, another prime candidate for an unstatistic in August was the headline “Readers live longer”. We have now built up a large fan base who also regularly supply us with ideas.
I am convinced that big data will massively change the healthcare industry.
Big data is not set to disappear from our everyday lives anytime soon. Will our daily lives be changed by it?
I am convinced that big data will massively change the healthcare industry. Big data could deliver major advances for the world of medicine. Many people now have health apps that collect all manner of data. This could flow into a database that an attending physician could access in the event of an emergency. The doctor would then know when and what the person last ate, as well as their pulse rate and blood pressure. All diagnoses or even X-rays could be saved to one location and then made available to all doctors where required.
It could even be the case in future that a patient is asked about his or her symptoms by the admitting physician when they are delivered to hospital according to a predefined routine and the answers are recorded on a tablet. The underlying software would draw on all of the medical findings collected up to that point in time. The doctor would thus receive a suggested diagnosis and can treat the patient accordingly. This is already technically feasible today.
This initially sounds positive. What is the catch?
Data privacy. At least in comparison to other countries, there is a high level of awareness about this subject in Germany, whereby I don’t understand some of the behaviour exhibited in this context. On the one hand, some people are not willing to participate in a survey conducted by the Federal Statistics Office because they have privacy concerns. Yet on the other hand they voluntarily disclose their health data in an app. Who knows who is behind which app? Or what the operator does with this data?
Data privacy becomes more problematic with this type of data.
Maybe people have the feeling that they as individuals will become lost in the thousands and thousands of data sets.
This is unfortunately not the case. There is geo-referenced data that delivers comprehensive information at a grid level of 500 by 500 metres: such as how many luxury cars are registered in the area, the unemployment rate, how many houses have ever filed for bankruptcy and the proportion of foreigners.
In a city district containing apartment blocks, there are a lot of people living in an area covering 500 by 500 metres. Yet this is not the case in the countryside or in residential areas consisting of detached houses. In addition, it is easy to enrich a data set with further information from another source. It is often possible to identify those people who have filed for bankruptcy with a higher degree of probability based on those data sets that have been combined together in this way. Data privacy becomes more problematic with this type of data.
Overall: Do you view big data as more of an opportunity or a risk?
I believe it is a great opportunity if we clarify some important issues. In academia, we need to devote ourselves more intensively to some specific problems when using big data: What is the data generating process? What is the underlying population? Furthermore, we also need to talk more often about the issue of data privacy. And finally we need to combat this statistical illiteracy.
28 September 2016