April 7, 2020 – Reading time: 8 minutes
The CoE Analytics & Sensorics analyses on a regular base available COVID-19 data to give insights into possibilities and constraints of Data Analysis
Established as Think-Tank for statistics and data analysis, the CoE AS normally enables companies to understand themselves better by analyzing their data (process, metrology, manufacturing…) to extract hidden information and parameters.
With this project we want to share our daily analysis about the actual COVID-19 outbreak and about impacts of these statistics. Everyone shall have the opportunity to understand how decisions in politics, society and in our daily life may have an impact and why data can be interpreted in such different ways from different stakeholders.
Any additional comments, adds and corrections are gratefully received.
Analysis of Data Bases for COVID-19
Most actual Data Analysis is performed on Data from John Hopkins University, as here data is updated several times per day thus always providing the most actual data, which is of great importance for news to be always up-to-date.
However, we here use the publicly available data of the Robert Koch Institute. These data are updated only once a day but provide for Data Analytics one advantage: The newly reported cases are not mapped to the current day, but the date on which the doctor diagnosed the infection is used. So, the curve of newly reported cases still changes over time some days afterwards, making Data Analytics a “living document” and on the same time leaves out one blurring parameter: the time between diagnosis at the doctor and appearance in reports of each state that differ often by several days (reporting delay, details can be found in the dashboard of the Robert Koch Institute).
Since the number of subsequently reported cases is decreasing asymptotically, analyses about the situation some days ago can be carried out with a higher quality and allow conclusions to be drawn, however less quickly.
Impacts of the data source
This is a good example that the goals for data analysis have an impact on the choice for available data source: Being always up-to-date as quick as possible (JHU data base) vs. Higher homogeneity of data base for profounder but less actual data analysis (RKI data base). So always chose your data source according to your use cases.
Behavior of reported cases over time
Figures 1 and 2 show logarithmic visualisations of the daily reported and the accumulated COVID-19 cases displayed versus the reporting date. Figure 1 shows the reported COVID-19 cases and deaths against the reporting date. From the March 19th onwards, it can be seen that the growth of newly reported cases decreases, which repeats after March 29th, which is a clear indicator for the success of the contact prohibition as we will see in the following.
Figure 1: (Logarithmic scale) Number of new cases and new death displayed vs. the reporting date. One can see that the growth of the number of new cases per day is getting slower after the 19th of March. The difference between the data range of March 23th to March 29th and March 30th to April 5th is much smaller than in comparable data ranges before.
Hint: the presented figures are using logarithmic scale in contrast to the official figures presented by the Robert Koch Institute. We’ve chosen logarithmic scale, because it makes exponential growth appear linear (changes are spotted more easily) and suppresses distracting small changes. Additionally, now it’s also possible to visualize the death numbers, which are two magnitudes smaller than the number of infections.
An additional observation that we have been able to make is a periodic behaviour of the curve. The periodicity is 7 days with the lowest value on each Sunday. It is of course explained by opening hours of doctors, laboratories and public health departments. This limits us to the extent that we cannot make reliable statements about the last seven days, but rather have to wait for a repetition of the cycle.
Extracting the doubling time
Figure 2 shows the accumulated cases against their respective reporting dates. Here, too, a bending of the curve can be seen from March 19th and again from March 29th onwards.
Figure 2: (Logarithmic scale) Accumulated COVID-19 cases and deaths displayed vs. the reporting date. The green, grey and black curves are the best fits for the exponential growth. Note that the doubling time of the green curve is 3 days larger than the grey one, which is again 3 days larger than the black curve. We also added the major government actions , which influence the behavior of the curve. Additionally, we show an estimation on the number of recoveries.
(*) The estimation is based upon the WHO report on the Hubei case , which gives us coarse numbers on the recovery time.
This also coincides with the three fitted exponential curves (note again the logarithmic scale, figure 3 presents the same graph using linear scale), which intersect each other on March 19th
and March 28th
Figure 3: (Linear scale) Accumulated COVID-19 cases and deaths using a linear scale for comparison. Note that the number of deaths is small compared to the number of infections. Additionally, we show an estimation on the number of recoveries.
(*) The estimation is based upon the WHO report on the Hubei case , which gives us coarse numbers on the recovery time.
The doubling time has increased to 9.692 days at the transition from the grey to the green curve, which means that it takes about 10 days for the number of reported infections to double. The government wants a doubling period of far more than 10 days until the current measures can be relaxed. So we are on the right track.
Time behaviour of exponential data
The analysis of the time behavior of exponential data can be tricky because the human mind has no understanding of exponential growth. Switching to a logarithmic scale, which makes exponential curves appear linear, makes analysis much easier. If the data are also periodic, reasonable statements are only possible after at least one period has elapsed.
Effects of the government actions
Taking into account the median incubation time of COVID-19 of 5-6 days (source), the kink in the growth rate on March 19th may be linked to the recommendation of Minister Jens Spahn to do his work in his home office. It is likely, however, that the data situation does not allow any statement about this, since the actual system is much more complex. Effects that have an influence on the propagation rate include
- Cancellation of flights and closure of borders
- Quarantine and rapid response by health authorities to break the chain of infection
- Increasing uncertainty of the population and social isolation
- The possibility to efficiently reduce the probability of infection with SARS-CoV-2 by washing hands
- Increasing infestation of social groups
- Forced leave and short-time work
However, the kink at March 29th is clearly related to the begin of the contact prohibition at March 22nd, because the time difference is in the regime of the median incubation time. Therefore, we can confirm that the actions of the government have been successful in that they have significantly increased the doubling time. As the maximal incubation time of COVID-19 is 14 days, we expect that the curve will flatten even more over the current week.
Coincidence and causality
The more complex a system is, the more difficult it becomes to distinguish between coincidence (simultaneous occurrence) and causality (logical consequence) in the data. In such cases, it is helpful to trace existing structures in the data back to logical causes rather than looking for structures from the perspective of logical causes. Otherwise you will probably fool yourself.
Opinions of the scientific community
Martin Eichner, epidemiologist from Tübingen and co-responsible for the online COVID 19 simulation, was interviewed by Tagesschau.de (Interview in German). The core statement is that the current contact ban does not solve the problem in any way, but only postpones it into the future, as the central problem of missing immunity remains. The goal of the contact ban cannot be to completely survive COVID-19, but only to delay the spread of the disease until a vaccine is available. However, as this is expected to take until the end of the year, the contact ban (if continued until then) will cause considerable social, economic and financial damage, which may not be in proportion to the benefits of contact ban (the number of deaths may not necessarily be lower). Eichner recommends propagation in waves, i.e. one takes the health system to its limits and then restricts public life incomparably harshly until the situation calms down again. This is repeated until the infestation of the population has reached 70%. But even this will be a huge burden on society.
The curve's integral does not change
The integral of the infection curve, i.e. the total number of all infected persons, is an important variable in epidemiology. Because infectivity requires that at least 70% of the population must be immune (i.e., must have suffered COVID-19), the integral of the infection curve must still be 70% of the population. If the curve is flattened, it will be much wider to keep the integral constant.
Incidences vs. case density
To make the case numbers of the federal states comparable, the so-called incidence is used. The incidence represents the number of cases per 100,000 inhabitants (see Figure 4).
Figure 4: Calculated indicences for each German federal state. Incidences (cases per 100,000 inhabitants) are a possibility to make the different federal states comparable.
The population figures of the Länder (in 100,000 inhabitants) are shown in Table 1, middle column.
Table 1: Populations and population densities (in units of Berlin) of the German federal states. The population density correlates with the infection probability. Although we will face some deaths, these numbers will not change significantly over time.
Looking at the incidences of the individual federal states, three main groups emerge over time:
- High incidence: Hamburg, Baden-Württemberg, Bavaria and Saarland
- Medium incidence: North Rhine-Westphalia, Berlin and Rhineland-Palatinate
- Low incidence: all other federal states
This suggests that Hamburg, Baden-Württemberg, Bavaria and Saarland are particularly affected by COVID-19 and have a higher rate of spread than the other German states.
What the incidence does not take into account, however, is the fact that the probability of infection is not the same for the different federal states. Since a distance of more than 2 m provides significant protection against infection, the probability of infection must be higher for more densely populated areas than for sparsely populated regions. If the cases are presented in relation to population density (simplified called „case density“ and standardized to the population density of Berlin), a different picture emerges (Figure 5):
- Bavaria now has the highest case density, followed by Baden-Württemberg
- North Rhine-Westphalia and Lower Saxony will follow later
- Hamburg (previously particularly affected) and all other federal states have relatively low case densities
Figure 5: Calculated cases per population density for each federal state. The population densities are taken from table 1 and are normalized to the population density of Berlin. The case density considers the fact that the infection probability correlates with the population density and therefore visualizes the actual spread of an epidemic (higher case density correlates with a higher spread). However, it is still a simplification, because the numbers are only averages and do not consider city vs. rural population densities.
The spread of COVID-19 is therefore particularly strong in Bavaria and to a lesser extent in Baden-Württemberg. The proximity to Italy and France could have a significant influence on this rate. However, additional influences are also the settlement of large corporations in these regions (Bosch, Zeiss, BMW, Audi, Daimler, etc.), which have many plants in the Asian countries and thus could have favoured the probability of transmission at the beginning. However, it is not possible to make reliable statements on these contributions on the basis of the data.
Contributions of the Ischgl case
A research of the Bavarian Broadcasting Corporation attests the ski tourism to Ischgl/Tyrol a contribution not to be underestimated. In comparison to the calculated case densities, it should follow that the number of unrecorded cases (see BR’s visualisation) must have been particularly high in Bavaria. This is, however, statistically very unlikely, so the Ischgl case cannot be central to the significantly higher spread in Bavaria.
Comparability of different systems
Absolute numbers are rarely helpful, so it is useful to compare results with literature values or other systems. However, it is necessary that the systems to be compared differ as little as possible or in a controlled way. Unfortunately, this is not the case for the federal states, as their populations, population densities and distributions, and how they deal with the spread of COVID-19 are very different. The use of population density as a base value increases the comparability somewhat, but cannot completely establish it.