Even Correlations Based On Billions Of Data Points Do Not Prove Causation

Wednesday, August 23, 2017

Readers may have already heard about a recent study by Tim Althoff and colleagues from Stanford University, published in Nature, that analyses physical activity data collected from smart phones consisting of 68 million days of physical activity for 717,527 people, in 111 countries (only 46 of which were included in the study).

As one may expect, not only do activity levels vary widely across countries but also substantially within countries (which in general terms, the authors refer to as “activity inequality”).

It turns out that activity inequality and not actual levels of activity predict obesity rates (based on BMI).

Furthermore,

“By quantifying the relationship between activity and obesity at the individual level, we were able to determine why a country’s activity inequality is a better predictor of obesity than average activity level. We find that the prevalence of obesity increases more rapidly for females than males as activity decreases. And while lower activity is associated with a substantial increase in obesity prevalence for low-activity individuals, there is little change in obesity prevalence among high-activity individuals. So given two countries with identical average activity levels, the country with higher activity inequality will have a greater fraction of low-activity individuals, many of them female, leading to higher obesity than predicted from average activity levels alone. These findings are analogous to the phenomenon revealed in past studies of the effects of income inequality on health, whereby a relatively small change in income (in our case, activity) for an individual at the bottom of the distribution can lead to substantial improvements in health. On the basis of our model relating activity inequality to obesity prevalence, we also performed a simulation experiment which, assuming perfect information (Methods), suggests that interventions focused on reducing activity inequality could result in a reduction in obesity prevalence up to four times greater than in population-wide approaches.”

The authors go on to discuss various limitation of their study but fail to mention the biggest limitation of all, the simple fact that correlations, no matter how strong or how large the data set, simply cannot prove causality.

Thus, while the data does prove the point that you can do all sorts of interesting analyses when you have large data sets, it simply does not not prove that activity levels (or activity inequality for that matter) actually has much to do with obesity at all.

Indeed, one could think of a number of confounders that would otherwise differentiate countries with high activity inequality that happen to have high obesity rates from countries that have low activity inequality and low obesity rates (let’s not even mention reverse causality).

Thus, as nice as the figures presented in the paper may be, it is really hard to follow the author’s conclusion that,

“Our findings can help us to understand the prevalence, spread, and effects of inactivity and obesity within and across countries and subpopulations and to design communities, policies, and interventions that promote greater physical activity.”

This is not to say that designing communities, policies, and interventions would not be of substantial health benefits – given all of the known benefits of physical activity.

Unfortunately, whether or not, these policies would do anything to prevent or reverse obesity is another matter altogether and remains as unclear after this study as before.

@DrSharma
Edmonton, AB

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_gat	1 minute	This cookies is installed by Google Universal Analytics to throttle the request rate to limit the colllection of data on high traffic sites.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Duration	Description
CONSENT	16 years 8 months 9 days 8 hours	No description
ct_checkjs	session	CleanTalk–Used to prevent spam on our comments and forms and acts as a complete anti-spam solution and firewall for this site.
ct_cookies_test	session	No description
ct_sfw_pass_key	session	CleanTalk–Used to prevent spam on our comments and forms and acts as a complete anti-spam solution and firewall for this site.
ct_timestamp	1 day	No description

Even Correlations Based On Billions Of Data Points Do Not Prove Causation

TEDx Talk

Watch Dr. Sharma in the News!

Listen to Dr. Sharma!

Even Correlations Based On Billions Of Data Points Do Not Prove Causation

TEDx Talk

Watch Dr. Sharma in the News!

Listen to Dr. Sharma!

Click for related posts