Affiliate Disclosure
If you buy through our links, we may get a commission. Read our ethics policy.

Apple Watch 'black box' algorithms unreliable for medical research [u]

An Apple Watch showing a blood oxygen reading.

Last updated

Apple's use of algorithms to analyze data may be an issue for medical research, after a Harvard professor discovered inconsistencies in data from one Apple Watch accessed at different times.

One of the benefits of mobile devices and wearable devices like the Apple Watch is that improvements can be made in software. In medical research, this may not necessarily be a good thing, and has prompted one study to rethink its methodology.

According to JP Onnela, an associate professor of biostatistics at the Harvard T.H. Chan School of Public Health, these changes can produce inconsistencies in data collection. This can even be the case for analyzing the same data, but at different moments in time.

While Onnela typically prefers using research-grade devices for data collection for studies, The Verge reports a collaboration with the department of neurosurgery at Brigham and Women's Hospital prompted an examination of consumer hardware. Specifically, the study's team wanted to check how different the results from commercial products like the Apple Watch could be in terms of accuracy.

Two sets of the same daily heart rate variability data collected from one Apple Watch were collected, covering the same period from December 2018 until September 2020. While the sets were collected on September 5, 2020, and April 15, 2021, the data should have been identical given they dealt with identical timeframes, but differences were discovered.

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

The changes are a concern for scientific researchers, who want there to be minimal changes or variances in how devices report or record data the same sets of data. Small changes may not be a problem for typical users, but for researchers where consistency is required, Onnela says "that's the concern."

The findings caused the team to shift away from using consumer hardware and back to medical-grade devices. Onnela proposes that the Apple Watch and other wearable items should only be used if raw data is available or if researchers can be informed of when algorithm changes occur.

The Apple Watch and other Apple hardware have been used for medical studies in the past, and sometimes as the primary device. In April, Apple partnered with the University of Washington to study how the Apple Watch could be used to predict illnesses like flu, or the coronavirus.

Stanford University also looked into whether an iPhone and Apple Watch could be used to remotely assess a heart disease patient's frailty, in a study funded by Apple. Researchers found there was a slight dip in accuracy in at-home testing versus in-clinic versions, though it was put down to "out-of-clinic variability" rather than Apple's sensors.

Update: Apple later told The Verge that algorithm changes are not retroactively applied to past data. The company had no explanation for the discrepancy found by Onnela, but suggested issues might arise when using third-party apps to export data.



35 Comments

mike1 10 Years · 3437 comments


It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

dws-2 22 Years · 277 comments

I'm surprised that the Apple Watch is used for research anyway. If I take a heart rate variability reading with the Breathe app, I get around 110-120. The automatic readings are around 20-30. Sort of silly, and it's unclear what meaning I would attach to any of it.

neoncat 5 Years · 165 comments

mike1 said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

I realize this is the new-normal for Apple sites—to be a dismissive prick, like a badge of honor—but the problem the article highlights is the researchers *don't have control* over this algorithmic versioning. This is public data, that *Apple itself* is encouraging be used for these studies (ResearchKit, anyone?) I think the frustration is entirely above-board. 

If you're going to welcome someone to do their job with your data, and create an interface for that data, don't be so surprised if that person says, "yeah but... this data isn't what we need." So, where's the problem again?

Right, yes, of course, Anyone But Apple™. Sorry, I'll be better.  

albatrossflyer 16 Years · 27 comments

dws-2 said:
I'm surprised that the Apple Watch is used for research anyway. If I take a heart rate variability reading with the Breathe app, I get around 110-120. The automatic readings are around 20-30. Sort of silly, and it's unclear what meaning I would attach to any of it.

That was the whole point of the study.  Can consumer grade hardware be used for medical studies?

crowley 15 Years · 10431 comments

mike1 said:

It is thought that tweaks by Apple to algorithms used in the Apple Watch changed how the data was interpreted before being collected.

"These algorithms are what we would call black boxes - they're not transparent. So it's impossible to know what's in them," said Onnela. "What was surprising was how different they are. This is probably the cleanest example that I have seen of this phenomenon."

It's amazing how smart people can be so stupid. It's not a phenomenon. It's called continual updates to tweak the algorithm over the course of almost two years. If this is an issue for your research, you make sure the software is locked down for the length of the study.

Why is any post-facto algorithm being applied over historical data?  That seems highly dubious for anything that requires data integrity and is forming part of ongoing health monitoring.

Apple should stop marketing ResearchKit as anything close to fit for purpose if they're going to dick around behind the scenes and without transparency.  Hell, even HealthKit should have some alarms bells going off.  

The opacity around this totally undermines the reliability of health data on the Apple Watch.  Stupid own goal by Apple.