Your Apple Watch Is Wrong About Your Deep Sleep — By How Much, and What to Trust Instead
You wake up, glance at the Apple Watch sleep graph, see 1 hour 38 minutes of deep sleep, and feel good about recovery. That number — the deep sleep tile specifically — is almost certainly wrong. Not a little wrong. Systematically, structurally wrong in a way that matters if you're using sleep data to decide whether to train hard today.
Recent polysomnography-controlled studies have put a number on it. The Berryhill et al. 2025 head-to-head of six wearables against PSG found that most devices, including the Apple Watch Series 8, differed significantly from the gold standard on total sleep time, sleep efficiency, wake after sleep onset, and light sleep classification. The Journal of Clinical Sleep Medicine 2025 meta-analysis of 24 studies and 798 patients confirmed the direction: wearables overestimate total sleep time and sleep efficiency, underestimate wake after sleep onset, and — specifically for Apple Watch — significantly overestimate light sleep and underestimate deep sleep.
Key Takeaways
- Apple Watch systematically overestimates light sleep and underestimates deep sleep (N3) compared to EEG-based polysomnography.
- The error isn't small — studies show deep sleep can be underreported by 20–40 minutes per night depending on the individual and sleep architecture.
- You can't fix it by wearing the watch tighter or sleeping differently. The limitation is physics: wrist-based PPG + accelerometer cannot reliably separate light NREM from deep NREM.
- Total sleep time and HRV trend are more trustworthy signals from the watch. Stage breakdown is a rough direction, not a calibrated value.
- If you're using deep sleep to decide whether to train hard or back off, you're optimizing against a noisy signal. Use a multi-day trend instead.
How Much Is It Off?
The Berryhill study tested 62 adults (52 men, 10 women, mean age 46) in a single-night PSG session at Antwerp University Hospital. Apple Watch Series 8 was one of six wearables compared against synchronized EEG scoring. The result: Apple Watch significantly overclassified light sleep (N1 + N2) and underclassified deep sleep (N3). The mean absolute error for deep sleep wasn't published as a single number in the abstract, but the pattern was consistent across the device class.
The JCSM meta-analysis pooled 24 studies and 798 patients. The direction was unanimous: wrist-worn devices overestimate total sleep time and sleep efficiency, underestimate wake after sleep onset, and misclassify sleep stages — with the most consistent error being light sleep overestimation and deep sleep underestimation. For Apple Watch specifically, the meta-analysis flagged it as significantly worse than some competitors on stage classification, though still better than pure actigraphy.
Brigham and Women's 2024 head-to-head put Apple Watch, Fitbit, and Oura Ring against PSG in a single-lab setting. Apple Watch and Fitbit both significantly overestimated light sleep and underestimated deep sleep. Oura Ring had no significant difference from PSG on total sleep time, wake, light, deep, REM, or sleep efficiency — but Oura uses a different sensor placement (finger vs wrist) and a different algorithm. The physics of measuring blood flow at the finger is easier than at the wrist, where motion artifact and perfusion variability are higher.
Why the Watch Gets It Wrong
The Apple Watch infers sleep stages from a fusion of accelerometer (movement), optical heart rate (PPG), and heart rate variability. PSG uses EEG to score N1, N2, and N3 directly from brain electrical activity. The gap is not something a software update can close completely.
PPG plus motion can detect "asleep versus awake" reasonably well. It cannot reliably separate light NREM from deep NREM, because the underlying physiological differences — slow-wave EEG activity, muscle tone, specific patterns of autonomic activity — aren't visible on the wrist. The watch sees a still body with a stable heart rate and low HRV. That pattern could be deep sleep, or it could be light sleep in a person who doesn't move much. The algorithm guesses, and it guesses "light sleep" more often than it should.
Wake misclassified as light sleep, and deep sleep collapsed into light sleep — both pull the same direction. The watch sees a lot of "light sleep" because it's lumping together actual light sleep, some deep sleep, and some quiet wakefulness. The deep sleep number you see is the remainder after that lumping, and it's systematically low.
What This Means for Training Decisions
If you're using your deep sleep number to decide whether to train hard or back off, you're optimizing against a noisy signal. A single night's deep sleep reading can be off by 20–40 minutes. Over a week, the error compounds unevenly — some nights the watch guesses closer, some nights further, and you have no way to tell which is which.
The high-confidence information on the wrist is total time in bed and HRV trend. These are the Apple Watch numbers that change training. Total time in bed correlates reasonably well with PSG-measured total sleep time (though still overestimates it slightly). HRV trend, measured consistently each morning, reflects autonomic recovery in a way that stage breakdown doesn't.
Stage breakdown — especially the deep sleep tile — should be treated as a rough direction, not a calibrated value. If your watch shows 30 minutes of deep sleep for three nights in a row, you're probably not getting enough deep sleep. If it shows 1 hour 40 minutes, you might be fine or you might be in the same boat. The trend across a week beats any single night.
What to Trust Instead
The most actionable signal from your watch for training is not the sleep stage pie chart. It's the combination of:
- Total sleep time: How many hours did you actually spend in bed asleep? This is the single best predictor of next-day recovery for most lifters.
- HRV trend: A rolling 7-day average of your morning HRV, measured consistently. This reflects autonomic balance better than any single night's stage breakdown. If you're seeing a downward trend, that's a real signal to pull back on training volume.
- Subjective feel: How did you wake up? The watch can't measure this. If you feel beat up and the watch says you slept great, trust the feel.
The stage breakdown is a nice-to-have, not a need-to-have. It's the part of the sleep dashboard that looks the most informative and is actually the least reliable.
The Wearable-as-Dashboard Problem
There's a deeper issue here that goes beyond Apple Watch. The consumer wearable industry has trained us to look at a dashboard of numbers — sleep stages, readiness scores, recovery percentages — and treat them as definitive. The numbers are colored green, yellow, red. They come with congratulatory messages or gentle nudges. The UX is designed to make you feel informed, not to make you question the data.
This is the Whoop memory problem applied to sleep: the device remembers last night's score, shows it to you with a trend arrow, and you make a decision based on it. But the score is built on a model that has known, documented weaknesses. The deep sleep component is the weakest part of that model.
The better approach is to use the watch as an input to a system that understands its limitations — not as a dashboard you interpret yourself. The watch collects raw data. A coaching layer that knows which signals are reliable (total sleep time, HRV trend) and which are noisy (stage breakdown) can make better decisions than you can by staring at the pie chart.
That's where Dorsi sits. I read your sleep data — total time, HRV, the trend — and I don't show you the stage breakdown because it's not reliable enough to act on. The decision about today's training intensity is made before you open the app. You bring the consistency. I read the watch.
The Compressed Version
If you only remember a few things from this:
- Apple Watch overestimates light sleep and underestimates deep sleep. The error is documented in PSG-controlled studies — Berryhill 2025 (n=62), JCSM 2025 meta-analysis (24 studies, 798 patients), Brigham and Women's 2024 head-to-head.
- The limitation is physics, not software. Wrist-based PPG plus accelerometer cannot reliably separate light NREM from deep NREM.
- Trust total sleep time and HRV trend over stage breakdown. Use the trend across a week, not a single night.
- If you feel beat up and the watch says you slept great, trust the feel.
- The best use of sleep data is one you don't have to stare at — when it quietly shapes today's session in the background.
Chase the inputs — bedtime consistency, sleep environment, stress management — and the sleep numbers follow on their own clock. Chase the deep sleep number directly and you'll spend a year optimizing a chart while the actual recovery work goes undone.
Sources
The Berryhill et al. SLEEP Advances 2025 head-to-head of six wearables against polysomnography is the cleanest recent dataset on stage-scoring accuracy, with 62 adults tested in a controlled lab setting. The Journal of Clinical Sleep Medicine 2025 meta-analysis pooling 24 studies and 798 patients confirms the direction: consumer wrist devices overestimate light sleep and underestimate deep sleep, with Apple Watch specifically flagged for significant misclassification. Brigham and Women's 2024 work adds the head-to-head with Oura that hints at where the physics gets easier — finger-based PPG outperforms wrist-based for stage classification, though the gap is still meaningful for all wrist devices.
Related Articles
Ready to just show up?
Open the free TestFlight beta — Dorsi handles your training decisions.