Correcting Data Quality Problems In Running Data

From Fellrnr.com, Running tips
Jump to: navigation, search

There is a saying that "you can't control what you can't measure." This applies to exercise as much as anything else. There are various approaches to training that use heart rate, pace, power, or other metrics to model and plan training. The problem we face is that all the data we use for training can have data quality issues. All the sensors, including GPS, altimeter, heart rate, etc., are subject to bad readings. And unfortunately, these bad readings can swamp the good data. An easy run where the heart rate reads at, or above maximum would indicate that the run is a vastly higher training stress than anything else you might do. I've been working on improving the data quality of my running data for some time, and I've come up with a few strategies for detecting and correcting these issues.

1 Detecting data quality issues

  • The simplest way of detecting a data quality issue is to look for impossible values. Most data associated with running has a physical limit. So, GPS data that is showing a pace faster than world record pace has a problem. Altitude data that is above the highest mountain or below the lowest point on earth has a data quality issue. It's a bit trickier with heart rate, as you have to be sure of your maximum heart rate to detect data that is too high.
  • The next approach is to look for data that is highly dubious. So, a heart rate value that goes up too quickly, or comes down too quickly, or stays too high for too long, all suggest data quality problems. However, you have to be careful of things like pauses in the data. If you pause and restart your watch in a different place, or after a big change in heart rate, your analysis needs to detect the pause.
  • A more sophisticated approach is to look at the relationships between the data streams. You can compare heart rate with power to determine if one of the two is bad.
  • A hierarchical approach to data quality is important. You can validate pace and altitude, which will allow you to calculate an estimated power. You can use that estimated power to validate a running power meter, or to create a power estimate that you have confidence in. You can also validate your power or power estimate based on W' (W prime) balance. If the value goes down too far, it's an indication of problems with the power. Once you have some confidence in your power values, you can use those to validate your heart rate data.

2 Correcting data quality issues

  • The simplest correction is simply to delete the data stream. While this is drastic, sometimes it's the only possibility if the data stream is too bad.
  • For altitude data where you have good GPS positions, it's possible to look up the altitude online. Strava will do this for you, but it's hard to propagate that downstream. There are various online APIs that will provide altitude data. Unfortunately, this is the only reliable way of replacing bad data that I've come across.
  • It may be possible to replace the deleted data stream with fully synthetic data. For instance, if the GPS or pace data is deleted, you could estimate pace and distance from elapsed time and your average pace. Or You could replace the heart rate data with either a broad average or a value based on the relationship between heart rate and pace or heart rate and power.
  • The ideal approach is to detect the section of bad data and replace just that section.

3 Summary

Here is the outline of my current approach.

  • Validate GPS data or pace data against physical limits.
  • Use the GPS data to validate altitude data.
  • Use GPS and multitude data to calculate running power. If you have data from a power meter, you can validate that data against your calculated power, and use the power meter data.
  • Do a second pass validation of power data by calculating W' balance.
  • Calculate the relationship between power and heart rate for your overall data. Use that relationship to then find problems with the heart rate data.

4 What about machine learning?

The issue with machine learning is the ability to create training data sets. We would need to create a set of activities with known good data, and a set with known bad data to train the machine learning algorithm. Creating the training data set requires most of the steps above, and so I haven't pursued this approach so far. It seems that companies with access to larger datasets, such as Garmin or Strava, might be able to achieve this level of data correction.