What are data unit tests and why we need them

  • Theodore Meynard from GetYourGuide

  • Talk: https://2022.pycon.de/program/MPWLWP/

  • What?

  • Frameworks for Data unit testing
  • In practice at GetYourGuide

What?

  • Data product = Code + Data
  • Data product test = Code test + Data test

How to do data unit testing?

  • Verify some expectations. Check
  • Range and Mean
  • Missing values
  • Duplicates
  • no. of samples

Frameworks

  1. Great expectations
  2. Supports SQL, Pandas, Spark
  3. Data profiling: Gives a draft of expectations
  4. Data validation
  5. Data documentation
  6. Supports distibution and statitstical tests
  7. Pandora
  8. Pandas and Spark
  9. TFDV
  10. Tenserflow
  11. SODA