During a training session on the fundamentals of testing, a participant asked me the following question: what is the meaning of software testing for “Data”?
I believe it is about the significance of software testing in the context of “Data.” But, the latter is a generic term with several meanings depending on the situation. Therefore, answering this question requires a well-defined use case in a given context.
I consider a simple case study: investigating a movie dataset from TMDb using data analysis with Python. It will offer, for example, a clear understanding of the test types and test levels while considering testing for “Data.”
Investigating the dataset
The movie dataset is a CSV data file. It contains details of around ten thousand movies. The analysis aims to answer critical business questions, like identifying traits of high-grossing movies or standard features in less popular ones. I run Python code via Jupyter Notebook to acquire, load, transform, and create reports.
The Jupyter Notebook running the Python code and the reports it generates represent the “system under test.” Unlike other systems it may interact with, here, it means the system we consider as the test object. Be aware that Jupyter Notebook – as software – is an existing subsystem that is out of the scope for testing here. Thus, my approach to answering the business questions is via the Notebook, an exploratory analysis following a data processing and transformation phase.
To understand the meaning of software testing in this given context, I recommend viewing this system through different working domains:
- The data at the source,
- The loading and transformation,
- The data visualization.
The data at the source
The dataset, which in this case is a CSV file, plays a vital role in the system under test. The quality of this dataset is crucial as it directly affects the precision of the analysis results. Consequently, customers often specify the dataset’s quality as a critical requirement. Therefore, checking this quality is a part of the functional testing. For instance, in the Notebook, this evaluation is carried out in the “Data Assessing and Inspection” section, where you can observe it step by step.
Data loading and transformation
Considering the system from a particular perspective, it appears to comprise two distinct parts:
- The Source System: This is the same as the one mentioned earlier, but it excludes the Python scripts, the transformation results, and any visualizations.
- The Target System: This remains identical to the previously described system. However, in this case, it includes all the Python scripts, their results, and the visualizations.
In the context of “Data,” integration tests typically focus on the data flow from the source to the target system. In our case, we establish the target system by executing Python script blocks within the Notebook. As a result, integration tests don’t apply here. However, there are instances where the Notebook’s Python scripts initiate calls to external software components, like a web service. It could be for specific data processing or accessing external data sources. When this happens, we conduct standard integration tests between these two components. For instance, these tests might examine the web service’s effectiveness as an interface.
Looking at it differently, we can treat data transformation as a part of unit testing. In this approach, we develop and execute individual components – blocks of Python scripts – in a Notebook specifically for data transformation. A development team’s responsibility is to create tests for these individual components while producing them. As an illustration, in the “Identify Customer Segments” Notebook (where I use unsupervised learning techniques to segment a population), I utilize several “assert” blocks for unit testing during data transformation.
Moreover, I conduct a system test by running the entire Notebook. This process involves starting with raw data and, through various transformations, arriving at the answers to the initial business questions.
In case of significant data processing, one can have a separate test team to validate the functional data transformation rules at the system test level.
As a developer or tester, I may walk through the Notebook with a client multiple times. In this situation, the client will be awaiting the data analysis results. These circumstances represent an acceptance testing scenario, where the client validates that the visualizations and answers meet the previously outlined requirements.
For data visualization, the testing activity includes observing or interacting with the visual representations (like graphs or dashboards) from an end user’s perspective. We conduct the so-called “reporting” tests according to specific visualization requirements. This process puts us in the scenario of conducting functional testing as system tests and then acceptance tests. Evaluating non-interactive visual elements based on specified requirements is a review process. In essence, this approach is akin to performing static testing.
Additionally, here are several suggestions for non-functional testing related to “Data”:
- Performance Testing: Employing a load testing tool to simultaneously create numerous reports and visualizations. This type of testing is crucial for evaluating how quickly visualizations are generated and for detecting any potential memory leaks.
- Security Testing: Under certain circumstances, transformations or the generation of visualizations might only be allowed after particular processing steps to safeguard the original data, a process often referred to as anonymization. It is the tester’s responsibility to verify, whether guided by specifications or not, that this anonymization is properly executed.
Testing within the Data domain distinguishes itself from typical testing projects in various aspects:
- Proficiency in terminology specific to Data
- A well-informed viewpoint on the object under test
- Knowledge of working domains such as data visualization and transformation
The rising prioritization of data requires cutting-edge quality techniques and practices while supplying and using those data. This is the reason why an ever-growing number of professionals are seeking education in software quality specifically tailored to data.
 The test types are: Functional Testing, Non-Functional Testing, White-box Testing, Change-Related Testing.
 The test levels: Component Testing, Integration Testing, System Testing and Acceptance Testing.
 Specific tool for anonymization: https://www.ihsn.org/software/disclosure-control-toolbox