Test Routine Automation through Natural Language Processing Techniques

A big thank you to all my ground segment colleagues who made my experience at the European Space Operations Center memorable. Experiments performed in the dataset of the European Space Agency's Ground Segment test scenarios demonstrate the ability of this domain-specific tool to produce results close to human thinking and facilitate test procedures.

Motivation

Problem Statements

Furthermore, the current test routine in the ESOC Ground Segment includes a well-established test framework, which allows the creation of test cases in a model-driven manner. Can we achieve traceability of test execution requirements based on natural language expression of test scenarios.

Proposed Solution

Structure

Test Scenarios

The purpose of scenario testing is to test the end-to-end functionality of a software application and ensure that business processes and flows work as required. Test scenarios are the high-level concept of what needs to be tested and are considered critical as they help determine the most important end-to-end transactions or actual use of software applications.

Test Case

Tracing Requirements

The background between the user requirements for the system you are building and the work products developed to implement and verify those requirements. Requirements tracing helps the project team understand which parts of the design and code implement user requirements and which tests are needed to verify that user requirements have been implemented correctly.

Text Mining

These work products include software requirements, design specifications, software code, test plans, and other artifacts of the system development process.

Natural Language Processing

Syntax
Semantics
Discourse
Speech

Lexical semantics, Name Entity Recognition (NER), Natural language understanding are also important tasks in this work. Background Other common tasks in this category are Machine Translation, Natural Language Generation, Optical Character Recognition, Question Answering, Text Relation Recognition, and Word Meaning Disambiguation. Natural language understanding Converts text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate.

Similarity Calculation between Documents

Vector Space Models

The first step to calculate semantic similarity is to obtain a semantic vector representation of the relevant terms. The frequency of each term in each document (tf) is calculated as the square root of the number of times the term appears in the document, while the inverse document frequency (idf) is the logarithm of the total number of documents divided by the number of documents that the term contains. After calculating the sense vectors, the cosine similarity between them defines the similarity score.

Association Rules Mining

Recommender Systems

Recommender Systems for Software Engineering

A test scenario with attributes expressed in natural language acts as an input to the system. The user chooses what they think is more relevant until the test scenario is completed and the system recommends requirements relevant to the functionality of the test. The user is provided with intermediate outputs of appropriate recommendations so that the user can quickly interact with the system.

Design Constraints

GUI

Test Blocks Tab

The Test Blocks tab allows the user to enter a test step and select one or more functionally equivalent test blocks. In the case of assigning a value to a parameter, the user must enter an equal sign before the value. Methodology In this sector, the user can select the appropriate test blocks from among the recommended ones.

Requirements Tab

The first option displays several recommendations, while the second opens a dialog box where the user can search for a test block by entering its full name or part of it. The parentheses should be replaced according to the desired start and end step IDs in this table.

System Design

Methodology The logic of the system regarding the Test Blocks tab is described in detail in the component diagram of Figure 3.4. Because the software implementation behind the Requirements tab is similar, we will discuss the components of the Test Blocks tab in detail and refer to the core differences where they apply.

Deep Learning Model

Most of the above parameters were chosen based on the suggestions of the creators of Word2Vec, while the size and architecture of the system were decided by the experimental evaluation detailed in Section 4.3.1. We selected a window of 5, min_count is set to 1 since our data is small and workers to 8; the number of cores of our development machine.

Presenter

Spell Checker

Parser

NLP Filter

The final retrieved information of a test block is a group of individual words that appear in both the description and name of the test block, without any parameter values.

Recommender

Score of Keywords

Score of Parameters

Association Analysis and Re-scoring

Flow Checker

Methodology We perform an analysis of the content of the test case to check whether an application must be started in order for the action in the selected test blocks to be applied. For this reason, we check the prerequisites of the selected block to open or close any applications as needed. The Tear Down phase is implemented by selecting the test blocks that close all the applications residing in the application stack.

Data Storage

In addition, we retrieved information from a database containing 5040 test scenarios, 5569 requirements, and 2160 test blocks from 21 test libraries. Regarding the testing part, the test blocks were extensively analyzed so that only those blocks that provided high-level information were included; close to human thinking. These comprise 685 of the total 2160 test blocks and usually consist of groups of two or more lower level test blocks.

Evaluation Measures

Experiments on Recommender Decisions

Text Similarity between Test Steps and Test Blocks

This experiment involves all the test steps of the collected test scenarios with associated test blocks. We aim to observe whether the correctly assigned test blocks can be included in this list, and therefore we calculate the Recall@K metric for each test step with corresponding test blocks in each test scenario. Recall that the k-metric informs us about how many of the test steps, the correct relevant test block has been retrieved in the first k recommendation points.

Text Similarity between Test Scenarios and Requirements

Experiments on User Feedback

The MRR is calculated for each test scenario as we go through each iteration of the data set. The occurrence of similar sequences of blocks is not uncommon, especially for the setup part of the developed automated test. However, this does not mean that the sequences are the same, and it negatively affects the ranking of the correct blocks.

Experiments on Efficiency and User Productivity

Time performance

Test Coverage

In addition, we extended the approach with the aim of improving mismatched rankings in the provided recommendations using association analysis reordering. The proposed system is implemented as a standalone tool compatible for integration with the software components of ESOC's ground segment. The instrument was designed in accordance with the software testing routine guidelines and procedures.

Contributions

Furthermore, we demonstrated the developed system using appropriate data from the Ground Segment's Mission Control System and proved its advantages and weaknesses in providing recommendations of test blocks and requirements. Specifically, the proposed method appears to perform efficiently in associating test steps with test blocks that may contain free text and parameters. In addition, it offers the ability to link proposed requirements to a test scenario; a very time-consuming and laborious task for a human mind.

Directions for Future Extensions

Applications

Linear relationships captured from Word2Vec

Le and Mikolov [Le and Mikolov, 2014] propose a variation of the Word2Vec algorithm for computing paragraph vectors by adding an explicit paragraph function to the input of the neural network. When we pass the user input through the NLP filter, we use this vector space model, where each term is the value of a dimension of the model. The hybrid approach has been introduced to avoid the limitations of the content-based and collaborative filtering approaches [Adomavicius and Tuzhilin, 2005].

In [Azizi and Do, 2018], the proposed recommendation system uses three data sets: code coverage, change history and user sessions, to produce a list of the riskiest components of a system for regression testing. To ensure the quality of the automated test cases generated by the tool, the user must follow the guidelines described in section 3.3. Using this benchmark, we can set the weights of the heuristic function of the recommendation system.

Depending on the case, a training corpus should be collected with caution and implementation should take into account the design of the recommendation system in advance to achieve the highest possible quality and effectiveness.

A simple Word2Vec model with CBOW architecture containing only

Word2Vec model with CBOW architecture

As we mentioned above, the similarity measure refers to semantic similarity, a metric defined over a set of documents or terms, where the idea of distance between them is based on the similarity of their meaning or semantic content. We will use an example to go through similarity calculations between a query and a list of document candidates to examine the vector space models listed below. The similarity between two text segments is calculated using the term frequency inverse document frequency (TFIDF).

Word2Vec model with skip-gram architecture

LSI is based on the principle that words used in the same contexts tend to have similar meanings. The system then recommends to the target user the movies that they have rated highly in the past by those users. The combination of collaborative filtering and content-based approaches is mostly used in industry today [Adomavicius and Tuzhilin, 2005].

Graphical User Interface - Test Blocks Tab

Each step must be listed as a new statement and may contain one or more user-system interactions relevant to the feature being developed. Both the description and the expected result must be a single sentence to achieve top quality recommendations. A table of test scenario steps lists the corresponding steps sequentially (in the order they are executed) to achieve the functionality being developed.

Graphical User Interface - Requirements Tab

High-Level System Design

The "push" operation of the stack is equivalent to opening a program, while the "pop" operation is equivalent to closing it. We have chosen the Mission Control System (MICONYS) software package in the Ground Segment (Fig.4.1) to perform the evaluation due to its complexity and its ability to collect tagged data. Towards the next generation of recommender systems: An overview of the state of the art and possible extensions”.

In: Proceedings of the 23rd Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp.

Component Diagram: Assignment of Test Blocks to Test Steps

Component Diagram: Assignment of Requirements to Test Scenario

Functionality of the Spell Checker component

Spell Checker example

Information extraction from a test step

Information extraction from a test block

We define applications that are opened only once and not closed during the scenario phase as belonging to the installation phase.

Data Storage Objects

The latest version of MICONYS consists of the following software systems: DABYS, DARC, EDDS, FARC, GFTS, MATIS, NIS, SCOS-2000, SFT, SLE API, SMF [Peccia,2005]. The recommended weights for these experiments are set towk = 0.8 and wp = 0.2; however, further improvements in return values are achieved with different weights, as shown in the Properties Recommender experiments. In Figure 4.5 we observe that for some test scenarios, the MRR has little improvement to negative for 50 iterations, while in others the corresponding test blocks are reaching high ranks.

The two recommender systems are decoupled in our system architecture, which provides the opportunity to save the requirements tracking task for a later point in the testing process. Measuring semantic similarity in short texts using greedy combinations and word semantics.” In: FLAIRS conference.

A simplified spacecraft system. Orange arrows denote radio links;

Relevant Test Blocks in the first k recommendations. The model with

Relevant test blocks recommendations from different Vector Space

Relevant requirements recommendations from different Vector Space

Test Block rankings improvement from user feedback in 50 iterations

Time performance of the tool in the testing dataset

Test Coverage of the testing dataset

This thesis considered the problem of improving a testing routine using natural language processing innovations and techniques.

Example - Candidate Documents in a Recommender Engine

Example - Search Query in a Recommender Engine

TFIDF Recommendations

LSI Recommendations

Jaccard Index Recommendations

Google Word2Vec Model Average Recommendations

Google Word2Vec Model Comprehensive Recommendations

Word2Vec model: Values assigned to numerical parameters after eval-

Presenter: System responses to User actions

Performance of Word2Vec with different sets of parameters for Test

Recommender Scores - Weights Tuning

Performace of Word2Vec with different sets of parameters for Require-

Recommender Scores - Weights Tuning