Automatic generation of multiple-choice tests Gera¸cão automática de testes de escolha m últipla

In this process, a set of syntactic patterns is generated and used to create multiple choice tests. Results show that the majority of the multiple choice tests returned by both approaches can be used with minor modifications.

Motivation

In this example, we can easily identify that the option "A dictionary" is the only answer that can be selected for a question asking something about words, so this multiple choice test item will be very poor at evaluating knowledge. Another thing that must be taken into account for the success of the distractor selection process is that the selected distractors must be related to the correct answer, not only in topic, but also, for example, in gender.

Goals

Document Overview

This state of the art survey examines the current research on computer multiple choice test generation and related work. Nevertheless, several frameworks on demand generation are described (Section 2.3) and possible sources of inference are analyzed (Section 2.4).

Systems

Computer-aided multiple-choice test items development

Question generation
Distractor generation
Test item review and finalization

WebExperimenter

Web application for question generation assistance

REAP

Free Assessment of Structural Tests

Lexical information for composing multiple-choice cloze items

System comparison

For example, after discovering the sentence "A dictionary is a collection of words with related meanings". the system generates the query. For example, given the previous sentence "A dictionary is a collection of words with related meaning.", the system would search WordNet for the distractor in "dictionary". To improve the quality of distractor generation, the system selects words with the same part of speech and inflection as the correct answer.

When all test items are generated, the user prints the test item list and a complete multiple-choice exam developed with the support of the system is presented. To find the meaning of "use", the key on the target phrase is replaced by each main word, and those that better fit that phrase determine the meaning of "consumption".

Question Generation

Framework for question generation
Automatic question generation from queries
Generating high quality questions from low quality questions
Question Classification Schemes

Some of the systems presented aim to assist in language learning, while some use user interaction to achieve high quality standards. This template can be used to find rules for generating questions from keyword-based topics. This is particularly important because most of the available content from the Internet, one of the largest sources of data that can be used for question generation, may contain one or more of these errors that can affect the output of the question generation system.

This can be used to organize patterns for generating questions, and some classifications may require additional mechanisms to improve the quality of generated questions. In both situations, the relation is important, but they are not of the same type.

Distractors Sources

WordNet
Wikipedia
Dictionaries
FrameNet

The Berkeley version of the project is described on the ebookFrameNet II: Extended Theory and Practice (Ruppenhofer et al., 2006) available on the website. First the system architecture is described on 3.1, followed by the question generation on 3.2, the distractor generation module on 3.3 and finally the issue of question items on 3.4. At the end of this chapter, two evaluations of the system with a different set of rules are presented at 3.5.

The processed corpus and the rule's file are then received by the query generation module (described in Section 3.2), which will match the rules to dependencies found in the processed corpus and create a query item. In the final stage, the output of the question generation and distractor generation modules is given to the user for approval or to discard the question items.

Question Generation Module

Rules

The dependency type, dependency member type, dependency flag, and dependency member flag are defined by the NLP chain's processing of the corpus. This rule is activated if we find a dependency of type "People" with a flag "OK", on the corpus processed by the NLP chain, with two members ("Noun[people]" and "Noun[title]" ) of the type. When this rule is matched, #1 and #2 are replaced by the words that matched "Noun[People]" and "Noun[title]" respectively.

In this case, the system looks for dependencies of the type "Business" with a flag "profession", the two members of which are of any type and marked as "people" and "profession", respectively. The query generation module, for each of the corpus units, attempts to match the loaded rules to the dependencies identified by the NLP chain.

Distractor Generation Module

Distractor Search

For the current prototype, the threshold was set to 100% to improve the quality of the distractors generated, at the cost of not being able to find enough distractors.

Question Items Edition

Evaluation

First Evaluation

Corpora and Evaluation Parameters
Question/Answer Pairs Generation
Distractor Generation

In future work, an investigation into which features should be prioritized could make it possible to lower this threshold while ensuring the required quality. The first step in evaluating the generated question/answer pairs was simply to accept or discard each one. In addition, CAID's model of test item assessment (see 2.2.1) was then used on the items generated by the prototype.

Large - test item requires grammatical correction, for example the question needs to be changed since then. Moderate - all distractors of the test item require some corrections and at most one distractor needs to be replaced or is missing.

Second Evaluation

Corpora and Evaluation Parameters
Question/Answer Pairs Generation
Distractor Generation

One of the major problems with feature search is that it ignores relationships between words when looking for confounding factors. In the case of the Wikipedia article “Hist ´oria de Portugal”, 13 question/answer pairs were created and accepted. In corpora with a historical theme, questions should be generated in the past tense instead of the present tense.

Exploring the fixes needed for each theme would be an interesting point of improvement for this system in the future. Regarding the distractor generation, only the generation based on the newspaper will be analyzed, since the system did not find enough distractors in the wikipedia article.

Evaluation Results Summary

Given multiple question/answer pairs, the system starts by grouping them into sets of two, called seeds. In each seed, the first question/answer pair is used to discover patterns and the second to validate the patterns. These are then used to identify text segments that match the pattern information, which are transformed by the system to generate questions with answers.

Regarding the syntactic analysis of questions, we use the Berkeley parser (Petrov & Klein, 2007) trained on QuestionBank (Judge et al., 2006), a tree bank of 4,000 analyzed annotated questions. Regarding question category, we use Li and Roth's two-layer taxonomy (Li & Roth, 2002), which is one of the most popular taxonomies for question classification, consisting of a set of six granular categories coarse (ABBREVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION and NUMERIC) and fifty fine.

Seed Creation

QA:How many years did Rip Van Winkle sleep?;twenty VQA:How many years did Sleeping Beauty sleep?;100. QA:Who pulled the thorn out of the Lion's paw?;androcles VQA:What was the color of Christ's hair in St John's. To ensure the quality of these seeds, the system sorts the given question/answer pairs according to syntactic structure, which was presented in the previous example.

Incorrect question pairing: QA: What was the language of Nineteen Eighty Four?; newspeak VQA: What was it. Q: How many years did Rip Van Winkle sleep?;twenty VQA: How many years did Sleeping Beauty sleep?;100 QA: What is the last word of the Bible?;amen VQA: What is the Hebrew word for peace used as both a hello and a goodbye?;shalom.

Pattern Bootstrap

Permutations of the first question/answer pair (excluding the first component of the question) and wildcard are then created and sent to the selected search engine. The system then analyzes the query results and finds the permutation components for each of the results. If no pattern has been registered for a given seed, the main verb of the question is replaced by each of its inflections, and the auxiliary verb is removed from the sequences to be sent to the search engine.

This allows the system to find patterns that are not in the same word time of the question. Each pattern component consists of a POS tag followed by parentheses containing the identification of the component of the question/answer pair (structured as "type, component number, subcomponent number").

Corpus Sources

Model components that have a "!" at the end of the POS tag, the brackets contain text that must be found in the corpus to trigger the pattern, the matched text is not used in generating the question/answer pair.

Question/Answer Pair Generation

If the triggered pattern is characterized as weak, generation is delegated to the strong or inflected patterns that share both the query structure and the category. The information from the strong or inflected patterns is used during query reconstruction to fill in missing elements that are not supported by the weak pattern. The final step is to remove all question/answer items that are based on anaphoric references, as these references are not processed by the current system.

Question/answer pairs that meet all the filters described above are sent to distractor generation.

Distractor Generation

Interfaces

Pattern Generation Interface

This information is then sent by the servlet to the responsible web service of pattern bootstrap and the user is notified that the request is being processed (Figure 4.3). When the processing of the given seed is finished, the web interface presents the results to the user (Figure 4.4). The user can then return to the start of the application and enter another seed for processing.

Multiple-choice Test Generation Interface

When processing is complete, the web interface presents a transparent interface for each generated item that presents the question, answer, distractors, and sentence that gave rise to the test item. When the entire review is completed, the web interface presents the entire exam with multiple possible answers and the parameters used to create it (Figure 4.8).

Web-service Interface

Evaluation

Evaluation Parameters

Some of the presented systems communicate with the user to improve the quality of query generation and to give the user some control over the process (e.g. web applications). Extend the interface to enable the use of the patterns discovered by the bootstrap component on the generation component; Application of an anaphoric reference solver component to improve the quality of the generated test items;

InHuman Language Technologies 2007: Conference of the North American Chapter of the Society for Computational Linguistics;. The following rules were used for the second evaluation of the rule-based system presented in this thesis.

WordNet search for distractors of “dictionary”

WordNet - “Alphabet” search results

Rule based architecture overview

Rule example

Pattern Generation - System overview

Pattern Generation - Seed request

Pattern Generation - Processing

Pattern Generation - Results

Multiple-choice test Generation - Request

Multiple-choice test Generation - Processing

Multiple-choice test Generation - Item review

Multiple-choice test Generation - Results