A general data model for questionnaires

Question Time

A trivial example of a second order data model is the generation of schemas for individual questionnaires (most questionnaires will differ, but in eminently visible ways).

Let's say we have the following text, with no context (perhaps from an OCR program):

### Section 1
- Question 1) What is your name?
- Question 2) Where do you live?

### Section 2A
- Question 1) What is your middle name?

### Section 2B
- Question 1) What is your favourite colour?

We need to ingest the entire questionnaire to determine the data model (e.g. to determine that the data type of the section names are not integers). Once we have done that, we can constrain generation into this data model to obtain structured representation, into a format such as JSON.

We may also want to validate the result: "Does this capture all the info?" or "Are all data types correct?"

Demo Program

Here is a demo program, generating according to Pydantic models which describe the QuestionnaireSchema, composed of QuestionnaireField submodels with a dtype belonging to the DataType string Enum (either str or int).

Notice that the prompt kind of gives the answer here, we are being fairly rigid (this is cheating really, the goal is to be more flexible than this, not embed the desired result in the prompt):

from enum import Enum

import outlines
from pydantic import BaseModel


class DataType(str, Enum):
    STR = "str"
    INT = "int"


class QuestionnaireField(BaseModel):
    name: str
    dtype: DataType
    example: str | int


class QuestionnaireSchema(BaseModel):
    fields: list[QuestionnaireField]


model = outlines.models.transformers("Qwen/Qwen2.5-3B-Instruct", device="cuda")

generator = outlines.generate.json(model, QuestionnaireSchema)
seed = 789001

prompt = """
Identify the fields needed to record each question in a questionnaire.

For example, given a questionnaire, what fields would you need in a spreadsheet to record:
- Which section the question is from
- The question's number
- The question text itself

Output the schema of fields, for each one giving field name, the field's data type
(`str` or `int`), and an example value (which must come from the text):
""".strip()

text = """
### Part 1
- Question 1) What is your name?
- Question 2) Where do you live?

### Part 2A
- Question 1) What is your middle name?

### Part 2B
- Question 1) What is your favourite colour?
"""

print("String-type section names:")
schema = generator(f"{prompt}:{text}", seed=seed)
print(schema.model_dump_json(indent=2))
print()
text = """
### Section 1
- Question 1) What is your name?
- Question 2) Where do you live?

### Section 2
- Question 1) What is your middle name?

### Section 3
- Question 1) What is your favourite colour?
"""
print("Integer-type section names:")
schema = generator(f"{prompt}:{text}", seed=seed)
print(schema.model_dump_json(indent=2))

This gives a somewhat underwhelming result! The goal of having 2 inputs was to try to distinguish an integer section name from a section number (i.e. the dtype of the section should have been flexibly assigned based on the questionnaire input), but instead the integer-typed section name made the schema deteriorate:

String-type section names:
{
  "fields": [
    {
      "name": "section_name",
      "dtype": "str",
      "example": "Q&A"
    },
    {
      "name": "question_number",
      "dtype": "int",
      "example": 1
    },
    {
      "name": "question_text",
      "dtype": "str",
      "example": "What is your name?"
    }
  ]
}

Integer-type section names:
{
  "fields": [
    {
      "name": "section_name",
      "dtype": "str",
      "example": "Section 1"
    },
    {
      "name": "question_number",
      "dtype": "int",
      "example": 1
    },
    {
      "name": "question_text",
      "dtype": "str",
      "example": "What is your name?"
    },
    {
      "name": "section_number",
      "dtype": "str",
      "example": "Section 1"
    },
    {
      "name": "question_index_in_section",
      "dtype": "int",
      "example": 1
    }
  ]
}