A trivial example of a second order data model is the generation of schemas for individual questionnaires (most questionnaires will differ, but in eminently visible ways).
Let's say we have the following text, with no context (perhaps from an OCR program):
### Section 1
- Question 1) What is your name?
- Question 2) Where do you live?
### Section 2A
- Question 1) What is your middle name?
### Section 2B
- Question 1) What is your favourite colour?
We need to ingest the entire questionnaire to determine the data model (e.g. to determine that the data type of the section names are not integers). Once we have done that, we can constrain generation into this data model to obtain structured representation, into a format such as JSON.
We may also want to validate the result: "Does this capture all the info?" or "Are all data types correct?"
Demo Program
Here is a demo program, generating according to Pydantic models which describe the
QuestionnaireSchema
, composed of QuestionnaireField
submodels with a dtype
belonging to the DataType
string Enum (either str
or int
).
Notice that the prompt kind of gives the answer here, we are being fairly rigid (this is cheating really, the goal is to be more flexible than this, not embed the desired result in the prompt):
from enum import Enum
import outlines
from pydantic import BaseModel
class DataType(str, Enum):
STR = "str"
INT = "int"
class QuestionnaireField(BaseModel):
name: str
dtype: DataType
example: str | int
class QuestionnaireSchema(BaseModel):
fields: list[QuestionnaireField]
model = outlines.models.transformers("Qwen/Qwen2.5-3B-Instruct", device="cuda")
generator = outlines.generate.json(model, QuestionnaireSchema)
seed = 789001
prompt = """
Identify the fields needed to record each question in a questionnaire.
For example, given a questionnaire, what fields would you need in a spreadsheet to record:
- Which section the question is from
- The question's number
- The question text itself
Output the schema of fields, for each one giving field name, the field's data type
(`str` or `int`), and an example value (which must come from the text):
""".strip()
text = """
### Part 1
- Question 1) What is your name?
- Question 2) Where do you live?
### Part 2A
- Question 1) What is your middle name?
### Part 2B
- Question 1) What is your favourite colour?
"""
print("String-type section names:")
schema = generator(f"{prompt}:{text}", seed=seed)
print(schema.model_dump_json(indent=2))
print()
text = """
### Section 1
- Question 1) What is your name?
- Question 2) Where do you live?
### Section 2
- Question 1) What is your middle name?
### Section 3
- Question 1) What is your favourite colour?
"""
print("Integer-type section names:")
schema = generator(f"{prompt}:{text}", seed=seed)
print(schema.model_dump_json(indent=2))
This gives a somewhat underwhelming result! The goal of having 2 inputs was to try to distinguish an integer section name from a section number (i.e. the dtype of the section should have been flexibly assigned based on the questionnaire input), but instead the integer-typed section name made the schema deteriorate:
String-type section names:
{
"fields": [
{
"name": "section_name",
"dtype": "str",
"example": "Q&A"
},
{
"name": "question_number",
"dtype": "int",
"example": 1
},
{
"name": "question_text",
"dtype": "str",
"example": "What is your name?"
}
]
}
Integer-type section names:
{
"fields": [
{
"name": "section_name",
"dtype": "str",
"example": "Section 1"
},
{
"name": "question_number",
"dtype": "int",
"example": 1
},
{
"name": "question_text",
"dtype": "str",
"example": "What is your name?"
},
{
"name": "section_number",
"dtype": "str",
"example": "Section 1"
},
{
"name": "question_index_in_section",
"dtype": "int",
"example": 1
}
]
}