Question Answer Generator

Project : Question Answer Generator and Judgement¶

Implement a synthetic data pipeline using an open-source model for generation and an API model as a judge to grade and filter quality.

In [1]:

Copied!





"""
Synthetic Data Pipeline — LangGraph + Llama 3 (Ollama)
Generator and Judge both use the same local model.
 
Structured output via Pydantic eliminates ALL manual JSON parsing.
 
Requirements:
    pip install langgraph langchain-ollama langchain-core pydantic
 
Ollama must be running with Llama 3 pulled:
    ollama pull llama3
    ollama serve
"""
"""
Synthetic Data Pipeline — LangGraph + Llama 3 (Ollama)
Generator and Judge both use the same local model.
 
Structured output via Pydantic eliminates ALL manual JSON parsing.
 
Requirements:
    pip install langgraph langchain-ollama langchain-core pydantic
 
Ollama must be running with Llama 3 pulled:
    ollama pull llama3
    ollama serve
"""

Out[1]:

'\nSynthetic Data Pipeline — LangGraph + Llama 3 (Ollama)\nGenerator and Judge both use the same local model.\n\nStructured output via Pydantic eliminates ALL manual JSON parsing.\n\nRequirements:\n    pip install langgraph langchain-ollama langchain-core pydantic\n\nOllama must be running with Llama 3 pulled:\n    ollama pull llama3\n    ollama serve\n'

In [38]:

Copied!





import json 
import time 
from typing import TypedDict
from pathlib import Path 

from pydantic import BaseModel, Field 
from langgraph.graph import StateGraph, END 
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage
import json 
import time 
from typing import TypedDict
from pathlib import Path 

from pydantic import BaseModel, Field 
from langgraph.graph import StateGraph, END 
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

In [ ]:

Copied!





OLLAMA_BASE_URL  = "http://localhost:11434"
MODEL            = "llama3.1:8b"   # used for both generator and judge
SCORE_THRESHOLD  = 4          # min score (1–5) to keep a sample
SAMPLES_PER_SEED = 3          # variations to generate per seed
MAX_RETRIES      = 2          # max re-generation rounds for failures
OUTPUT_FILE      = "clean_dataset.jsonl"

SEED_EXAMPLES = [
    {
        "instruction": "Explain what a binary search tree is.",
        "response": (
            "A binary search tree (BST) is a tree where each node has at most two children. "
            "For every node, left subtree values are smaller and right subtree values are larger. "
            "This enables O(log n) average-case search, insert, and delete."
        ),
    },
    {
        "instruction": "What is the difference between a list and a tuple in Python?",
        "response": (
            "Lists are mutable sequences (square brackets []) — you can change elements after creation. "
            "Tuples are immutable (parentheses ()) and cannot be modified. "
            "Use lists for changing collections; tuples for fixed data like coordinates."
        ),
    },
    {
        "instruction": "How does garbage collection work in Python?",
        "response": (
            "Python uses reference counting: each object tracks how many references point to it. "
            "When the count hits zero the memory is freed immediately. "
            "A cyclic GC handles reference cycles — groups of objects that reference each other "
            "but are no longer reachable from the program."
        ),
    },
]
OLLAMA_BASE_URL  = "http://localhost:11434"
MODEL            = "llama3.1:8b"   # used for both generator and judge
SCORE_THRESHOLD  = 4          # min score (1–5) to keep a sample
SAMPLES_PER_SEED = 3          # variations to generate per seed
MAX_RETRIES      = 2          # max re-generation rounds for failures
OUTPUT_FILE      = "clean_dataset.jsonl"

SEED_EXAMPLES = [
    {
        "instruction": "Explain what a binary search tree is.",
        "response": (
            "A binary search tree (BST) is a tree where each node has at most two children. "
            "For every node, left subtree values are smaller and right subtree values are larger. "
            "This enables O(log n) average-case search, insert, and delete."
        ),
    },
    {
        "instruction": "What is the difference between a list and a tuple in Python?",
        "response": (
            "Lists are mutable sequences (square brackets []) — you can change elements after creation. "
            "Tuples are immutable (parentheses ()) and cannot be modified. "
            "Use lists for changing collections; tuples for fixed data like coordinates."
        ),
    },
    {
        "instruction": "How does garbage collection work in Python?",
        "response": (
            "Python uses reference counting: each object tracks how many references point to it. "
            "When the count hits zero the memory is freed immediately. "
            "A cyclic GC handles reference cycles — groups of objects that reference each other "
            "but are no longer reachable from the program."
        ),
    },
]

In [40]:

Copied!





class Sample(BaseModel):
    """A single instruction-response training pair."""
    instruction: str = Field(description="The user question or task")
    response:    str = Field(description="The ideal model response")
 
class GeneratedBatch(BaseModel):
    """Batch of generated training samples."""
    samples: list[Sample] = Field(description="List of generated instruction-response pairs")
 
class JudgeVerdict(BaseModel):
    """Quality verdict from the judge model."""
    score:  int = Field(ge=1, le=5, description="Quality score from 1 (worst) to 5 (best)")
    reason: str = Field(description="One-sentence explanation of the score")


class PipelineState(TypedDict):
    seeds:          list[dict]   # hand-crafted seed examples
    raw_samples:    list[dict]   # generated, unscored samples
    scored_samples: list[dict]   # samples with score + reason attached
    clean_samples:  list[dict]   # samples that passed the threshold
    retry_queue:    list[dict]   # failed samples queued for retry
    retry_count:    int          # number of retry rounds so far
    stats:          dict         # final run statistics
class Sample(BaseModel):
    """A single instruction-response training pair."""
    instruction: str = Field(description="The user question or task")
    response:    str = Field(description="The ideal model response")
 
class GeneratedBatch(BaseModel):
    """Batch of generated training samples."""
    samples: list[Sample] = Field(description="List of generated instruction-response pairs")
 
class JudgeVerdict(BaseModel):
    """Quality verdict from the judge model."""
    score:  int = Field(ge=1, le=5, description="Quality score from 1 (worst) to 5 (best)")
    reason: str = Field(description="One-sentence explanation of the score")


class PipelineState(TypedDict):
    seeds:          list[dict]   # hand-crafted seed examples
    raw_samples:    list[dict]   # generated, unscored samples
    scored_samples: list[dict]   # samples with score + reason attached
    clean_samples:  list[dict]   # samples that passed the threshold
    retry_queue:    list[dict]   # failed samples queued for retry
    retry_count:    int          # number of retry rounds so far
    stats:          dict         # final run statistics

In [41]:

Copied!





def base_llm(temperature: float) -> ChatOllama:
    return ChatOllama(
        model=MODEL,
        base_url=OLLAMA_BASE_URL,
        temperature=temperature,
        num_predict=1024,
    )

# Structured-output LLMs — each bound to its Pydantic schema.
# .invoke() returns the schema object directly; no parsing needed.
generator_llm = base_llm(0.85).with_structured_output(GeneratedBatch)
judge_llm     = base_llm(0.1).with_structured_output(JudgeVerdict)
retry_llm     = base_llm(0.85).with_structured_output(Sample)
def base_llm(temperature: float) -> ChatOllama:
    return ChatOllama(
        model=MODEL,
        base_url=OLLAMA_BASE_URL,
        temperature=temperature,
        num_predict=1024,
    )

# Structured-output LLMs — each bound to its Pydantic schema.
# .invoke() returns the schema object directly; no parsing needed.
generator_llm = base_llm(0.85).with_structured_output(GeneratedBatch)
judge_llm     = base_llm(0.1).with_structured_output(JudgeVerdict)
retry_llm     = base_llm(0.85).with_structured_output(Sample)

In [50]:

Copied!





def generate_samples(state: PipelineState) -> PipelineState:
    print("\n[NODE] generate_samples")

    seeds_text = json.dumps(state["seeds"], indent=2)

    messages = [
        SystemMessage(content=(
            "You are a synthetic training-data generator. "
            "Given seed examples, produce new instruction-response pairs "
            "in the same educational Q&A style. Each must be original and factually accurate."
        )),
        HumanMessage(content=(
            f"Seed examples:\n{seeds_text}\n\n"
            f"Generate {SAMPLES_PER_SEED} new instruction-response pairs "
            "inspired by these seeds. Vary the topic and phrasing."
        )),
    ]

    try:
        result: GeneratedBatch = generator_llm.invoke(messages)
        """
        For each Sample object `s`:
            1. s.model_dump() converts the Pydantic object into a dictionary like {"instruction": "...", "response": "..."}.
            2. The `|` operator (Python 3.9+) merges this dictionary with {"source": "generated"}, adding a "source" field to indicate
                that the sample was generated (as opposed to, e.g., retried).
        """
        raw = [s.model_dump() | {"source" : "generated"} for s in result.samples]
        print(f"Generated {len(raw)} raw samples")
    except Exception as e:
        print(f"GENERATION ERROR: {e}")
        raw = []
    
    return {**state, "raw_samples": state.get("raw_samples", []) + raw}

def judge_samples(state: PipelineState) -> PipelineState:
    print("\n[NODE] judge_samples")

    system_message = SystemMessage(content=(
        "You are a strict quality judge for AI fine-tuning data.\n"
        "Score each instruction-response pair from 1 to 5 AND provide a one-sentence reason.\n\n"
        "Scoring rubric:\n"
        "  5 = perfect: accurate, clear, well-structured, genuinely useful\n"
        "  4 = good: mostly correct with only minor issues\n"
        "  3 = mediocre: noticeable inaccuracies or poor structure\n"
        "  2 = poor: mostly wrong, incomplete, or confusing\n"
        "  1 = reject: harmful, factually wrong, or malformed\n\n"
        "Your reason must explain WHY you gave that score — point to a specific strength or flaw.\n"
        "Example reason for 5: 'Covers the concept accurately with a clear real-world analogy.'\n"
        "Example reason for 2: 'Response confuses lists with tuples and gives incorrect syntax.'"
    ))

    scored = []
    for i, sample in enumerate(state.get("raw_samples", [])):
        messages = [
            system_message,
            HumanMessage(content = (
                f"INSTRUCTION: {sample['instruction']}\n\n"
                f"RESPONSE: {sample['response']}\n\n"
                "Rate this sample and provide the reason for the score."
            ))
        ]
        try:
            verdict: JudgeVerdict = judge_llm.invoke(messages)
            score, reason = verdict.score, verdict.reason 
        except Exception as e:
            score, reason = 1, f"JUDEGE ERROR: {e}"
        
        scored.append({**sample, "score": score, "reason": reason})
        mark = "✓" if score >= SCORE_THRESHOLD else "✗"
        print(f"  [{mark}] Sample {i+1}: score={score} — {reason[:70]}")
        time.sleep(0.1)
    
    return {**state, "scored_samples": scored}


def filter_samples(state: PipelineState) -> PipelineState:
    print("\n[NODE] filter_samples")

    scored = state.get("scored_samples", [])
    passed = [s for s in scored if s["score"] >= SCORE_THRESHOLD]
    failed = [s for s in scored if s["score"] < SCORE_THRESHOLD]

    print(f"PASSED : {len(passed)}")
    print(f"FAILED : {len(failed)}")

    return {
        **state, 
        "clean_samples" : state.get("clean_samples", []) + passed, 
        "retry_queue" : failed, 
        "raw_samples" : [], 
        "scored_samples" : [],
    }

def retry_generation(state: PipelineState) -> PipelineState:
    print("\n[NODE] retry_generation")

    system = SystemMessage(content=(
        "You are a precise and accurate AI assistant tasked with improving low-quality responses.\n"
        "You will be given an instruction, a previous poor response, its quality score (1-5), "
        "and the reason it failed.\n"
        "Your job is to write a significantly better response that directly addresses the failure reason.\n"
        "Be accurate, clear, and well-structured."
    ))

    new_samples = []
    for item in state.get("retry_queue", []):
        messages = [
            system,
            HumanMessage(content=(
                f"INSTRUCTION: {item['instruction']}\n\n"
                f"PREVIOUS RESPONSE: {item['response']}\n\n"
                f"QUALITY SCORE: {item['score']}/5\n\n"
                f"REASON IT FAILED: {item['reason']}\n\n"
                "Now write an improved response that fixes the issues mentioned above."
            )),
        ]
        try:
            result: Sample = retry_llm.invoke(messages)
            new_samples.append(result.model_dump() | {
                "source":         "retried",
                "previous_score": item["score"],
                "previous_reason": item["reason"],
            })
            print(f"  ↺ Retried: {item['instruction'][:65]}")
            print(f"    Previous score: {item['score']}/5 — {item['reason'][:60]}")
        except Exception as e:
            print(f"  ✗ Retry error: {e}")

    return {
        **state,
        "raw_samples":  new_samples,
        "retry_count":  state.get("retry_count", 0) + 1,
        "retry_queue":  [],
    }

def save_dataset(state: PipelineState) -> PipelineState:
    print("\n[NODE] save_dataset")

    clean = state.get("clean_samples", [])
    out_path = Path(OUTPUT_FILE)

    with out_path.open("w") as f:
        for s in clean:
            f.write(json.dumps({
                "instruction": s["instruction"],
                "response":    s["response"],
                "score":       s["score"],
                "reason":      s["reason"],
            }) + "\n")

    
    stats = {
        "clean_samples" : len(clean), 
        "retry_rounds" : state.get("retry_count", 0), 
        "output_file" : str(out_path.resolve())
    } 

    print(f"\n{'='*52}")
    print(f"  Pipeline complete!  {len(clean)} clean samples saved.")
    print(f"  → {out_path.resolve()}")
    print(f"{'='*52}\n")
 
    return {**state, "stats": stats}
def generate_samples(state: PipelineState) -> PipelineState:
    print("\n[NODE] generate_samples")

    seeds_text = json.dumps(state["seeds"], indent=2)

    messages = [
        SystemMessage(content=(
            "You are a synthetic training-data generator. "
            "Given seed examples, produce new instruction-response pairs "
            "in the same educational Q&A style. Each must be original and factually accurate."
        )),
        HumanMessage(content=(
            f"Seed examples:\n{seeds_text}\n\n"
            f"Generate {SAMPLES_PER_SEED} new instruction-response pairs "
            "inspired by these seeds. Vary the topic and phrasing."
        )),
    ]

    try:
        result: GeneratedBatch = generator_llm.invoke(messages)
        """
        For each Sample object `s`:
            1. s.model_dump() converts the Pydantic object into a dictionary like {"instruction": "...", "response": "..."}.
            2. The `|` operator (Python 3.9+) merges this dictionary with {"source": "generated"}, adding a "source" field to indicate
                that the sample was generated (as opposed to, e.g., retried).
        """
        raw = [s.model_dump() | {"source" : "generated"} for s in result.samples]
        print(f"Generated {len(raw)} raw samples")
    except Exception as e:
        print(f"GENERATION ERROR: {e}")
        raw = []
    
    return {**state, "raw_samples": state.get("raw_samples", []) + raw}

def judge_samples(state: PipelineState) -> PipelineState:
    print("\n[NODE] judge_samples")

    system_message = SystemMessage(content=(
        "You are a strict quality judge for AI fine-tuning data.\n"
        "Score each instruction-response pair from 1 to 5 AND provide a one-sentence reason.\n\n"
        "Scoring rubric:\n"
        "  5 = perfect: accurate, clear, well-structured, genuinely useful\n"
        "  4 = good: mostly correct with only minor issues\n"
        "  3 = mediocre: noticeable inaccuracies or poor structure\n"
        "  2 = poor: mostly wrong, incomplete, or confusing\n"
        "  1 = reject: harmful, factually wrong, or malformed\n\n"
        "Your reason must explain WHY you gave that score — point to a specific strength or flaw.\n"
        "Example reason for 5: 'Covers the concept accurately with a clear real-world analogy.'\n"
        "Example reason for 2: 'Response confuses lists with tuples and gives incorrect syntax.'"
    ))

    scored = []
    for i, sample in enumerate(state.get("raw_samples", [])):
        messages = [
            system_message,
            HumanMessage(content = (
                f"INSTRUCTION: {sample['instruction']}\n\n"
                f"RESPONSE: {sample['response']}\n\n"
                "Rate this sample and provide the reason for the score."
            ))
        ]
        try:
            verdict: JudgeVerdict = judge_llm.invoke(messages)
            score, reason = verdict.score, verdict.reason 
        except Exception as e:
            score, reason = 1, f"JUDEGE ERROR: {e}"
        
        scored.append({**sample, "score": score, "reason": reason})
        mark = "✓" if score >= SCORE_THRESHOLD else "✗"
        print(f"  [{mark}] Sample {i+1}: score={score} — {reason[:70]}")
        time.sleep(0.1)
    
    return {**state, "scored_samples": scored}


def filter_samples(state: PipelineState) -> PipelineState:
    print("\n[NODE] filter_samples")

    scored = state.get("scored_samples", [])
    passed = [s for s in scored if s["score"] >= SCORE_THRESHOLD]
    failed = [s for s in scored if s["score"] < SCORE_THRESHOLD]

    print(f"PASSED : {len(passed)}")
    print(f"FAILED : {len(failed)}")

    return {
        **state, 
        "clean_samples" : state.get("clean_samples", []) + passed, 
        "retry_queue" : failed, 
        "raw_samples" : [], 
        "scored_samples" : [],
    }

def retry_generation(state: PipelineState) -> PipelineState:
    print("\n[NODE] retry_generation")

    system = SystemMessage(content=(
        "You are a precise and accurate AI assistant tasked with improving low-quality responses.\n"
        "You will be given an instruction, a previous poor response, its quality score (1-5), "
        "and the reason it failed.\n"
        "Your job is to write a significantly better response that directly addresses the failure reason.\n"
        "Be accurate, clear, and well-structured."
    ))

    new_samples = []
    for item in state.get("retry_queue", []):
        messages = [
            system,
            HumanMessage(content=(
                f"INSTRUCTION: {item['instruction']}\n\n"
                f"PREVIOUS RESPONSE: {item['response']}\n\n"
                f"QUALITY SCORE: {item['score']}/5\n\n"
                f"REASON IT FAILED: {item['reason']}\n\n"
                "Now write an improved response that fixes the issues mentioned above."
            )),
        ]
        try:
            result: Sample = retry_llm.invoke(messages)
            new_samples.append(result.model_dump() | {
                "source":         "retried",
                "previous_score": item["score"],
                "previous_reason": item["reason"],
            })
            print(f"  ↺ Retried: {item['instruction'][:65]}")
            print(f"    Previous score: {item['score']}/5 — {item['reason'][:60]}")
        except Exception as e:
            print(f"  ✗ Retry error: {e}")

    return {
        **state,
        "raw_samples":  new_samples,
        "retry_count":  state.get("retry_count", 0) + 1,
        "retry_queue":  [],
    }

def save_dataset(state: PipelineState) -> PipelineState:
    print("\n[NODE] save_dataset")

    clean = state.get("clean_samples", [])
    out_path = Path(OUTPUT_FILE)

    with out_path.open("w") as f:
        for s in clean:
            f.write(json.dumps({
                "instruction": s["instruction"],
                "response":    s["response"],
                "score":       s["score"],
                "reason":      s["reason"],
            }) + "\n")

    
    stats = {
        "clean_samples" : len(clean), 
        "retry_rounds" : state.get("retry_count", 0), 
        "output_file" : str(out_path.resolve())
    } 

    print(f"\n{'='*52}")
    print(f"  Pipeline complete!  {len(clean)} clean samples saved.")
    print(f"  → {out_path.resolve()}")
    print(f"{'='*52}\n")
 
    return {**state, "stats": stats}

In [51]:

Copied!





def should_retry(state: PipelineState) -> str:
    has_failures = bool(state.get("retry_queue"))
    under_limit  = state.get("retry_count", 0) < MAX_RETRIES
    if has_failures and under_limit:
        n = len(state["retry_queue"])
        r = state["retry_count"] + 1
        print(f"\n[ROUTER] {n} failures → retry round {r}/{MAX_RETRIES}")
        return "retry"
    print("\n[ROUTER] Done retrying → saving")
    return "save"

def build_pipeline():
    g = StateGraph(PipelineState)
 
    g.add_node("generate", generate_samples)
    g.add_node("judge",    judge_samples)
    g.add_node("filter",   filter_samples)
    g.add_node("retry",    retry_generation)
    g.add_node("save",     save_dataset)
 
    g.set_entry_point("generate")
    g.add_edge("generate", "judge")
    g.add_edge("judge",    "filter")
    g.add_conditional_edges("filter", should_retry, {"retry": "retry", "save": "save"})
    g.add_edge("retry",    "judge")   # retried samples go back through judge
    g.add_edge("save",     END)
 
    return g.compile()
 
def should_retry(state: PipelineState) -> str:
    has_failures = bool(state.get("retry_queue"))
    under_limit  = state.get("retry_count", 0) < MAX_RETRIES
    if has_failures and under_limit:
        n = len(state["retry_queue"])
        r = state["retry_count"] + 1
        print(f"\n[ROUTER] {n} failures → retry round {r}/{MAX_RETRIES}")
        return "retry"
    print("\n[ROUTER] Done retrying → saving")
    return "save"

def build_pipeline():
    g = StateGraph(PipelineState)
 
    g.add_node("generate", generate_samples)
    g.add_node("judge",    judge_samples)
    g.add_node("filter",   filter_samples)
    g.add_node("retry",    retry_generation)
    g.add_node("save",     save_dataset)
 
    g.set_entry_point("generate")
    g.add_edge("generate", "judge")
    g.add_edge("judge",    "filter")
    g.add_conditional_edges("filter", should_retry, {"retry": "retry", "save": "save"})
    g.add_edge("retry",    "judge")   # retried samples go back through judge
    g.add_edge("save",     END)
 
    return g.compile()
 

In [52]:

Copied!





def main():
    print("=" * 52)
    print("  Synthetic Data Pipeline  (LangGraph + Llama 3)")
    print("=" * 52)
    print(f"  Model       : {MODEL} @ {OLLAMA_BASE_URL}")
    print(f"  Seeds       : {len(SEED_EXAMPLES)}")
    print(f"  Per seed    : {SAMPLES_PER_SEED} samples")
    print(f"  Threshold   : score >= {SCORE_THRESHOLD}/5")
    print(f"  Max retries : {MAX_RETRIES}")
    print("=" * 52)
 
    pipeline = build_pipeline()
 
    final = pipeline.invoke({
        "seeds":          SEED_EXAMPLES,
        "raw_samples":    [],
        "scored_samples": [],
        "clean_samples":  [],
        "retry_queue":    [],
        "retry_count":    0,
        "stats":          {},
    })
 
    print("Final stats:")
    for k, v in final.get("stats", {}).items():
        print(f"  {k}: {v}")
 
    print("\nSample outputs:")
    for s in final.get("clean_samples", [])[:2]:
        print(f"\n  Q : {s['instruction']}")
        print(f"  A : {s['response'][:120]}...")
        print(f"  * : {s['score']}/5 — {s['reason']}")
 
def main():
    print("=" * 52)
    print("  Synthetic Data Pipeline  (LangGraph + Llama 3)")
    print("=" * 52)
    print(f"  Model       : {MODEL} @ {OLLAMA_BASE_URL}")
    print(f"  Seeds       : {len(SEED_EXAMPLES)}")
    print(f"  Per seed    : {SAMPLES_PER_SEED} samples")
    print(f"  Threshold   : score >= {SCORE_THRESHOLD}/5")
    print(f"  Max retries : {MAX_RETRIES}")
    print("=" * 52)
 
    pipeline = build_pipeline()
 
    final = pipeline.invoke({
        "seeds":          SEED_EXAMPLES,
        "raw_samples":    [],
        "scored_samples": [],
        "clean_samples":  [],
        "retry_queue":    [],
        "retry_count":    0,
        "stats":          {},
    })
 
    print("Final stats:")
    for k, v in final.get("stats", {}).items():
        print(f"  {k}: {v}")
 
    print("\nSample outputs:")
    for s in final.get("clean_samples", [])[:2]:
        print(f"\n  Q : {s['instruction']}")
        print(f"  A : {s['response'][:120]}...")
        print(f"  * : {s['score']}/5 — {s['reason']}")
 

In [53]:

Copied!

main()
main()

====================================================
  Synthetic Data Pipeline  (LangGraph + Llama 3)
====================================================
  Model       : llama3.1:8b @ http://localhost:11434
  Seeds       : 3
  Per seed    : 3 samples
  Threshold   : score >= 5/5
  Max retries : 2
====================================================

[NODE] generate_samples
Generated 3 raw samples

[NODE] judge_samples
  [✗] Sample 1: score=4 — Accurately describes Python's reference counting mechanism, but doesn'
  [✗] Sample 2: score=4 — Accurately describes the main differences between HashMap and HashSet,
  [✗] Sample 3: score=4 — Accurately describes the purpose of try-except blocks in Python, but l

[NODE] filter_samples
PASSED : 0
FAILED : 3

[ROUTER] 3 failures → retry round 1/2

[NODE] retry_generation
  ↺ Retried: Describe how Python's memory management differs from languages li
    Previous score: 4/5 — Accurately describes Python's reference counting mechanism, 
  ↺ Retried: What are some key differences between a HashMap and a HashSet in 
    Previous score: 4/5 — Accurately describes the main differences between HashMap an
  ↺ Retried: Explain how error handling in try-except blocks works in Python.
    Previous score: 4/5 — Accurately describes the purpose of try-except blocks in Pyt

[NODE] judge_samples
  [✓] Sample 1: score=5 — Covers the concept accurately with a clear explanation of both Python'
  [✓] Sample 2: score=5 — Covers the concept accurately with a clear real-world analogy, provide
  [✗] Sample 3: score=4 — Accurately explains the basic concept of try-except blocks in Python, 

[NODE] filter_samples
PASSED : 2
FAILED : 1

[ROUTER] 1 failures → retry round 2/2

[NODE] retry_generation
  ↺ Retried: Explain how error handling in try-except blocks works in Python.
    Previous score: 4/5 — Accurately explains the basic concept of try-except blocks i

[NODE] judge_samples
  [✗] Sample 1: score=4 — The response is mostly accurate and clear, but it could benefit from a

[NODE] filter_samples
PASSED : 0
FAILED : 1

[ROUTER] Done retrying → saving

[NODE] save_dataset

====================================================
  Pipeline complete!  2 clean samples saved.
  → /home/nirajanbekoju/Documents/Personal/Agentic AI course/src/question_answer_generator/clean_dataset.jsonl
====================================================

Final stats:
  clean_samples: 2
  retry_rounds: 2
  output_file: /home/nirajanbekoju/Documents/Personal/Agentic AI course/src/question_answer_generator/clean_dataset.jsonl

Sample outputs:

  Q : Describe how Python's memory management differs from languages like Java
  A : Python utilizes a reference counting system for memory management. In this approach, each object keeps track of its own ...
  * : 5/5 — Covers the concept accurately with a clear explanation of both Python's reference counting system and its differences from Java's garbage collection-based approach.

  Q : What are some key differences between a HashMap and a HashSet in programming?
  A : A `HashMap` (also known as a `dictionary` or `associative array`) is a data structure that stores mappings of unique key...
  * : 5/5 — Covers the concept accurately with a clear real-world analogy, provides specific examples of use cases, and effectively highlights the key differences between HashMaps and HashSets.

In [ ]: