Stravoris
Reference

Dataset Schema Reference

All Stravoris medical MCQ datasets use this exact schema. Every subject, every row, every file - same columns, same types, same semantics.

Columns

ColumnTypeNullableDescription
idstringNoHuman-readable unique identifier (e.g., ANA-UG-uplimnera-0042)
subjectstringNoSubject name
sourcestringNo"medmcqa" or "synthetic_YYYY-MM-DD"
syllabus_topicstringNoNormalized curriculum topic
questionstringNoQuestion stem and lead-in
optionslist[string]NoExactly 4 answer options
correct_indexintegerNoIndex of correct option (0-3)
explanationstringNoReference explanation (pedagogical)
training_explanationstringNoTraining-optimized explanation (for SFT)
blooms_levelstringNorecall / comprehension / application / analysis

Loading Examples

pandas
import pandas as pd
df = pd.read_parquet("anatomy/anatomy.parquet")
HF datasets
from datasets import load_dataset
ds = load_dataset("stravoris/medical-mcq-dataset")
JSONL
import json
with open("anatomy/anatomy.jsonl") as f:
    for line in f:
        q = json.loads(line)
        break
All subjects at once
subjects = [
    "biochemistry",
    "anatomy",
    "cell-biology",
    "physiology",
    "pathology",
    "microbiology",
    "pharmacology",
    "ophthalmology",
    "obstetrics-gynaecology",
    "ent",
    "radiology",
    "orthopaedics",
    "anaesthesia",
    "psychiatry",
    "genetics",
    "dermatology",
    "paediatrics",
    "venereology",
    "medicine",
    "surgery",
]
full = pd.concat([pd.read_parquet(f"{s}/{s}.parquet") for s in subjects], ignore_index=True)

Key Properties

  • 4 options per question, always
  • Curated against 24 published item-writing flaws
  • Bloom's taxonomy tagged
  • Syllabus-aligned
  • Provenance tracked