Reference
Dataset Schema Reference
All Stravoris medical MCQ datasets use this exact schema. Every subject, every row, every file - same columns, same types, same semantics.
Columns
| Column | Type | Nullable | Description |
|---|---|---|---|
id | string | No | Human-readable unique identifier (e.g., ANA-UG-uplimnera-0042) |
subject | string | No | Subject name |
source | string | No | "medmcqa" or "synthetic_YYYY-MM-DD" |
syllabus_topic | string | No | Normalized curriculum topic |
question | string | No | Question stem and lead-in |
options | list[string] | No | Exactly 4 answer options |
correct_index | integer | No | Index of correct option (0-3) |
explanation | string | No | Reference explanation (pedagogical) |
training_explanation | string | No | Training-optimized explanation (for SFT) |
blooms_level | string | No | recall / comprehension / application / analysis |
Loading Examples
pandas
import pandas as pd
df = pd.read_parquet("anatomy/anatomy.parquet")HF datasets
from datasets import load_dataset
ds = load_dataset("stravoris/medical-mcq-dataset")JSONL
import json
with open("anatomy/anatomy.jsonl") as f:
for line in f:
q = json.loads(line)
breakAll subjects at once
subjects = [
"biochemistry",
"anatomy",
"cell-biology",
"physiology",
"pathology",
"microbiology",
"pharmacology",
"ophthalmology",
"obstetrics-gynaecology",
"ent",
"radiology",
"orthopaedics",
"anaesthesia",
"psychiatry",
"genetics",
"dermatology",
"paediatrics",
"venereology",
"medicine",
"surgery",
]
full = pd.concat([pd.read_parquet(f"{s}/{s}.parquet") for s in subjects], ignore_index=True)Key Properties
- 4 options per question, always
- Curated against 24 published item-writing flaws
- Bloom's taxonomy tagged
- Syllabus-aligned
- Provenance tracked
