The Noor Project, and why it matters to those of us who slipped through the net

Sep 08, 2025

a laptop computer with headphones on top of it — Photo by Catherine Breslin on Unsplash

I was diagnosed autistic in my fifties. By then, I’d already built a company, burned out twice, and reverse-engineered my own social scripts the way an engineer debugs a legacy system with no documentation. I’m Level 1 support needs that are “mild” on paper, yet relentless in real life. What hurt most wasn’t autism; it was not knowing I was autistic. Which is why The Noor Project caught my eye: a new study using speech to help flag autism earlier, with a specific focus on model fairness. The approach is technical, but the stakes are human.

At its core, Noor project asks a practical question: can we hear autism in children’s voices well enough to prompt earlier evaluation? The team leverages transformer models in two flavors.

First, discriminative fine-tuning (D-FT): pre-train on one dataset (DE-ENIGMA) and fine-tune on another (CPSD).
Second, Wav2Vec 2.0 fine-tuning (W2V2-FT): start from self-supervised speech representations trained on massive, general audio (LibriSpeech) and then fine-tune on the target clinical data.

On a binary “typical vs. atypical” task, D-FT achieved a test UAR (unweighted average recall) of 94.8%, outperforming the W2V2 model. On a tougher four-class diagnosis task (TD, ASD, dysphasia, and PDD-NOS), D-FT still led but with a more modest 60.9% UAR, a reminder that multi-class classification with scarce, messy labels is hard.

Those numbers are promising; they’re also not the headline. Fairness is.

The authors explicitly probe whether models trained on skewed data perform differently by gender. They do.

Performance is better for boys than girls, mirroring the male-heavy training distribution and the field’s long-standing blind spot. In their datasets, male participation dominates. For e.g., CPSD includes 54 boys vs. 13 girls (~81% vs. 19%), and DE-ENIGMA also skews male (~78% male by participants; sample counts can be even more imbalanced). When they test fairness, the models favor male voices. That’s not a moral failing of machine learning; it’s a data pipeline problem with real-world consequences.

For women like me, especially those of us with Level 1 support needs, the implications are immediate. If a tool like Noor is deployed without correcting the imbalance, we simply recreate the historical under-diagnosis of girls and women in algorithmic form.

The literature has documented this pattern for years: camouflaging/masking (the social effort to “pass”) delays or prevents recognition; misdiagnoses with anxiety, depression, eating disorders, or personality disorders are more frequent in women; and late diagnosis is associated with greater mental-health burden. None of this is soft science anymore.[1]

Let me spell that out in lived-experience terms. Without an early, accurate label, we’re often told we’re “too sensitive,” “too intense,” or “unmotivated”, while simultaneously praised for competence that’s actually chronic overcompensation. We learn the rules by rote, not intuition. We people-please to survive. Then the bill comes due. Co-occurring psychiatric diagnoses are more common in adults diagnosed late, and many of us arrive at the clinic with a thicket of symptoms that could have been mitigated had someone joined the dots when we were eight, not twenty-eight.[2] [3]

The Noor Project’s promise, if we do it right

There’s a quiet elegance to using speech as a screening signal. It’s ubiquitous, relatively low-cost, and already known to carry ASD-related acoustic patterns (think atypical prosody). If models trained on brief vocal samples can flag “this kid needs a closer look,” we could shorten diagnostic waitlists and nudge families toward evaluation sooner, especially in communities without specialist access. The Noor study demonstrates that transfer learning can squeeze value from limited labeled child data, a pragmatic choice in a field where large, balanced clinical corpora are rare.

But let’s not gloss over the limits.

Limitations (and the elephant in the room)

Gender skew. Noor’s training data are overwhelmingly male, and the models perform worse on female voices. The authors both quantify this and call it out, a commendable step, but it still constrains external validity for girls and women.
Dataset quirks. DE-ENIGMA contains only autistic participants (no TD controls) and was recorded in home settings; CPSD includes multiple diagnostic categories and lab-based recordings. Mixing across such domain shifts can depress multi-class performance and bias models toward the dominant setting.
Small, heterogeneous samples. The four-class task struggles because the labels (ASD/DYS/NOS/TD) are imbalanced and the total minutes are slim. Resampling helps, sometimes; sometimes it harms. Anyone who’s wrangled clinical audio knows the pain.
Unimodal. Speech is one window. Children’s social-communication differences also surface in facial affect, gesture, and motor patterns; a speech-only model may miss cases that show fewer prosodic deviations (which, anecdotally, is common in high-masking girls). The authors suggest moving to multi-modal next. Good.

What this means for Level 1 folks who weren’t caught early

For the “quiet strugglers”, the bright kid with perfect grades and a meltdown that arrives only at home; the employee praised for attention to detail who crumples after the third ambiguous meeting early recognition is not about labels for their own sake. It’s about unlocking supports that reduce secondary harm: sensory accommodations, explicit social curricula, executive-function scaffolding, therapy that understands autistic cognition. We now have longitudinal evidence that kids diagnosed later carry higher emotional and behavioral difficulties into adolescence, and adult-diagnosed people report more psychiatric comorbidity than those recognized in childhood. A screening tool that is biased against girls effectively withholds those supports from half the population.[4]

The next steps (from a scientist who’s also the use-case)

Over-sample girls and non-binary youth, on purpose. New data collection must set targets for female and gender-diverse participants and keep recruiting until the targets are met. Weighting after the fact isn’t enough; we need representation at the source. (Noor documents 19–22% female participation by dataset; aim for parity.)
Fairness-by-design training. Use stratified batch sampling, group-aware loss (e.g., re-weighting by subgroup), and post-hoc calibration per gender. Report subgroup UAR as a primary metric, not a footnote. Noor’s explicit fairness testing is a solid precedent, now bake it into the training loop.
Multi-modal fusion. Combine speech with facial expression, gesture, and context (home vs. clinic) to reduce reliance on a single channel that may vary by gendered socialization. The datasets Noor touches already contain richer annotations that could be leveraged.
Lifecycle evaluation. Test models on younger ages (pre-linguistic vocalizations), older kids, and adults because many of us are only diagnosed in adulthood, and our voices aren’t eight anymore. The authors note infant vocalization work; extend that to the opposite end of the pipeline.
Real-world deployments with guardrails. In practice, Noor-like tools should be framed as triage, not diagnosis. Pair algorithmic flags with human-led assessment pathways, provide clear risk communication to caregivers, and audit for false reassurance in girls. Privacy and consent are non-negotiable when recording children. (Common sense, yet historically uneven.)
Co-design with late-diagnosed women. Put us on the grant, in the lab, and on the IRB. We’ll spot failure modes you won’t because we’ve lived them.

Share The Divergent Dispatch

A personal postscript

The week I got my diagnosis, a friend texted, “So, does anything really change?” Everything did. My life didn’t become less autistic; it became better configured. I stopped wasting energy pretending to be a kind of person who never quite fit. If Noor and projects like it can move even a fraction of girls and Level 1 kids from “undiagnosed until burnout” to “recognized and supported when it matters,” that’s not just an academic win. That’s fewer lost years.

But the technology will only be as fair as the data we feed it. Noor shows that the engineering is there; the ethics are, too, if we choose them. Now we have to do the unglamorous work collect inclusive data, measure subgroup performance on purpose, and refuse to accept models that work mainly for boys. Otherwise, we’ll keep automating the very gap that harmed so many of us in the first place.

The Divergent Dispatch