Validation Highlights
Critique quality on H-Max. Above the best-human-review anchor of 5.0, benchmarked using the ScholarPeer framework
ScholarPeer, arXiv 2601.22638
Inter-rater reliability (ICC2) with adjudication - against a human peer-review benchmark of 0.34
Bornmann, Mutz & Daniel (2010)
Of retracted papers flagged in the bottom two quality tiers - blind, before retraction information was available. Evaluation surfaces latest retraction status as well.
Validation corpus. Numbers update as the expanded corpus completes.
1. Abstract
Nabu evaluates research papers on intrinsic merit - Quality, Reliability, and Impact Potential - benchmarked against the standard in the paper’s field and methodology type. It evaluates blinded to publication venue, citation count, or other proxies.
Quality is scored against the 4C framework (Contribution, Craft, Clarity, Context) by blinded AI-Human hybrid reviewers per paper, with adjudicated scoring and documented rationale per component. A separate Reliability rating (Red / Amber / Green) flags concerns - methodological, statistical, or post-publication - that could change how the findings should be interpreted. A separate Impact Potential score captures likely real-world significance and translation. The three signals are never blended.
The framework is rubric-driven by design: explicit criteria, field-specific calibration anchors, and component-level scoring keep AI judgments evidence-bound and surface insufficient-information cases rather than guessing through them.
The framework has been tested four ways against an initial validation corpus (sampled across OECD Fields of Science): critique quality benchmarked against expert human reviews, blind detection of retracted work, inter-rater reliability and decision-divergence from the journal-prestige default. The Nabu evaluation ranks slightly better than the best human reviewers on average (6.1 vs 5.0 for best human review; none worse than human reviewers (<4); n=50); reaches inter-rater reliability (ICC2) of 0.81 (n=400+); and blinded flagged 85%+ of retracted papers in the bottom two quality tiers (n=50). Details below.
2. Methodology
Every paper is evaluated on three independent signals: Quality, Reliability, and Impact Potential. Each draws on the same evaluation pipeline, with criteria calibrated to the paper’s methodology type and field of science. The three signals are scored, displayed, and reasoned about separately - never blended into a single number.
2.1 Quality (4C) Framework
The Quality score is built from four weighted dimensions. Each dimension is scored holistically against a set of guiding criteria rather than many individually scored sub-components.
Contribution · 25%
Does this paper move the field forward?
Craft · 45%
Is the methodology sound for the question asked?
Clarity · 10%
Is the work clearly and precisely told?
Context · 20%
Is the scope of the conclusions defensible?
Quality score calibration
| Score range | What the paper looks like |
|---|---|
| 4.4 - 5.0Exceptional | Reference-quality work with strong evidence across dimensions. |
| 3.8 - 4.3Very Good | Clearly strong work with limited, non-fundamental weaknesses. |
| 3.0 - 3.7Solid | Capable work with identifiable strengths and some meaningful gaps. |
| 2.0 - 2.9Weak | Minimum threshold met; notable concerns reduce confidence. |
| 0.0 - 1.9Poor | Fundamental concerns that materially limit reliability. |
Scores are calibrated against published work at the time of publication, not against current standards. II (Insufficient Information) is used for non-reportable components; weight is redistributed across reportable components, not penalised.
2.2 Reliability Screen
The Reliability rating is a pre and post-scoring gate, that flags methodological, statistical, or evidentiary concerns.
A Red rating does not mean "bad research." It means "proceed with caution - specific concerns identified." Quality work can carry an Amber or Red flag if reliability concerns are present, and weak work can carry a Green flag if no specific concerns are identified.
Internal Coherence
Methods–results alignment. Statistical results computable. Claims supported by evidence presented in the paper.
Research Conduct
Ethics approval declared. Conflicts of interest disclosed. Preregistration referenced with identifier. Data and code availability as reported.
Reference Integrity
Every cited reference resolves to a real publication whose metadata matches the citation. Topaz et al. 2026, The Lancet
Post-Publication Record
Retraction status. Post-publication concerns raised by the scientific community.
Each component returns one of three statuses: Clean (no concerns identified), Noted (concerns present but do not change interpretation of core findings), or Concern (concerns that could change interpretation if confirmed). The overall Reliability flag is derived from the worst component status across the four.
Green: No reliability concerns identified. Quality and Impact Potential scores reflect merit on their own.
Amber: Watch-outs noted. The concerns do not change the interpretation of core findings, but readers should consult the rationale.
Red: Concerns identified that could change the interpretation of findings. Affected components are capped, and the paper carries an explicit reliability flag in downstream displays.
2.3 Impact Potential (4T) Framework
Quality and Impact Potential are measured separately, always. A perfectly executed study of a trivial question may score high on Quality and low on Impact Potential. The reverse also occurs: an ambitious agenda paper with weak execution can carry strong Impact Potential and a low Quality score. The two signals answer different questions, and the framework keeps them apart.
The Impact Potential score is built from four components: Traction, Translation, Transferability, and Trajectory.
Traction · 30%
Does the paper address a problem that real stakeholders are actively trying to solve?
Translation · 30%
How close are the findings to deployment or applied use?
Transferability · 20%
How likely are the findings to hold beyond the specific study conditions?
Trajectory · 20%
Do the findings advance a cumulative evidence base?
Impact Potential calibration
| Score range | What the paper looks like |
|---|---|
| 4.4 - 5.0Very High | Directly addresses a documented need with clear stakeholder pathway. |
| 3.8 - 4.3High | Clear real-world relevance with a plausible pathway to use. |
| 3.0 - 3.7Medium | Meaningful potential, but pathway remains partly defined. |
| 2.0 - 2.9Low | Narrow or early-stage pathway requiring substantial additional work. |
| 0.0 - 1.9Minimal | No clear pathway to application, policy influence, or practical use yet. |
The same II (Insufficient Information) rule applies: where a component cannot be evaluated from what the paper reports, it is marked II and weight is redistributed.
2.4 Evaluation Pipeline
Each paper passes through a structured pipeline. A classification agent first assigns the paper to its OECD Field of Science and selects the appropriate methodology module. Three independent reviewers then evaluate the paper against the rubric, each blind to the others' scoring. An adjudicator resolves any divergence on strength of evidence, applying the same rubric and field-specific calibration anchors.
The architecture is designed so that the rubric, not any single reviewer, is the evaluator. When reviewers disagree, the rubric - applied by the adjudicator - decides.
field + module
independent evaluations
resolves divergence
Quality + Reliability + Impact
Two further controls operate inside this pipeline:
Bias control - reviewers are blinded to author, journal, and institutional information; calibration anchors are field-specific to prevent default-to-prestigious patterns.
Anti-metric gaming - rubric components reward methodological substance, not surface signals; scores cannot be improved by edits that do not reflect underlying quality.
2.5 Role of AI in the Pipeline
Nabu’s reviewers are AI-Human hybrid agents operating within the structured rubric. Adjudication is also AI-executed within the rubric, with an escalation and quality-assurance path to human reviewers. The system is engineered so that the rubric does the work, not the model.
“The rubric is the evaluator. The AI is the instrument.”
Four operational guardrails constrain the AI to evidence-based judgments:
Component-level scoring, not holistic judgment.
Each reviewer scores each rubric component independently, not the paper as a whole.
Evidence-bound rationale per component.
Every component score must be accompanied by rationale that cites specific text, methods, or results from the paper.
The Insufficient Information (II) flag.
When a reviewer cannot extract enough evidence from the paper to score a component, the reviewer marks the component II rather than guess.
Field-specific calibration anchors.
Each rubric component is calibrated against literature and standards for the paper's Field of Science. This prevents the AI from defaulting to a generic prior of good research.
3. Validation
Initial validation results from the Nabu evaluation corpus. Numbers will be updated as the expanded validation corpus completes.
Result 1: Critique quality vs. expert human reviews
Nabu’s review critiques were scored on H-Max - a metric that calibrates critique quality against the full set of human expert reviews, where the best human review anchors at 5.0. The approach follows ScholarPeer, a peer-review framework published by Google. Nabu scored 6.1, above the best-human-review anchor (n=50). Notably, none of Nabu’s reviews scored below 4 - none fell short of the human benchmark.
H-Max calibrates critique quality against the full set of human expert reviews. Approach based on ScholarPeer, a peer-review framework from Google (arXiv 2601.22638).
Result 2: Inter-rater reliability
Nabu’s primary reviewers achieve an ICC2 of 0.81 (absolute agreement) across all scoring dimensions (n=400+, sampled across OECD Fields of Science) - more than double the published meta-analytic benchmark for human peer review (0.34, across 48 studies and 19,443 manuscripts).
Bornmann, Mutz & Daniel (2010). A reliability-generalization study of journal peer reviews: a multilevel meta-analysis. PLOS ONE, 5(12), e14331. doi:10.1371/journal.pone.0014331
Result 3: Retraction detection
A curated corpus of confirmed-retracted papers (n=50), evaluated blind against a control set of non-retracted papers from the same sources - with no knowledge of retraction status available to the reviewers.
- 85%+ of retracted papers were placed in the bottom two quality tiers (Poor or Weak, 0.0–2.9).
- The low scores were driven by specific, documented methodological concerns - not a generic “this seems bad” signal.
- The remainder were papers retracted for reasons not reliably visible in the rubric-scorable text alone (post-hoc data fabrication, image manipulation, ethical violations).
The rubric identified what post-publication scrutiny later confirmed.
4. Limitations
The framework is calibrated against published research. It is most reliable for paper formats where methodology, claims, and evidence are explicit and reportable. It is less reliable, by design, in the following cases:
Reliability and validation results in the Validation section should be read with these scope conditions in mind.
5. Commitments
Open methodology.
The rubric, weights, scoring criteria, and adjudication principles are published in full. Evaluate the evaluator.
Full blinding.
No author, journal, or institution signals enter the evaluation. Publication year is retained solely for era-appropriate calibration.
No un-paid reviewers.
All reviewers engaged by Nabu for calibration, escalation and quality control are dedicated professional reviewers trained and compensated accordingly.
No conflicts of interest.
Nabu has no relationship with publishers, journals, or institutions being evaluated. Reviewers have no career incentive tied to the scores they produce. The evaluation is structurally independent: there is no scenario in which scoring a paper higher or lower benefits Nabu commercially.
Venue-independent.
The same rubric everywhere. A paper is a paper.
Living evaluations.
Post-publication signals (replications, retractions, method validity, practitioner feedback) update the reliability layer over time.
DORA and CoARA alignment.
Paper-level, methodology-based, venue-independent. The framework operationalises what 3,000+ signatory institutions committed to.