How Nabu Evaluates Research

Cite as: Nabu Science (2026). Nabu Evaluation Framework v1.0. nabu.science/methodology

Have a methodologist’s eye? We’d like to hear from you. - submit feedback →

Validation Highlights

Validation corpus. Numbers update as the expanded corpus completes.

1. Abstract

Nabu evaluates research papers on intrinsic merit - Quality, Reliability, and Impact Potential - benchmarked against the standard in the paper’s field and methodology type. It evaluates blinded to publication venue, citation count, or other proxies.

Quality is scored against the 4C framework (Contribution, Craft, Clarity, Context) by blinded AI-Human hybrid reviewers per paper, with adjudicated scoring and documented rationale per component. A separate Reliability rating (Red / Amber / Green) flags concerns - methodological, statistical, or post-publication - that could change how the findings should be interpreted. A separate Impact Potential score captures likely real-world significance and translation. The three signals are never blended.

The framework is rubric-driven by design: explicit criteria, field-specific calibration anchors, and component-level scoring keep AI judgments evidence-bound and surface insufficient-information cases rather than guessing through them.

The framework has been tested four ways against an initial validation corpus (sampled across OECD Fields of Science): critique quality benchmarked against expert human reviews, blind detection of retracted work, inter-rater reliability and decision-divergence from the journal-prestige default. The Nabu evaluation ranks slightly better than the best human reviewers on average (6.1 vs 5.0 for best human review; none worse than human reviewers (<4); n=50); reaches inter-rater reliability (ICC2) of 0.81 (n=400+); and blinded flagged 85%+ of retracted papers in the bottom two quality tiers (n=50). Details below.

2. Methodology

Every paper is evaluated on three independent signals: Quality, Reliability, and Impact Potential. Each draws on the same evaluation pipeline, with criteria calibrated to the paper’s methodology type and field of science. The three signals are scored, displayed, and reasoned about separately - never blended into a single number.

2.1 Quality (4C) Framework

The Quality score is built from four weighted dimensions. Each dimension is scored holistically against a set of guiding criteria rather than many individually scored sub-components.

  • Contribution · 25%

    Does this paper move the field forward?

  • Craft · 45%

    Is the methodology sound for the question asked?

  • Clarity · 10%

    Is the work clearly and precisely told?

  • Context · 20%

    Is the scope of the conclusions defensible?

Quality score calibration

Score rangeWhat the paper looks like
4.4 - 5.0ExceptionalReference-quality work with strong evidence across dimensions.
3.8 - 4.3Very GoodClearly strong work with limited, non-fundamental weaknesses.
3.0 - 3.7SolidCapable work with identifiable strengths and some meaningful gaps.
2.0 - 2.9WeakMinimum threshold met; notable concerns reduce confidence.
0.0 - 1.9PoorFundamental concerns that materially limit reliability.

Scores are calibrated against published work at the time of publication, not against current standards. II (Insufficient Information) is used for non-reportable components; weight is redistributed across reportable components, not penalised.

2.2 Reliability Screen

The Reliability rating is a pre and post-scoring gate, that flags methodological, statistical, or evidentiary concerns.

A Red rating does not mean "bad research." It means "proceed with caution - specific concerns identified." Quality work can carry an Amber or Red flag if reliability concerns are present, and weak work can carry a Green flag if no specific concerns are identified.

  • Internal Coherence

    Methods–results alignment. Statistical results computable. Claims supported by evidence presented in the paper.

  • Research Conduct

    Ethics approval declared. Conflicts of interest disclosed. Preregistration referenced with identifier. Data and code availability as reported.

  • Reference Integrity

    Every cited reference resolves to a real publication whose metadata matches the citation. Topaz et al. 2026, The Lancet

  • Post-Publication Record

    Retraction status. Post-publication concerns raised by the scientific community.

Each component returns one of three statuses: Clean (no concerns identified), Noted (concerns present but do not change interpretation of core findings), or Concern (concerns that could change interpretation if confirmed). The overall Reliability flag is derived from the worst component status across the four.

  • Green: No reliability concerns identified. Quality and Impact Potential scores reflect merit on their own.

  • Amber: Watch-outs noted. The concerns do not change the interpretation of core findings, but readers should consult the rationale.

  • Red: Concerns identified that could change the interpretation of findings. Affected components are capped, and the paper carries an explicit reliability flag in downstream displays.

2.3 Impact Potential (4T) Framework

Quality and Impact Potential are measured separately, always. A perfectly executed study of a trivial question may score high on Quality and low on Impact Potential. The reverse also occurs: an ambitious agenda paper with weak execution can carry strong Impact Potential and a low Quality score. The two signals answer different questions, and the framework keeps them apart.

The Impact Potential score is built from four components: Traction, Translation, Transferability, and Trajectory.

  • Traction · 30%

    Does the paper address a problem that real stakeholders are actively trying to solve?

  • Translation · 30%

    How close are the findings to deployment or applied use?

  • Transferability · 20%

    How likely are the findings to hold beyond the specific study conditions?

  • Trajectory · 20%

    Do the findings advance a cumulative evidence base?

Impact Potential calibration

Score rangeWhat the paper looks like
4.4 - 5.0Very HighDirectly addresses a documented need with clear stakeholder pathway.
3.8 - 4.3HighClear real-world relevance with a plausible pathway to use.
3.0 - 3.7MediumMeaningful potential, but pathway remains partly defined.
2.0 - 2.9LowNarrow or early-stage pathway requiring substantial additional work.
0.0 - 1.9MinimalNo clear pathway to application, policy influence, or practical use yet.

The same II (Insufficient Information) rule applies: where a component cannot be evaluated from what the paper reports, it is marked II and weight is redistributed.

2.4 Evaluation Pipeline

Each paper passes through a structured pipeline. A classification agent first assigns the paper to its OECD Field of Science and selects the appropriate methodology module. Three independent reviewers then evaluate the paper against the rubric, each blind to the others' scoring. An adjudicator resolves any divergence on strength of evidence, applying the same rubric and field-specific calibration anchors.

The architecture is designed so that the rubric, not any single reviewer, is the evaluator. When reviewers disagree, the rubric - applied by the adjudicator - decides.

Two further controls operate inside this pipeline:

  • Bias control - reviewers are blinded to author, journal, and institutional information; calibration anchors are field-specific to prevent default-to-prestigious patterns.

  • Anti-metric gaming - rubric components reward methodological substance, not surface signals; scores cannot be improved by edits that do not reflect underlying quality.

2.5 Role of AI in the Pipeline

Nabu’s reviewers are AI-Human hybrid agents operating within the structured rubric. Adjudication is also AI-executed within the rubric, with an escalation and quality-assurance path to human reviewers. The system is engineered so that the rubric does the work, not the model.

“The rubric is the evaluator. The AI is the instrument.”

Four operational guardrails constrain the AI to evidence-based judgments:

  • Component-level scoring, not holistic judgment.

    Each reviewer scores each rubric component independently, not the paper as a whole.

  • Evidence-bound rationale per component.

    Every component score must be accompanied by rationale that cites specific text, methods, or results from the paper.

  • The Insufficient Information (II) flag.

    When a reviewer cannot extract enough evidence from the paper to score a component, the reviewer marks the component II rather than guess.

  • Field-specific calibration anchors.

    Each rubric component is calibrated against literature and standards for the paper's Field of Science. This prevents the AI from defaulting to a generic prior of good research.

3. Validation

Initial validation results from the Nabu evaluation corpus. Numbers will be updated as the expanded validation corpus completes.

Result 1: Critique quality vs. expert human reviews

Nabu’s review critiques were scored on H-Max - a metric that calibrates critique quality against the full set of human expert reviews, where the best human review anchors at 5.0. The approach follows ScholarPeer, a peer-review framework published by Google. Nabu scored 6.1, above the best-human-review anchor (n=50). Notably, none of Nabu’s reviews scored below 4 - none fell short of the human benchmark.

6.1
Nabu critiques
mean H-Max
5.0
Best human review
benchmark anchor
01234567Best human review(anchor)5.0Nabu critiques(mean H-Max)6.1Best-human-review anchor

H-Max calibrates critique quality against the full set of human expert reviews. Approach based on ScholarPeer, a peer-review framework from Google (arXiv 2601.22638).

Result 2: Inter-rater reliability

Nabu’s primary reviewers achieve an ICC2 of 0.81 (absolute agreement) across all scoring dimensions (n=400+, sampled across OECD Fields of Science) - more than double the published meta-analytic benchmark for human peer review (0.34, across 48 studies and 19,443 manuscripts).

Bornmann, Mutz & Daniel (2010). A reliability-generalization study of journal peer reviews: a multilevel meta-analysis. PLOS ONE, 5(12), e14331. doi:10.1371/journal.pone.0014331

0.000.250.500.751.00Human peer review(meta-analysis mean)0.34Nabu reviews0.81Good reliability threshold

Result 3: Retraction detection

A curated corpus of confirmed-retracted papers (n=50), evaluated blind against a control set of non-retracted papers from the same sources - with no knowledge of retraction status available to the reviewers.

  • 85%+ of retracted papers were placed in the bottom two quality tiers (Poor or Weak, 0.0–2.9).
  • The low scores were driven by specific, documented methodological concerns - not a generic “this seems bad” signal.
  • The remainder were papers retracted for reasons not reliably visible in the rubric-scorable text alone (post-hoc data fabrication, image manipulation, ethical violations).

The rubric identified what post-publication scrutiny later confirmed.

4. Limitations

The framework is calibrated against published research. It is most reliable for paper formats where methodology, claims, and evidence are explicit and reportable. It is less reliable, by design, in the following cases:

Reliability and validation results in the Validation section should be read with these scope conditions in mind.

5. Commitments

  • Open methodology.

    The rubric, weights, scoring criteria, and adjudication principles are published in full. Evaluate the evaluator.

  • Full blinding.

    No author, journal, or institution signals enter the evaluation. Publication year is retained solely for era-appropriate calibration.

  • No un-paid reviewers.

    All reviewers engaged by Nabu for calibration, escalation and quality control are dedicated professional reviewers trained and compensated accordingly.

  • No conflicts of interest.

    Nabu has no relationship with publishers, journals, or institutions being evaluated. Reviewers have no career incentive tied to the scores they produce. The evaluation is structurally independent: there is no scenario in which scoring a paper higher or lower benefits Nabu commercially.

  • Venue-independent.

    The same rubric everywhere. A paper is a paper.

  • Living evaluations.

    Post-publication signals (replications, retractions, method validity, practitioner feedback) update the reliability layer over time.

  • DORA and CoARA alignment.

    Paper-level, methodology-based, venue-independent. The framework operationalises what 3,000+ signatory institutions committed to.

What is DORA? → · What is CoARA? →