How Nabu Evaluates Research

A structured, rubric-based quality assessment framework for scientific publications

Nabu Science · v1.0 · March 2026 · Open methodology

Status: Peer review invited — submit feedback →

Cite as: Nabu Science (2026). Nabu Evaluation Framework v1.0. nabu.science/methodology

Current research evaluation relies on venue-based proxies — journal impact factor, citation counts — that measure attention rather than quality. Over 3,000 institutions have signed DORA and CoARA commitments to evaluate research on intrinsic merit, but lack operational measurement infrastructure to do so. Nabu addresses this gap with a structured, rubric-based quality assessment framework. Every paper is scored across four weighted dimensions (Contribution, Craft, Clarity, Context) by three blinded reviewers, with adjudicated scoring and documented rationale. In validation testing, the framework achieves an inter-rater reliability (ICC₂) of 0.81 — nearly double the published benchmark for human peer review (0.34). Blind evaluation of retracted papers correctly identified 8 of 9 as fundamentally flawed. The framework, rubric, and validation data are published here in full.

1. The measurement gap in research evaluation

Most research evaluation systems measure reputation, not quality. Citation counts reward visibility. Journal impact factor scores the venue, not the work. Qualitative peer review — where it happens at all — is inconsistently applied, rarely structured, and produces judgments that two reviewers would often disagree on.

This is not a new observation. The San Francisco Declaration on Research Assessment (DORA, 2012) and the Coalition for Advancing Research Assessment (CoARA, 2022) represent formal commitments by over 3,000 institutions to abandon venue-based metrics in favour of content-based evaluation. But thirteen years after DORA, the infrastructure to deliver on that commitment does not exist. Institutions are publicly reform-aligned, privately still using Impact Factor — because there is nothing operational to replace it with.

Nabu is built to fill that gap: a standardised, auditable, paper-level quality assessment framework that evaluates the work, not the wrapper.

What gets measured today vs. what should be measured

Current practice

Journal Impact Factor
measures venue prestige
Citation count
measures attention
h-index
measures career volume
Peer review
unstructured, variable, often single-reviewer

Nabu

4C quality score
measures intrinsic merit
Impact Potential
measures likely significance
Reliability Layer
monitors post-publication evidence
Multi-reviewer adjudication
structured, blinded, documented

2. Evaluation framework

2.1 The 4C Framework

Every paper is scored against four dimensions, weighted to reflect their contribution to scientific quality. Craft carries the highest weight (45%) because methodological rigour is the foundation of reliable findings. In v5, each dimension is scored holistically against a small set of guiding criteria rather than many individually scored sub-components.

25%

Contribution

What does it add?

1.1 Advance
1.2 Claim-evidence proportionality

45%

Craft

How well was it done?

2.1 Design-execution fit
2.2 Analytical soundness
2.3 Methodological transparency

10%

Clarity

How well is it told?

3.1 Argument structure
3.2 Precision of language

20%

Context

How well does it sit?

4.1 Engagement with prior work
4.2 Honest positioning

Do these dimensions capture what matters in your field?

Tell us more

Quality score calibration:

Range	Label	Meaning
4.5–5.0	Exceptional	Reference-quality work with strong evidence across dimensions.
3.5–4.4	Very Good	Clearly strong work with limited, non-fundamental weaknesses.
2.5–3.4	Good	Solid work with identifiable strengths and some meaningful gaps.
1.5–2.4	Acceptable	Minimum threshold met; notable concerns reduce confidence.
0.0–1.4	Poor	Fundamental concerns that materially limit reliability.

Scores are calibrated against published work at the time of publication — not against current standards. II (Insufficient Information) is used for non-reportable components; weight is redistributed, not penalised.

2.2 Methodology Modules

Craft dimension component 2.5 applies methodology-specific criteria. An RCT is evaluated differently from a qualitative study, because it should be. Six modules cover the primary research designs.

Randomised Controlled TrialObservational StudySystematic Review / Meta-AnalysisComputational / MLQualitative ResearchTheoretical / Conceptual

2.3 Reliability Screen

Before any dimension is scored, every paper passes through an independent reliability screen. This is a pre-scoring gate — not a dimension. Red does not mean “bad research.” It means “proceed with caution — specific concerns identified.”

Green — No concerns. Scores reflect merit.

Amber — Watch-outs noted. Does not change interpretation of core findings.

Red — Concern that could change interpretation. Relevant scores capped.

2.4 Impact Potential (separate score)

Quality and impact are measured separately, always. A perfectly executed study of a trivial question scores high on quality, low on impact. Nabu never blends these into a single number.

3.9

/5.0

Quality Score

3.9 / Good

4.1

/5.0

Impact Potential

4.1

Two scores, always separate.

Four IP components (the 4Ts):

Traction
Translation
Transferability
Trajectory

Impact Potential calibration:

Range	Label	Calibration Anchor
4.5–5.0	High	Directly addresses a documented need with clear stakeholder pathway.
3.5–4.4	Strong	Clear real-world relevance with a plausible pathway to use.
2.5–3.4	Moderate	Meaningful potential, but pathway remains partly defined.
1.5–2.4	Limited	Narrow or early-stage pathway requiring substantial additional work.
0.0–1.4	Minimal	No clear pathway to application, policy influence, or practical use yet.

2.5 Evaluation Pipeline

Each paper is evaluated by three independent, blinded reviewers using the same rubric and calibration anchors defined by subject area (using the OECD Fields of Science classification), which set the field-specific standards against which each component is scored. Score divergence is flagged across all dimensions in a divergence map. An adjudicator resolves disagreements on strength of evidence, applying six principles. The primary reviewers alone achieve an inter-rater reliability (ICC₂) of 0.81; with the adjudication layer, this rises to 0.89 — compared to a meta-analytic benchmark of 0.34 for human peer review.

Reviewer A

Reviewer B

Reviewer C

Divergence Map

Score gaps flagged

Adjudicator

Resolves on evidence

Final Evaluation

AI-Human hybrid agents · Fully blinded · ICC₂ = 0.81

Resolves on strength of evidence · ICC₂ = 0.89

Six adjudication principles:

1.Evidence over assertion
2.Methodology over narrative
3.Specific over vague
4.Conservative when uncertain
5.Era-appropriate standards
6.No inference beyond what is reported

2.6 The role of AI in the evaluation pipeline

Nabu's reviewers are AI-Human hybrid agents — large language models operating within a structured rubric, with calibration anchors defined by subject area, under human oversight at the adjudication layer. The AI does not produce a holistic judgment. It scores each component independently against explicit criteria, with documented reasoning. The adjudicator — which applies human editorial judgment — resolves divergence, enforces the six adjudication principles, and produces the final evaluation. The system is designed so that the rubric does the work, not the model. A different model following the same rubric and calibration anchors should produce convergent scores — and cross-model consistency testing is underway to validate this.

Rubric + Calibration Anchors

Defines what is measured and how

AI Reviewer Agents

Execute structured scoring per component

Human-Overseen Adjudicator

Resolves divergence, applies editorial judgment

“The rubric is the evaluator. The AI is the instrument.”

3. Results

Early validation results from the Nabu evaluation corpus. Additional validation studies are in progress.

Result 1: Inter-rater reliability

Human peer review (meta-analysis mean)

Nabu (primary reviewers, ICC₂)

Nabu (with adjudication, ICC₂)

Good reliability threshold (0.70)

0.00.51.0

Nabu's primary reviewers achieve an ICC₂ of 0.81 (absolute agreement) on the composite quality score — more than double the published meta-analytic benchmark for human peer review (0.34, across 48 studies and 19,443 manuscripts). When the adjudication layer is applied — resolving score divergence on strength of evidence — reliability rises to 0.89.

Human peer review benchmark: Bornmann, Mutz & Daniel (2010). A reliability-generalization study of journal peer reviews: a multilevel meta-analysis. PLOS ONE, 5(12), e14331. doi:10.1371/journal.pone.0014331

Result 2: Score distribution

Distribution of quality scores across the non-retracted evaluation corpus. The majority of evaluated papers score in the Good range (3.5–3.9). No papers have scored Exceptional — the standard is intentionally conservative and calibrated to the published literature, not to internal benchmarks. The rubric discriminates: scores are distributed, not clustered.

Result 3: Retraction detection

8 of 9

retracted papers correctly identified as fundamentally flawed

9 papers with confirmed retracted status evaluated blind
8 of 9 scored in the bottom quality tier (Poor, 0.0–1.9)
8 of 9 flagged Red for reliability concerns by all three reviewers independently
Mean quality score across retracted papers: 0.74 / 5.0
1 of 9 correctly scored higher — a paper analysing retraction patterns, not itself retracted for methodological failure

Named example

Surgisphere / Lancet HCQ Paper

Published in The Lancet. Retracted June 2020.

1.4 / 5.0Red flag

Rejected by all 3 reviewers independently — with no knowledge of the retraction.

The rubric identified what post-publication scrutiny later confirmed. In each case, the low scores were driven by specific, documented methodological concerns — not by a generic “this seems bad” signal. The one paper that scored higher was correctly assessed: it was a study about retraction patterns, not a paper retracted for methodological failure.

Validation roadmap

Systematic retraction corpus (n=50+)

In development

Impact Potential longitudinal validation

In development

Cross-model consistency analysis

Collecting data

Expert calibration panels

Planned

4. What this means in practice

For researchers

Compare across journals on equal terms.

A paper in PLOS ONE and a paper in Nature are evaluated with the same rubric, the same blinding, the same adjudication. The score reflects the work.

Read a structured assessment before the full paper.

Dimension-level scores and rationale give you a summary of where a paper is strong and where it has gaps — before you invest time in a full read. For systematic reviewers screening hundreds of papers, this changes the workflow.

Build a quality record that travels.

Your Nabu profile aggregates structured quality signals across your publications, independent of venue prestige. Useful for grant applications, hiring panels, and promotion cases.

For evaluators

Replace subjective impressions with structured evidence.

When you see a score of 3.8 on Craft, you know it means the methodology meets field standards with specific strengths and identified gaps — documented in the rationale. That is a defensible basis for an evaluation conversation.

Turn evaluation into feedback.

Dimension-level scores make it possible to say "your methodological rigour is strong, your contextual positioning needs work" — not just "your h-index is below the threshold." Evaluation becomes developmental, not just gatekeeping.

Meet your DORA commitment operationally.

If your institution has signed DORA or CoARA, you have committed to evaluating research on intrinsic merit. Nabu provides the quality signals to make that commitment real.

5. Commitments

Open methodology

The rubric, weights, scoring criteria, and adjudication principles are published in full. Evaluate the evaluator.

Full blinding

No author, journal, or institution signals enter the evaluation. Publication year is retained solely for era-appropriate calibration.

No unpaid labour

Evaluations are performed by dedicated agents, not by unpaid academic reviewers.

No conflicts of interest

Nabu has no relationship with publishers, journals, or institutions being evaluated. Reviewers have no career incentive tied to the scores they produce. The evaluation is structurally independent — there is no scenario in which scoring a paper higher or lower benefits Nabu commercially.

Venue-independent

The same standard everywhere. A paper is a paper.

Living evaluations

Post-publication signals (replications, retractions, method validity, practitioner feedback) update the reliability layer over time.

DORA and CoARA alignment

Paper-level, methodology-based, venue-independent. This is what 3,000+ institutions signed up for.

What is DORA? →What is CoARA? →

For the first time, I can point to a structured assessment of my work that doesn’t reduce everything to which journal accepted it. The dimension-level breakdown is genuinely useful for understanding where my methodology is strong and where I need to improve.

— Postdoctoral Researcher, Biomedical Sciences

We’ve been DORA signatories for three years, but we had nothing to replace Impact Factor with in practice. Nabu gives us structured, article-level quality signals that make our evaluation panels defensible.

— Vice-Dean Research, Faculty of Social Sciences

The standardised rubric across methodology types is what sold me. When I’m screening 200 papers for a systematic review, having a consistent quality signal — especially on Craft — saves weeks of full-text assessment.

— Senior Research Fellow, Evidence Synthesis