How Nabu Evaluates Research
A structured, rubric-based quality assessment framework for scientific publications
Nabu Science · v1.0 · March 2026 · Open methodology
Status: Peer review invited — submit feedback →
Cite as: Nabu Science (2026). Nabu Evaluation Framework v1.0. nabu.science/methodology
Current research evaluation relies on venue-based proxies — journal impact factor, citation counts — that measure attention rather than quality. Over 3,000 institutions have signed DORA and CoARA commitments to evaluate research on intrinsic merit, but lack operational measurement infrastructure to do so. Nabu addresses this gap with a structured, rubric-based quality assessment framework. Every paper is scored across four weighted dimensions (Contribution, Craft, Clarity, Context) by three blinded reviewers, with adjudicated scoring and documented rationale. In validation testing, the framework achieves an inter-rater reliability (ICC₂) of 0.81 — nearly double the published benchmark for human peer review (0.34). Blind evaluation of retracted papers correctly identified 8 of 9 as fundamentally flawed. The framework, rubric, and validation data are published here in full.
1. The measurement gap in research evaluation
Most research evaluation systems measure reputation, not quality. Citation counts reward visibility. Journal impact factor scores the venue, not the work. Qualitative peer review — where it happens at all — is inconsistently applied, rarely structured, and produces judgments that two reviewers would often disagree on.
This is not a new observation. The San Francisco Declaration on Research Assessment (DORA, 2012) and the Coalition for Advancing Research Assessment (CoARA, 2022) represent formal commitments by over 3,000 institutions to abandon venue-based metrics in favour of content-based evaluation. But thirteen years after DORA, the infrastructure to deliver on that commitment does not exist. Institutions are publicly reform-aligned, privately still using Impact Factor — because there is nothing operational to replace it with.
Nabu is built to fill that gap: a standardised, auditable, paper-level quality assessment framework that evaluates the work, not the wrapper.
What gets measured today vs. what should be measured
Current practice
Journal Impact Factor
measures venue prestige
Citation count
measures attention
h-index
measures career volume
Peer review
unstructured, variable, often single-reviewer
Nabu
4C quality score
measures intrinsic merit
Impact Potential
measures likely significance
Reliability Layer
monitors post-publication evidence
Multi-reviewer adjudication
structured, blinded, documented
2. Evaluation framework
2.1 The 4C Framework
Every paper is scored against four dimensions, weighted to reflect their contribution to scientific quality. Craft carries the highest weight (45%) because methodological rigour is the foundation of reliable findings. In v5, each dimension is scored holistically against a small set of guiding criteria rather than many individually scored sub-components.
25%
Contribution
What does it add?
- 1.1 Advance
- 1.2 Claim-evidence proportionality
45%
Craft
How well was it done?
- 2.1 Design-execution fit
- 2.2 Analytical soundness
- 2.3 Methodological transparency
10%
Clarity
How well is it told?
- 3.1 Argument structure
- 3.2 Precision of language
20%
Context
How well does it sit?
- 4.1 Engagement with prior work
- 4.2 Honest positioning
Do these dimensions capture what matters in your field?
Quality score calibration:
| Range | Label | Meaning |
|---|---|---|
| 4.5–5.0 | Exceptional | Reference-quality work with strong evidence across dimensions. |
| 3.5–4.4 | Very Good | Clearly strong work with limited, non-fundamental weaknesses. |
| 2.5–3.4 | Good | Solid work with identifiable strengths and some meaningful gaps. |
| 1.5–2.4 | Acceptable | Minimum threshold met; notable concerns reduce confidence. |
| 0.0–1.4 | Poor | Fundamental concerns that materially limit reliability. |
Scores are calibrated against published work at the time of publication — not against current standards. II (Insufficient Information) is used for non-reportable components; weight is redistributed, not penalised.
2.2 Methodology Modules
Craft dimension component 2.5 applies methodology-specific criteria. An RCT is evaluated differently from a qualitative study, because it should be. Six modules cover the primary research designs.
2.3 Reliability Screen
Before any dimension is scored, every paper passes through an independent reliability screen. This is a pre-scoring gate — not a dimension. Red does not mean “bad research.” It means “proceed with caution — specific concerns identified.”
Green — No concerns. Scores reflect merit.
Amber — Watch-outs noted. Does not change interpretation of core findings.
Red — Concern that could change interpretation. Relevant scores capped.
2.4 Impact Potential (separate score)
Quality and impact are measured separately, always. A perfectly executed study of a trivial question scores high on quality, low on impact. Nabu never blends these into a single number.
Quality Score
3.9 / Good
Impact Potential
4.1
Two scores, always separate.
Four IP components (the 4Ts):
- Traction
- Translation
- Transferability
- Trajectory
Impact Potential calibration:
| Range | Label | Calibration Anchor |
|---|---|---|
| 4.5–5.0 | High | Directly addresses a documented need with clear stakeholder pathway. |
| 3.5–4.4 | Strong | Clear real-world relevance with a plausible pathway to use. |
| 2.5–3.4 | Moderate | Meaningful potential, but pathway remains partly defined. |
| 1.5–2.4 | Limited | Narrow or early-stage pathway requiring substantial additional work. |
| 0.0–1.4 | Minimal | No clear pathway to application, policy influence, or practical use yet. |
2.5 Evaluation Pipeline
Each paper is evaluated by three independent, blinded reviewers using the same rubric and calibration anchors defined by subject area (using the OECD Fields of Science classification), which set the field-specific standards against which each component is scored. Score divergence is flagged across all dimensions in a divergence map. An adjudicator resolves disagreements on strength of evidence, applying six principles. The primary reviewers alone achieve an inter-rater reliability (ICC₂) of 0.81; with the adjudication layer, this rises to 0.89 — compared to a meta-analytic benchmark of 0.34 for human peer review.
Reviewer A
Reviewer B
Reviewer C
Divergence Map
Score gaps flagged
Adjudicator
Resolves on evidence
Final Evaluation
AI-Human hybrid agents · Fully blinded · ICC₂ = 0.81
Resolves on strength of evidence · ICC₂ = 0.89
Six adjudication principles:
- 1.Evidence over assertion
- 2.Methodology over narrative
- 3.Specific over vague
- 4.Conservative when uncertain
- 5.Era-appropriate standards
- 6.No inference beyond what is reported
2.6 The role of AI in the evaluation pipeline
Nabu's reviewers are AI-Human hybrid agents — large language models operating within a structured rubric, with calibration anchors defined by subject area, under human oversight at the adjudication layer. The AI does not produce a holistic judgment. It scores each component independently against explicit criteria, with documented reasoning. The adjudicator — which applies human editorial judgment — resolves divergence, enforces the six adjudication principles, and produces the final evaluation. The system is designed so that the rubric does the work, not the model. A different model following the same rubric and calibration anchors should produce convergent scores — and cross-model consistency testing is underway to validate this.
Rubric + Calibration Anchors
Defines what is measured and how
AI Reviewer Agents
Execute structured scoring per component
Human-Overseen Adjudicator
Resolves divergence, applies editorial judgment
“The rubric is the evaluator. The AI is the instrument.”
3. Results
Early validation results from the Nabu evaluation corpus. Additional validation studies are in progress.
Result 1: Inter-rater reliability
Human peer review (meta-analysis mean)
Nabu (primary reviewers, ICC₂)
Nabu (with adjudication, ICC₂)
Nabu's primary reviewers achieve an ICC₂ of 0.81 (absolute agreement) on the composite quality score — more than double the published meta-analytic benchmark for human peer review (0.34, across 48 studies and 19,443 manuscripts). When the adjudication layer is applied — resolving score divergence on strength of evidence — reliability rises to 0.89.
Human peer review benchmark: Bornmann, Mutz & Daniel (2010). A reliability-generalization study of journal peer reviews: a multilevel meta-analysis. PLOS ONE, 5(12), e14331. doi:10.1371/journal.pone.0014331
Result 2: Score distribution
Distribution of quality scores across the non-retracted evaluation corpus. The majority of evaluated papers score in the Good range (3.5–3.9). No papers have scored Exceptional — the standard is intentionally conservative and calibrated to the published literature, not to internal benchmarks. The rubric discriminates: scores are distributed, not clustered.
Result 3: Retraction detection
8 of 9
retracted papers correctly identified as fundamentally flawed
- 9 papers with confirmed retracted status evaluated blind
- 8 of 9 scored in the bottom quality tier (Poor, 0.0–1.9)
- 8 of 9 flagged Red for reliability concerns by all three reviewers independently
- Mean quality score across retracted papers: 0.74 / 5.0
- 1 of 9 correctly scored higher — a paper analysing retraction patterns, not itself retracted for methodological failure
Named example
Surgisphere / Lancet HCQ Paper
Published in The Lancet. Retracted June 2020.
Rejected by all 3 reviewers independently — with no knowledge of the retraction.
The rubric identified what post-publication scrutiny later confirmed. In each case, the low scores were driven by specific, documented methodological concerns — not by a generic “this seems bad” signal. The one paper that scored higher was correctly assessed: it was a study about retraction patterns, not a paper retracted for methodological failure.
Validation roadmap
Systematic retraction corpus (n=50+)
In development
Impact Potential longitudinal validation
In development
Cross-model consistency analysis
Collecting data
Expert calibration panels
Planned
4. What this means in practice
For researchers
Compare across journals on equal terms.
A paper in PLOS ONE and a paper in Nature are evaluated with the same rubric, the same blinding, the same adjudication. The score reflects the work.
Read a structured assessment before the full paper.
Dimension-level scores and rationale give you a summary of where a paper is strong and where it has gaps — before you invest time in a full read. For systematic reviewers screening hundreds of papers, this changes the workflow.
Build a quality record that travels.
Your Nabu profile aggregates structured quality signals across your publications, independent of venue prestige. Useful for grant applications, hiring panels, and promotion cases.
For evaluators
Replace subjective impressions with structured evidence.
When you see a score of 3.8 on Craft, you know it means the methodology meets field standards with specific strengths and identified gaps — documented in the rationale. That is a defensible basis for an evaluation conversation.
Turn evaluation into feedback.
Dimension-level scores make it possible to say "your methodological rigour is strong, your contextual positioning needs work" — not just "your h-index is below the threshold." Evaluation becomes developmental, not just gatekeeping.
Meet your DORA commitment operationally.
If your institution has signed DORA or CoARA, you have committed to evaluating research on intrinsic merit. Nabu provides the quality signals to make that commitment real.
5. Commitments
Open methodology
The rubric, weights, scoring criteria, and adjudication principles are published in full. Evaluate the evaluator.
Full blinding
No author, journal, or institution signals enter the evaluation. Publication year is retained solely for era-appropriate calibration.
No unpaid labour
Evaluations are performed by dedicated agents, not by unpaid academic reviewers.
No conflicts of interest
Nabu has no relationship with publishers, journals, or institutions being evaluated. Reviewers have no career incentive tied to the scores they produce. The evaluation is structurally independent — there is no scenario in which scoring a paper higher or lower benefits Nabu commercially.
Venue-independent
The same standard everywhere. A paper is a paper.
Living evaluations
Post-publication signals (replications, retractions, method validity, practitioner feedback) update the reliability layer over time.
DORA and CoARA alignment
Paper-level, methodology-based, venue-independent. This is what 3,000+ institutions signed up for.
For the first time, I can point to a structured assessment of my work that doesn’t reduce everything to which journal accepted it. The dimension-level breakdown is genuinely useful for understanding where my methodology is strong and where I need to improve.
— Postdoctoral Researcher, Biomedical Sciences
We’ve been DORA signatories for three years, but we had nothing to replace Impact Factor with in practice. Nabu gives us structured, article-level quality signals that make our evaluation panels defensible.
— Vice-Dean Research, Faculty of Social Sciences
The standardised rubric across methodology types is what sold me. When I’m screening 200 papers for a systematic review, having a consistent quality signal — especially on Craft — saves weeks of full-text assessment.
— Senior Research Fellow, Evidence Synthesis