Reliability and Validity

What reliability, validity, norms, and measurement error mean in assessment, and how FermatMind should state evidence cautiously.

Source
science-contentpage-en-review-draft-2026-06-09/pages/03-reliability-validity-content-en-01.md

Reliability and Validity Are Not the Same Thing

Many users ask whether a test is "accurate." In assessment, that question cannot be answered with one word. At minimum, reliability and validity need to be separated.

Reliability concerns whether results are stable. For example, under similar conditions, do items of the same kind produce relatively consistent results? If the same person answers again within a short period, are the results broadly similar?

Validity concerns whether an assessment is actually measuring what it claims to measure. For example, is a career-interest assessment observing interest rather than ability? Is a personality dimension explaining behavioral tendency rather than moral value?

An assessment may be relatively stable but not necessarily valid. It may also have a clear target but insufficient result stability. Reliability and validity therefore need to be considered separately.

Common Reliability Questions

Reliability is not a single indicator. Common discussions include internal consistency, test-retest stability, and scoring consistency.

Internal consistency asks whether items under the same dimension are broadly observing related content. Test-retest stability asks whether results stay relatively stable after some time. Scoring consistency is more relevant when human scoring is involved.

Current public documentation does not provide specific FermatMind reliability numbers. Without public and reviewable materials, no number should be written as a validated conclusion.

Common Validity Questions

Validity also has several layers. Content validity asks whether items reasonably cover the target construct. Structural validity asks whether the result structure matches the model assumption. Criterion-related validity asks whether results have a reasonable relationship with external indicators.

For example, if a career-interest assessment claims to explain work-environment preferences, it should organize content around activity preferences, environment preferences, and career exploration. It should not imply that it can judge ability or predict high-stakes career outcomes.

Current public documentation does not provide specific validity numbers, sample ranges, or norm information for FermatMind tests. Before any specific number is published, science review and legal/compliance review are required.

Why Norms and Comparison Groups Matter

Some assessment results need a reference group for interpretation. A score by itself may not mean much. The important question is: compared with what kind of group, what version, and what language environment?

If the sample, language version, and applicable range are not explained, users should not read a result as a universal ranking. When current public documentation does not provide specific norms or sample sizes, those fields should remain Unknown.

Error Is Not Failure, But It Must Be Acknowledged

Assessment includes error. Error can come from item interpretation, response state, language differences, cultural experience, recent events, and model limitations. Acknowledging error does not remove the value of assessment; it tells users how results should be used.

A more appropriate approach is to treat results as reference clues and continue testing them through real experience and observation. Major decisions should not rely on one assessment result alone.

FermatMind's Current Evidence-Wording Principle

When public validation materials are not available, FermatMind should not claim that a test is absolutely accurate, absolutely authoritative, already proven at scale, or able to predict outcomes. A more appropriate statement is that the test is intended to support self-observation, while specific reliability, validity, sample, and norm information remain Unknown in current public documentation.

visible_faq_items: Does high reliability mean a test is valid?

Not necessarily. Reliability means results are relatively stable, while validity asks whether the assessment measures the target content. They need to be considered separately.

What is internal consistency?

Internal consistency asks whether items under the same dimension are observing related content. It is one part of reliability discussion, not the whole picture.

Can I read results without norm data?

Yes, as a self-observation reference, but the score should not be treated as a universal ranking or strict comparison. Missing norm data should remain Unknown.

Does FermatMind currently publish specific reliability and validity numbers?

Current public documentation does not provide specific numbers. Materials that have not been reviewed should not be written as validated conclusions.

Does measurement error mean the assessment is useless?

No. Error means results have interpretation boundaries. An assessment can still help users ask questions and reflect, but it should not be the only basis for judgment.