Multidimensional Quality Metrics Definition

This document is a draft of the MQM specification. It is subject to frequent and substantial revision and should not be relied upon for implementation.

Feedback

Overview

This document defines the Multidimensional Quality Metrics (MQM) framework. It contains a description of the issue types, scoring mechanism, and markup, as well as informative mappings to various quality systems. MQM provides a flexible framework for defining custom metrics for the assessment of translation quality. These metrics may be considered to be within the same “family” as they draw on a common inventory of values for data categories and a common structure. MQM supports multiple levels of granularity and provides a way to describe translation-oriented quality assessment systems, exchange information between them, and embed that information in XML or HTML5 documents.

1. Introduction (non-normative)

Multidimensional Quality Metrics (MQM) provides a framework for describing and defining quality metrics used to assess the quality of translated texts and to identify specific issues in those texts. It provides a systematic framework to describe quality metrics based on the identification of textual features. This framework consists of the following items:

A vocabulary for categorizing quality issues. This vocabulary was based on an extensive examination of existing translation quality metrics. (3. Issue types (normative))
A scoring mechanism to arrive at quality scores based on either counts of errors or actual error annotations. (6. Scoring (normative))
Markup (7. Markup (normative)) including the following:
- A formal XML mechanism for declaring/describing quality metrics. This vocabulary allows implementers and users to precisely describe metrics and can be used to emulated existing metrics in MQM format. (7.1. MQM metrics description)
- A set of attributes in the mqm: namespace that can be used with XML or HTML5 (with appropriate adjustments to the HTML5 format) to embed MQM data in these file formats. These attributes are designed to work with Internationalization Tag Set 2.0 (ITS 2.0) localization quality metadata (7.2. MQM inline attributes)
- A set of elements in the mqm: namespace that can be used to insert MQM data into XML files when existing elements do not meet requirements. (7.3. MQM inline elements)
A set of guidelines for selecting issues based on the ASTM F2575:2014 specification (8. Guidelines for selecting issues for use in a metric (non-normative))
A set of informative mappings from existing legacy metrics to MQM that may prove useful for implementers (10. Mappings of existing metrics to MQM (non-normative))

MQM does not define a single metric intended for use with all translations. Instead it adopts the “functionalist” approach that quality can be defined by how well a text meets its communicative purpose. In practical terms, this statement means that MQM is a framework for defining a family of related metrics.

MQM is intended to provide a set of criteria which can be used to assess the quality of translations. While these criteria are intended to promote objectivity in assessment, a certain degree of subjectivity is inherent in assessing translation quality and MQM may not be able to distinguish between high-end translations that meet all specifications (other than to assure that they do, in fact, meet those specifications).

1.1. Scope

This document applies primarily to quality assessment of translated content (and thus to the output of translation systems). It does not apply to assessment of translation processes or projects. Here “translated content” is to be understood broadly to include text, graphics, and any other content which may be translated or adapted for multiple locales (i.e., a combination of language and geographical region). MQM applies to the translation industry, interpreted broadly to include localization (of software and other technical content) and “transcreation” (creative adaptation of content for target audiences and purposes, including but not limited to, adaptation of marketing materials and multi-media content), as well to various types of purely textual translation.

MQM is useful for assessing verifiable qualities of translations. It is not intended to address purely subjective criteria (such as “artistry” or “elegance”) that may be of key importance in some circumstances. Rather, it provides a functional approach to quality that seeks to see whether a translation meets specifications and to identify aspects that may fall short of expectations.

MQM is designed to apply to (monolingual) source texts as well as translated target texts. MQM’s , , , , and branches apply equally to source texts and target texts (although some specific issues within them might apply more to one or the other). The dimension is specific to translated texts (or, more properly, to the relationship between source and target text). While is more likely to apply to target texts, it can apply to source texts. And, finally, the dimension applies solely to source content (many of its issues correspond to specific faults in the target text that can be identified under ).

1.2. Quality assessment, quality assurance, and quality control

Within the translation industry, three terms are used somewhat interchangeably to refer to quality activities: quality assessment, quality assurance, and quality control. However, within broader literature on quality these terms have distinct meanings and should be distinguished:

Quality assessment (sometimes considered synonymous with “quality evaluation”) is the measurement of the extent to which a product complies with quality specifications (see the definition of “quality” below).
Quality assurance refers to ways of preventing mistakes or defects in manufactured products and avoiding problems when delivering solutions or services to customers. Quality assurance relies on continual assessment of quality.
Quality control is the process of checking whether manufactured products meet stated quality specifications. Quality control is a decision process based on the results of quality assessment.

The focus of MQM is on quality assessment, which is essential to quality assurance and quality control. This document does not, however, specify or recommend particular quality assurance or quality control processes. (Note that within the translation industry there is widespread confusion between “quality assessment” and “quality assurance” within the localization industry, partially due to the adoption of the LISA Quality Assurance Model, which actually provided a model for quality assessment.)

5. Issue types (normative)

5.1. MQM issues

5.1.1. List of MQM issues

The full list of MQM issues is maintained in a separate document at .

5.1.2. High-level structure

At the top level, MQM is defined into major dimensions:

: Accuracy issues address the relationship of the target text to the source text and can be assessed only by considering this relationship. Changes in intended meaning, addition and omission of content, and similar issues are considered in it.
: Design includes issues related to the physical presentation of text, typically in a “rich text” or “markup” environment.
: Fluency includes those issues about the linguistic “well-formedness” of the text that can be assessed without regard to whether the text is a translation or not. Most Fluency issues apply equally to source and target texts.
: Internationalization covers areas related to the preparation of the source content for subsequent translation or localization. Internationalization issues may be detected through problems found in the target (particularly from those included in , but an Internationalization audit is generally conducted separately from a general assessment of translation quality.
: Issues in Locale convention relate to the formal compliance of content with locale-specific conventions, such as use of proper number formats. If content is otherwise correctly translated and fluent but violates specific locale expectations (as defined in the translation specifications), it is addressed in this dimension. This dimension does not cover issues related to whether the content itself is appropriate for the locale (these issues are covered under .
: Style issues relate to what is commonly known as “Style”, defined both formally (in style guides) and informally (e.g., a “light style” or an “engaging style”). These issues are closely related to , but are often treated separately by tools and quality processes and so are grouped as a separate dimension in MQM.
: Terminology issues relate to the use of domain- or organization-specific terminology (i.e., the use of words to relate to specific concepts not considered part of general language). Adherence to specified terminology is widely considered an issue of central concern in both translation and content authoring. Issues in this branch should not be used for general language mistranslation (e.g., translations that would not be considered correct under reasonable circumstances), and should be reserved for issues related to terminology (e.g., a translation is reasonable but incorrect in the context of specific technical domain or for a particular organization).
: Verity issues relate to the suitability of content for the target locale and audience. They do not relate to fluency or accuracy since content may be fluently written and accurately translated and still be inappropriate for the target locale or audience. For example, if a text translated for Germans in Germany refers to options available only in the UK, these portions will likely be problematic. For more details on Verity, see the discussion below.
: The Compatibility dimension includes issues taken from legacy metrics that are not considered appropriate for general use in MQM (because they are related to areas not covered by MQM, such as deadlines, software functionality, or physical production). They are included only for compatibility with these older metrics and should not be used for new MQM metrics.
: This dimension is used for issues which cannot be otherwise classified into a dimension of MQM. In cases where an unforeseen issue can be classified as belonging to a dimension, it should be classified in that dimension under the top level or using a custom issue type. In practice Other should be used extremely rarely.

More information on the dimensions and their content can be found in the full list of MQM issues at

5.2. MQM Core

In order to simplify the application of MQM, MQM defines a smaller “Core” consisting of 20 issue types that represent the most common issues arising in quality assessment of translated texts. The Core represents a relatively high level of granularity suitable for many tasks. Where possible, users of MQM are encouraged to use issues from the Core to promote greater interoperability between systems.

The MQM Core can be graphically represented as follows (branches in gray italics represent major branches not included in the MQM core) (available here in SVG format):

The 19 issues are defined in the MQM core as follows:

Definitions for these issues can be found in the list of MQM issue types.

Even the 20 issues of the Core represent more issues than are likely to be checked in any given quality application and users may define subsets of the core for their needs. It is recommended for translation quality assessment tasks that issues contain at least the issue types and if no other more granular types are included.

5.3. User extension

While users are strongly encouraged to limit issue types to pre-defined MQM issues, they may add additional issue types to MQM to meet additional requirements. User-defined issue types MUST include the following information:

Name: A human-readable name for the issue type.
ID: An unprefixed QName that serves as the XML identified of the issue type. The ID MUST begin with x- to indicate that the issue is a user-defined issue. E.g., x-respeaking-error would be a valid ID for a user-defined “respeaking error” but respeaking-error would be invalid.
Parent: The ID value of the parent issue type. The parent may by a predefined MQM issue type or a user-defined issue type.
Definition: A human-readable explanation of the issue type.

User extensions do not provide interoperability between systems and impede the exchange of data. Nevertheless they may be needed to support requirements not anticipated in MQM. Users should tie extensions into the predefined hierarchy using the parent value as much as possible since doing so provides consumers of MQM data with the best guidance in interpreting unknown categories and mapping them to other systems. As with other aspects of MQM, users should limit granularity to the least granular level that meets requirements.

Users who encounter frequent need for custom extensions are encouraged to communicate their requirements to the MQM project for possible inclusion of these types in future versions of MQM.

5.4. Integration with other metrics

In addition to MQM, it may be desirable to use other metrics that cannot be converted to a native MQM representation for various purposes. The key principle in integrating metrics is that they must be scoped to indicate to what MQM content they apply. For example, if a metric assesses only readability, it would be scoped to provide a score for MQM , while a metric that provides a score for “Adequacy” would provide a score for MQM . A metric that provides an undifferentiated “quality” score would take all of an MQM metric as its scope and thus provide an overall score.

Non-MQM scores may be indicated in an MQM report by using the nodeScore and scoreType attributes, which may be appended to any node in the score report.

As the interpretation of any particular metric’s result/score is likely to depend on the specifics of the assessment, MQM can provide no guidance on how to utilize the result/score of non-MQM metrics. Results may be appended to MQM reports at the appropriate nodes in the MQM hierarchy and users may wish to combine these results with the results of MQM-based evaluation, (e.g., through averaging MQM and non-MQM scores normalized on a 1-100 scale). Such combinations are outside the scope of MQM.

As an example, the BLEU metric, an automatic metric for assessing machine translation (MT) quality with respect to human reference translation(s), is widely used in MT research. In the case of BLEU, the scope is global because BLEU provides a single, undifferentiated quality score. A BLEU score would thus be provided as parallel to the overall MQM score (see Section 8. Scoring for a recommended method for generating an MQM score). An implementer could utilize the BLEU score in various ways in conjunction with MQM: e.g., only assessing those translations that obtain a BLEU score over a specific threshold, averaging the BLEU and MQM scores, or using both scores for thresholds.

While the specific use of other scores cannot be mandated, their usage should not conflict with the MQM principles. For example, a metric’s results should not be stated to apply to the branch of MQM if they include the results of an evaluation of whether or not terms have been translated correctly.

6. Markup (normative)

This section describes the MQM declarative markup. Use of the metrics declaration markup is mandatory for declaring an interoperable MQM metric. When used with XML or HTML, it is strongly recommended that the ITS 2.0 Localization Quality Issue data category be used to declare MQM issues in conjunction with the locQualityProfileRef pointing to a valid MQM definition. Note that when implemented with ITS 2.0 quality markup that the requirements for implementing are also mandatory.

6.1. MQM metrics description

MQM provides an XML mechanism for exchanging descriptions of MQM-compliant metrics. MQM metrics description files use the .mqm file name extension. An .mqm file contains a hierarchical list of MQM issue types. This listing MUST conform to the hierarchy of issue types.

The following is an example of a small metric description file with issue names in both English and German. It includes a user-defined extension (x-respeaking) used to identify errors caused when a vocal text being respoken without background noise based on a live audio feed is incorrectly repeated by the person doing the respeaking, leading to a mistranscription.

<?xml version="1.0" encoding="UTF-8"?>
<mqm version="0.9">
  <head>
    <name>Small metric</name>
    <descrip>A small metric intended for human consumption</descrip>
    <version>1.5</version>
    <src>http://www.example.com/example.mqm</src>
  </head>
  <issues>
    <issue type="accuracy" display="no">
      </issue>
      <issue type="omission" weight="0.7"/>
      <issue type="addition"/>
    </issue>
    <issue type="terminology" weight="1.5"/>
        <issue type="style" weight="0.5"/>
    <issue type="fluency" display="no">
      <issue type="spelling"/>
      <issue type="grammar"/>
      <issue type="unintelligible" weight="1.5"/>
    </issue>
    <issue type="x-respeaking" weight="1.5"/>
  </issues>
  <displayNames>
    <displaNameSet lang="en">
      <displayName typeRef="accuracy">Adequacy</displayName>
      <displayName typeRef="terminology">Terminology</displayName>
      <displayName typeRef="omission">Omission</displayName>
      <displayName typeRef="addition">Addition</displayName>
      <displayName typeRef="fluency">Fluency</displayName>
      <displayName typeRef="style">Style</displayName>
      <displayName typeRef="spelling">Spelling</displayName>
      <displayName typeRef="grammar">Grammar</displayName>
      <displayName typeRef="unintelligible">Unintelligible</displayName>
      <displayName typeRef="x-respeaking">Respeaking</displayName>
    </displaNameSet>
    <displayNameSet lang="de">
      <displayName typeRef="accuracy">Genauigkeit</displayName>
      <displayName typeRef="terminology">Terminologie</displayName>
      <displayName typeRef="omission">Auslassung</displayName>
      <displayName typeRef="addition">Ergänzung</displayName>
      <displayName typeRef="fluency">Sprachkompetenz</displayName>
      <displayName typeRef="style">Stil</displayName>
      <displayName typeRef="spelling">Rechtschreibung</displayName>
      <displayName typeRef="grammar">Grammatik</displayName>
      <displayName typeRef="unintelligible">Unverständlich</displayName>
      <displayName typeRef="x-respeaking">Sprecherfehler</displayName>
    </displayNameSet>
  </displayNames>
  <severities>
    <severity id="minor" multiplier="1"/>
    <severity id="major" multiplier="10"/>
    <severity id="critical" multiplier="100"/>
  </severities>
</mqm>

6.2. MQM inline attributes

MQM implements the following attributes in the mqm namespace:

issueType. Contains the MQM issue type, listed by ID. Note: MQM implementations MUST use the ID and MUST NOT use a localized name.
issueSeverity. Contains the MQM issue severity using the name defined in the metric. Note that the default severity level names are minor, major, and critical. While other values MAY be used, if they are used they MUST be defined in the metric definition for proper interpretation.

MQM is designed to be used in conjunction with the following ITS 2.0 attributes from the localization quality issue data category:

locQualityIssueType: Contains the issue type as defined by ITS 2.0. Mapping the native MQM value to the appropriate ITS issue type helps ensure compatibility with ITS 2.0-aware implementations, even if they do not implement MQM.
locQualityIssueComment: Contains a human-readable comment about the issue.
locQualityIssueSeverity: Contains a rating of severity from 0 to 100. Mapping from the name contained in the MQM issueSeverity attribute to this attribute enables ITS 2.0-aware tools to interpret the severity of the issue.

To ensure compatibility with ITS 2.0 markup, implementers SHOULD use ITS 2.0 markup where possible. All of the ITS 2.0 localization quality annotation may be used. MQM markup adds capability to the ITS 2.0 quality markup.

			
<?xml version="1.0"?>
<doc xmlns:its="http://www.w3.org/2005/11/its" its:version="2.0">
<doc xmlns:mqm="[XXXXXXXXXXX]" mqm:version="1.0">
  <para><span 
      mqm:issueType="spelling"
      mqm:issueSeverity="major"
      its:locQualityIssueType="misspelling"
      its:locQualityIssueComment="Should be Roquefort"
      its:locQualityIssueSeverity="50">Roqfort</span> is an cheese</para>
</doc>

To create this markup the following process is followed:

The MQM issue type () is mapped to the corresponding ITS 2.0 type (ITS 2.0 is less fine-grained than MQM in many cases) as described in 8. Relationship to ITS and added as the value of its:locQualityIssueType.
The MQM issue type and severity are declared in the mqm: namespace
The value of the severity multiplier is declared on a scale from 0 to 100 and inserted as the value of the its:locQualityIssueSeverity attribute. In this case the multiplier value was 5 (out of 10), so it is represented as 50 in ITS markup.
A comment is added using the its:locQualityIssueComment attribute.
Globally, the relevant profile (specifications and metric definition) are linked using the its:locQualityProfile attribute.

6.3. MQM inline elements

In general, MQM XML implementations should use existing span-level elements in the native XML format that MQM is being added to where possible. This use can be done using any of the ITS 2.0 methods with the addition of the MQM-specific attributes. However, such elements may not be available. In such cases, MQM defines two elements that can be used to add inline markup:

<mqm:startIssue />. This element defines the starting position of an MQM span.
<mqm:endIssue />. This element defines the end position of an MQM span.

Two empty elements are used so as to prevent any interference between MQM tags and existing XML structure, such as those that could be caused by improperly nested elements. To pair these tags the id attribute is used. ID values MUST be unique within the document to prevent confusion.

An example of an MQM annotation is seen in the following XML snippet:

    <para>“Instead of strengthening
        <mqm:startIssue type="function-words" id="1f59a2" severity="minor" agent="f-deluz" comment="article unneeded here" active="yes"/>
        the<mqm:endIssue idref="1f59a2"/>
        civil society, the president cancels
        <mqm:startIssue type="agreement" severity="major" comment="should be “it”" agent="f-deluz" id="3c469d" active="yes"/>them<mqm:endIssue idref="3c469d"/>
        de facto”, deplores Saeda.
    </para>

The mqm:startIssue element MUST take the following mandatory attributes:

id. Used to match the corresponding mqm:startIssue and mqm:endIssue tags within text
type. Provides the MQM issue type.

The mqm:startIssue element CAN take the following optional attributes:

severity. Permissible values defined by the MQM metric in use. Provides the severity of the issue. Default value is undefined.
agent. Text string identifying the agent that supplied the annotation. Default value is undefined.
comment. Text string containing a human-readable comment attached to an issue. Default value is undefined
active. One of yes or no. Indicates whether the issue is considered active (yes) or inactive (no). Default value is yes. If an issue is marked as inactive, this means that it has either been resolved or determined not to be an actual error.

In addition, ITS 2.0 attributes MAY be added to these elements to promote greater interoperability.

The mqm:endIssue element MUST take the following mandatory attribute:

idref. A value corresponding to the id of the mqm:startIssue tag that begins the identified span.

Use of these inline elements also requires that the mqm namespace be declared in the document. The method for declaring this namespace needs to be determined.

7. Relationship to ITS 2.0 (normative)

The Internationalization Tag Set (ITS) 2.0 specification holds a privileged position with respect to MQM due to its use as a standard format for interchanging localization quality information through its localization quality issue data category.

This section describes the mapping process from MQM to ITS 2.0 and from ITS 2.0 to MQM. As MQM allows the declaration of arbitrary translation quality assessment metrics, it serves a different purpose from ITS, which provides high-level interoperability between different metrics. While ITS is much less granular than the full MQM hierarchy, individual MQM metrics may be either more or less granular than the set of ITS 2.0 localization quality issue types (or may be more granular in some areas and less in other). As a result it is likely that conversion between MQM-based metrics and ITS will be “lossy” to some extent. In general the mapping process from MQM to ITS 2.0 is straight-forward since ITS 2.0 does not allow subsetting of the possible values for localization quality issue type, but the conversion from ITS 2.0 to MQM may be more challenging since an arbitrary MQM metric may or may not contain the default target mappings provided below and mappings may account for the MQM hierarchy.

MQM metrics that map to ITS MUST use the mappings described in this section, subject to the limitations described below.

7.1. MQM-to-ITS mapping

MQM issue types are mapped to ITS issue types according to the following table. Note that this mapping is unambiguous and MUST be followed to ensure consistency between applications.

MQM issue type ITS 2.0 issue type

mistranslation

addition

mistranslation

mistranslation

inconsistent-entities

mistranslation

numbers

mistranslation

omission

mistranslation

untranslated

other (for all children)

formatting

length

formatting

markup

length

formatting

other

other

characters

other

other

non-conformance

duplication

grammar

register (ITS register covers both and )

inconsistency

other

other

characters

other

pattern-problem

other

misspelling

typographical

uncategorized

internationalization (for all subtypes)

locale-violation (for all subtypes)

style

register (ITS register covers both and )

style

terminology (for all subtypes)

other

legal

locale-specific-content

Note that the entire Internationalization branch of MQM maps to the ITS internationalization type. It is anticipated that this mapping will apply to all children of the MQM Internationalization issue type that may be added in the future.

7.2. ITS-to-MQM mapping

Mapping from ITS to MQM is less likely to be used and presents particular problems since MQM metrics typically contain only a small subset of the full MQM issue set. As a result MQM issues to which ITS localization quality issue type values are mapped may not exist in a particular MQM metric. In such cases processes MUST map the ITS value to the closest higher-level issue type in MQM if one exists in the target MQM metric. If no higher-level issue type exists in the target MQM metric, the process MUST skip the ITS 2.0 issue type (but MAY preserve the ITS 2.0 markup).

For example, if a process encounters the ITS 2.0 type and the target MQM metric does not contain but does contain , the ITS omission value would be mapped to MQM . However, if the MQM metric does not contain , the higher node in the MQM hierarchy, the ITS omission issue type would be ignored/omitted by the conversion process.

Note that the above requirements mean that in some cases there may be a many-to-one mapping from ITS to MQM. For example, if a document contains ITS annotations for omission, untranslated, and addition, but the target MQM metric contains and no daughter categories, all of these categories would be mapped to MQM . In other words, there is no universal mapping from ITS to all MQM metrics since MQM metrics do not all contain the same issues.

Processes encountering issues such as those described in the previous paragraphs SHOULD alert the user about the information loss or remapping if user interaction is expected by the process.

In most cases the table shows that the ITS issue types map to MQM issue types with identical (except for casing) or similar names, highlighting the evolutionary relationship between ITS and MQM. Those items where names are different in a non-trivial manner are marked with an asterisk (*) to help draw attention to the fact that the names do not match.

ITS 2.0 Localization Quality Issue type	MQM issue type	Notes
terminology
mistranslation
omission
untranslated
addition
duplication
inconsistency
grammar
legal	*
register	*	Register in ITS can also describe (under ). If a mapping process is sophisticated enough to distinguish the two meanings, it may map to the appropriate issue. Otherwise, use as it is the more common issue
locale-specific-content
locale-violation	*
style
characters	*
misspelling	*
typographical	*
formatting	*
inconsistent-entities	*
numbers	*
markup
pattern-problem
whitespace
internationalization
length
non-conformance	*
uncategorized	*
other

Note that the ITS uncategorized category maps to MQM even though MQM maps to ITS uncategorized. In other words, the mapping is asymmetric because the semantics of uncategorized are broader than .

8. Scoring (non-normative)

The MQM scoring model applies only to error-count implementations of MQM. At present this specification does not define a default scoring model for holistic systems, which are less detailed in nature than error-count metrics. Future versions, however, MAY define a default model to holistic systems.

Note that MQM-conformant tools are NOT required to implement any scoring module at all. For example, an automatic tool that identifies possible issues but which does not determine their severity might not provide a score.

This scoring model provides one method to calculate a single quality score as a percentage value. Such scores are frequently used for acceptance testing in translation quality assurance processes. In addition, it generates sub-scores for various aspects of the both the target and, optionally, the source text. Additional scoring methods may apply to specific circumstances. It is RECOMMENDED, but not required, that implementers of MQM provide scores the conform to this section in addition to any other scores they may provide.

8.1. Default severity levels for error-count metrics

Version 0.3.0 made major changes with respect to severity multipliers. These changes render the default scoring for versions 0.3.0 and later incompatible with earlier versions. Version 0.9.1 introduced a new severity level, none, that always has a penalty of 0, i.e., it does not count against the transaltion, it is used to mark items that should be changed, but which are not considered errors for scoring purposes (see below).

For the purposes of calculating quality scores, the following default values apply:

Weight

All issues have a default weight of 1.0. This weight can be updated on a per-issue basis to reflect specific requirements.

Severity

The default severity levels are defined as follows:

none: 0. Issues with the severity level none are items that need to be noted for further attention or fixing but which should not count against the translation. This severity level can be conceived of as a flag for attention that does not impose a penalty. It should be used for “preferential errors” (i.e, items that are not wrong, per se, but where the reviewer or requester would like to see a different solution), systematic repeated errors that can be easily fixed (e.g., a translator has systematically used an incorrect domain term but it is a simple matter of search and replace to correct them all). Because no penalty is assessed for this level, it is not discussed in the scoring formulae.
minor: 1. Minor issues are issues that do not impact usability or understandability of the content. For example, if an extra space appears after a full stop, this may be considered an error, but does not render the text difficult to use or problematic (even if it should be corrected). If the typical reader/user is able to correct the error reliably and it does not impact the usability of the content, it SHOULD be classified as minor. Since minor errors do not impact the usability of the content, resolution of them is at the discretion of those responsible for the content.
major: 10. Major issues are issues that impact usability or understandability of the content but which do not render it unusable. For example, a misspelled word may require extra effort for the reader to understand the intended meaning, but do not make it impossible. If an error cannot be reliably corrected by the reader/user (e.g., the intended meaning is not clear) but it does not render the content unfit for purpose, it SHOULD be categorized as major. While it is generally advisable to fix major errors prior to use of the content, the inclusion of major errors may not, by themselves, render the text unfit for purpose.
critical: 100. Critical issues are issues that render the content unfit for use. For example, a particularly bad grammatical error that changes the meaning of the text would be considered critical. If the error prevents the reader/user from using the content as intended or if it presents incorrect information that could result in harm to the user it MUST be categorized as critical. In general, critical errors have to be fixed prior to use of the text since even a single critical error is likely to cause serious problems.

8.2 Scoring algorithm

MQM can generate target document quality scores according to the following formula:

TQ = 100 - TP + SP

where:

TQ = quality score: The overall rating of quality
TP = penalties for the target content: Sum of all weighted penalty points assigned to the target text
SP = penalties for the source content: Sum of all weighted penalty points assigned to the target text

All penalties are relative to the sample size (in words) and are calculated as follows (assuming default weights and severity levels):

P = \frac{({Issues}_{minor} + {Issues}_{major} \times {SeverityMultiplier}_{major} + {Issues}_{critical} \times {SeverityMultiplier}_{critical})}{Word count}

where:

${Issues}_{minor}$ = Number of issues with a “minor” severity

${Issues}_{major}$ = Number of issues with a “major” severity

${Issues}_{critical}$ = Number of issues with a “critical” severity

A score can thus be generated through the following (pseudo-code) algorithm:

foreach targetIssue {
	targetIssueTotal = targetIssueTotal +
	(targetIssue * weight[sourceIssueType] * severityMultiplier);
}

foreach sourceIssue {
	sourceIssueTotal = sourceIssueTotal +
	(sourceIssue * weight[sourceIssueType] * severityMultiplier);
}

// Generate overall score
translationQualityScore = 100 - (targetIssueTotal / wordcount) + (sourceIssueTotal / wordcount);

In this algorithm, each issue type has a weight assigned by the metric that is retrieved and used to determine the individual penalties. Penalties are cumulative. Note that if the source is examined, penalties against the source are effectively added to the overall score for the translation, reflecting the fact that they indicate problems in the source the translator had to deal with. If the source is not assessed, the source penalties are by definition 0 and do not count for or against the translation’s quality score.

(Scores can be generated for any dimension or branch in the MQM hierarchy by counting only those issues in that selection. Note that counting source issues is optional and that if a score for a source document is desired then the formula should ignore target issues and instead subtract the total of source issues divided by the wordcount from 100 to arrive at a source content score.)

This algorithm can serve as a model for other systems, such as metrics with two severity levels or those with four. However, using other models will impede comparability of scores generated by various metrics.

8.3. Default severity multipliers from versions earlier than 0.3.0 (deprecated)

The following severity multipliers were recommended as default multipliers prior to version 0.3.0. The former default severity weights were taken from the LISA QA Model and represent common industry practice. Discussion with experts in psychometrics, however, revealed that the range of values was too close to provide sufficient discrimination between relatively insignificant errors and those considered serious enough to reject a project. As these values were implemented in a number of tools they are documented here:

minor: 1
major: 5
critical: 10

Scores using these multipliers can be easily updated to reflect the new values simply by changing the multipliers in the formula. Similarly, new scores can be compared with old scores by using this values in place of the new ones. However, as the old multipliers are deprecated, they SHOULD NOT be used as the default model for any new implementations.

9. Creating MQM metrics (non-normative)

This section describes the process for creating an MQM metric in cases where a suitable predefined metric is not available. The process may be graphically represented as shown below:

In this view, implementers first determine what sort of metric they wish to use (analytic, holistic, task-based testing, functional testing, etc.) based on the following criteria:

What they are assessing. Possible options include a translation product (e.g., a translated text or piece of software), a translation process, or a system for producing translation. Depending on what is being assessed, certain methods are more likely than others. Note that some possibilities, such as assessing a process, may be outside the scope of MQM.
Who does the assessment. Possibile options include expert translators, language scholars, community members, and end users. Some methods will be more suitable than others for potential classes of assessors.
Where (the context in which) the assessment will take place. This question helps users understand the requirements for assessment in the particular context. For example, in an academic or research environment, very detailed feedback may be needed that would not be possible to obtain if the assessment is to be done on texts published on a customer portal.
When the assessment will take place. This question asks users to consider the process step in which assessment will be done. Requirements for assessing an initial/draft translation before revision to determine suitability for further work may be different than those for assessing a translation thought to be ready for publication.
Why the assessment is being done. Assessment may be done for many different reasons. For example, it may be done to determine whether a translation is ready for publication, to fix problems in a translation, or to improve a system. Each of these options suggests different possibilities for assessment.

Based on the answers to the questions given above, users may select a method (the “how”) for assessing the translation. Some of the possible options include the following:

Analytic. Analytic approaches involve the identification of specific issues in text. These approaches are detail-oriented and are particularly suited for the detection and repair of individual errors. This approach is also time- and labor-intensive, although in some cases analytic approached may be partially or entirely automated. Analytic approaches typically require training and are best suited for expert assessors or cases where detailed information is needed to improve systems or processes. They are generally less suitable for assessment by end-user or community members.
Holistic. Holistic approaches consider the text (or portions of the text) as a whole. Rather than identifying specific locations in the text with problems, they evaluate the overall quality of the text with respect to specific criteria. There are two sub-types of holistic assessment methods:
- Simple/mono-dimensional. Simple or mono-dimensional holistic approaches assess the text on a single criterion. For example, a mono-dimensional holistic approach might ask the assessors to rate the translated text on a scale from one to five based on how good a translation they find it, with no other criteria or aspects. These approaches are particularly suitable for community or end-user evaluation where more complex assessment methods would most likely be ignored.
- Complex/multi-dimensional. In complex or multidimensional approaches, assessors are asked to evaluate the text on multiple criteria. Depending on the number of criteria, such assessment may be appropriate for various classes of evaluators, but the more criteria are features, the more training of assessors will be required.
Task-based testing. Task-based testing evaluates how well the translation serves to assist a users to accomplish a task. It is particularly useful for determining problems with textual verity. In a simple form it may be equivalent to a simple holistic approach (e.g., if users are asked “Did this (translated) text answer your question adequately?”). Complex task-based testing is a discipline in its own right and is not generally covered in MQM, although aspects of task-based testing may be related to MQM issue types.
Functional testing. Functional testing evaluates whether a translated produce functions as expected. It is most suitable for evaluating translated applications or other products that combine translation with other functions. Functional testing is beyond the scope of MQM, although, as with usage testing, it may relate to MQM issue types.

In addition to selecting an assessment method based on the answers to the questions on the left of the diagram, users also need to define the specifications (i.e., the values of the parameters) for the translation(s) to be assessed. (The MQM parameters are defined in section 8.2. Definition of MQM parameters below.) Based on the specifications, users decide which dimensions of the text will be assessed. Dimensions defined in MQM are the following:

Accuracy
Design
Fluency
Internationalization
Locale convention
Style
Terminology
Verity

Note that the dimensions correspond to top-level branches in the MQM hierarchy.

Depending upon which dimensions are selected and the degree of granularity required for the assessment task, MQM issues are then selected to ensure that the required dimensions are adequately assessed. In the case of it is likely that different assessment methods will be needed since internationalization cannot generally be assessed from examining texts (versus doing a code audit).

9.1. Example of defining a metric

The following example will help clarify how the process works. The example is for a case in which a company that makes network diagnostic gear wishes to evaluate whether automatic (machine) translations into Japanese of user-generated forum content written in English is helping their Japanese users solve technical problems with their equipment.

Selection of assessment method.
1. What: The company wishes to assess a translation product (the forum content) and also the MT system they are using to translate the content.
2. Who: The company wants to use its customers to evaluate the translation since they are the only ones who can determine whether the content meets their needs.
3. Where: The assessment must be done on the user-to-user forum with end users who are not experts in translation or language and who cannot be trained in advance.
4. When: The assessment will take place after texts have already been published on the website. The texts will be raw MT output with no post-editing or other correction.
5. Why: The assessment will be used to determine if the MT system’s results help users meet their needs or whether more manual processes (e.g., MT + post-editing) are required.
Based on these answers, the company decides to use a holistic assessment method with a low number of dimensions (no more than three).
Creating the specifications. The company fills out a worksheet to define the values for the parameters in their specification (described in section 8.2. Definition of MQM parameters below) and creates a full set of translation specifications.
Selection of Dimensions. Based on their translation specifications they determine that the following dimensions are relevant to this task: Accuracy, Fluency, and Verity. Because of the nature of the assessment method and the assessors, however, the company decides to limit assessment to three issues: Terminology, Fluency, and Verity. Although Accuracy is highly important, they cannot expect their users to understand English well enough to assess the accuracy of translated texts.
Building the metric. Based on the selection of a holistic metric and three issues, the company selects three issue types and implements a metric with three questions on their website at the end of each translated forum entry:
1. Did this answer enable you to solve your problem? (Yes/No) (Addresses Verity)
2. Was this answer grammatically correct? (Yes/No) (Addresses Grammar)
3. Did this answer use the correct words to describe your product and the solution? (Yes/No) (Addresses Terminology)
In addition, because the company realizes that their customers cannot assess some core aspects and help them evaluate their MT system, they decide to create a second, analytic metric for human assessors to check a subset of the output.

Although simple, this example, shows how it is possible to build customized metrics to meet specific requirements using MQM.

9.2. Definition of MQM parameters

MQM makes use of a selection of 11 of the 21 parameters defined in ASTM F2575, with the addition of one additional parameter, Output modality, which is subsumed under Text type in ASTM F2575 but which is broken out in MQM because of its special impact on some translationed. The parameters are defined as follows:

Parameter	Description
1. Language/locale	Definition: The language into which the text is to be translated Note/Explanation: This parameter should specify geographical language variants where appropriate. Examples: the text is to be translated into Swiss German (de-CH) the text is to be translated into Cantonese as spoken in Hong Kong using Traditional Chinese characters (zh-HK-Hant)
2. Subject field/domain	Definition: Subject field(s) (domain(s)) of the source text Note/Explanation: This information should be as specific as possible to assist translation providers in finding the best translators for the job Examples: the text is a specialized text dealing with meteorological science the text is a sixteenth-century legal text regarding fishing rights in the North Sea
3. Teminology	Definition: List of terms or reference to terms to be used Note/Explanation: These terms are domain- or project-specific ones Examples: the requester provides instructions to see a website that defines many of the domain-specific terms in the project the requester states that specialist physics terms are to be used
4. Text type	Definition: The type of the source content Note/Explanation: Needed to locate resources with the appropriate linguistic skills. For example, a translator who specializes in technical translations may not be ideal to translate a compilation of 12th-century religious poems. Note that “Text type” is known as “Form of the text” in ASTM F2575 Examples: user manual literary novel set in medieval Ireland
5. Audience	Definition: The project’s target audience Note/Explanation: The audience should be described or defined as precisely as possible without being too restrictive Examples: business analysts with a background in Russian mineral exploration activity teenage users of tablet computers
6. Purpose	Definition: statement of the purpose or intended use of the translation Note/Explanation: This information is useful in helping the translator decide the appropriate manner in which to translate the text. In some cases the purpose of the translation may differ significantly from the purpose of the source text. Examples: the text is intended for entertainment, to transmit information, or to persuade an audience of a political point the source text was written to convince youth to join a political movement but the translation is to used by foreign journalists to help them understand the goals of this political movement
7. Register	Definition: Description of the linguistic register to be used in the target language Note/Explanation: Register is often difficult to infer from the source text and must be defined on a per-language basis Examples: the text is an informal conversation between friends and should be translated in German using the du form the text is a formal letter to the Hungarian ambassador and should be translated using the Őn pronouns and very formal honorifics, salutations, and grammatical structures
8. Style	Definition: Information about the document’s style. Note/Explanation: Could include formal style guides, references to comparable documents, or other clear indications of style expectations Examples: the text is a promotional piece for investors and style is highly relevant, with the translation trying to capture an air of excitement the text is intended for use by technicians in a service environment and style is considered irrelevant the text is to be published by a press with very specific in-house style rules that must be followed
9. Content correspondence	Definition: Specifies how the content is to be translated Note/Explanation: The default assumption is that text is to be fully translated and adapted to the target locale (a covert, localized translation). In some instances, requesters may ask for partial or summary translations Examples: a British English text should be fully translated into German but all prices should be left in pounds sterling rather than converted to euros a marketing text should be heavily adapted to match target language conventions, with the translator free to rewrite portions as needed to appeal to the audience the text should be translated as a summary that presents the main points but leaves out details
10. Output modality	Definition: Information about the way in which the translated text will be displayed/presented Note/Explanation: This parameter provides information about the specific environments in which the text will be output and any limitations or special requirements they may impose. Examples: the text is to be output as captions on a YouTube video the text will be used in voice prompts for a telephone dialogue system with a female voice reading the prompts the text will be displayed on an embedded LCD screen of a device and is limited to a length of 25 characters
11. File format	Definition: The file format(s) in which the translated content is to be delivered Note/Explanation: It is quite common for the target file format to differ from the source file format Examples: the translator is asked to translate a text in an InDesign file but to return the translation as an RTF text the translator is to return text in Microsoft Word (.docx) format and graphics in layered TIFF format
12. Production technology	Definition: Any technology or software to be used in the translation process Note/Explanation: May be generic or specific as to particular translation tools. Production technology is included, even though it is not a product parameter, because specific technologies may have an impact on likely issues in the target texts they produce. Examples: the project is to be completed using a translation memory tool of the translator’s choice the translation must use TTC TermBase v3

After the values for these parameters are fully specified, MQM implementers should verify that the selection of issue types will ensure that the requirements defined by the parameters are met. Note that parameters may override each other. For example, under Content correspondence the parameters might specify that a “gist” translation is acceptable, in which case would not normally be assessed; however if Audience specifies that the target audience consists of young readers with low literacy, might be assessed to assure that the “simple” style needed for the target audience is achieved.

At this stage in MQM development, there are no normative guidelines for selecting issues. Instead implementers are encouraged to go through each parameter to identify project-relevant issues that will enable them to verify whether the translation meets the requirements set out in those parameters. Future versions of MQM may provide a more formal approach to issue selection.

9.3. Analytic metrics

Analytic metrics are created by making a selection of relevant issues from the listing of MQM issue types. The following procedure may be used to create a metric:

Complete a full set of project specifications, including the 12 MQM parameters. Ensure that all stakeholders are in agreement about the values of the parameters. (Note that the value of some parameters, such as the target language, may change from project to project, so implementers should consider the range of likely values. For example, if a project will be translated into 15 languages, the impact each language might have should be considered.)
For the value of each parameter, consider what features of the text would be needed to verify that the text meets specifications and note these issue types down. Note that “doesn’t matter” is an acceptable value for many parameters and if this value is chosen, the parameter may be skipped. (E.g., if Style is judged to be insignificant, then this parameter will be skipped in assessment.
After deciding what features need to be checked, determine which issue types can be used to assess that feature and note these types.
From the list of issue types, prioritize them based on the importance of each parameter and then make a selection of issue types based on this list and the priorities. (Note that it may be impractical to do fine-grained analysis of every potential issue type identified. Feedback from LSPs suggests that six to seven issue types is sufficient for most assessment tasks, although some use up to twenty.
If a score is to be assigned, assign weights to the issues. Assigning weights is a tricky process and should be done by assessing existing translations deemed to be acceptable, borderline acceptable, and unacceptable to see what impact each issue type has on that judgment. Note that some existing metrics, such as SAE J2450, have predefined weights that should be honored. The default issue weight in MQM is 1.0 and any positive decimal value may be used.
If the resulting metric is to be implemented in an MQM-compliant tool chain, it should be declared as described in Section 7.1. MQM metrics description.

When considering which issues to check, creators of metrics should consider the following practical guidelines:

Are there any requirements for compatibility with legacy systems or standard/semi-standard specifications? If so, choose issue types that correspond to those used by those systems/specifications. In most cases it is possible to emulate legacy metrics in MQM with little or no modification, although some might require the use of custom extensions.
Select the least granular issue types that allow assessment of whether the text meets specifications. For example, in many cases use of the category would be sufficient because it is not particularly relevant to know what subcategory is used. On the other hand, when trying to diagnose problems generated by an MT system, finer-grained types might be necessary.
When possible, choose issues from the MQM Core. Using these issues helps ensure compatibility. However, the Core does not cover all cases, including common ones such as checking formatting, because it is focused on text translations.
Consider not just requirements for one set of specifications/parameters, but also for other likely sets. For example, if two types of translations are frequently assessed, it may make sense to develop one list of issues with different sets of weights and to use the single (master) set of issues. This practice is recommended to prevent the need to train evaluators on multiple metrics.

9.4. Holistic metrics

Holistic assessment methods are more flexible in some respects than error-count metrics. They are designed to provide an assessment of the translated text as a whole rather than a detailed accounting of all errors. As analytic assessment can be time consuming and is not needed in all cases (e.g., when the question is whether a text should be accepted or not), holistic methods may be more appropriate in some cases. Most of the MQM issue types can be easily used as either analytic types or holistic types that apply to the text as a whole. For example, the MQM issue type can be used by asking assessors using a holistic tool whether the text is punctuated correctly. In this context some issues will be more useful than others. For example, the issue type is unlikely to be useful in most holistic assessments since it generally makes sense only with regard to very specific sections of a text. By contrast, categories like can more readily be applied to entire texts.

Note that there is no single method for building holistic scores. In a holistic approach specific issues are addressed through qualitative questions that may be assessed via ranking or on a binary- or scalar-value system. For example, a holistic assessment might address the issue via questions like the following:

The translated text is spelled correctly:
[ ] Yes
[ ] No
The translated text is spelled correctly:
[ ] Strongly disagree
[ ] Disagree
[ ] Neither disagree nor agree
[ ] Agree
[ ] Strongly agree
Does the translated text meet expectations with regard to correct spelling?
[ ] The text does not meet expectations
[ ] The text meets expectations
[ ] The text exceeds expectations

Because the scoring for holistic systems is highly dependent on the type of assessment scale used, no specific scoring system is provided here. Users of MQM who wish to implement it in a holistic environment should tie holistic questions to specific MQM issue types and develop appropriate scoring systems. This version of MQM does not define a system for describing holistic scoring systems, although future versions may do so. However, by using the MQM issue types and associating specific holistic questions with them, implementers can make their metrics more transparent and tie them to project parameters in the same way that can be done with error-count metrics.

The following guidelines may assist in designing appropriate holistic assessments and selecting issue types:

Use the highest-level issues that suffice to provide the needed assessment. For example, “Is the text translated accurately?” (assessing ) is more likely to be answered accurately than “Does the text show any problems related to ‘false friends’?” (assessing ), especially as false friends are less likely to occur than generalized mistranslation. However, for some assessment purposes, fine-grained holistics may be appropriate.
Write holistic statements or questions that clearly address the desired issue type. For example, “Does the text comply with specified guidelines for style?” is more likely to obtain an accurate assessment for than “Does the text read well?”, which is vague in its intent.
Design holistic questions that allow assessors to give credit, not just penalties. Holistic assessment tools often allow assessors to indicate that a text outperforms expectations in order to give credit for jobs well done. Using scalars that give translators credit is in line with the MQM principle of fairness and is roughly analogous to the MQM practice of assigning credits to translators for errors detected in the source.

13. Previous versions (non-normative)

Changes from version 0.9.2 to 0.9.3 [Diffs]

No substantive changes
Changed format for display of administrative data

Changes from version 0.9.1 to 0.9.2 [Diffs]

Removed extraneous definition of “metric” that should have been deleted earlier.
Corrected errors in diagram of DQF

Changes from version 0.9.0 to 0.9.1

Added the “none” severity level for DQF compatibility.

Changes from version 0.3.0 to 0.9.0

Note: major jump in version number reflects major changes to MQM structure.
Major revision with new list of dimensions, new scoring model, improved terms and definitions, additional issue types, revision of mappings to other metrics, etc.
Added design to the core for compatibility with DQF.

Changes from version 0.2.0 to 0.3.0

Old default scoring model deprecated (see new section, 8.2. Default severity multipliers from versions earlier than 0.3.0 (deprecated)) and replaced with new default severity multipliers.
Corrections to version history

Changes from version 0.1.16 to 0.2.0

Added section 5.4. Integration with other metrics
Fixed numerous errors in linking to issue types and added back-end mechanism to ensure that issue names and IDs are always correct and that any incorrect links are apparent.
Changed core structure to reflect Style split with updated graphics
Corrected errors in TOC and section numbering

Changes from version 0.1.15 to 0.1.16

Changed Style to Stylistics and split Style-guide out to resolve problem that Style included both mechanical and content-related issues.

Changes from version 0.1.14 to 0.1.15

Fixed section numbering for section 9
Updated section 9 to use current graphic and current Dimensions list
Added scope section and adjusted other text to match scope statement
Updated name locale-violation to Locale-convention
Updated name locale-applicability to Locale-specific-content
Linked to updated version of issue type list to address issue type name changes
Added link to two-page overview document

Changes from version 0.1.13 to 0.1.14

Made scoring metric non-normative

Changes from version 0.1.12 to 0.1.13

Added normative mappings to and from ITS 2.0
Clarified some points about MQM markup and its relationship to ITS 2.0 markup

Changes from version 0.1.11 to 0.1.12

Made usage of “quality assessment” consistent
Added explanatory note about varying quality terms

Changes from version 0.1.10 to 0.1.11

Added link to introductory presentation

Changes from version 0.1.8 to 0.1.10

Changed default location for files
Split issues list into separate document
Added graphical representation of MQM Core
Added more information on building metrics and types of metrics

Changes from version 0.1.7 to 0.1.8

Added additional issue types for TAUS DQF compatibility
Replaced hierarchy graphic to reflect latest changes
Added informative TAUS DQF and SAE J2450 mappings

Changes from version 0.1.6 to 0.1.7

Changed “dimension” to “parameter”
Various textual changes
Additions to Overview section
Added definition for “data category”

Changes from version 0.1.5 to 0.1.6

Added error-count description
Added mqm inline elements description
Added holistic description
Added list of dimensions and description of dimensions
Various corrections
Added copyright notice and acknowledgement of QTLaunchPad

Changes from version 0.1 to 0.1.5

Added MQM graphic representation.

This version:	0.9.3 (2015-06-16) (http://www.qt21.eu/mqm-definition/definition-2015-06-16.html)
Latest version:	http://www.qt21.eu/mqm-definition/
Previous Version:	0.9.2 (2015-06-12) (http://www.qt21.eu/mqm-definition/definition-2015-06-12.html)
Diff from last major version (0.3.0):	http://www.qt21.eu/mqm-definition/diffs/mqm-0_3-0_9_3.html

SAE J2450 issue type	MQM issue type	Note(s)
Wrong term
Omission
Misspelling
Punctuation error
Syntactic error
Word structure or agreement error
Miscellaneous error

${Issues}_{minor}$	= Number of issues with a “minor” severity
${Issues}_{major}$	= Number of issues with a “major” severity
${Issues}_{critical}$	= Number of issues with a “critical” severity

Multidimensional Quality Metrics (MQM) Definition

Copyright

Editor

Contributors

Document status