Machine learning models/Proposed/Tone Check

Model card
Model card
This page is an on-wiki machine learning model card.
	A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)	Diego Sáez Trumper
Model owner(s)	Diego Sáez Trumper
Model interface	Coming soon
Code	Coming soon
Uses PII	No
In production?	Yes
Which projects?	Tone Check
	This model uses the page title and edit text to predict the likelihood that a given edit contains a tone check violation for new edits to articles.
	v; t; e;

This model card page currently has a draft status. It is a piece of model documentation that is in the process of being written. Once the model card is completed, this template should be removed.

Motivation

The model was developed to support Wikipedia’s commitment to a neutral point of view, especially in contributions from newer editors who may be less familiar with the platform’s content policies. Initially aimed at detecting “peacock language” (overly promotional or flattering phrases), the scope of the model expanded to address a broader category of tone issues, including language that may come across as promotional, subjective, or derogatory.

This model powers Tone Check, an edit check that prompts editors to consider neutralizing their language when their edits include problematic tone. By surfacing tone-related suggestions at the point of editing, this model helps contributors - especially newcomers - improve the quality and neutrality of their contributions, making it easier to participate in Wikipedia in a productive and policy-aligned way.

Users and uses

Use this model for

Predicting (with various degrees of probability) whether a given sentence or paragraph of human-written content contains a tone that is promotional, derogatory, or otherwise subjective
Providing suggestions to editors to improve the tone of an edit or published article

Don't use this model for

As with any AI/ML model, we recommend keeping humans in the loop when applying this model in a new setting
We don't recommend using this model's predictions as training data for other ML models

Current uses

Tone Check

Ethical considerations, caveats, and recommendations

This models relies on Multilingual BERT, a Large Language Model, that might contain certain biases.

Model

Tone Check leverages a Small Language Model (SLM) to detect the presence of promotional, derogatory, or otherwise subjective language. The SLM we are using is a BERT model, which is open source and presents its weights openly.

The model works by being fine-tuned on examples of Wikipedia revisions. It learns from instances where experienced editors have applied a specific template ("peacock") to flag tone violations, as well as instances where that template was removed. This process teaches the BERT model to identify patterns associated with appropriate and inappropriate tones based on Wikipedia's editorial standards. Under the hood, SLMs work by transforming text into high-dimensional vectors, which are then compared with the label, allowing the model to find a hyperplane that splits text into negative or positive cases.

The model was trained using 20,000 data points from 10 languages consisting of:

Positive examples: Revisions on Wikipedia that were marked with the "peacock" template, indicating a tone policy violation.
Negative examples: Revisions where the "peacock" template had been removed (signifying no policy violation).

The language covered are:

Language	wiki_db	lang_code
English	enwiki	en
French	frwiki	fr
Arabic	arwiki	ar
Russian	ruwiki	ru
Spanish	eswiki	es
Japanese	jawiki	ja
Dutch	nlwiki	nl
Portuguese	ptwiki	pt
Chinese	zhwiki	zh
German	dewiki	de

Small Language Models (like the one being used for Tone Check) differ from Large Language Models (LLMs) in that the former are trained to adapt for particular use cases by learning from a focused dataset. In the case of Tone Check, this means the SLM learns directly from the expertise of experienced Wikipedia volunteers. Hence, they offer more explainability and flexibility compared to LLMs. Also SLMs requires significantly fewer computational resources than its larger counterparts.

LLMs on the other hand, are designed to work for general-purposes, with limited context and through a chat or prompting interface. LLMs require a huge amount of computation resources, and their behavior is difficult to explain, due the high amount of parameters involved.

Performance

wiki_db	precision	f1	recall	acc
nlwiki	0.763	0.556	0.438	0.651
frwiki	0.706	0.581	0.494	0.644
ptwiki	0.694	0.555	0.462	0.629
enwiki	0.668	0.519	0.424	0.607
ruwiki	0.665	0.509	0.412	0.602
eswiki	0.650	0.529	0.446	0.603
dewiki	0.647	0.489	0.394	0.589
zhwiki	0.631	0.465	0.368	0.576
arwiki	0.617	0.634	0.651	0.624
jawiki	0.597	0.393	0.292	0.547

Implementation

Model architecture

The information provided to the BeRT model is the article’s title,language and content:

{article_title} <SEP> {language} <SEP> {article_content}

This is the original model input, to see how to use it on LiftWing check below.

Output schema

{
      "prediction": true,
      "probability": 0.831,
}

This is the original model output, to see how to use it on LiftWing check below.

Example input and output

This is the format to use the model on LiftWing

curl https://api.wikimedia.org/service/lw/inference/v1/models/edit-check:predict -X POST -d '{ "instances": [{"lang": "en", "check_type": "tone","page_title":"this is a test", "original_text": "text", "modified_text": "this is a great example of work"}]}'

{
  "predictions": [
    {
      "status_code": 200,
      "model_name": "edit-check",
      "model_version": "v1",
      "check_type": "tone",
      "language": "en",
      "page_title": "this is a test",
      "prediction": true,
      "probability": 0.831,
      "details": {}
    }
  ]
}

Data

Data pipeline

We use Templates as labels. In this case the Peacock Language template was used. We collect pairs of positive/negative samples for each template, where a positive sample corresponds to the revision that adds a template, and a negative sample - to the first revision where the template is removed.

Mode details can be found here.

Training data

We collect this data for all languages covered by the AYA 23 model then exclude all languages with less than 1K pairs. 10 languages are left after this step.

Test data

We sample 1k pairs for each remaining language as evaluation data. 20k evaluation samples are collected

Licenses

Code: Apache 2.0 License
Model: Apache 2.0 License

Citation

Cite this model as:

@misc{Tone_Check,
   title={Tone Check,
   author={Saez-Trumper, Diego and Baigutanova, Aitolkyn and Chou, Aiko and Aslam, Muniza},
   year={2024},
   url={https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Tone_Check}
}