Jump to content

Machine learning models/Proposed/Tone Check

From Meta, a Wikimedia project coordination wiki
Model card
This page is an on-wiki machine learning model card.
A diagram of a neural network
A model card is a document about a machine learning model that seeks to answer basic questions about the model.
Model Information Hub
Model creator(s)Diego Sáez Trumper
Model owner(s)Diego Sáez Trumper
Model interfaceComing soon
CodeComing soon
Uses PIINo
In production?Yes
Which projects?Tone Check
This model uses the page title and edit text to predict the likelihood that a given edit contains a tone check violation for new edits to articles.


Motivation

[edit]

The model was developed to support Wikipedia’s commitment to a neutral point of view, especially in contributions from newer editors who may be less familiar with the platform’s content policies. Initially aimed at detecting “peacock language” (overly promotional or flattering phrases), the scope of the model expanded to address a broader category of tone issues, including language that may come across as promotional, subjective, or derogatory.

This model powers Tone Check, an edit check that prompts editors to consider neutralizing their language when their edits include problematic tone. By surfacing tone-related suggestions at the point of editing, this model helps contributors - especially newcomers - improve the quality and neutrality of their contributions, making it easier to participate in Wikipedia in a productive and policy-aligned way.

Users and uses

[edit]
Use this model for
  • Predicting (with various degrees of probability) whether a given sentence or paragraph of human-written content contains a tone that is promotional, derogatory, or otherwise subjective
  • Providing suggestions to editors to improve the tone of an edit or published article
Don't use this model for
  • As with any AI/ML model, we recommend keeping humans in the loop when applying this model in a new setting
  • We don't recommend using this model's predictions as training data for other ML models
Current uses
  • Tone Check

Ethical considerations, caveats, and recommendations

[edit]

This models relies on Multilingual BERT, a Large Language Model, that might contain certain biases.

Model

[edit]

Tone Check leverages a Small Language Model (SLM) to detect the presence of promotional, derogatory, or otherwise subjective language. The SLM we are using is a BERT model, which is open source and presents its weights openly.

The model works by being fine-tuned on examples of Wikipedia revisions. It learns from instances where experienced editors have applied a specific template ("peacock") to flag tone violations, as well as instances where that template was removed. This process teaches the BERT model to identify patterns associated with appropriate and inappropriate tones based on Wikipedia's editorial standards. Under the hood, SLMs work by transforming text into high-dimensional vectors, which are then compared with the label, allowing the model to find a hyperplane that splits text into negative or positive cases.

The model was trained using 20,000 data points from 10 languages consisting of:

  • Positive examples: Revisions on Wikipedia that were marked with the "peacock" template, indicating a tone policy violation.
  • Negative examples: Revisions where the "peacock" template had been removed (signifying no policy violation).

The language covered are:

Language wiki_db lang_code
English enwiki en
French frwiki fr
Arabic arwiki ar
Russian ruwiki ru
Spanish eswiki es
Japanese jawiki ja
Dutch nlwiki nl
Portuguese ptwiki pt
Chinese zhwiki zh
German dewiki de

Small Language Models (like the one being used for Tone Check) differ from Large Language Models (LLMs) in that the former are trained to adapt for particular use cases by learning from a focused dataset. In the case of Tone Check, this means the SLM learns directly from the expertise of experienced Wikipedia volunteers. Hence, they offer more explainability and flexibility compared to LLMs. Also SLMs requires significantly fewer computational resources than its larger counterparts.

LLMs on the other hand, are designed to work for general-purposes, with limited context and through a chat or prompting interface. LLMs require a huge amount of computation resources, and their behavior is difficult to explain, due the high amount of parameters involved.

Performance

[edit]
wiki_db precision f1 recall acc
nlwiki 0.763 0.556 0.438 0.651
frwiki 0.706 0.581 0.494 0.644
ptwiki 0.694 0.555 0.462 0.629
enwiki 0.668 0.519 0.424 0.607
ruwiki 0.665 0.509 0.412 0.602
eswiki 0.650 0.529 0.446 0.603
dewiki 0.647 0.489 0.394 0.589
zhwiki 0.631 0.465 0.368 0.576
arwiki 0.617 0.634 0.651 0.624
jawiki 0.597 0.393 0.292 0.547


Implementation

[edit]
Model architecture

The information provided to the BeRT model is the article’s title,language and content:

{article_title} <SEP> {language} <SEP> {article_content}

This is the original model input, to see how to use it on LiftWing check below.

Output schema
{
      "prediction": true,
      "probability": 0.831,
}

This is the original model output, to see how to use it on LiftWing check below.

Example input and output

This is the format to use the model on LiftWing

curl https://api.wikimedia.org/service/lw/inference/v1/models/edit-check:predict -X POST -d '{ "instances": [{"lang": "en", "check_type": "tone","page_title":"this is a test", "original_text": "text", "modified_text": "this is a great example of work"}]}'
{
  "predictions": [
    {
      "status_code": 200,
      "model_name": "edit-check",
      "model_version": "v1",
      "check_type": "tone",
      "language": "en",
      "page_title": "this is a test",
      "prediction": true,
      "probability": 0.831,
      "details": {}
    }
  ]
}

Data

[edit]
Data pipeline

We use Templates as labels. In this case the Peacock Language template was used. We collect pairs of positive/negative samples for each template, where a positive sample corresponds to the revision that adds a template, and a negative sample - to the first revision where the template is removed.

Mode details can be found here.

Training data

We collect this data for all languages covered by the AYA 23 model then exclude all languages with less than 1K pairs. 10 languages are left after this step.

Test data

We sample 1k pairs for each remaining language as evaluation data. 20k evaluation samples are collected

Licenses

[edit]


Citation

[edit]

Cite this model as:

@misc{Tone_Check,
   title={Tone Check,
   author={Saez-Trumper, Diego and Baigutanova, Aitolkyn and Chou, Aiko and Aslam, Muniza},
   year={2024},
   url={https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Tone_Check}
}