Drupal Mountain Camp 2024

Supported tagging of articles using AI
03-09, 13:55–14:45 (Europe/Zurich), Parsenn

With growing website complexity, content tagging can become a key asset in content management. In this talk, I will give an introduction on how to leverage AI to support editors into automatically tagging their content, without depending on black-boxes or third-party services such as ChatGPT.
After this talk, the audience will be able to train their own machine learning model for their own dataset, and suggest to editors relevant keywords based on a taxonomy.

On websites with a large number of nodes, tags become essential to sort and find content. However, tagging can be tedious and time consuming. To support editors with classifying content, we are developing a tool that suggests relevant tags from a taxonomy, using machine learning.

There is already one module on drupal.org that integrates ChatGPT into drupal (https://www.drupal.org/project/openai) and implements various features, including content tagging. However, this module does have a few drawbacks. For example, it is not possible out-of-the-box to give ChatGPT a taxonomy of tags to choose from. The website also shares information with ChatGPT, which may or may not be desirable.
On another hand, the method that I will present does not depend on ChatGPT, everything happens internally as there is no need to send data to a third party. You have full control over what is happening to your data, and you can understand exactly why such taxonomy term was selected - contrary to using ChaGPT which is a black box.
The method that I will present (TF-IDF + model-fitting) basically consists in training your own model from your own data. This means that there are a few drawbacks compared to using ChatGPT. One of them is that, because you train your own model from your own data, you first need an important set of tagged articles (a few hundreds at minimum). The quality of the model also depends on the quality of the manually tagged content.
The run-down of the method is:
1. structure the available data into numbers, which is what models can work with, using the TF-IDF methodology
2. train a clustering model that essentially classifies the content into distinct clusters, where each cluster corresponds to a taxonomy term
3. assess the performance of the model, perhaps try another more advanced model, until the performance is satisfactory
4. use the model to classify new content.

The presentation will focus on the following points:
- Present the currently available modules that use ML to support content tagging, with their strengths and their shortcomings
- Explain in more details what the method TF-IDF is and how it structures texts into a data structure usable for machine learning algorithms
- Give a short introduction to machine-learning, and especially clustering
- Give pointers as to whether to use a pre-trained model
- Present a few examples of the simplest and most used algorithms for clustering
- Present metrics that allow you to assess whether an algorithm was successful in clustering your data
- Give insights into how to apply the model on the website, on runtime or as a cron job

See also: Slides

Mathilde discovered the Drupal world 4 years ago.
After completing her PhD in computational biology in France, Mathilde moved to Switzerland for a postdoc before starting to work with MD Systems.
She uses the strengths she acquired during her scientific career to develop high quality back-end systems and data analysis tools.