Model Card for Model ID

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: Paul Bochtler
Finetuned from model [optional]: microsoft/deberta-v3-base

Uses

The model has been trained on about 700 articles from Kenyan newspapers to detect the presence of the following topics:

Coronavirus: Includes topics related to the outbreak and vaccines.
Cultural Cooperation: Topics covering cultural exchanges and partnerships.
Development Cooperation: Focuses on areas such as agriculture, transport, and renewable energies.
Diaspora Affairs/Remittances: Topics involving the Kenyan diaspora and financial remittances.
European Domestic and Regional Politics: Includes issues such as Brexit and market regulation/standards.
Financing/Loans/Debt: Covers financial aspects including loans and debt management.
Global Affairs/International (Geo)politics: Topics related to international relations and geopolitical dynamics.
Kenyan Foreign Policy/Diplomacy: Focus on Kenya's foreign relations and diplomatic efforts.
Regional Affairs/African Politics: Topics on regional dynamics and African political issues.
Social Controversies: Includes discussions on the colonial past, visa/migration issues, energy justice, and the ICC case.
Tourism: Covers aspects related to the tourism industry.
Trade/Investment: Includes import/export, tenders, and investment projects.

Direct Use

This model can be directly applied to classify articles based on the above topics, making it suitable for use in media analysis, content categorization, and research on public discourse in Kenyan media.

Bias, Risks, and Limitations

The model swp-berlin/deberta-base-news-topics-kenia-europe was trained on approximately 700 articles from Kenyan newspapers, which may introduce certain biases and limitations:

Data Bias: The model's predictions are influenced by the specific articles and sources used during training, which may reflect the perspectives, biases, and linguistic styles of those publications. This can result in an overrepresentation of certain viewpoints or underrepresentation of others, especially those outside the mainstream media.
Cultural and Regional Bias: Since the training data is centered around Kenyan newspapers, the model may perform better on content related to East African contexts and may not generalize well to other regions or cultural settings.
Topic Limitations: The model is designed to detect specific topics such as global affairs, development cooperation, and social controversies. It may not perform well on texts that fall outside these predefined categories.
Risks of Misclassification: As with any classification model, there is a risk of misclassification, especially with nuanced or ambiguous content. Users should verify model outputs where high accuracy is critical.
Ethical Considerations: Users should be mindful of how the model’s outputs are used, particularly in sensitive contexts such as media analysis or public discourse monitoring, to avoid reinforcing biases or misinformation.

How to Get Started with the Model

To get started with the model, use the following code snippet:

from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch

# Define model path and device
model_name = "swp-berlin/deberta-base-news-topics-kenia-europe"
device = "cuda:0" if torch.cuda.is_available() else "cpu"  # Use GPU if available, otherwise CPU

# Load the model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)

# Initialize the pipeline for text classification
pipe_classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    framework="pt",
    device=device,
    batch_size=2
)

# Example usage
result = pipe_classifier("Example text to classify")
print(result)

Training Data

The model was trained on a custom dataset comprising approximately 700 articles from Kenyan newspapers. The dataset includes a variety of topics relevant to Kenyan and international contexts, including health, politics, development, and cultural affairs. Preprocessing involved filtering irrelevant articles and balancing the dataset across the target topics.

Training Procedure

The model was fine-tuned on a pre-trained DeBERTa-base model using the following training configuration:

Preprocessing

Texts were tokenized using the DeBERTa tokenizer, with special attention given to splitting sentences and removing noise such as URLs and non-text elements.

Training Hyperparameters

Hyperparameters:
- Learning Rate: 6e-5
- Batch Size: 8
- Epochs: 20
- Gradient Accumulation Steps: 4
- Warm-up Ratio: 0.06 to gradually ramp up the learning rate at the start of training
- Weight Decay: 0.01 to regularize the model and prevent overfitting
- Evaluation Strategy: Evaluation was performed at the end of each epoch, with the best model based on f1_macro score retained.

Training was conducted on a GPU environment to optimize performance and speed. The training script utilized Hugging Face's Trainer class for efficient model management and evaluation.