Skip to content

Model - Results

This page summarises the evaluation results for turkish-news-bert-base on the held-out test split (15% of the dataset, stratified by category).

For the full training report — including loss and metric curves per epoch, run configuration, and per-class breakdown — see the W&B report:

W&B Training Report →

Evaluation Metrics

Evaluated on the held-out test split.

MetricScore
Accuracy0.8705
Macro F10.8674
Weighted F10.8687
  • Accuracy — fraction of correctly classified articles across all classes.
  • Macro F1 — unweighted mean of per-class F1 scores; gives equal weight to each category regardless of sample count. This was the primary metric used to select the best model checkpoint during training.
  • Weighted F1 — mean of per-class F1 scores weighted by class frequency in the test set.

The closeness of macro and weighted F1 (0.8674 vs 0.8687) reflects the dataset's balanced class distribution.

Category Map

IDLabelDescription
0POLITIKAPolitics and government
1EKONOMIEconomy and finance
2SPORSports
3SAGLIKHealth and medicine
4KULTUR_SANATCulture and arts
5DUNYAWorld / international news
6TEKNOLOJITechnology

Notes

Dataset balance. The dataset was collected with explicit per-category targets to ensure balanced class distribution across all three sources. Per-category shortfalls in one source were compensated by higher quotas from others (e.g. the TEKNOLOJI quota was higher for Sabah and Sözcü to offset Hürriyet's lower technology coverage).

Split strategy. The 70/15/15 train/val/test split is stratified by category label. Data leakage is verified before every training run by checking both unique_id overlap and SHA-256 text fingerprint overlap across all split pairs. See the Code Flow page for details.

Macro vs. weighted F1. Macro F1 was used as metric_for_best_model in TrainingArguments because it treats each category equally and penalises poor performance on any single class. Weighted F1 is reported as a secondary measure of overall quality weighted by class frequency.