aa-news-encoder

Turkish news classification with encoder-only BERT — a completed proof-of-concept.

This project was built as an educational case study to answer one question: can a fine-tuned Turkish BERT model reliably predict the category of a news article from its title and summary? The answer, at 87% accuracy across 7 categories, is yes.

The full system covers scheduled data ingestion from the Anadolu Ajansı (AA) news API, a Kafka-based processing pipeline, a classification model served over FastAPI and gRPC, a REST/SSE API, and a management dashboard. Training data was scraped from Hürriyet, Sabah, and Sözcü using a custom dataset CLI (dscli). The model and training analytics are publicly available.

Pipeline

AA API  → Producer (NestJS) — scheduled fetch, Redis dedup, Kafka publish (raw_content_aa)
        → Consumer (NestJS) — Kafka consume, text preprocessing, gRPC call to model
        → Model (Python)    — FastAPI /predict + gRPC ModelService.Predict
        → PostgreSQL        — persist results
        → API (NestJS)      — REST, SSE, analytics, manual override
        → Dashboard (React) — management panel, live updates

Component	Technology	Role
Producer	NestJS, TypeORM, Redis	Polls AA API, deduplicates (SHA-256, 3 h TTL), publishes to Kafka
Consumer	NestJS, gRPC client	Consumes Kafka, preprocesses text, calls model, stores results
Model	Python, FastAPI, HF Transformers	Serves fine-tuned BERT; HTTP + gRPC endpoints
API	NestJS, PostgreSQL	REST API with filtering, pagination, analytics, SSE
Dashboard	React, TanStack Router	Management panel with magic-link auth and SSE-powered updates

Model

Fine-tuned dbmdz/bert-base-turkish-cased on a balanced dataset of Turkish news articles from Hürriyet, Sabah, and Sözcü.

Published model: mehmetraufoguz/turkish-news-bert-base

Results

Evaluated on the held-out test split (15%, stratified by category):

Metric	Score
Accuracy	87.05%
Macro F1	86.74%
Weighted F1	86.87%

7 categories: POLITIKA · EKONOMI · SPOR · SAGLIK · KULTUR_SANAT · DUNYA · TEKNOLOJI

Full training report, loss curves, and per-class metrics → W&B Training Report

See the Model — Results page for more detail.

Dataset & Training

Training data was scraped from three Turkish news sources — Hürriyet, Sabah, and Sözcü — using a custom local CLI called dscli, with scrape scripts prepared in phases. GitHub Copilot (OpenClaw, GPT-5 mini free model on GitHub Copilot) assisted in the gathering process. The entire scraping and training process was conducted for educational and learning purposes. The dataset is published on HuggingFace Hub: mehmetraufoguz/turkish-news-dataset (private — not publicly accessible, educational use only).

Model training and creation were also for educational and learning purposes, conducted on Google Colab using the free T4 GPU runtime.

Colab notes for reproducing:

Upload model/train.py, model/config.py, model/utils/ to Colab, or use train.ipynb directly.
Load secrets with from google.colab import userdata and userdata.get('KEY_NAME') instead of os.environ.get.
After installing requirements, run !pip install --upgrade numpy to avoid version conflicts:
```
!pip install -q -r requirements.txt
!pip install --upgrade numpy
```

See Datasets Gathering Plan and dscli & Scraping Methods for the full data pipeline.

Documentation

Section	Description
Producer — Getting Started	Setup and configuration for the AA API ingestion service
Model — Getting Started	Installation, inference server, and environment variables
Model — Results	Evaluation metrics, category map, and training report
Consumer — Getting Started	Kafka consumer and model integration setup
API — Getting Started	REST API reference, SSE, analytics endpoints
Dashboard — Getting Started	Frontend setup and authentication
Datasets Gathering Plan	Dataset schema, collection strategy, and category targets
dscli & Scraping Methods	Scraping scripts and `dscli` CLI reference
AA News API	AA API endpoint reference used by the Producer

External Resources

Resource	Link
HuggingFace Model	mehmetraufoguz/turkish-news-bert-base
HuggingFace Dataset	mehmetraufoguz/turkish-news-dataset (private)
W&B Training Report	W&B Report
GitHub Repository	mehmetraufoguz/aa-news-encoder

Known Issues

AA XML parsing: The kategori and ozet fields are not always present or well-formed in the AA NewsML feed. Most records parse correctly; edge cases are silently skipped. Non-critical — the main goal is classification, not data completeness.
FETCH_CRON env variable: This variable exists in the config but is not currently used. The Producer fetches on a fixed 1-hour interval regardless of the FETCH_CRON value.
There are likely other edge-case bugs. The main flow — data ingestion, Kafka pipeline, model inference, API, and dashboard — has been tested end-to-end and works.

This project started as a case study and turned into a full exploration of fine-tuning, dataset scraping, microservice pipelines, and model serving — all on free-tier tooling. It was a great journey to work on.