Skip to content

aa-news-encoder

Turkish news classification with encoder-only BERT — a completed proof-of-concept.

This project was built as an educational case study to answer one question: can a fine-tuned Turkish BERT model reliably predict the category of a news article from its title and summary? The answer, at 87% accuracy across 7 categories, is yes.

The full system covers scheduled data ingestion from the Anadolu Ajansı (AA) news API, a Kafka-based processing pipeline, a classification model served over FastAPI and gRPC, a REST/SSE API, and a management dashboard. Training data was scraped from Hürriyet, Sabah, and Sözcü using a custom dataset CLI (dscli). The model and training analytics are publicly available.

Pipeline

AA API  → Producer (NestJS) — scheduled fetch, Redis dedup, Kafka publish (raw_content_aa)
        → Consumer (NestJS) — Kafka consume, text preprocessing, gRPC call to model
        → Model (Python)    — FastAPI /predict + gRPC ModelService.Predict
        → PostgreSQL        — persist results
        → API (NestJS)      — REST, SSE, analytics, manual override
        → Dashboard (React) — management panel, live updates
ComponentTechnologyRole
ProducerNestJS, TypeORM, RedisPolls AA API, deduplicates (SHA-256, 3 h TTL), publishes to Kafka
ConsumerNestJS, gRPC clientConsumes Kafka, preprocesses text, calls model, stores results
ModelPython, FastAPI, HF TransformersServes fine-tuned BERT; HTTP + gRPC endpoints
APINestJS, PostgreSQLREST API with filtering, pagination, analytics, SSE
DashboardReact, TanStack RouterManagement panel with magic-link auth and SSE-powered updates

Model

Fine-tuned dbmdz/bert-base-turkish-cased on a balanced dataset of Turkish news articles from Hürriyet, Sabah, and Sözcü.

Published model: mehmetraufoguz/turkish-news-bert-base

Results

Evaluated on the held-out test split (15%, stratified by category):

MetricScore
Accuracy87.05%
Macro F186.74%
Weighted F186.87%

7 categories: POLITIKA · EKONOMI · SPOR · SAGLIK · KULTUR_SANAT · DUNYA · TEKNOLOJI

Full training report, loss curves, and per-class metrics → W&B Training Report

See the Model — Results page for more detail.

Dataset & Training

Training data was scraped from three Turkish news sources — Hürriyet, Sabah, and Sözcü — using a custom local CLI called dscli, with scrape scripts prepared in phases. GitHub Copilot (OpenClaw, GPT-5 mini free model on GitHub Copilot) assisted in the gathering process. The entire scraping and training process was conducted for educational and learning purposes. The dataset is published on HuggingFace Hub: mehmetraufoguz/turkish-news-dataset (private — not publicly accessible, educational use only).

Model training and creation were also for educational and learning purposes, conducted on Google Colab using the free T4 GPU runtime.

Colab notes for reproducing:
  • Upload model/train.py, model/config.py, model/utils/ to Colab, or use train.ipynb directly.
  • Load secrets with from google.colab import userdata and userdata.get('KEY_NAME') instead of os.environ.get.
  • After installing requirements, run !pip install --upgrade numpy to avoid version conflicts:
    !pip install -q -r requirements.txt
    !pip install --upgrade numpy

See Datasets Gathering Plan and dscli & Scraping Methods for the full data pipeline.

Documentation

SectionDescription
Producer — Getting StartedSetup and configuration for the AA API ingestion service
Model — Getting StartedInstallation, inference server, and environment variables
Model — ResultsEvaluation metrics, category map, and training report
Consumer — Getting StartedKafka consumer and model integration setup
API — Getting StartedREST API reference, SSE, analytics endpoints
Dashboard — Getting StartedFrontend setup and authentication
Datasets Gathering PlanDataset schema, collection strategy, and category targets
dscli & Scraping MethodsScraping scripts and dscli CLI reference
AA News APIAA API endpoint reference used by the Producer

External Resources

ResourceLink
HuggingFace Modelmehmetraufoguz/turkish-news-bert-base
HuggingFace Datasetmehmetraufoguz/turkish-news-dataset (private)
W&B Training ReportW&B Report
GitHub Repositorymehmetraufoguz/aa-news-encoder

Known Issues

  • AA XML parsing: The kategori and ozet fields are not always present or well-formed in the AA NewsML feed. Most records parse correctly; edge cases are silently skipped. Non-critical — the main goal is classification, not data completeness.
  • FETCH_CRON env variable: This variable exists in the config but is not currently used. The Producer fetches on a fixed 1-hour interval regardless of the FETCH_CRON value.
  • There are likely other edge-case bugs. The main flow — data ingestion, Kafka pipeline, model inference, API, and dashboard — has been tested end-to-end and works.

This project started as a case study and turned into a full exploration of fine-tuning, dataset scraping, microservice pipelines, and model serving — all on free-tier tooling. It was a great journey to work on.