aa-news-encoder
This project was built as an educational case study to answer one question: can a fine-tuned Turkish BERT model reliably predict the category of a news article from its title and summary? The answer, at 87% accuracy across 7 categories, is yes.
The full system covers scheduled data ingestion from the Anadolu Ajansı (AA) news API, a Kafka-based processing pipeline, a classification model served over FastAPI and gRPC, a REST/SSE API, and a management dashboard. Training data was scraped from Hürriyet, Sabah, and Sözcü using a custom dataset CLI (dscli). The model and training analytics are publicly available.
Pipeline
AA API → Producer (NestJS) — scheduled fetch, Redis dedup, Kafka publish (raw_content_aa)
→ Consumer (NestJS) — Kafka consume, text preprocessing, gRPC call to model
→ Model (Python) — FastAPI /predict + gRPC ModelService.Predict
→ PostgreSQL — persist results
→ API (NestJS) — REST, SSE, analytics, manual override
→ Dashboard (React) — management panel, live updates| Component | Technology | Role |
|---|---|---|
| Producer | NestJS, TypeORM, Redis | Polls AA API, deduplicates (SHA-256, 3 h TTL), publishes to Kafka |
| Consumer | NestJS, gRPC client | Consumes Kafka, preprocesses text, calls model, stores results |
| Model | Python, FastAPI, HF Transformers | Serves fine-tuned BERT; HTTP + gRPC endpoints |
| API | NestJS, PostgreSQL | REST API with filtering, pagination, analytics, SSE |
| Dashboard | React, TanStack Router | Management panel with magic-link auth and SSE-powered updates |
Model
Fine-tuned dbmdz/bert-base-turkish-cased on a balanced dataset of Turkish news articles from Hürriyet, Sabah, and Sözcü.
Published model: mehmetraufoguz/turkish-news-bert-base
Results
Evaluated on the held-out test split (15%, stratified by category):
| Metric | Score |
|---|---|
| Accuracy | 87.05% |
| Macro F1 | 86.74% |
| Weighted F1 | 86.87% |
7 categories: POLITIKA · EKONOMI · SPOR · SAGLIK · KULTUR_SANAT · DUNYA · TEKNOLOJI
Full training report, loss curves, and per-class metrics → W&B Training Report
See the Model — Results page for more detail.
Dataset & Training
Training data was scraped from three Turkish news sources — Hürriyet, Sabah, and Sözcü — using a custom local CLI called dscli, with scrape scripts prepared in phases. GitHub Copilot (OpenClaw, GPT-5 mini free model on GitHub Copilot) assisted in the gathering process. The entire scraping and training process was conducted for educational and learning purposes. The dataset is published on HuggingFace Hub: mehmetraufoguz/turkish-news-dataset (private — not publicly accessible, educational use only).
Model training and creation were also for educational and learning purposes, conducted on Google Colab using the free T4 GPU runtime.
Colab notes for reproducing:- Upload
model/train.py,model/config.py,model/utils/to Colab, or usetrain.ipynbdirectly. - Load secrets with
from google.colab import userdataanduserdata.get('KEY_NAME')instead ofos.environ.get. - After installing requirements, run
!pip install --upgrade numpyto avoid version conflicts:!pip install -q -r requirements.txt !pip install --upgrade numpy
See Datasets Gathering Plan and dscli & Scraping Methods for the full data pipeline.
Documentation
| Section | Description |
|---|---|
| Producer — Getting Started | Setup and configuration for the AA API ingestion service |
| Model — Getting Started | Installation, inference server, and environment variables |
| Model — Results | Evaluation metrics, category map, and training report |
| Consumer — Getting Started | Kafka consumer and model integration setup |
| API — Getting Started | REST API reference, SSE, analytics endpoints |
| Dashboard — Getting Started | Frontend setup and authentication |
| Datasets Gathering Plan | Dataset schema, collection strategy, and category targets |
| dscli & Scraping Methods | Scraping scripts and dscli CLI reference |
| AA News API | AA API endpoint reference used by the Producer |
External Resources
| Resource | Link |
|---|---|
| HuggingFace Model | mehmetraufoguz/turkish-news-bert-base |
| HuggingFace Dataset | mehmetraufoguz/turkish-news-dataset (private) |
| W&B Training Report | W&B Report |
| GitHub Repository | mehmetraufoguz/aa-news-encoder |
Known Issues
- AA XML parsing: The
kategoriandozetfields are not always present or well-formed in the AA NewsML feed. Most records parse correctly; edge cases are silently skipped. Non-critical — the main goal is classification, not data completeness. FETCH_CRONenv variable: This variable exists in the config but is not currently used. The Producer fetches on a fixed 1-hour interval regardless of theFETCH_CRONvalue.- There are likely other edge-case bugs. The main flow — data ingestion, Kafka pipeline, model inference, API, and dashboard — has been tested end-to-end and works.
This project started as a case study and turned into a full exploration of fine-tuning, dataset scraping, microservice pipelines, and model serving — all on free-tier tooling. It was a great journey to work on.