Skip to content

Structured & Detailed

Summary (based on Raw)

This structured section restates and organizes the decisions captured in the Raw section without adding assumptions. It follows the exact components and choices you already listed.

Producer

  • Framework: NestJS standalone microservice (NestFactory.createMicroservice).
  • Source: AA API https://api.aa.com.tr/abone/ (requires provided credentials).
  • Validation: apply zod validation to API responses.
  • Storage: save full payload into Postgres using TypeORM. Required fields to persist: id, baslik, ozet, icerik, kategori_aa, yayim_tarihi, kaynak_url, dil.
  • Publish: push compact message to Kafka topic raw_content_aa with id, baslik, ozet.
  • Deduplication: compute sha256(id + source) and store in Redis with 3h TTL (key pattern aan_source_-key-) to avoid duplicates.
  • Scheduling: use task scheduling for periodic fetches; follow API pagination and limit to 8 pages per run; stop when existing contents are reached. Provide a manual trigger by publishing to fetch_content_aa Kafka topic.
  • Observability: record analytics for how many fetched, duplicates encountered, and counts of rate-limit / DB errors.
  • Deploy/dev: require Docker Compose; ensure Kafka heartbeat is handled.
  • Mock fallback: if AA API credentials are unavailable, the producer should load or generate a set of 64 mock contents, randomizing IDs and varying fields on each pick; ensure mock data is distinct from model training data and suitable for testing the ingestion and consumer flows.
  • Two-step fetch design: perform a metadata gather step and publish metadata to a discovery/topic queue, then separately enqueue downloads to a dedicated download topic (e.g. raw_content_aa_download) to decouple fetching and file downloads and avoid lost work on failures.
  • Two-step fetch design: perform a metadata gather step and publish metadata to a discovery/topic queue, then separately enqueue downloads to a dedicated download topic (e.g. raw_content_aa_download) to decouple fetching and file downloads and avoid lost work on failures.
  • Download format & structured flow: newsml29 is implemented in Raw and should be the primary structured download format; support newsml12 as a fallback where needed. Producer must request metadata first and enqueue newsml29 (or newsml12) downloads to the download topic for a separate downloader worker.
  • XML handling & cleaning: the download worker (producer or separate downloader) must parse NewsML/XML responses, run the XML parser and a text-cleaner pipeline (strip tags, normalize whitespace, extract title/summary/body), and persist normalized text into Postgres.

Consumer

  • Framework: NestJS standalone microservice (clusterable) to support horizontal scaling.
  • Behavior: subscribe to raw_content_aa Kafka topic, apply zod validation, check Redis and Postgres for an existing result (aan_source_result-key-) and skip model call if present.
  • Inference: call model runtime via gRPC with the required fields.
  • Preprocessing: consumer should lowercase incoming texts and strip HTML tags when present before sending to the model.
  • Categories: categories must be numerized (mapping shared between consumer and model). Ensure the numerization mapping is available to the consumer.
  • Post-processing: save inference result into Postgres and set aan_source_result-key- in Redis with 3h TTL.
  • Publish: emit processed_content_aa Kafka event with content id and result; rebroadcast via SSE when required by API.
  • Observability: capture processing latency per item and model connection failure rate.
  • Deployment: intend to run with Docker Compose in a cluster of 3 for dev; ensure Kafka heartbeat.

Model

  • Endpoints: provide an HTTP /predict endpoint (FastAPI) for direct inference and a gRPC API for the consumer flow.
  • Concurrency: gRPC server may use asyncio or multiprocessing for throughput and to support batching; implement an async batcher to aggregate requests for efficient GPU use.
  • Offline image: Docker image build should support loading the trained model (e.g., AutoModel.from_pretrained) during the image build or at startup so the container can run offline if required.
  • Checkpoints & training workflow: support saving checkpoints to Google Drive (for Colab) and use PushToHubCallback / WandbCallback when pushing experiments to Hugging Face / W&B. Use AutoModelForSequenceClassification.from_pretrained to load model artifacts in the service.
  • Model choice: initial experiment with dbmdz/bert-base-turkish-cased (as noted in Raw).
  • Preprocessing: lowercase all texts and remove HTML tags before tokenization.
  • Data strategy: aim for balanced training sets; suggested staged sizes in Raw: 128 per class, then 256, then 512 (augmentation or synthetic generation may be used to speed early iterations). Randomize content between sets to vary tone.
  • Label handling: categories must be numerized and the same numeric mapping shared to consumer and model; separator for composite fields: "|=|" (as in Raw).
  • Dev/Colab integration: provide Colab/VSCode configuration and keep notebooks / scripts under version control; checkpoints can be copied to Google Drive during Colab runs.
  • Artifact/versioning: save model artifacts in safetensors/PyTorch format and include model_version, git commit, and training metadata. Support PushToHub as optional flow.
  • Scraping & tooling (optional): as noted in Raw, you may augment training data by scraping public news sites using a controlled crawler (ClawBot). If used, include a SKILL.md that documents allowed targets and crawling behavior to keep scraping under control. Provide a small CLI utility to manage scraped data (normalize, split, assign ids) so edits to large datasets remain reliable. Optionally add SearxNG support to find candidate sites for controlled scraping.

API + BetterAuth

  • Endpoints: implement list endpoint (pagination, limit 16) with query support for date range, category, title search — validate queries with zod.
  • Detail endpoint: return content and prediction history.
  • Auth: simple magic-link PoC (BetterAuth) where the magic link is logged to console (no real email sending) for proof-of-concept authentication.
  • Editing: allow manual category changes; when editor changes category save new category while preserving original model suggestion in model_category field.
  • Pending workflow: provide pending list and ability to requeue selected items back to raw_content_aa via fetch_content_aa topic for batch processing.
  • Analytics: expose category weights, daily processed content count, model trust (median), and comparison between model suggestion and source category.
  • Integration: use TypeORM for Postgres; publish processed_content_aa events and support server-sent-events (SSE) to the panel for live updates.

Panel

  • Framework: TanStack Start (React) with UI components from Kumo UI.
  • Auth: magic-link login using BetterAuth client integration.
  • Data flow: proxy API endpoints under /api; panel consumes SSE for live updates and stores data in TanStack DB collection with local storage.
  • UI: TanStack Table for content list, TanStack Form for filters and login, show a "new updates" button when SSE indicates new activity (avoid auto re-render).
  • Actions: support approving/rejecting model suggestions, batch requeue to fetch_content_aa, and view analytics dashboards.

Raw

Producer

  • might be nestjs standalone NestFactory.createMicroservice
  • get required content from AA API (https://api.aa.com.tr/abone/)
  • AA api require creds, that will be shared
  • fetching may require to follow aa api's rate limiting.
  • on api response apply zod validation, use vanilla fetch lib.
  • use postgres with typeorm
  • id, baslik, ozet, icerik, kategori_aa, yayim_tarihi, kaynak_url, dil fields from api might be fine save them into postgres contents. until sharing api creds create mock data by those fields for testing and model train.
  • for future support sha256 -> id + source, and save it into redis with 3 hour ttl. dedup protect. key would be aan_source_-key-
  • on raw_content_aa kafka topic, share id, baslik, ozet fields
  • use task scheduling to keep fetching
  • follow api's pagination and max do 8 page and on scheduling do until we reached already exist contents
  • need to take analytic of
    • how many we fetched during period.
    • how many content faced with dupp.
    • count errors of rate limiting, db
  • require docker compose
  • kafka auto hearthbeat
  • manual trigger when request on fetch_content_aa kafka topic.
  • if no api creds available then use mock contents, prepare mock contents 64 on picking do randomize and change id of them. those mock contents needs to be different from model train data

updates after working on - 1

  • api fetching methods were 2 stepped, first gather metadata then download required contents. i will choose newsml12 type on download.
  • we need 2 topic on producer side. i dont want to gather and directly download. there might something goes wrong and we could lost track so, new topic to download and queue them would better way.
  • i will investigate xml return then continue on working producer side

updates after working on - 2

  • newsml29 implemented, due to schema requirements. structured flow added on download content section.
  • added xml parser and text cleaner.

https://docs.nestjs.com/techniques/configuration https://docs.nestjs.com/standalone-applications https://www.pertek.bel.tr/dosya/ihale/dosyalar/aa-api.pdf https://docs.nestjs.com/microservices/kafka https://docs.nestjs.com/recipes/sql-typeorm https://docs.nestjs.com/microservices/redis https://docs.nestjs.com/techniques/task-scheduling

Consumer

  • might be nestjs standalone NestFactory.createMicroservice
  • could be cluster, needs to follow source group and horizontal scaling.
  • check result on redis then postgres db if already processed, aan_source_result-key-. if exist skip model call
  • get content from raw_content_aa kafka topic, apply zod validation then pass fields into model with grpc, save result into db and save key into aan_source_result-key- to redis with 3 hour ttl
  • get content id and result then share in processed_content_aa kafka topic.
  • require docker compose with cluster 3
  • kafka auto hearthbeat
  • need to take analytic of
    • how many seconds took to process related content.
    • fail rate on model connections
  • categories needs to be numerized before sending into model.

https://docs.nestjs.com/techniques/configuration https://docs.nestjs.com/standalone-applications https://docs.nestjs.com/microservices/kafka https://docs.nestjs.com/microservices/grpc https://docs.nestjs.com/microservices/redis

Model

  • by case requirement add fastapi support with /predict endpoint.
  • add grpc support for consumer flow, we might require multiprocessing or asyncio
  • on docker compose, build might require AutoModel.from_pretrained to keep image offline
  • need vs code --> collab configuration to keep collab file in git commits, with Jupyter
  • checkpoints needs to goes to google drive
  • by PushToHubCallback from transformers model needs to goes to huggingface hub, also we might need to use WandbCallback
  • on python service we could use AutoModelForSequenceClassification.from_pretrained to get our model from huggingface
  • i want to try with first bmdz/bert-base-turkish-cased
  • before passing thru from consumer service, we might need to do those;
    • lowercase all texts
    • remove html tags if exist on source
  • on train i need to pass balanced data. by 3 sets, first 128 each on category then 256 then 512. in less time we might use sevaral ai models to generate those. we could randomize contents between sets so tone would be changed.
  • categories needs to be numerized. shared between consumer side.
  • separator -> "|=|"
  • we could scrape real news sites with clawbot with free models. with that we could generate 3x512 data sets. we might require to write SKILL.md to keep under control clawbot behavior. models when editing big files have troubling on editing. we might built mini-cli to keep data structured and easy to add. on finding different website to we might add SearxNG support to clawbot.

https://grpc.io/docs/languages/python/basics/

API + BetterAuth

  • list endpoint, with pagination, limit 16. query with date range, category, title search. zod validation on queries
  • detail endpoint
  • auth handle on all endpoints with better-auth integration. with only magic link auth. magic link auth just needs to log auth link on console, we should not sending emails cuz its only poc.
  • manual changing category of content. like approving or not model's suggestion. saving new category and keeping old category (saving into category field into new one but keeping old one on model_category field).
  • pending list, we could re add those into kafka topic again. batch action.
  • analytics of;
    • category weights
    • daily processed content count
    • model trust, median
    • comparing model suggestion category and given source category
  • typeorm connection to postgres
  • kafka topic processed_content_aa, re broadcast on sse.
  • manual trigger endpoint, send ping into fetch_content_aa kafka topic.
  • serving producer and consumer analytics

https://docs.nestjs.com/recipes/router-module https://docs.nestjs.com/openapi/introduction https://better-auth.com/docs/integrations/nestjs https://github.com/thallesp/nestjs-better-auth https://docs.nestjs.com/recipes/sql-typeorm https://docs.nestjs.com/microservices/kafka https://docs.nestjs.com/techniques/server-sent-events

Panel

  • tanstack start framework
  • proxy to api endpoint, use api on "/api" path
  • simple magic link login with better auth
  • kumo ui (cloudflare ui kit), on ui components
  • tanstack form on table filtering and login form.
  • tanstack table on table rendering
  • sse connection feeds tanstack db collection with local storage
    • on new activity for example on contents table, show "new updates" button to re render, we should not do auto re-render on table and charts.
  • on auth, fetch available contents from api endpoints

https://tanstack.com/start/latest/docs/framework/react/overview https://better-auth.com/docs/basic-usage#client-side https://kumo-ui.com https://tanstack.com/form/latest/docs/overview https://tanstack.com/table/latest/docs/introduction https://tanstack.com/db/latest/docs/overview