Skip to content

Model - Code Flow

This page explains the runtime sequence for both the training pipeline (train.py) and the inference service (server.py / app.py / grpc_server.py).

Training Pipeline

The full training pipeline is implemented in train.py and runs the following steps in order:

seed_everything()
  → W&B init
  → load_and_split()
  → verify_no_leakage()
  → tokenize_dataset()
  → load BertForSequenceClassification
  → Trainer.train()
  → evaluate_test()
  → save_artifacts()
  → (optional) copy to Google Drive
  → (optional) push_to_hub()

1. Seed

seed_everything(seed) sets seeds for random, numpy, torch, torch.cuda, PYTHONHASHSEED, and HuggingFace set_seed to ensure full reproducibility across runs.

2. Data Loading and Splitting

load_and_split() loads the data/data.jsonl split from mehmetraufoguz/turkish-news-dataset and applies three transformations:

  • SHA-256 deduplication — articles are fingerprinted by their baslik field; exact duplicates are dropped before splitting.
  • ClassLabel casting — the kategori column is cast to a ClassLabel feature so HuggingFace's train_test_split can stratify by class.
  • Two-step stratified split — first splits off 30% (val + test), then splits that 30% in half, yielding a 70 / 15 / 15 train / val / test ratio.

3. Leakage Verification

verify_no_leakage() checks all three splits for two kinds of overlap and raises RuntimeError on any hit:

  • Record ID overlap — compares unique_id sets across all split pairs.
  • Text fingerprint overlap — compares SHA-256 hashes of the combined baslik + ozet text across all split pairs.

This guarantees that the test set contains no data the model saw during training.

4. Preprocessing and Tokenization

tokenize_dataset() applies prepare_input(baslik, ozet) to every example, which in turn calls clean_and_lowercase() on each field:

  1. clean_and_lowercase(text) — strips HTML tags, decodes HTML entities (<, &, ", etc.), replaces \n\r\t with a space, removes bracket characters, collapses whitespace, strips, and lowercases. Returns None if the result is empty.
  2. prepare_input(baslik, ozet) — joins the cleaned fields as "{baslik} |=| {ozet}". Empty fields become "" so the separator is always present.

This preprocessing mirrors shared/src/utils/text-cleaner.ts → cleanAndLowercase() exactly, ensuring consistent behaviour between training and production inference.

Tokenization uses the BertTokenizer with truncation=True, max_length=256, and no padding at this stage. Padding is applied dynamically per-batch by DataCollatorWithPadding.

The kategori column is renamed to labels to match the expected input for BertForSequenceClassification.

5. Model

BertForSequenceClassification is loaded from dbmdz/bert-base-turkish-cased with num_labels=7. The id2label / label2id maps are injected into the model config at load time.

Architecture overview:

PropertyValue
Hidden layers12
Attention heads12
Hidden size768
Vocabulary size32 000
Max position embeds512

6. Training Arguments and Evaluation

build_training_args() configures the HuggingFace Trainer:

  • Eval and save strategy: "epoch" (evaluate and checkpoint after each epoch)
  • Best model selection metric: eval_macro_f1
  • fp16=True on CUDA devices
  • EarlyStoppingCallback with patience 2 — training stops if eval_macro_f1 does not improve for 2 consecutive epochs

compute_metrics(eval_pred) is called after each eval step and returns:

MetricMethod
accuracyaccuracy_score
macro_f1Unweighted mean F1 across all 7 classes
weighted_f1Class-frequency-weighted mean F1

7. Test Evaluation and Artifact Saving

After training, evaluate_test() runs inference on the held-out test split, produces a full classification_report (sklearn), and logs all per-class and aggregate metrics to W&B under the test/ prefix.

save_artifacts() writes the best checkpoint as:

  • model.safetensors — model weights in safe serialization format
  • tokenizer.* files — tokenizer vocabulary and configuration
  • label_map.jsonid2label, label2id, num_labels, separator

Inference Flow

Combined Server (server.py)

The production entry point runs FastAPI and gRPC concurrently in a single Python process:

main()
  → _ModelSingleton.load(model_dir)    # weights loaded once
  → asyncio.TaskGroup
      ├── _run_grpc(port)              # grpc.aio async server
      └── _run_fastapi(port)           # uvicorn.Server (signal handlers disabled)

Both servers share the same _ModelSingleton instance. The model is loaded once before the TaskGroup starts; neither server begins accepting requests until the weights are fully loaded.

FastAPI Predict Path

POST /predict
  → validate PredictRequest (Pydantic: id, baslik, ozet)
  → prepare_input(baslik, ozet)          # HTML-strip + lowercase + "|=|" join
  → tokenizer(text, truncation, max_length=256)
  → model(**inputs)  [torch.no_grad()]
  → softmax(logits)
  → return PredictResponse {
        predicted_category,
        confidence,
        all_confidences
    }

Returns 503 if the model singleton is not loaded, 422 if the input is empty after preprocessing.

gRPC Predict Path

ModelService.Predict(PredictRequest { id, baslik, ozet })
  → _ModelSingleton.predict(baslik, ozet)
  → (same inference path as FastAPI)
  → return PredictResponse { predicted_category, confidence, all_confidences }

On error, the gRPC handler sets grpc.StatusCode.INTERNAL.

The standalone gRPC server (grpc_server.py) uses a ThreadPoolExecutor(max_workers=cpu_count) and handles graceful shutdown on SIGTERM / SIGINT with a 5-second grace period.