Sentiment Attribution & Methodology
Last updated: December 2025
PublicMoodTracker generates political sentiment scores using a reproducible, version-controlled AI pipeline. This document provides complete transparency on how scores are calculated, what they represent, and their known limitations — in compliance with emerging AI transparency obligations and good-faith commitment to responsible AI deployment in Kenya's democratic context.
1. What a Sentiment Score Represents
A PublicMoodTracker sentiment score for a politician represents the weighted average sentiment of public media discourse about that individual over a defined time window (24 hours, 7 days, or 30 days). It is derived from news articles, social media posts, and broadcast content — not from direct polling, surveys, or voter intention data.
A score of +0.72 means: on average, public media coverage of this politician is moderately positive. It does not mean 72% of Kenyans support them.
2. Data Ingestion Pipeline
2.1 Sources (9 active scrapers)
| Source | Type | Language | Update Frequency |
|---|---|---|---|
| Nation Media Group | Online news | English / Swahili | Every 15 min |
| Standard Group Digital | Online news | English | Every 15 min |
| Citizen Digital (RMS) | Online news | English / Swahili | Every 15 min |
| Tuko.co.ke | Online tabloid | English / Swahili | Every 15 min |
| The Star Kenya | Online news | English | Every 15 min |
| Twitter / X (Kenyan political) | Social media | English / Swahili / Sheng | Real-time stream |
| YouTube (KE political channels) | Video comments | English / Swahili / Sheng | Every 30 min |
| The Elephant (blog) | Long-form analysis | English | Daily |
| Parliamentary Hansard | Official record | English | Weekly (session days) |
2.2 Content Preprocessing
- De-duplication: Near-duplicate articles (cosine similarity > 0.92) are collapsed to prevent single stories dominating scores.
- Spam filtering: Content flagged by our spam classifier (precision: 94%) is excluded.
- Sentence segmentation: Articles are split into sentences; sentiment is scored per sentence, not per article.
- Language detection:
langdetectlibrary classifies language; Swahili and Sheng trigger the multilingual model path.
3. Named Entity Recognition (NER)
Our custom NER model identifies Kenyan political entities within text — politicians, counties, parties, and institutions. It is fine-tuned from bert-base-multilingual-casedon a labelled Kenyan political corpus of 28,000 sentences.
- Mention linking: Aliases are resolved (e.g., "DP Ruto", "His Excellency Ruto", "Hustler" → William Ruto).
- Ambiguity handling: Ambiguous names (e.g., common surnames) require county or party context; unresolvable ambiguities are discarded.
- Minimum threshold: A politician requires ≥3 mentions within a scoring window to appear in rankings.
4. Sentiment Classification Model
4.1 Base Model
XLM-RoBERTa-large (Phase 13.1) — a transformer model supporting 100 languages, fine-tuned on 12,000 labelled Kenyan political sentences (English, Swahili, Sheng). Labels: POSITIVE, NEGATIVE, NEUTRAL.
4.2 Custom Lexicon Enhancement
A supplemental lexicon of 40 Swahili political terms and150 Sheng political terms (e.g., "hustler", "dynasty", "kamwana") modifies token embeddings before classification to improve accuracy for Kenyan-specific discourse.
4.3 Output
Each classified mention produces:
sentiment_label: POSITIVE / NEGATIVE / NEUTRALconfidence: 0.0–1.0 (model softmax probability)raw_score: +1.0 (positive), −1.0 (negative), 0.0 (neutral)
5. Score Aggregation Formula
5.1 Weighted Average Score
The displayed score for a politician over a time window is a confidence-weighted mean:
score = Σ(raw_score_i × confidence_i) / Σ(confidence_i)Mentions with confidence < 0.55 are excluded to reduce noise from ambiguous text.
5.2 Polarity Confidence
Measures how consistently directional public discourse is (used in the Rivals comparison table):
polarity_confidence = |positive_count − negative_count| / total_mentionsA value near 1.0 means near-unanimous sentiment; near 0.0 means deeply polarised.
5.3 Mention Velocity (Spike Detection)
velocity = mentions_in_last_hour / avg_mentions_per_hour_last_7d spike_flag = True if velocity ≥ 3.0 and |score_delta_1h| ≥ 0.46. Score Display Rules
| Condition | Display Behaviour |
|---|---|
| < 3 mentions in window | Score hidden; shown as "Insufficient data" |
| 3–9 mentions | Score shown with ⚠ low-confidence warning |
| 10–49 mentions | Score shown with confidence label |
| ≥ 50 mentions | Full score with high-confidence badge |
7. Known Limitations
- Media ≠ public opinion: Scores reflect what journalists and social users write, not what the broader public thinks.
- Sheng coverage gaps: Rapidly evolving Sheng slang may lag our lexicon updates by 2–4 weeks.
- Satire misclassification: The model may misread satirical content as genuine sentiment; estimated error rate ~3%.
- Source bias: Not all Kenyan media outlets are included; sources skew toward urban, online audiences.
- Coordinated inauthentic behaviour: Social media manipulation (bot networks) can temporarily distort scores; we apply velocity filters but cannot guarantee full detection.
8. Model Versioning and Change Log
| Version | Date | Key Changes |
|---|---|---|
| Phase 13.1 | Nov 2025 | Sheng lexicon expanded to 150 terms; Hansard source added |
| Phase 12.0 | Jun 2025 | XLM-RoBERTa-large (replaced base); confidence threshold raised to 0.55 |
| Phase 11.3 | Jan 2025 | County-level NER improved; duplicate detection threshold tuned |
9. Requesting Source Evidence
Any registered user may click the transparency icon (🔍) on a politician's profile to view the source articles contributing to the current score. Users with a Daily Access subscription can export these citations as a reference list.
Academic researchers may request full methodology documentation by contacting research@siasaiq.com.