Sentiment Attribution & Methodology

Last updated: December 2025

Transparency commitment: Every sentiment score on PublicMoodTracker is traceable to its source articles and model outputs. This document provides full methodological transparency so users can critically evaluate and contextualise the data they see.

PublicMoodTracker generates political sentiment scores using a reproducible, version-controlled AI pipeline. This document provides complete transparency on how scores are calculated, what they represent, and their known limitations — in compliance with emerging AI transparency obligations and good-faith commitment to responsible AI deployment in Kenya's democratic context.

1. What a Sentiment Score Represents

A PublicMoodTracker sentiment score for a politician represents the weighted average sentiment of public media discourse about that individual over a defined time window (24 hours, 7 days, or 30 days). It is derived from news articles, social media posts, and broadcast content — not from direct polling, surveys, or voter intention data.

A score of +0.72 means: on average, public media coverage of this politician is moderately positive. It does not mean 72% of Kenyans support them.

2. Data Ingestion Pipeline

2.1 Sources (9 active scrapers)

Source	Type	Language	Update Frequency
Nation Media Group	Online news	English / Swahili	Every 15 min
Standard Group Digital	Online news	English	Every 15 min
Citizen Digital (RMS)	Online news	English / Swahili	Every 15 min
Tuko.co.ke	Online tabloid	English / Swahili	Every 15 min
The Star Kenya	Online news	English	Every 15 min
Twitter / X (Kenyan political)	Social media	English / Swahili / Sheng	Real-time stream
YouTube (KE political channels)	Video comments	English / Swahili / Sheng	Every 30 min
The Elephant (blog)	Long-form analysis	English	Daily
Parliamentary Hansard	Official record	English	Weekly (session days)

2.2 Content Preprocessing

De-duplication: Near-duplicate articles (cosine similarity > 0.92) are collapsed to prevent single stories dominating scores.
Spam filtering: Content flagged by our spam classifier (precision: 94%) is excluded.
Sentence segmentation: Articles are split into sentences; sentiment is scored per sentence, not per article.
Language detection: langdetect library classifies language; Swahili and Sheng trigger the multilingual model path.

3. Named Entity Recognition (NER)

Our custom NER model identifies Kenyan political entities within text — politicians, counties, parties, and institutions. It is fine-tuned from bert-base-multilingual-casedon a labelled Kenyan political corpus of 28,000 sentences.

Mention linking: Aliases are resolved (e.g., "DP Ruto", "His Excellency Ruto", "Hustler" → William Ruto).
Ambiguity handling: Ambiguous names (e.g., common surnames) require county or party context; unresolvable ambiguities are discarded.
Minimum threshold: A politician requires ≥3 mentions within a scoring window to appear in rankings.

4. Sentiment Classification Model

4.1 Base Model

XLM-RoBERTa-large (Phase 13.1) — a transformer model supporting 100 languages, fine-tuned on 12,000 labelled Kenyan political sentences (English, Swahili, Sheng). Labels: POSITIVE, NEGATIVE, NEUTRAL.

4.2 Custom Lexicon Enhancement

A supplemental lexicon of 40 Swahili political terms and150 Sheng political terms (e.g., "hustler", "dynasty", "kamwana") modifies token embeddings before classification to improve accuracy for Kenyan-specific discourse.

4.3 Output

Each classified mention produces:

sentiment_label: POSITIVE / NEGATIVE / NEUTRAL
confidence: 0.0–1.0 (model softmax probability)
raw_score: +1.0 (positive), −1.0 (negative), 0.0 (neutral)

5. Score Aggregation Formula

5.1 Weighted Average Score

The displayed score for a politician over a time window is a confidence-weighted mean:

score = Σ(raw_score_i × confidence_i) / Σ(confidence_i)

Mentions with confidence < 0.55 are excluded to reduce noise from ambiguous text.

5.2 Polarity Confidence

Measures how consistently directional public discourse is (used in the Rivals comparison table):

polarity_confidence = |positive_count − negative_count| / total_mentions

A value near 1.0 means near-unanimous sentiment; near 0.0 means deeply polarised.

5.3 Mention Velocity (Spike Detection)

velocity = mentions_in_last_hour / avg_mentions_per_hour_last_7d spike_flag = True if velocity ≥ 3.0 and |score_delta_1h| ≥ 0.4

6. Score Display Rules

Condition	Display Behaviour
< 3 mentions in window	Score hidden; shown as "Insufficient data"
3–9 mentions	Score shown with ⚠ low-confidence warning
10–49 mentions	Score shown with confidence label
≥ 50 mentions	Full score with high-confidence badge

7. Known Limitations

Media ≠ public opinion: Scores reflect what journalists and social users write, not what the broader public thinks.
Sheng coverage gaps: Rapidly evolving Sheng slang may lag our lexicon updates by 2–4 weeks.
Satire misclassification: The model may misread satirical content as genuine sentiment; estimated error rate ~3%.
Source bias: Not all Kenyan media outlets are included; sources skew toward urban, online audiences.
Coordinated inauthentic behaviour: Social media manipulation (bot networks) can temporarily distort scores; we apply velocity filters but cannot guarantee full detection.

8. Model Versioning and Change Log

Version	Date	Key Changes
Phase 13.1	Nov 2025	Sheng lexicon expanded to 150 terms; Hansard source added
Phase 12.0	Jun 2025	XLM-RoBERTa-large (replaced base); confidence threshold raised to 0.55
Phase 11.3	Jan 2025	County-level NER improved; duplicate detection threshold tuned

9. Requesting Source Evidence

Any registered user may click the transparency icon (🔍) on a politician's profile to view the source articles contributing to the current score. Users with a Daily Access subscription can export these citations as a reference list.

Academic researchers may request full methodology documentation by contacting research@siasaiq.com.