When an evaluation spans 15 countries and six humanitarian contexts, processing evidence manually has limits. AI can help close the gap. Here is how it can be done responsibly.
Evaluations at scale sift through enormous volumes of information to build their evidence base. The independent evaluation of UNFPA's capacity in humanitarian action 2019–2025 was no exception: the evaluation team reviewed more than 1,500 documents and publications across 15 countries (a significant proportion in French and Spanish) and conducted over 600 interviews and focus group discussions with staff, partners, and community members spanning six distinct humanitarian contexts. Processing this evidence base manually, at the pace and depth required, would have constrained both the breadth and the rigour of the evaluation. This created a clear methodological opportunity: to explore whether artificial intelligence (AI) tools could enhance the work of experienced evaluators, and whether they could do so responsibly.
The evaluation team developed a deliberate, ethics-first strategy before applying any AI tool. The team tested two tools, InsightWise and Google NotebookLM, and selected NotebookLM for the majority of the analytical work. The selection reflected the tool's ease of use and its integration with Google Translate, enabling seamless processing of French and Spanish source material alongside English-language documents. It also aligned with the IEO's institutional preference for solutions within the Google ecosystem, which meets the office's information security and data privacy requirements. This multilingual capability was critical: four of the 15 sampled countries were francophone and three were Spanish-speaking, and ensuring their evidence contributed equitably to the analysis was a methodological imperative.
Evaluators applied AI across three stages of the evaluation. In the data analysis phase, the team removed all personal identifiers from key informant interviews and focus group transcripts before ingesting any data to the platform. The tool then generated concise summaries and identified recurring themes, patterns and emergent evidence across a very large transcript dataset. This complemented manual coding by the evaluation team against a pre-agreed evaluation matrix. In the secondary data review phase, AI-enabled structured scanning and extraction across the 1,500-document evidence base, drawing out relevant passages organized by country, evaluation assumption and timeframe. In the synthesis phase, AI assisted with cross-country comparisons and flagged points of convergence and divergence between data sources, signaling agreement or tension that warranted deeper evaluator scrutiny.
The evaluation team embedded human oversight at every stage, ensuring that human validation remains the final, authoritative word on all outputs. To maintain this standard, evaluators reviewed all AI-generated outputs, correlated findings directly against primary and secondary sources, and corrected or discarded any results that were inaccurate or reflected bias. Two workshops provided external quality control: an analysis workshop with evaluation managers immediately after data collection, and a co-creation workshop where the full Evaluation Reference Group critically reviewed AI-supported findings. The IEO's strategy on GenAI-powered evaluation function at UNFPA and the UNEG Ethical Principles for Harnessing AI in United Nations Evaluations (2025) governed the process throughout. UNFPA's enterprise agreement with Google further protected data security, ensuring that sensitive evaluation data, even in anonymized form, was neither retained nor used beyond the immediate analytical task. In keeping with IEO transparency requirements, the evaluation report also includes an annex documenting exactly how AI was applied at each stage, which tools were used, and how the team upheld ethical and responsible use. This level of explainability strengthens the credibility of the findings and offers a practical resource for other evaluation teams working in this growing area.
The results were tangible. AI cut the time needed to process and structure a dataset that would otherwise have demanded weeks of sequential manual review, delivering a verified cost saving of $12,000 on the evaluation and freeing evaluators to concentrate on interpretation, validation and professional judgement. Multilingual processing ensured that evidence from contexts that might otherwise have been deprioritized due to language constraints shaped the synthesis on equal terms. This systematic cross-checking also strengthened the triangulation that underpins credible evaluation findings.
Even so, the experience surfaced as a clear lesson. The AI tools currently available to UNFPA are exceptionally efficient at extracting and summarizing large volumes of information, but the resulting analysis can be superficial. Complex, nuanced interpretation still depends on the cognitive depth and institutional expertise of evaluators, who transform raw summaries into meaningful insights.
The humanitarian evaluation reflects an institutional commitment to responsible AI, not a one-off experiment. Published in 2024, the GenAI-powered evaluation strategy sets out six core principles and a phased implementation roadmap to embed responsible AI practices across evaluation processes. The strategy prioritizes a demand-driven approach (use AI only where it genuinely improves evaluation quality and efficiency) alongside a human rights-based approach that foregrounds transparency, fairness and the principle of leaving no one behind. The humanitarian evaluation is among the first centralized evaluations to put these principles into practice at scale.
The evaluation insights AI helped surface are now being catalyzed into action. On 28-29 April 2026, the IEO hosted the Utilization Lab, a facilitated space designed to translate evaluation lessons into practical regional and country action plans. Drawing on examples from Bangladesh, Colombia, Egypt, Moldova, and Uganda, the Lab supported regional offices and country offices to identify priority actions, and brainstorm on mechanisms for integrating the evaluation's lessons directly into humanitarian decision-making and programming based on specific contexts.
This article was written with AI support with human authors in the lead.