The Unseen Hand: Data Science Is What Makes RAG and Agentic AI Work
Published: September 2025
By Amy Humke, Ph.D.
Founder, Critical Influence
Why is the data science so hard to see?
As I've spent the last year going deep on agentic RAG, the question that kept nagging me was: Where is the data science? The NLP pieces in ingestion/preprocessing are obviously data science. After that, it gets harder to see. With public, affordable libraries, you can deploy a working RAG stack without training a single model yourself: load content, use an off-the-shelf embedding service, or start with plain keyword search, retrieve, and send context to a foundation model. That's part of why it feels like assembly instead of analysis. The ease is seductive. It also hides where the scientific work should live.
Myth: RAG is mostly pipelines and APIs
Reality: The hard parts are data science
Evaluation
Start with evaluation, not the diagram. Too many tutorials and trainings leave evaluation to the end and start with wiring. In real life, if you want something that launches and keeps trust, you think about the end first. How will you know it works? Build a labeled dataset (a small "gold set") with questions you care about and source-backed answers. Use it to check two simple retrieval metrics and two answer checks:
• Precision: How much of what we retrieved was actually relevant?
• Recall: How much did we retrieve of all the relevant material out there?
• Correctness: Did the answer match the gold answer?
• Faithfulness: Did the answer stick to the retrieved sources (i.e., avoid hallucinating)?
That's your scorecard.
Three pragmatic ways to build the gold set (business needs should guide the selection):
-
SME-authored Q&A with citations. Partner with subject-matter experts to write the "best available" answer and point to the exact passages that justify it. This anchors correctness and faithfulness from day one.
-
Evidence labeling. Give SMEs (or trained reviewers) a mix of passages and ask them to label each as supports/contradicts/irrelevant for a target question. This is slower, but it teaches you what good evidence looks like and sets up precision/recall tests for retrieval.
-
LLM-assisted candidates, human-verified. Use an LLM to propose questions, draft answers, and score likely evidence; then require human spot-checks before anything becomes "gold." This accelerates coverage but keeps humans in the loop for quality.
I also ask SMEs to tag each question with a small topic taxonomy (policy, pricing, eligibility, etc.). Those tags help me check coverage (do we even have the sources to answer?) and to estimate an honest abstention rate for each topic bucket. If you cannot answer well, say so, and fix the content gap before you tweak prompts.
Ingestion
Ingestion and curation are not busywork. If your source text is stale, unfocused, duplicated, or stripped of the metadata your users search for, retrieval will fail before you vectorize a single byte. Mature guidance calls out rigorous prep and evaluation; yet most quick-starts sprint to a "hello world" chatbot and stop. This step isn't glamorous. It is decisive.
Chunking
Chunking is feature engineering. Text must be split into pieces that the system can understand. Too big and you miss specifics; too small and you lose the thread. Treat chunking as an experiment, try structure-aware or parent-child strategies, then validate against your gold set. Poor chunking tanks relevance and sufficiency long before the LLM sees the prompt.
Retrieval
Retrieval quality is not a single toggle. High-performing systems combine meaning-based (semantic) search with keyword (lexical) search and then re-rank so the best passages rise to the top. In practice, I run it in two stages:
• Stage 1: bi-encoder retrieval (fast). Encode the question and passages separately (dense vectors) and run a keyword match (lexical). Blend them to pull a focused top-k (for example, 20). This maximizes recall without dragging in the whole corpus.
• Stage 2: cross-encoder re-rank (accurate). Score each candidate with a joint model that reads the question with the passage. Re-rank and keep only the top few to feed the LLM. This lifts precision where it matters, at the top. If you aren't testing hybrid plus re-rank, you're leaving accuracy on the table.
Gatekeeping
Gatekeeping and thresholds (before generation). Don't ask the model to write an answer unless you've met a bar you defined from the gold set: enough relevant evidence, disagreement resolved, and sources that actually cover the question. Concretely, a minimum evidence score (or coverage rule) is required; otherwise, abstain or ask a clarifying question and search again. This is how you control faithfulness and reduce bad confidence spikes.
Where the data science actually lives (the levers I move)
When you stop thinking of a pipeline and start thinking of a decision boundary, science shows up fast. Retrieval and generation create the same kinds of precision and recall trade-offs we know from classifiers. Here are the levers I actually use, and how I judge them, in human terms:
• Chunk size and overlap: More overlap usually raises recall (you miss fewer details) but may lower precision (you pull extra noise). I watch how often the sources truly cover the question.
• How many candidates to retrieve (top-k): Pulling more boosts recall but risks crowding the prompt with off-target text, which hurts precision and confuses generation. I dial this in on the gold set.
• Hybrid weighting (keyword vs. semantic): Tilting toward keywords helps with acronyms and exact phrases; tilting toward semantics helps with paraphrases and concept matches. I adjust the balance and see which way improves precision without collapsing recall.
• Re-rank cutoffs: A second-pass re-rank usually raises precision in the top few slots; the trade-off is latency. I pick the smallest cutoff that still improves correctness on the gold set.
• Gate for enough evidence: I prefer a simple rule, only answer if the sources plausibly cover the question; otherwise, ask a follow-up or search again. This improves faithfulness and trust.
• Prompt constraints and citation policy: Telling the model to answer from retrieved context reduces hallucinations and makes faithfulness checkable. You'll see fewer slick guesses and more grounded answers.
That's what data science in RAG looks like to me: move a lever, watch the precision, recall, correctness, and faithfulness dashboard, and keep the change only if users win.
Why trainings make it hard to see the science
Most mainstream guides lead with setup: load data, make embeddings, add a vector store, chat loop. Evaluation and monitoring typically appear as an end module under production, which makes them feel optional. It's great for momentum. It's terrible for rigor. It trains teams to ship before they measure.
Am I saying that's the only reason projects stall? No. Unclear use cases and weak governance are real. However, the evaluation gap is a big culprit. Coverage debunking the viral 95% of AI pilots fail claim and industry forecasts on agentic AI point to brittle proofs-of-concept that never mature into products.
Agentic RAG raises the bar, not the magic
Agents add planning, tool use, memory, and autonomy. That multiplies the places you can make (or break) precision and recall. The more serious playbooks show versioned pipelines, evaluation gates, and promotion criteria for agent workflows, evidence that the data science didn't go away; it got more important.
Why this feels different from traditional ML
In classic ML, many knobs sit inside one model, and you tune in one place. In RAG, the knobs live across the system: chunking, candidate counts, hybrid weights, re-rank cutoffs, gating rules, and grounding constraints. You still optimize for precision, recall, correctness, and faithfulness, but you do it across steps, not inside a single training loop. That dispersion is why it can feel less like data science at first glance, and why an integrated evaluation habit matters more than ever.
Ownership without the org chart
I'm not trying to hand you a RACI chart. In small teams, the same person may write the loader, tune chunking, and design the evaluation pack. In bigger orgs, the titles differ. What matters is simple: someone owns the gold set and the scorecard, someone owns the retrieval knobs (and can explain the precision and recall trade-offs), and someone owns the gating rules before generation. Engineering assembles and hardens the plumbing; data science defines the thresholds, measures the impact, and decides what ships. If those responsibilities are unassigned, quality becomes nobody's job, and that's when brittle demos masquerade as products.
Who does what: engineering vs. data science
Stage | Engineering owns | Data science owns |
---|---|---|
Ingest | Connectors, ETL, orchestration | Deduping, normalization, metadata extraction, quality checks |
Indexing | Vector store config, scaling | Embedding choice, chunking strategy design and validation |
Retrieval | API wiring, latency and cost budgets | Balancing precision and recall via candidate counts and hybrid search |
Re-rank | Integrating the re-ranker | Picking cutoffs and judging precision gains at the top |
Gate | Implementing control flow | Defining enough evidence rules and measuring false-answer rate |
Generate | Endpoint configs, throughput, caching | Prompt templates, citation policy, hallucination controls |
Safety/Citation | Enforcement, logging | Spot-checks for faithfulness and contradiction, escalation rules |
Evaluation | Dashboards and alerts | Gold-set design, metric definitions, thresholds, and promotion criteria |
If your plan doesn't assign ownership for these data-science tasks, you don't have a plan.
Conclusion
If RAG and agentic AI haven't felt like data science, you're not wrong, you've been looking for a single model to tune. The science is there, but it's spread across the system: how you chunk, what you retrieve, when you abstain, and how you prove it works. Start with evaluation, not wiring. Build the gold set. Gate answers with thresholds you can defend. Then, keep moving the levers, chunking, candidate counts, hybrid balance, and re-rank cutoffs, and watch the precision, recall, correctness, and faithfulness scoreboard. That's the scientific method applied to a system, not a single model. Do this, and the unseen hand stops being a complaint and becomes the reason your product earns trust.