Computer Science Thesis Defense Questions: What Examiners Actually Ask
CS committees probe four things above everything else: what exactly is new in your work and how it differs from prior state of the art; how you evaluated it and whether the evidence is trustworthy; why this approach rather than a plausible alternative; and whether the results travel beyond the specific datasets and conditions you tested. Getting those four lines straight before the defense covers most of what the room will ask.
Last updated
What makes a CS defense different from other disciplines
Most doctoral defenses test whether you can articulate your contribution and defend your methodology. A CS defense does that and then adds a layer that most other fields do not: you must explain why the thing you built or proved could not have been derived trivially from existing work, and you must justify every major empirical decision — baseline selection, benchmark choice, dataset scope, statistical procedure — as if you were the reviewer who nearly rejected the paper.
The engineering-versus-research distinction is also live in a way it isn't in, say, sociology. Building a system that works is not automatically a doctoral contribution. Examiners will ask what the system reveals or proves that was not already known. If you built something, you need a claim about what the building demonstrates. If the claim is purely "it works faster," you need to explain why that speed matters theoretically and not just practically.
In the UK viva, this typically comes from a single external examiner who has read your thesis closely and prepared specific technical questions. In the US defense, a committee of four or five may divide the questioning — someone may probe the formal correctness of your algorithms, another the evaluation methodology, another the related work. The questions themselves are largely the same; the rhythm is different.
Novelty, delta over prior work, and the state of the art
These questions come first in most CS defenses and set the tone for everything that follows. An examiner who is not convinced you know where your work sits relative to the field will return to that doubt throughout the session.
What is the single most original contribution of this thesis?
This is the most common opener. Examiners want one sentence that names the result — not the topic, not the system, not the approach — and why the field could not have gotten it from existing work. "I present a method for X that achieves Y on benchmark Z" is not enough; add why that required genuinely new work rather than straightforward application of known techniques. Candidates who lead with the problem rather than the result usually get a follow-up asking them to sharpen the claim.
How does your work differ from [specific paper the examiner names]? They seem to be doing the same thing.
Prepare for the examiner to name a paper you cited — possibly one they wrote. The question is not a trap; it is a genuine test of whether you understand your own delta. Have a precise two-sentence answer for each paper in your related-work section: what they did, and what your work adds, changes, or contradicts. Vague answers like "we extend their approach" invite harder follow-ups.
Is your contribution a research contribution or an engineering contribution? What's the difference in this case?
Examiners in CS ask this partly to see if you've thought about it at all. An engineering contribution is a system that solves a problem well. A research contribution is a result that changes what the field knows — a formal proof, a falsified assumption, an empirical finding that generalises. If your thesis is primarily a system, name the research claims the system supports. If it is primarily theoretical, explain what implementation or evaluation you used to ground the theory.
Could a competent engineer have implemented this as a straightforward application of existing methods?
This is the sharpest version of the novelty question. The answer must be no, and you must say why not. Name the specific technical obstacle that required new work — not "it was hard" but the concrete problem that existing techniques failed to address. If your contribution is primarily in the combination or adaptation of existing techniques rather than in a new technique, say so clearly and defend why the combination itself constitutes a scholarly contribution.
Your thesis makes several contributions. Which one would survive if the others were removed?
Examiners use this to identify what you believe the core claim is. It also surfaces dependency relationships between your chapters that you may not have articulated. Have a clear answer: one contribution that stands alone and is the minimum for the thesis to pass. The others should be presented as additions that strengthen or extend the core, not as independent load-bearing walls.
Evaluation methodology, benchmarks, and baseline selection
For empirical CS — systems, machine learning, software engineering, HCI — evaluation questions often occupy the longest part of the exam. Your contribution is exactly as strong as your evidence for it, and the examiner's job is to find where the evidence is thinner than the claim.
Why did you choose these baselines? The strongest baseline in the literature is [method X] — why isn't it here?
Baseline selection is one of the most scrutinised decisions in empirical CS. If you excluded a competitive method, you need a principled reason: it was not reproducible from the published description, it required data you could not access, or its experimental setup was genuinely incompatible with yours. Saying a method was "out of scope" without explaining why will prompt follow-ups. If the examiner names a baseline you didn't include, acknowledge it and say specifically what including it would or would not have changed about your conclusions.
Are your benchmarks still representative of the problem, or are they saturated?
Several standard CS benchmarks — particularly in NLP and vision — have become effectively solved by the time a PhD completes. If your evaluation relies on a well-known benchmark, be prepared to discuss whether performance on it still carries information. The examiner is checking whether your results show genuine progress or benchmark overfitting. Have a response that addresses what the benchmark captures and what it doesn't.
How do you know the performance improvement is real and not noise?
For any quantitative result, know your statistical significance approach and its limits. In ML papers, this typically means repeated runs across different random seeds and confidence intervals. In software engineering experiments, it may mean effect sizes and non-parametric tests. If your improvement margins are small relative to variance, say what that means for the practical significance of the claim — an improvement that is statistically significant but smaller than the noise in real deployment is a result, but a different kind of result.
You report results on dataset X. How much of your improvement is specific to that dataset?
This is the generalisability question in its most direct CS form. Have a clear statement of what you believe generalises and what is dataset-specific, grounded in the characteristics of the dataset. If you tested only one dataset, acknowledge that squarely and explain what additional experiments would have been needed to broaden the claim. Don't claim more transferability than the evidence supports.
Did you run ablation studies? What did they tell you about which components of your method are actually doing the work?
Examiners ask about ablations to check whether you understand your own system. If your method has multiple components, you should know what happens when each is removed: performance degrades, the method fails entirely, or nothing much changes. The last outcome is the one to prepare for carefully — if removing a component doesn't hurt, the examiner will ask why it's there.
How would a practitioner reproduce your results? What would they need that isn't in the paper?
Reproducibility is now a genuine concern in computer science, not just a theoretical one. Examiners will ask whether you have released code and data, and if not, why not. If you have a proprietary dataset or a dependency on infrastructure you cannot share, name that explicitly and say what the minimum requirement for reproduction would be. Saying you ran experiments on a specific GPU configuration that affects results is the kind of honesty that builds credibility.
Threats to validity in empirical CS work
The Wohlin et al. taxonomy — construct, internal, external, and conclusion validity — is the standard reference in empirical software engineering and is increasingly used across other CS empirical subfields. Examiners may not use this exact vocabulary, but the questions they ask map onto these categories.
What are the main threats to the validity of your results, and which did you accept?
The right approach is to name them categorically: construct threats (do your metrics measure what you claim they measure?), internal threats (could something in the experimental setup have produced the result independently of your method?), external threats (how far do the results travel?), and conclusion threats (are the statistical procedures appropriate?). Name the ones you accepted and say what they limit — not the whole thesis, but the specific inference they affect.
Your evaluation metric is [accuracy / F1 / latency / BLEU score]. Why is that the right metric for the claim you're making?
Construct validity in CS is often a metric validity question. The examiner wants to know that your metric captures what you claim it captures. BLEU score does not measure translation quality directly; accuracy on imbalanced datasets does not measure classifier usefulness; throughput in a synthetic benchmark may not reflect real workload behaviour. Know your metric's limits and say what they mean for the confidence you can place in your conclusions.
You tested on a specific set of programs / datasets / users. What happens to your claims if those aren't representative?
External validity in empirical CS most often surfaces as sampling: are the subjects (programs, queries, participants, networks) representative of the population you are generalising to? Name the population you intend your claim to cover, then characterise how well your sample approximates it. If the sample was opportunistic — you used open-source projects that were convenient to instrument, for example — say that and describe what it implies for scope.
Could your results be explained by the fact that you tuned your method on the test data, even inadvertently?
This is the data leakage question, one of the most serious threats in ML-adjacent work. If you used a held-out test set throughout development to guide design decisions, that is leakage even if you never directly trained on it. Have a precise account of how you kept evaluation data separate from development decisions. If there was any contact between development and evaluation, describe it, quantify the risk where possible, and say what additional experiments (a completely fresh held-out set, cross-validation) would have added.
Does your system scale? What happens at ten times the input size?
Scalability questions test both formal claims and practical ones. If you have a complexity analysis, know its assumptions and where the analysis stops being tight. If you have empirical scaling experiments, know what the curve looks like and where it starts to diverge from the theoretical prediction. If you don't have either, say what the limiting factor is — memory, quadratic components, I/O — and why that didn't affect the scope of claim you are making.
Design choices, alternatives, and why this approach
Once the examiner is satisfied that the contribution is real and the evaluation is sound, the questioning often shifts to the choices that shaped the approach. These questions are less about whether you were right and more about whether you can think clearly about the options you had.
Why this approach rather than [deep learning / formal methods / simulation / whatever the dominant alternative is in your subfield]?
Name the alternatives, say what they would have been suited for, and explain specifically what they would have failed to give you. This is not a question about which approach is best in general. It is a question about fit between your research question and your method. An examiner who specialises in the approach you didn't use is not asking you to justify not using their approach — they want to see that you thought about it.
You made a design decision in Chapter 3 that you describe as pragmatic. What would have changed if you had made the more principled choice?
Examiners read your thesis as carefully as you wrote it, sometimes more carefully. If you used phrases like "for practical reasons" or "we simplified by assuming," those are flagged as questions. Have a specific answer for what the principled version would have required and what it would have changed in your results — either a direction (probably better / probably the same / probably worse) or an argument for why it wouldn't have mattered.
If you were starting this project today, knowing what you know now, what would you do differently?
Answer this with one specific, technical change — not a sweeping revision and not a denial that anything would change. The best answers name a design decision early in the project that you now understand had downstream consequences, explain what those consequences were, and say what you would have done instead. This demonstrates exactly the kind of reflective ownership that examiners are looking for in a candidate claiming independent researcher status.
Your results are better than the baseline, but only on these specific conditions. How confident are you that the advantage holds in practice?
This is the examiner testing whether you can separate statistical results from practical claims. Be precise: the advantage was demonstrated under conditions X, Y, and Z. Under different conditions — higher noise, larger scale, different domain — the picture is less clear, and you should say so. Overselling a result that works only under constrained conditions is the fastest way to lose an examiner's confidence.
Frequently asked questions
- How long do CS thesis defenses and vivas typically last?
- UK vivas for CS PhDs usually run between one and a half and three hours. US thesis defenses often follow a public presentation of around 45 minutes with a private examination of one to two hours. The technical depth of CS work means examiners often spend more time on evaluation and methodology than in some other fields, so preparation time on those sections pays off disproportionately.
- What if the examiner finds a genuine bug in my implementation?
- This happens. The right response is to acknowledge it, describe what you believe the impact on your results is, and — if you can — say what a corrected version would look like. A bug that affects a secondary experiment is a different matter from one that undermines the main claim. Examiners are generally more concerned with whether you understand the scope of the problem than with the bug itself. Trying to argue that the bug doesn't matter when it clearly might is more damaging than admitting the problem cleanly.
- My thesis uses machine learning. Do examiners expect me to justify every hyperparameter choice?
- Not every hyperparameter, but the ones that could plausibly affect your conclusions. If you tuned learning rate, batch size, or architecture depth and the final values differ from common defaults, be ready to say whether you tuned on validation data (acceptable) or on the test set (a validity threat). Examiners in ML-adjacent CS are increasingly attentive to the difference between reported results that are the product of a single run and results that are stable across seeds and configurations.
- How should I handle a question about a paper published after I submitted my thesis?
- This is more common than candidates expect, especially in fast-moving subfields. You are not expected to have incorporated work you couldn't have known about, but you are expected to be able to discuss how it relates to your own. If the new paper reports results better than yours, have a brief response about whether it changes the validity of your contribution or simply extends the state of the art beyond what you claimed.
- Should I bring slides to a CS viva or defense?
- For a US defense, yes — the public talk is typically slide-based and the committee may ask you to return to specific figures. For a UK viva, bring a copy of your thesis and potentially a one-page summary of your contributions, but most examiners work from the thesis directly. Check your department's norms and ask your supervisor what the external examiner usually prefers. Arriving with the wrong format in either direction is easily avoidable.
The MockDefense Committee
Doctoral defense preparation, MockDefense
MockDefense builds AI examiners that rehearse the questions a real doctoral committee asks — on methodology, contribution, and the gaps you haven't patched yet. Our guides are written from that examiner's-eye view of what defenses actually test.
Keep preparing
Practice these questions before the exam
MockDefense runs you through CS-specific defense and viva questions with an AI examining committee that pushes on exactly the areas above — novelty, evaluation, baselines, and validity threats. Start with a free drill — no card required.