Skip to main content
Liveness Detection Benchmarks

Choosing Your Ascent: How Pre-Benchmark Vetting Processes Shape Liveness Detection Performance

This comprehensive guide examines how the often-overlooked pre-benchmark vetting process — the selection, curation, and preparation of test data and attack simulations — fundamentally determines the real-world performance of liveness detection systems. Drawing on shared professional practices as of May 2026, we explore why two systems with identical benchmark scores can behave radically differently in production. We compare three approaches (vendor-led, in-house composite, and third-party audit)

Introduction: The Hidden Variable in Liveness Detection

Teams often find themselves staring at a benchmark report, confident that a liveness detection system scoring 99.5% will secure their deployment. Then, in production, the false acceptance rate climbs, or a simple paper mask fools the system during a pilot. The culprit is almost never the algorithm itself — it is the pre-benchmark vetting process. This guide argues that the way you select, prepare, and challenge your test data shapes performance more than any model tweak. As of May 2026, this overview reflects widely shared professional practices; verify critical details against current official guidance where applicable.

The core problem is that standard benchmarks, like those from well-known standards bodies, use controlled datasets that may not reflect your threat model. A system tuned to reject high-quality replay attacks might fail against a low-resolution deepfake video. The pre-benchmark vetting process — the workflow you use to curate genuine samples, define attack types, and set pass-fail thresholds — is where real-world robustness is built or broken. This article provides a conceptual framework for evaluating and designing that process, drawing on composite scenarios from industry practice.

We will cover three main approaches to vetting, a step-by-step pipeline, and common pitfalls. The goal is to help you make informed trade-offs, not to promise a perfect solution. Liveness detection is a rapidly evolving field, and no single process guarantees security. By focusing on workflow comparisons, we aim to give you the tools to ask better questions of vendors and internal teams.

Core Concepts: Why Pre-Benchmark Vetting Matters

To understand why pre-benchmark vetting is critical, we must first separate two distinct activities: evaluation and vetting. Evaluation is the formal scoring of a system against a fixed benchmark — for example, using a public dataset like the one from a major biometric conference. Vetting, by contrast, is the preparatory phase where you decide what data to include, how to simulate attacks, and what performance thresholds are acceptable for your use case. This guide focuses on the latter, because it is where the assumptions that drive evaluation results are formed.

Common Mistakes in Genuine Sample Curation

One team I read about collected 10,000 selfie images from employee smartphones for a pilot. They assumed this was a representative genuine set. However, the images were all taken under office lighting, with the same phone model, and with minimal variation in expression. When the system was deployed to a customer base using older devices in dimly lit rooms, the false rejection rate tripled. The mistake was not in the algorithm but in the vetting process: they did not account for environmental variability. A robust vetting process would include samples from multiple devices, lighting conditions, angles, and demographics, as well as metadata about capture quality.

Attack Simulation Diversity

Another common error is focusing too narrowly on known presentation attack instrument (PAI) types — for instance, only testing against printed photos and video replay. Modern attackers use silicone masks, 3D-printed heads, and deepfake videos generated from stolen images. A vetting process that only includes low-tech attacks will miss advanced threats. Conversely, overemphasizing exotic attacks can lead to overfitting, where the system rejects legitimate users who wear heavy makeup or have facial hair. The key is to match attack diversity to your threat model, not to an abstract benchmark.

Trade-offs in Threshold Setting

Pre-benchmark vetting also involves setting preliminary thresholds for false acceptance rate (FAR) and false rejection rate (FRR). These thresholds are not arbitrary; they should be derived from a cost-benefit analysis of your application. For a high-security bank transaction, a very low FAR (e.g., 0.001%) might be worth a higher FRR. For a convenience app like a photo unlock, a higher FAR may be acceptable to keep FRR low. The vetting process should include a sensitivity analysis to understand how different thresholds affect performance on your specific data mix.

Approach 1: Vendor-Led Vetting

Many organizations rely on the vendor of the liveness detection system to provide vetting data, attack simulations, and performance reports. This approach is convenient and often uses the vendor's proprietary test sets, which may be optimized to show the system in the best light. However, it introduces a risk of overfitting to the vendor's data distribution. This section outlines the workflow, strengths, and weaknesses of vendor-led vetting, based on common industry patterns.

Workflow Overview

In a typical vendor-led process, the organization receives a test kit containing a set of genuine sample images and attack videos. The vendor provides instructions for running the system on this data and reports metrics like FAR, FRR, and attack presentation classification error rate (APCER). The organization may also receive a sample report template. The entire process can take a few days to a week, depending on the vendor's responsiveness. The key assumption is that the vendor's data is representative of the deployment environment.

Strengths

Vendor-led vetting is fast, requires minimal internal expertise, and often includes a warranty or service-level agreement (SLA). The vendor has deep knowledge of their system's failure modes and can provide guidance on optimal threshold settings. For organizations with simple threat models — for example, a low-security office access system — this approach may be sufficient. It also reduces the burden of data collection, which can be expensive and time-consuming.

Weaknesses

The primary weakness is lack of independence. The vendor's test data may not include the specific attack types or environmental conditions relevant to your deployment. For example, a vendor might test against standard printed photos and video replay but not against silicone masks, which are becoming more accessible. Additionally, the vendor may select genuine samples that are unusually clean (e.g., high resolution, good lighting), leading to an inflated performance estimate. Another issue is that the vendor may not share raw data or attack generation scripts, making it difficult to reproduce or extend tests later.

When to Use / When to Avoid

Use vendor-led vetting when you have limited internal resources, a low-risk application, and trust in the vendor's track record. Avoid it when your deployment involves high-value transactions (e.g., financial services), diverse user demographics, or a rapidly evolving threat landscape. In those cases, the lack of transparency and customization can lead to costly failures.

Approach 2: In-House Composite Vetting

For organizations with more resources and higher security requirements, an in-house composite vetting process offers greater control. This approach combines vendor-provided data with internally collected samples, third-party attack tools, and custom attack simulations. The goal is to create a test set that mirrors the deployment environment as closely as possible. This section describes the workflow, including data collection, attack generation, and iterative refinement, based on composite scenarios from multiple teams.

Building a Genuine Sample Library

The first step is to collect a library of genuine samples that represent your target user base. This includes images and videos from different devices (smartphones, tablets, laptops), lighting conditions (indoor, outdoor, low light), angles (frontal, profile, tilted), and demographics (age, ethnicity, facial hair, glasses). A practical rule of thumb is to collect at least 1,000 samples per significant variation (e.g., device type). Metadata should be recorded for each sample: capture device, resolution, lighting level, and any occlusions (masks, hats, sunglasses). This library becomes the baseline for measuring false rejection.

Attack Simulation Pipeline

Attack simulations should cover multiple PAI categories: print attacks (paper, glossy, matte), replay attacks (video on various screen sizes and resolutions), masks (silicone, latex, 3D-printed), and digital injection attacks (deepfakes, face swaps). For digital attacks, use publicly available deepfake generation tools (e.g., FaceSwap, DeepFaceLab) to create videos from stolen images. For physical attacks, purchase or create masks using affordable materials (e.g., silicone molds from online retailers). Document each attack with metadata: PAI type, material, resolution, lighting condition, and any preprocessing (e.g., scaling, cropping).

Iterative Test Cycles

In-house vetting is not a one-shot process. Start with a small subset of genuine and attack samples (e.g., 500 each) to tune system parameters. Then expand to a larger set (e.g., 5,000 genuine, 1,000 attacks) for formal evaluation. After each cycle, analyze false positives (rejected genuine users) and false negatives (accepted attacks) to identify failure patterns. For example, if the system consistently rejects users with glasses, add more samples of glasses-wearing users and adjust preprocessing. This iterative loop is the core of robust vetting.

Strengths and Weaknesses

The main strength is relevance: the test data reflects your actual deployment conditions. The iterative process also surfaces edge cases that vendor tests might miss. However, this approach requires significant time (weeks to months), expertise in data curation and attack generation, and ongoing maintenance to keep the library current as new attack types emerge. It is best suited for organizations with a dedicated security or biometrics team.

Approach 3: Third-Party Audit Vetting

A third approach involves hiring an independent testing laboratory or consultancy to design and execute the vetting process. This combines the independence of an external party with the customization of in-house work. Third-party auditors typically have experience across multiple deployments and can bring best practices from other industries. They may also have access to proprietary attack libraries or tools. This section explores the workflow, trade-offs, and typical scenarios for this approach.

Auditor Selection Criteria

Choosing an auditor is itself a vetting process. Look for firms with experience in biometric security, a published methodology (e.g., based on ISO/IEC 30107 standards), and a track record of testing systems similar to yours. Ask for anonymized case studies or sample reports. Avoid auditors who promise guaranteed results or refuse to share their test data definitions. A good auditor will work with you to define the threat model and data requirements, not impose a one-size-fits-all test.

Test Design Collaboration

In a typical engagement, the auditor interviews your team to understand the deployment environment, user demographics, and security requirements. They then design a test plan that includes genuine sample collection (which you may provide or they may collect), attack simulation (using their own tools and your threat model), and pass-fail criteria. The plan should specify sample sizes, attack types, and metrics. The auditor runs the tests and provides a detailed report with findings, including failure analysis and recommendations.

Strengths and Weaknesses

The main strengths are independence, expertise, and a standardized methodology that reduces bias. Auditors can also provide a benchmark against industry peers (anonymized, of course). The weaknesses include cost (often tens of thousands of dollars), time (weeks to months), and potential lack of familiarity with your specific deployment quirks. There is also a risk that the auditor's test data is still not perfectly representative, especially for niche applications (e.g., a system designed for users in extreme cold weather with heavy clothing).

Step-by-Step Guide to Building Your Vetting Pipeline

Regardless of which approach you choose, a structured pipeline will improve consistency and reproducibility. The following steps are a composite of practices observed across multiple organizations. They are designed to be adaptable to different resource levels and threat models. This guide assumes you have at least basic familiarity with biometric evaluation concepts.

Step 1: Define Your Threat Model

Before collecting any data, write down the specific attacks you are most concerned about. For example, a mobile banking app might prioritize deepfake videos and printed photos, while a physical access system might worry about silicone masks and replay attacks. Also define the acceptable FAR and FRR thresholds based on business impact. This document will guide all subsequent decisions. Without a threat model, you risk testing for the wrong things.

Step 2: Collect Genuine Samples with Metadata

Using the criteria from your threat model, collect at least 2,000 genuine samples that cover the anticipated variations. Record metadata for each sample: device model, capture timestamp, lighting condition (lux), resolution, and any anomalies (blur, occlusion). Organize the samples into a structured directory or database. This library is the foundation for all false rejection testing. Ensure you have informed consent and comply with relevant privacy regulations.

Step 3: Generate Attack Samples with Diversity

Create or acquire attack samples for each PAI type in your threat model. For print attacks, print photos on different paper types (matte, glossy) and at different sizes. For replay attacks, record videos on phones, tablets, and laptops at various resolutions. For masks, purchase or 3D-print at least three different models. For digital attacks, generate deepfakes using at least two different tools. Aim for at least 100 samples per PAI type, with metadata on creation method and quality.

Step 4: Run Initial Calibration Tests

Using a small subset (e.g., 200 genuine, 50 attack samples per type), run the liveness detection system to establish baseline performance. Analyze the results to identify obvious issues: are there any PAI types that the system accepts at a high rate? Are there genuine samples that are consistently rejected? Adjust system parameters (e.g., threshold, preprocessing) as needed. Document all changes and their rationale.

Step 5: Conduct Full Evaluation

Run the system on the full test set (all genuine and attack samples). Calculate FAR, FRR, and APCER for each PAI type. Examine failure patterns: do certain demographics or devices cause higher false rejection? Do certain attack types achieve high acceptance? Generate confusion matrices and ROC curves to visualize performance. If results are unsatisfactory, revisit Step 4 or consider a different system.

Step 6: Iterate and Update

Vetting is not a one-time event. As new attack types emerge or your user base changes, update the genuine and attack libraries. Re-run tests at regular intervals (e.g., quarterly) or after significant system updates. Maintain a changelog of all modifications to the pipeline and results. This iterative approach ensures that your vetting process remains relevant over time.

Real-World Scenarios: Lessons from Composite Cases

The following scenarios are anonymized composites based on patterns observed in industry discussions. They illustrate common failures and successes in pre-benchmark vetting. Names, organizations, and specific numbers are invented to protect confidentiality, but the workflow lessons are drawn from real practices.

Scenario 1: The Retail Kiosk That Failed During Holiday Rush

A retail chain deployed a liveness detection system for self-checkout kiosks, relying on vendor-led vetting. The vendor tested with indoor lighting and standard smartphone cameras. During the holiday season, the kiosks were placed near windows with bright sunlight, and many customers wore sunglasses or hats. The false rejection rate jumped from 2% to 18%, causing long queues and abandoned transactions. The root cause was that the vendor's genuine sample library lacked outdoor lighting and accessory variations. The chain then built an in-house library with such scenarios, updated the system, and reduced FRR to 3%. The lesson: vetting must match the deployment environment, not a lab.

Scenario 2: The Financial App That Missed Deepfakes

A fintech startup used a third-party audit for its liveness detection system, but the audit only tested against print and video replay attacks. The startup's threat model did not initially include deepfakes, as they were considered exotic. Six months after launch, a security researcher demonstrated that a deepfake generated from a stolen selfie could bypass the system. The startup then expanded its attack library to include deepfakes, and the system's APCER for that category was 45%. They had to retrain the model and re-issue tokens. The lesson: threat models must anticipate evolving attack types, and vetting should include a horizon scan of emerging threats.

Scenario 3: The Hospital That Reduced False Rejections with Iterative Vetting

A hospital system used in-house composite vetting for its patient identity verification system. In the first iteration, they used a uniform set of genuine samples from staff. The system had a 5% FRR on patients with darker skin tones, due to lighting biases in the training data. By adding more diverse samples and adjusting preprocessing, they reduced FRR to 1.5% across all demographics. They continued to iterate quarterly, adding new devices and lighting conditions. The key was the iterative loop and willingness to address bias proactively.

Common Questions and Frequent Concerns

Teams evaluating liveness detection often have recurring questions about the vetting process. This section addresses the most common ones based on discussions with practitioners. Remember that this is general information only; consult a qualified security professional for organization-specific decisions.

How many genuine samples are enough?

There is no single number, but a practical guideline is to have at least 1,000 samples per significant variation (e.g., device type, lighting condition). For a deployment with 100,000 users, a library of 5,000–10,000 diverse samples is often sufficient to estimate FRR with reasonable confidence. The key is diversity, not just volume. A library of 100,000 identical images is less useful than 1,000 images covering 10 different conditions.

Should I test against every possible attack type?

No. Focus on attack types that are credible for your threat model and resource constraints. A bank handling high-value transactions should test against advanced attacks like silicone masks and deepfakes. A low-security app might only need to test against printed photos and video replay. Over-testing can lead to overfitting and wasted effort. The trade-off is between comprehensiveness and practicality.

How often should I update the vetting pipeline?

At minimum, update the genuine sample library annually to reflect device changes and user demographics. Update attack libraries whenever a new attack type becomes commercially available or is reported in security literature. For high-security deployments, consider quarterly updates and continuous monitoring of false acceptances in production.

Can I trust vendor-provided benchmark numbers?

Vendor benchmarks are useful for initial screening but should not be the sole basis for deployment decisions. Request the vendor's test methodology, data sources, and any limitations. Ideally, conduct your own vetting using a representative subset of your data. Independent verification is a strong trust signal.

Conclusion: Choosing Your Ascent

The choice of pre-benchmark vetting process is the most consequential decision in liveness detection deployment. It determines whether your system will perform well in the real world or fail under unexpected conditions. This guide has outlined three approaches — vendor-led, in-house composite, and third-party audit — each with distinct trade-offs. The best choice depends on your resources, threat model, and security requirements.

Key takeaways include: always define your threat model before collecting data; prioritize diversity in genuine and attack samples; use iterative testing to surface edge cases; and update your pipeline regularly. Avoid the trap of relying solely on vendor benchmarks or one-shot tests. A robust vetting process is not a cost but an investment in security and user trust.

As you choose your ascent, remember that no process is perfect. The goal is to make informed trade-offs, not to eliminate all risk. By understanding the conceptual workflow behind vetting, you can ask better questions, evaluate vendors more critically, and build systems that are resilient against evolving threats. The path is not always straightforward, but with careful preparation, you can significantly improve your chances of a successful deployment.

About the Author

This article was prepared by the editorial team for this publication. We focus on practical explanations and update articles when major practices change.

Last reviewed: May 2026

Share this article:

Comments (0)

No comments yet. Be the first to comment!