Academic and vendor benchmarks for construction PPE detection consistently report mAP (mean average precision) figures above 90%. On clean datasets, with good lighting and minimal occlusion, YOLOv8-based models genuinely achieve those numbers. The question no one in the industry wants to answer clearly is: what happens on an actual jobsite at 6:30 AM in February under sodium vapor lighting, when workers are wearing non-standard hard hat colors from three different subcontractors?
We measured it. Here's what we found, and why the gap matters for anyone making a procurement decision based on vendor-supplied benchmark figures.
How benchmark datasets are constructed
The most-cited public datasets for construction PPE detection — including CHV (Construction Hardhat and Vest), SHD (Safety Helmet Detection), and the datasets used in the 2021 IEEE Access paper that many vendors reference — share a common characteristic: they were assembled from controlled environments or curated video footage. Images are photographed at standard heights, in daylight, with workers positioned to ensure good visibility of PPE items.
The CHV dataset, for example, contains 7,000 images with hard hat and safety vest annotations. Approximately 73% were photographed outdoors in daylight, and the annotation guidelines excluded frames where the hard hat was more than 60% occluded. That's a reasonable academic constraint for building a clean training benchmark. It's not representative of an active jobsite where workers are frequently seen from overhead angles, partially occluded by equipment, or wearing hard hats that differ significantly from the training distribution.
Vendors who train on these datasets and publish benchmarks derived from held-out test splits of the same datasets are reporting in-distribution performance. That number tells you how well the model works on data similar to what it was trained on. It does not tell you how it performs on your site, with your workers, under your lighting conditions.
Our measurement methodology
During our 14-week pilot at a 47-acre highway interchange project in the Houston area, we instrumented detection accuracy measurement with a ground truth protocol: one trained observer with a tally counter confirmed or denied each machine detection in real-time during three 2-hour observation windows per day. Observer sessions were randomly distributed across shifts and weather conditions. Total manually validated frames: 41,200 across 14 weeks.
We measured true positive rate (correct PPE non-compliance detections), false positive rate (workers flagged who were actually compliant), and false negative rate (non-compliant workers missed by the system). We did not use mAP because it's a precision-recall curve aggregate that doesn't map cleanly to operational decision-making requirements. What a site supervisor cares about is: when a worker is out of compliance, does the alert fire? And when it does fire, is the worker actually out of compliance?
The numbers by condition
Daylight, standard PPE variants, clear camera view: 97.3% true positive rate, 2.1% false positive rate. These numbers are consistent with published benchmarks, which makes sense — controlled daytime conditions approximate the conditions under which benchmarks are measured.
Post-sunset, sodium vapor lighting only: 89.1% true positive rate, 4.7% false positive rate. The drop is driven primarily by yellow hard hats under sodium vapor light, which shift in apparent color toward the orange-white spectrum and confuse the classifier. We've addressed this in a model update using nighttime-specific augmentation, which improved post-sunset performance to 93.4% in subsequent testing.
Partial occlusion (worker partially behind equipment or formwork): 84.3% true positive rate. This is the hardest category to improve without fundamentally changing the detection architecture. When only 30% of a worker's torso is visible, hard hat detection from overhead angles becomes ambiguous even for a human observer.
Non-standard PPE variants (hard hat colors or models not in training distribution): 91.2% on initial deployment, improving to 96.8% after site-specific fine-tuning at week 4. This is the strongest argument for on-site model calibration during deployment. Models trained on generic datasets perform acceptably but not excellently on novel PPE variants; a week of labeled site footage substantially closes the gap.
What the false positive rate actually costs
False positive rates get less attention than true positive rates in vendor marketing, but they determine whether supervisors trust and act on the system. A 5% false positive rate at 1,000 detection events per day is 50 spurious alerts. If each alert requires 2 minutes of supervisor attention to investigate and dismiss, that's 100 minutes of false alert management per day — roughly 12% of a safety supervisor's 8-hour shift spent on events that didn't happen.
Our 2.1% daytime false positive rate at the pilot site generated approximately 8-12 false alerts per day. Site supervisors reported dismissing these in under 30 seconds each after the first week of familiarity with common false positive patterns. That's manageable. A system at 8-10% false positive rate — which we measured in an early model iteration before site-specific tuning — was not manageable. Supervisors stopped responding to alerts when the false positive rate exceeded approximately 15-20% of daily alert volume.
This matters for evaluating competitor claims. If a vendor reports 95% accuracy without specifying whether that's precision, recall, mAP, or a precision-recall trade-off at a specific threshold, the number is uninterpretable for operational planning. Ask specifically for false positive rates under live conditions, not controlled benchmark conditions.
The occlusion problem at multi-level sites
Multi-level construction sites present a camera placement challenge that doesn't appear in any published benchmark. When workers are on elevated platforms, scaffold decks, or mezzanine-level work areas, overhead cameras positioned on mast poles at ground level have severely degraded view angles. The effective coverage area of a mast camera drops by approximately 60% when the camera needs to cover workers at an elevation above 15 feet from the camera base.
Our site survey process specifically maps elevation profiles and identifies camera placement gaps that would result in coverage dead zones at working elevations. For the highway interchange pilot, we installed three additional cameras at elevated positions on temporary mast structures to maintain coverage as formwork and decking rose. That's an installation cost that doesn't appear in the hardware price but is a real project cost for multi-level deployments.
YOLOv8 vs. earlier architectures
It's worth noting the specific model architecture choice. YOLOv8 (Ultralytics, 2023) represents a meaningful improvement over YOLOv5 for construction safety detection for two reasons: better small object detection — critical for hard hat detection at standard camera installation distances — and improved performance under partial occlusion due to changes in anchor-free detection head design. Our side-by-side comparison at the pilot site measured 97.3% (YOLOv8) vs. 91.4% (YOLOv5) true positive rate under identical daylight conditions, with YOLOv8 also showing lower false positive rates (2.1% vs. 3.8%).
Vendors still using YOLOv5 or earlier architectures and citing 2021-era benchmark papers are reporting numbers that aren't representative of what their deployed system currently achieves. The benchmark papers cited don't correspond to the model they're shipping.
The calibration requirement
Our honest recommendation: any serious deployment requires two to four weeks of site-specific calibration before production alerting goes live. That's the period where the model adapts to your specific camera angles, PPE variants, and worker movement patterns. Running pre-calibration detection on a live site and acting on those alerts will generate false positive rates that damage trust in the system before it has a chance to prove its value.
We build the 48-hour initial calibration period into every deployment contract. The 14-week pilot used a more extended calibration window, which is why our production accuracy numbers were higher than what you'd see in a week-one deployment. If a vendor tells you the system performs at 95%+ accuracy on day one without any site-specific adaptation, ask to see the raw logs from their first week at a comparable site. The numbers will tell a different story.
Why this matters for safety outcomes
The gap between benchmark accuracy and live performance matters beyond marketing honesty. At 97% true positive rate, you detect 970 real non-compliance events out of 1,000. At 84%, which is what you get under partial occlusion conditions without site tuning, you detect 840. The 160 missed events in that gap represent unaddressed safety exposures on your jobsite. Knowing the actual detection rate under your conditions — not the benchmark figure — lets you calibrate how much you're relying on machine detection versus supplementing with periodic human walk-throughs in camera dead zones.
Contact us at contact@hardhatpulse.com if you want to see the full benchmark data from our Houston pilot, including per-condition breakdowns by camera position and lighting condition. We share actual numbers, not confidence intervals on curated datasets.