5 Critical Questions to Ask Every Data Security Vendor Now

If you’ve evaluated data security vendors in the past year, you’ve probably heard one of these two promises –  It’s become an assurance for all and sundry – particularly when describing their support for on-premises datastores and file shares. Through this, every vendor emphasizes their commitment to data privacy, their respect for data sovereignty, and their “local-only” classification architecture.But here’s the uncomfortable truth: for most of our competitors, this claim doesn’t survive contact with technical reality.

The disconnect isn’t always intentional. Some vendors genuinely believe their architecture keeps data local-until you ask about the GPU infrastructure powering their AI Classification models. Others have convinced themselves that “extracting metadata” doesn’t count as data exfiltration. And some are simply betting that you won’t ask the right questions. This post provides five interconnected questions designed to test the “your data stays local” promise. These aren’t gotcha questions – they’re straightforward technical inquiries that any legitimate vendor should answer confidently and consistently. The catch? When vendors can’t answer these questions without contradicting themselves, you’ve just discovered where your data actually goes.

The Five CRITICAL Questions

Question 1: “If you support on-premise data stores, is your classification performed on-premises or remotely?”

This should be the easiest question to answer. Either classification happens on-premises, or it doesn’t. There’s no middle ground-your data either travels across the network for classification, or the classification engine comes to your data. For organizations with on-premise databases, data warehouses, or file servers, this question has direct compliance implications. GDPR, HIPAA, and industry-specific regulations often restrict where sensitive data can be processed. If your compliance strategy depends on data remaining in specific geographic locations or security boundaries, remote classification may break that promise.

What to Look For

A straightforward vendor will answer: “On-premises. We deploy classification engines in your environment that process data locally.” Any other answer-“hybrid approach,” “we process metadata centrally,” “we use an outpost model” means that data is leaving your environment.  This question directly tests the “local processing” claim. If the answer isn’t an unambiguous “on-premises,” then the vendor’s marketing about keeping data local is false. Watch for vendors promoting “outpost” or “edge” models where classification agents run locally but send “only metadata” back to central infrastructure. This introduces dangerous opacity. What exactly is “metadata”? Classification results? Data statistics? Sample fingerprints? Without transparency into what’s transmitted, you can’t verify that sensitive information isn’t included.

Worse, outpost models can hide classification quality. A vendor might deploy a lightweight, less-accurate engine in your environment while marketing their advanced GPU-powered models. You get the appearance of local processing without the advertised accuracy—and no visibility into the gap. True local processing means classification happens in your environment, results stay in your environment, and you have complete control over what, if anything, is transmitted outbound. Any architecture that sends information back to vendor infrastructure—whether called “metadata,” “telemetry,” or “encrypted fingerprints”—requires scrutiny about what’s actually in those packets.

Question 2: “Do you use GPU-powered LLM classification or other forms of AI, and where are those GPUs located and are they shared?”

Modern data classification relies on sophisticated machine learning models that identify sensitive information across structured and unstructured data. These models – especially vendors that have implemented it using LLMs – require significant computational power. GPU’s deliver the power and speed that make LLM’s viable – But GPUs are expensive. Enterprise-grade GPU infrastructure costs tens of thousands of dollars per node, requires specialized expertise to maintain, and needs constant updates as models evolve. Most vendors solve this problem by centralizing GPU resources in their cloud infrastructure.

What to Look For

Listen carefully to where the vendor places their GPU infrastructure if they claim to use LLM’s. Answers like “we leverage advanced cloud infrastructure” or “we use optimized processing” are red flags. If the vendor runs GPU workloads in AWS, Azure, Snowflake or their own data centers, that’s where classification happens-not in your environment. A consistent architecture would require deploying GPU resources wherever you perform the classification: in your data center, in your cloud tenants, in your on-premises data centre. This is technically possible but economically impractical for most vendors. 

See the contradiction? GPU powered LLMs cannot classify your data without accessing your data. If a vendor claims “your data never leaves your environment” but runs the LLMs in their cloud, the math doesn’t work. Your data must travel to those GPUs for classification to occur. Some vendors try to split the difference: “We extract features locally and classify those features remotely.” This is still data exfiltration. Those “features” are derived from your sensitive data and can reveal significant information about your data’s content and structure.

Question 3: “Are your classification models consistent across all environments you support? How are they kept consistent?”

Inconsistent classification is a governance nightmare. Imagine data classified as “Highly Confidential” in your cloud environment but “Internal Use Only” in your on-premise database. Different classifications for identical data create policy gaps, compliance blind spots, and organizational confusion about what’s actually sensitive. Vendors love to promise consistent classification. It’s a compelling value proposition: one classification standard, uniformly applied, regardless of where data lives. But delivering on this promise requires solving a complex technical problem.

To achieve truly consistent classification across diverse environments, vendors need either:

  • Centralized classification engines: All data funnels to a single classification engine. This ensures consistency but requires data exfiltration.
  • Distributed engines with identical models: Deploy the exact same classification engine everywhere. This is technically challenging. And when your classification is heavily dependent on GPU powered LLMs. – it becomes economically expensive. 
  • A combination of the above – depending on capabilities to deploy compute i.e. within SaaS environments.

Most vendors choose option one because it’s simpler to scale, and easier to build an LLM based classifier than ever before. If they’re truly using local classification everywhere, maintaining consistency requires synchronized model updates, version control, and constant validation that distributed engines produce identical results. Few vendors have this infrastructure.

Question 4: “How can we verify the accuracy of your classification? Do you show us the data samples you retain?”

Classification accuracy isn’t academic – it’s the foundation of your entire data security strategy. Over-classification creates operational friction and desensitizes everyone to the importance of classification. Under-classification leaves sensitive data exposed. You need to validate that the vendor’s classification engine works correctly on your specific data. Most vendors demonstrate accuracy through sample sets: “Here are 100 records we classified, and here’s the sensitive information we found in each.” This transparency builds trust and allows you to calibrate the system. 

If a vendor can show you classified samples from your environment, they must be storing those samples somewhere. Ask explicitly: “Where are these samples stored? How long do you retain them? Who has access?”

Vendors who truly don’t take data outside your environment will either:

  • Not retain samples at all and not show them in the UI (less ideal for validation)
  • Store samples securely within your environment with clear access controls
  • Be completely transparent about the redaction of samples in their infrastructure, and provide ephemeral links.

See the contradiction? If you can review samples in a vendor portal, those samples are stored in vendor infrastructure.  This question creates a direct contradiction with the “we don’t take your data” claim – “We retain samples for validation” + “We don’t store your data” = Choose one. If those samples contain sensitive data (which they must, to demonstrate accuracy), the vendor is storing your sensitive information-exactly what they promised not to do. Some vendors try to square this circle by “anonymizing” or “redacting” samples. But truly effective anonymization often makes samples useless for validation. And imperfect anonymization still represents data retention and potential exposure. Think about how many processes rely on confirming the last four of your SSN.

Question 5: “Is the classification shown in your demo environment identical to what we’ll get in production at scale? How much will it cost us at that scale?”

Demos are designed to impress. Vendors showcase instant classification, comprehensive results, and sophisticated pattern matching. But demo environments run under ideal conditions: pre-loaded data, optimized infrastructure, and often, centralized cloud resources with abundant GPU power. The critical question is whether that demo performance translates to your actual environment-especially if you’re deploying on-premise or in restricted cloud environments. Ask the vendor to replicate demo performance in a controlled test environment that mirrors your production constraints or scale. If you’re deploying on-premise, can they show you classification running on local hardware without cloud connectivity? If you’re in a regulated industry with data residency requirements, can they demonstrate classification within those boundaries? 

Watch for answers like “production performance depends on your infrastructure and budget” These suggest that the impressive demo relies on centralized cloud resources you won’t have in production.

See the contradiction? If demos show instant, GPU-powered classification, that compute power lives somewhere. For most vendors, it lives in their cloud infrastructure. Those impressive demos work by uploading your sample data to vendor-controlled environments where abundant GPU resources deliver fast results. Enterprise GPU infrastructure costs hundreds of thousands of dollars and requires specialized expertise. Vendors that show impressive GPU-powered demos but promise local classification face an economic impossibility-they can’t afford to deploy GPU infrastructure in every customer environment. GPUs are centralized in vendor cloud infrastructure. Your data travels to those GPUs for classification. The “local processing” claim is probably false if you expect the same performance. And wroseif your production environment won’t have network connectivity to those cloud GPUs-because you’ve deployed on-premise or in an air-gapped environment-that demo performance won’t materialize. The vendor must choose: accurate demos using cloud resources (data exfiltration) or honest demos showing actual on-premise performance (often slower and less impressive).

What This Means for Your Organization

These contradictions aren’t just technical curiosities-they have real implications for your security posture and compliance obligations.

  • Compliance Risk: If your compliance strategy depends on data remaining within specific boundaries (geographic, network, or organizational), remote classification breaks that model. GDPR’s data residency requirements, HIPAA’s security rule, and industry-specific regulations all require understanding exactly where data is processed. “We classify in our cloud” may violate your compliance obligations-even if the vendor promises encryption and security.
  • Data Sovereignty: For organizations in regulated industries or government sectors, data sovereignty isn’t negotiable. If data cannot leave your environment under any circumstances, vendors that rely on centralized cloud classification aren’t viable-regardless of their marketing claims.
  • Attack Surface Expansion: Every location where your data exists or is processed expands your attack surface. Data transmitted to vendor infrastructure for classification creates new interception opportunities. Data stored in vendor environments for sample validation creates new breach targets. Understanding your actual data flows is essential for threat modeling.
  • Trust and Transparency: Perhaps most importantly, vendors who can’t answer these five questions consistently to your satisfaction aren’t being transparent about their architecture. If they’re misleading you about data flows-even unintentionally-what else don’t you know about their security model, their incident response capabilities, or their sub-processor relationships?

The Bottom Line

The “we don’t take your data outside your environment” promise is the easiest claim to make and the hardest to deliver. It’s technically challenging and often incompatible with other vendor promises about performance, consistency, and use of LLM’s. At Symmetry, we know because we’ve spent decades researching this problem, and engineering a distributed data lake that can get everywhere. Even with our SaaS connectors, we take great care to keep the data within your environment albeit not the SaaS environment.

These five questions force vendors to explain the technical reality behind the marketing. Ask them in sequence. Listen for contradictions. Watch for hedging about what counts as “data” or what “local processing” really means. In your next vendor evaluation, try this exercise: Start with question one (“Is classification performed on-premises or remotely?”) and get a clear answer. Then ask the other four questions and see if the answers remain consistent. If a vendor claims on-premise classification but uses cloud GPUs, demonstrates impressive cloud-based demos, shows consistent results across all environments, and retains samples for validation-you’ve found a logical impossibility.

The vendors who can answer all five questions consistently, without contradicting themselves, are rare. They’re also the ones who’ve made the substantial technical and economic investments required to truly keep your data in your environment. Those vendors are worth finding- because they offer actual, not aspirational, data security at their core. Transparency about data flows shouldn’t be a competitive advantage. It should be table stakes. But until it is, these five questions will help you separate genuine architectural commitments from marketing promises that don’t survive technical scrutiny.

Your data deserves better than promises. It deserves proof.

 

Recent Blogs

About Symmetry Systems

Symmetry Systems is the Data+AI Security company. Symmetry’s leading cybersecurity platform helps organizations of all sizes safeguard data at scale, detect and reduce identity threats, ensure compliance & reduce AI risks. Born from the award-winning and DARPA funded Spark Research Lab at UT Austin, Symmetry is backed by leading security investors like ForgePoint, Prefix Capital, and others. With total visibility into what data you have, where it lives, who can access it, and how it’s being used, Symmetry’s innovative platform merges identity access with DSPM, delivering security outcomes that matter, including:

  • Finding significant savings by eliminating petabytes of unnecessary data
  • Removing thousands of dormant identities and excessive permissions
  • Satisfying HIPAA and PCI compliance requirements in record time
  • Reducing data blast radius and attack surface
  • Detecting ransomware attacks and enforcing least-privilege access

Symmetry’s platform works across both structured and unstructured data in all major cloud environments (AWS, GCP, Azure and OCI), SaaS, and on-premise databases and data lakes. As a read-only service, it inherits all existing security and compliance controls, making it deployable even in the most strictly regulated environments. 

Organizations of all sizes trust Symmetry to protect their data without it ever leaving their custody and control. 

Innovate with confidence with Symmetry Systems.

Privacy Preferences
When you visit our website, it may store information through your browser from specific services, usually in form of cookies. Here you can change your privacy preferences. Please note that blocking some types of cookies may impact your experience on our website and the services we offer.