News

Section 232 Tariffs Live on Tru

Industry·May 13, 2026

How to Evaluate AI for Trade Compliance

By Aidan Gallary

The trade compliance software market has become an AI market. Every vendor pitching an importer, broker, or 3PL in 2026 is leading with AI features. The variance in quality is enormous, and the cost of choosing wrong is asymmetric. The wrong AI tool does not just fail to deliver value but also creates audit exposure. This is your framework for evaluating AI in trade compliance: four demo questions, six red flags to avoid, six attributes that define a pilot-worthy tool.

Four questions that cut through any demo

Every AI vendor pitch should answer these four questions clearly. If a vendor can't answer all four in the first 15 minutes, they aren’t fit for trade compliance.

1. Is the tool built by trade professionals or by software engineers?

A team that has never filed an entry, written a CF-29 response, or read a CBP ruling cannot build a defensible product in this space.

The right question to a vendor: who on your team has done customs work, and can I speak with them directly? If the vendor cannot produce that person on a call, the team does not have the domain depth to sell AI in trade compliance.

2. Does it connect to your actual data?

AI accuracy is bounded by the inputs the system can see. If a tool cannot ingest a product database, HTS codes, purchase orders, commercial invoices, entries, it is operating blind. Whatever it produces is a reasonable-sounding guess based on whatever description was typed into a text box.

The right question to a vendor: how does this platform integrate with our our workflows and existing data sources? If the answer is "we do not integrate, the user types in the product description," the tool is a general-purpose chatbot in a trade compliance wrapper.

3. Are outputs source-grounded and auditable?

Every classification, valuation, and origin determination must cite the regulation or ruling that supports it. In an audit, the importer or broker has to walk the chain from the entry back to the underlying reasoning.

The right question to a vendor: show me a determination and trace it back to its source. If the demo cannot produce that trace in real time, the production system will not produce it either.

4. Is there a human in the loop?

Confidence scoring and workflow triggers separate useful AI from dangerous AI. Every output should carry a confidence score. Low-confidence outputs should automatically route to a human reviewer. Edge cases should not be auto-resolved by the model.

The right question to a vendor: what happens when the AI's confidence on a determination falls below your threshold? If the answer is vague, the vendor has not thought seriously about the failure modes that matter most in compliance work.

Six red flags that should end the conversation

Beyond the four questions, these six red flags should trigger an immediate walk-away. Any single red flag indicates a tool that does not belong in a compliance workflow.

Red flag 1: No confidence scoring. If the system cannot tell you how certain it is about a determination, the user cannot manage risk.

Red flag 2: No audit trail. If the system cannot reconstruct how it arrived at a classification, valuation, or origin call, the determination cannot be defended. Every output should be traceable to its inputs and its reasoning.

Red flag 3: Free-form LLM output only. Long narrative answers with no structure, no cited assumptions, and no source list are a liability. A general-purpose chatbot dressed up in a compliance UI is still a general-purpose chatbot.

Red flag 4: No integration with your data. If the platform cannot read the importer's actual orders, invoices, product database, and HTS codes, the AI is guessing. Integration depth is a leading indicator of how serious the vendor is about real-world deployment.

Red flag 5: No trade expertise on the team. If the vendor cannot discuss the importer's specific compliance challenges in a consultative way, the team does not understand the domain.

Red flag 6: "The model knows." When asked where regulatory data comes from, a serious vendor cites a curated regulatory library, versioned content, and a process for keeping HTS, CFR, and rulings current. A vendor who says the regulatory knowledge "comes from the model's training data" has not built a defensible product.

Six attributes that define a tool worth piloting

The inverse of the red flags is the positive checklist. A tool worth piloting demonstrates all six.

1. Updated regulations and resources. The tool has live access to the current HTS, CFR Title 19, customs rulings, explanatory notes, and the importer's own internal policies. Regulatory content is versioned and dated.

2. Confidence scoring on every output. Every determination carries a confidence score. The scoring is calibrated against real-world outcomes. Low-confidence outputs automatically trigger human review.

3. Controlled data sources. The AI does not pull from the open internet to fill gaps. Regulatory information comes from a defined, restricted, versioned library. There is a clear answer to the question "where did this come from."

4. Multi-layered audit. AI audits human output. Humans audit AI output. Both directions, continuously. The audit model is not redundancy. It is the only structure that catches both classes of error before they reach CBP.

5. Expansive, global datasets. For sourcing, PO management, and restricted party screening, the platform draws on global trade datasets that reflect counterparty risk, country of origin patterns, and sanctions exposure.

6. Customizable workflows. Configurable triggers for supervisor review, complex classifications, and regulatory changes. The tool adapts to the importer's risk tolerance, product mix, and team structure.

Where to start: post-entry audit

For importers and brokers evaluating AI for the first time, the highest-leverage starting point is a post-entry audit.

  • The data already exists. A post-entry audit compares filed entry data to commercial invoices and product records. No new data collection is required.
  • The findings are immediate. Misclassifications, missing PGA data, 99-code sequencing errors, and duty overpayments show up within the first audit cycle.
  • The risk is contained. A post-entry audit is read-only. The AI is not making determinations that get filed with CBP. It identifies issues for human review and correction.
  • The ROI is measurable. Overpayment findings are recoverable through PSCs and protests. Underpayment findings can be corrected through prior disclosure. Either result has financial value.

For a tool that promises to support customs operations, a post-entry audit is the cleanest demonstration of capability. Any vendor that cannot produce a meaningful audit result on a recent sample of entries within a short pilot has not built the underlying capability.

The framework in one paragraph

Ask the four questions: trade expertise, data integration, source-grounded outputs, human in the loop. Walk away from the six red flags: no confidence scoring, no audit trail, free-form LLM output, no data integration, no trade expertise, "the model knows." Pilot tools that demonstrate the six positive attributes: updated regulations, confidence scoring, controlled sources, multi-layered audit, global datasets, customizable workflows. Start with a post-entry audit. The cost of getting this evaluation wrong shows up in the next audit cycle, and that math gets worse every quarter.