A Market Researcher’s Guide to Data Quality in an AI World

Researchers have always had to deal with bad data. In the days of in-person interviews, people would lie about their age, income, or habits just to qualify or avoid embarrassment. Phone interviewers faced challenges like respondents giving half-hearted answers while distracted or multitasking or interviewers biasing respondents with the way they asked questions or offering their own opinions. Then came online surveys, which made research quicker and more scalable, but opened the door to duplicates, speeders, straightliners, and people whose only motivations were the incentives. 

These AI bots can fill out surveys with perfect grammar, logic, speed, and are smart enough to mimic real people, which makes them even harder to catch. Their scalability also makes it profitable for these bad actors to flood surveys with fraudulent responses.

We previously addressed AI’s impact on the integrity of market research and included ways to identify poor quality responses and bots, and we wanted to take it a step further to provide a practical guide for tackling data quality head-on.

It starts with the sample provider

The source of your sample will be the main driver of the quality of your overall insights, and know that not all sources are equal. If it sounds like it’s too good to be true (like the ability to get a huge sample size of a low IR study for a low CPI), it probably is. 

Key tips for choosing your sample provider:

  • Know their sources

    • Understanding where the responses are coming from is critical. Are they pulling from vetted communities or random traffic networks?

    • What does it take to become part of their network?

  • Ask about their trust metrics

    • Do they use fraud scores, device fingerprinting, or behavioral validation? How often is their panel cleaned? 

  • What is their reconciliation policy?

    • If poor quality slips through, will they replace or credit?

    • What are the parameters for removing responses?

  • Do a test run

    • Evaluate the data quality. How does your vendor respond if you flag a high number of AI-generated responses? Do they acknowledge the issue and work with you to find solutions, or do they push back and suggest your standards are too strict? 

    • The first step toward improvement is acceptance—if your sample provider isn’t there yet, this partnership may not work out.

  • Trust, but validate

    • Even with a good provider, you still need to build in your own data quality checks. Don’t rely entirely on upstream quality controls. Plan to catch what slips through:

Program a smartER survey

Program a survey that finds a balance between effective DQ questions and being too annoying or conservative with data quality checks. Key tips for programming a smart survey

  • Include red herring questions/options: 

    • Insert impossible or highly unlikely answer options (e.g. "I’ve used the brand SuperGlue Shampoo").

  • Include low-likelihood events as options:

    • When was the last time you piloted a helicopter, boat, and airplane in the same day?

  • Set up time traps within the survey

    • Track time per page and overall survey time. Humans need time to read and think.

  • Logic traps: 

    • Flag inconsistent logic (e.g. says they’re 18, retired, and a parent of a 25-year-old).

  • Geo/IP mismatches: 

    • Respondent says they live in Boston, but IP is pinging from Bangladesh.

  • Utilize your survey software technology

    • Ours includes flags for straightlining, speeding, copy/paste/bot detection, gibberish answers

  • Set up series of data quality questions

    • Set up a bank of data quality questions (radio, checkbox, etc.) and show any 3-5 at random for each respondent. 

Ask smarter open ended questions

Open ended questions used to only be about depth. Now they’re also extremely useful for manual bot detection and data quality checks. 3 tactics that work:

  • Ask very technical questions that most respondents won’t know, but AI would

    • “Off the top of your head without searching, what is the 37th digit of Pi after the decimal point?”

      • Most humans would answer “I don’t know’ but bots will know the answer.

  • Include hidden/hard-to-read text on page

    • Some bots will crawl the page, so include hidden or hard-to-read fonts in certain questions. If the response addresses the hidden question, we know it’s a bot and we can remove it. 

  • Add the following steps 

    • Hands on practice

      • As you do more DQ checks, you’ll be better at identifying the ways that most bots leave their answers. Perfect grammar, punctuation, generally vague and similar response patterns in the dataset . 

  1. Reconcile

    1. Even with these DQ questions in place, bad data can still sneak in, and you shouldn’t have to pay for it. 

      1. Clean your data on a daily basis, and have a master document of the ID’s you want to remove

      2. If you notice a huge spike, immediately reach out to your provider

The truth is, there is no silver bullet solution. But layered, adaptive, and evolving protections like smart sampling, thoughtful programming, and manual review are very important tactics that will provide you with better data and better insights. It just takes more work than it used to. 

Jon Pirc

Jon has spent his professional career as an entrepreneur and is constantly looking to disrupt traditional industries by using new technologies. After working at Sandbox Industries as a ‘Founder in Residence’, Jon founded Lab42 in 2010 as a way to make research more accessible to smaller companies. Jon has a Bachelor’s of Science in Psychology from Northern Illinois University.

Next
Next

When bots take surveys: AI’s impact on the integrity of market research