The Wall Street Journal posted a troubling article today on how the incidence of companies scraping data from websites (i.e. obtaining the data by pulling it from the user side instead of my accessing a site’s database or other private/proprietary repositories) is increasing.
Media research firm Nielsen Co. was recently caught scraping data from the forum at PatientsLikeMe.com by signing in as a regular forum member, then systematically copying/scraping the data on the forums, including user’s profile information and posts. While not technically illegal, the practice is pretty creepy.
According to the WSJ article, Nielsen isn’t the only company doing this. It’s just one of many.
“Customers for whom we were regularly blocking about 1,000 to 2,000 scrapes a month are now seeing three times or in some cases 10 times as much scraping,” says Marino Zini, managing director of Sentor Anti Scraping System. The company’s Stockholm team blocks scrapers on behalf of website clients.
This kind of data has always been used for marketing and advertising purposes, and many people don’t see that as a big deal. However, the more disturbing stuff comes when scrapers use that information for more targeted, personal reasons. Employers aren’t just relying on Google to tell them about you, but using scraping firms to dig deeper. Your tweets, status updates, and other social networking chatter may not be safe even if it doesn’t show up on your public profile. And then there are companies like PeekYou LLC, which is attempting to patent a method that is meant to match people’s real names to their screen names/handles/pseudonyms they use on blogs, social networks, etc. Classy.
Is there any way to protect yourself from scraping? There’s not likely a completely fool-proof way, and perhaps the best you can hope for is that whatever data these companies mine can’t be linked to you if you don’t want it to be. Right now there’s no law on this here in the U.S., and apparently international law is fuzzy and varies. Looking at this diagram, taken from PeekYou’s patent application, it may be possible to thwart their efforts to link you to your screen name by changing up the details you give different websites. Don’t always use the same email, don’t give your real birthday, leave out your middle name and initial, and avoid giving your address or even city.
There is probably very little you can do to completely protect yourself if you spend time on the Internet at all, but minimizing the amount of data these companies can scrape and use to their benefit is a good first step.