Data Mining like NSA/DHS (little guy’s can play)

What about data mining on a budget? How would you like to know CLAPPER’s every movement? Or photo’s of Obama taking a dump in the toilet? Why not turn it around, they say spying harm’s no one, how about every free person start spying on government officials and see how they like  it?

Consider relying on a GPU(s). A CPU is designed to be multitasker that can quickly switch between actions, whereas a Graphical Processing Unit(GPU) is designed to do the same calculations repetitively while giving large increases in performance. The stacks in the listed papers, while giving exponentially higher speeds, did not use modern designs or graphics cards, which hindered them from running even faster.

The GPU (Graphics Prossessing Unit) is changing the face of large scale data mining by significantly speeding up the processing of data mining algorithms. For example, using the K-Means clustering algorithm, the GPU-accelerated version was found to be 200x-400x faster than the popular benchmark program MimeBench running on a single core CPU, and 6x-12x faster than a highly optimised CPU-only version running on an 8 core CPU workstation.

These GPU-accelerated performance results also hold for large data sets. For example in 2009 data set with 1 billion 2-dimensional data points and 1,000 clusters, the GPU-accelerated K-Means algorithm took 26 minutes (using a GTX 280 GPU with 240 cores) whilst the CPU-only version running on a single-core CPU workstation, using MimeBench, took close to 6 days (see research paper “Clustering Billions of Data Points using GPUs” by Ren Wu, and Bin Zhang, HP Laboratories). Substantial additional speed-ups are expected were the tests conducted today on the latest Fermi GPUs with 480 cores and 1 TFLOPS performance.

Over the last two years hundreds of research papers have been published, all confirming the substantial improvement in data mining that the GPU delivers. I will identify a further 7 data mining algorithms where substantial GPU acceleration have been achieved in the hope that it will stimulate your interest to start using GPUs to accelerate your data mining projects:

Hidden Markov Models (HMM) have many data mining applications such as financial economics, computational biology, addressing the challenges of financial time series modelling (non-stationary and non-linearity), analysing network intrusion logs, etc. Using parallel HMM algorithms designed for the GPU, researchers (see cuHMM: a CUDA Implementation of Hidden Markov Model Training and Classification by Chaun Lin, May 2009) were able to achieve performance speedup of up to 800x on a GPU compared with the time taken on a single-core CPU workstation.

Sorting is a very important part of many data mining application. Last month Duane Merrill and Andrew Grinshaw (from University of Virginia) reported achieving a very fast implementation of the radix sorting method and was able to exceed 1G keys/sec average sort rate on an the GTX480 (NVidia Fermi GPU). See

Density-based Clustering is an important paradigm in clustering since typically it is noise and outlier robust and very good at searching for clusters of arbitrary shape in metric and vector spaces. Tests have shown that the GPU speed-up ranged from 3.5x for 30k points to almost 15x for 2 million data points. A guaranteed GPU speedup factor of at least 10x was obtained on data sets consisting of more than 250k points. (See “Density-based Clustering using Graphics Processors” by Christian Bohm et al).

Similarity Join is an important building block for similarity search and data mining algorithms. Researchers using a special algorithm called Index-supported similarity join for the GPU to outperform the CPU by a factor of 15.9x on 180 Mbytes of data (See “Index-supported Similarity Join on Graphics Processors” by Christian Bohm et al).

Bayesian Mixture Models has applications in many areas and of particular interest is the Bayesian analysis of structured massive multivariate mixtures with large data sets. Recent research work (see “Understanding the GPU Programming for Statistical Computation: Studies in Massively Massive Mixtures” by Marc Suchard et al.) has demonstrated that an old generation GPU (GeForce GTX285 with 240 cores) was able to achieve a 120x speed-up over a quad-core CPU version.

Support Vector Machines (SVM) has many diverse data mining uses including classification and regression analysis. Training SVM and using them for classification remains computationally intensive. The GPU version of a SVM algorithm was found to be 43x-104x faster than SVM CPU version for building classification models and 112x-212x faster over SVM CPU version for building regression models. See “GPU Accelerated Support Vector Machines for Mining High-Throughput Screening Data” by Quan Liao, Jibo Wang, et al.

Kernel Machines. Algorithms based on kernel methods play a central part in data mining including modern machine learning and non-parametric statistics. Central to these algorithms are a number of linear operations on matrices of kernel functions which take as arguments the training and testing data. Recent work (See “GPUML: Graphical processes for speeding up kernel machines” by Balaji Srinivasan et al. 2009) involves transforming these Kernel Machines into parallel kernel algorithms on a GPU and the following are two example where considerable speed-ups were achieved; (1) To estimate the densities of 10,000 data points on 10,000 samples. The CPU implementation took 16 seconds whilst the GPU implementation took 13ms which is a significant speed-up will in excess of 1,230x; (2) In a Gaussian process regression, for regression 8 dimensional data the GPU took 2 seconds to make predictions whist the CPU version took hours to make the same prediction which again is a significant speed-up over the CPU version.

If you want to use the GPUs but you do not want to get your hands “dirty” writing CUDA C/C++ code (or other languages bindings such as Python, Java, .NET, Fortran, Perl, or Lau) then consider using MATLAB Parallel Computing Toolbox. This is a powerful solution for those who know MATLAB. Alternatively R now has GPU plugins. A subsequent post will cover using MATLAB and R for GPU accelerated data mining.


4 thoughts on “Data Mining like NSA/DHS (little guy’s can play)

  1. Computer programming for the Prism seem fairly simple..A knowledgable twelve girl could probably write most of the programs. The program that extracts data from Google and the email data would require some experience. We would let the CIA send the badass phone number and internet protocol addresses to her. We could let a fourteen year old boy print the reports on Monday morning for the CIA. We would need a huge number of computers with a huge amount of disc storage. The computer programs are consist ing of a large amount of sorting and comparing, very simple programming.


    A knowledgeable 12 year old with a PHD in machine-learning from stanford-university, and they have all been hired to work at Palantir next to stanford, just like the CIA groomed young GOOGLE at stanford back in the 1990s’.

    I wrote an article about how they correlate this data last month, its largely machine-learning, SVM ( support vector machine ) most of the math is hardly 5+ years old, and Israel leads the world on SVM application to SPYING.

    No a 12 year old girl can’t do this, … the problem is you have a million constraints and a quad-drillion pieces of data, and that take vector algebra light years beyond einstein, but end result is graphics on anything, like ‘computer graph the location of all tea party folk in usa now”, … done on screen above in red where all tea-party are at, … if you see a hot-spot, a DHS fusion center is notified,..

    This link will teach you about the math, .. the trouble is it take about 10 years to learn this math, … so the average developer is over 20, … but sure once its operational you can put a monkey in front of a PALANTIR system ( or a high school dropout ).

  2. This is why its critical to think about ‘data mining for the little guy’, I have written many articles, … but you need to understand that all these ‘toilet cam’s are online, …

    The NSA is simply doing what facebook&google built for the corporations, and they’re data-mining with a different angle.

    Given that the data-mining by the corporation, is on the same legal level as NSA/CIA, given that all algorithms are public-domain. Given that high speed computer’s are now available for less than $1000, cuda, nvidia, amdX,

    Now its possible for any little guy, with the right direction to offer 24/7 SPYING on all politicians, I see it coming, … perhaps as a service, or a distributed wiki, and then client app’s consolidate the data,

    Given that FEINSTEIN & McCAIN are OK, with 24/7 information awareness, then they should be OK, with we the people watching Feinstein in the toilet.

    This is coming, … and there is nothing they can do, they made spying legal for the corporations first, and then sold it to the local cops, and then the NSA decided to play catch up, and now the little guy can spy on the spy’s…

    Feinstein of course will call for little guy spying to be illegal, but the cat is out of the bag, and ALL this stuff is public if you know how to find it.

  3. PRISM and related programs may harvest metadata of every phone call, every email, every Internet search, every Facebook post — everything — and use algorithmic filtering to find suspicious communication. Once they’ve found it, they can get a warrant to listen to the actual phone calls and read the actual email to find clues that enable authorities to stop terrorist attacks before they happen. (You know, Minority Report-style precrime.)

    Metadata is not the content of the phone call or email, but the information about them: Who contacted whom, when, from where and for how long.

    PRISM inspires shock and awe. But if you set aside the shock part — the privacy and constitutional implications — you realize the awe component is worth exploring.

    The PRISM approach is this: Cast the widest possible information net, then use machine intelligence to serve up just the needles without the haystack.

    PRISM works. It gets government snoops what they’re looking for. And if it works for the NSA, it can work for you, too.

    In fact, the ideas behind PRISM are built into a wide variety of tools available to everybody.

    So here’s how to run your own private PRISM program:

    1. Capture massive amounts of data

    One of the NSA’s goals is to record the metadata on every phone call and email.

    Obviously, no human personally reads all that data. But it’s copied and stored anyway for searching later.

    You can take the same approach. One easy way is to use integrated Google services together.

    Google now offers 15 GB of free storage that can be divided any way you like between Gmail, Google Drive and Google+ photos. And they’ll give you more if you pay for it.

    Google also offers an Alerts service that searches the Internet and mails you the results. Most people set up only the number of Alerts that they can read. But that’s not the NSA way.

    The PRISM approach would be to harvest far more Google Alerts than any human could possible process, then use Gmail filters to automatically skip the inbox and send them straight to a specially created folder within Gmail. You can set up new Alerts every day each time you think of an area of interest. These can include people you know, companies to watch, ideas to keep up with.

    Alerts won’t send you the data (the story), but the metadata (information about the story, plus the link). One advantage of this approach is that if a site is deleted, making it vanish also from Google Search, you’ll still have a record of it with enough metadata to pursue leads.

    Note that Google also offers Google Scholar Alerts, which works like regular alerts but that searches academic books, papers and other resources. This is one of the great underappreciated services on the Internet.

    You can also spy on yourself NSA style by capturing the metadata on your phone calls and chats. (Of course, the email is already there.)

    The trick is to use Google Voice, and turn on the features that save your information to email. (Note that Voice will send your data to any email address, not just a Gmail one.) You’ll find the appropriate checkboxes under the Voicemail & Text tab of Google Voice Settings.

    This will send metadata on all of your calls, plus full data on all your SMS chats, transcripts of your recorded calls and voicemails and even the sound recording of your voicemails for searching later.

    Note that Google’s new Hangouts feature, which is accessible in Gmail, Google+ and in the dedicated Hangouts mobile apps, will send the full text of all your chats plus metadata on your video calls to the Gmail address associated with your Google+ account.

    You can also use various tools like IFTTT or Zapier to automatically drop all content or metadata from any RSS feed into Google Drive, or alternatives like Evernote for searching later on.

    Remember: Do it the NSA way and go nuts with this, dropping dozens, or even hundreds of items per day into your searchable storage. Don’t worry about having too much data. Have faith in existing and future search tools to later find what you’re looking for.

    Beyond the automated harvesting of data, don’t forget the manual approach, either. Capture every document that might someday be relevant and dump it into a special folder in Google Drive by using a browser extension like the Save to Google Drive plug-in. (Chrome has other extensions and so do other browsers.) You can do similar one-click saving using Evernote Web Clipper.

    Once all this data and metadata is pouring into Gmail and Drive, you can simply use Google’s search features to find what you’re looking for.

    The key to great NSA-style data harvesting, by the way, is to constantly tweak your code. Keep adding, deleting and modifying your Google Alerts and RSS feeds to make sure they deliver the kind of data you want.

    2. Use algorithmic filtering

    Algorithmic “noise filters” are popping up everywhere these days, especially on social networks and social media services where users could be overwhelmed by too much information.

    But thinking like the NSA, we can use these filters to cast a massively wide information net, then let the filters weed out duplicate and irrelevant information for us. (Note that I got this tip from a conversation with blogger Robert Scoble this week.)

    The idea is to set up a special-purpose Twitter feed for information harvesting, then use it to follow vastly more content sources than any human could possibly keep up with.

    Then, read that feed using Flipboard, Prismatic or some other site that filters content for you and that supports Twitter. (Note that these services also support Facebook and Google Reader, but Google will discontinue Reader soon. Twitter is probably your best bet.)

    One thing these filters do well is eliminate content duplicates. Instead of getting 500 stories about the name of Kanye and Kim’s baby, you’ll get just one story — probably the best or most popular one — and get it over with.

    Another way to think about the power of algorithmic de-duping is that normally you might not follow a news source from which only one story in 100 is unique or exclusive. But because duplicate stories are filtered out, you get only the one unique story from that source and not the 99 also-ran stories.

    This elimination of duplicates frees you to follow news and content sources promiscuously, casting an ultra-wide net without fear of overloading yourself with redundant content.

    3. Don’t forget the new photograph recognition tech

    One of the amazing spy tools at the disposal of the NSA is the ability to process photographs for face, object and location information.

    These tools are at your disposal, too.

    Facebook’s new Graph Search feature lets you quickly experiment with finding photos by trying different queries. For example, if you search for “Pictures taken by people who work at …” followed by a company, you’ll get what you asked for. (This is one way to spy on a competitor, for example.)

    Google’s picture searching takes it even further, enabling you to search not only for tags, keywords, associated text and location, but also content categorization. Google can actually recognize objects, landmarks and other stuff, even if the person who posted it added no such context.

    For example, if you search Google+ for something like Sydney Opera House, you’ll get a massive trove of pictures of the building, many of which are not accompanied by any mention of the words Sydney, Opera or House. Google actually recognizes the building using machine intelligence.

    The same goes for categories of things. You can search for the word ” car,” which is not a specific thing but a type or category of thing. Google still gives you cars, whether they’re tagged or not.

    There’s one ironic caveat to using the NSA’s methods for wide-scale information harvesting and algorithmic filtering, which is that the NSA may theoretically know everything you’re doing.

    The NSA’s domestic surveillance programs are controversial and possibly unconstitutional. But let’s face it: They work.

    And the NSA’s methods can work for you, too.

    This article, How to run your own NSA spy program, was originally published at


    Well the technical’s are missing, but this has some good information. Probably possible to do even more interesting information in time.

    All sound can be done, and photo’s, and lip reading like in the movie 2001 ‘Hal’ computer.

    Another spy correlation could be stock market and spying, as GREED&FEAR always feed the markets, and what better way to measure and/or correlate the markets than FEAR metrics based on spying.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s