Data Mining in Homeland Security
Essay by review • February 11, 2011 • Research Paper • 4,487 Words (18 Pages) • 1,992 Views
DATA MINING IN HOMELAND SECURITY
Abstract
Data Mining is an analytical process that primarily involves searching through vast amounts of data to spot useful, but initially undiscovered, patterns. The data mining process typically involves three major steps--exploration, model building and validation and finally, deployment.
Data mining is used in numerous applications, particularly business related endeavors such as market segmentation, customer churn, fraud detection, direct marketing, interactive marketing, market basket analysis and trend analysis. However, since the 1993 World Trade Center bombing and the terrorist attacks of September 11, data mining has increasingly been used in homeland security efforts.
Two of the earlier homeland security programs were the Total Information Awareness Program (TIA) and the Computer-Assisted Passenger Prescreening System (CAPPS II). Privacy and other concerns led to the eventual demise of these programs.
In addition to efforts by the federal government, state programs are also being implemented. The Texas Fusion Center is a prime example of state agencies data mining data in efforts to thwart attacks against our populace.
Data mining is not difficult to implement, as an example of detecting potential subversives using Amazon.com wishlists is presented.
The primary negatives of data mining are concerns related to privacy. False positives whereby individuals are wrongly identified as "terrorists" and inadequate government control over data are prime examples.
In conclusion, data mining can be enormously beneficial in homeland security efforts, however, until privacy and other concerns are adequately addressed, it will be difficult for the government to get approval from its citizens for many programs.
Introduction
This technical paper is intended to introduce to the reader to the analytical process known as data mining and its growing application in homeland security endeavors. In doing so, some of the more popular techniques and applications will be briefly addressed before highlighting data mining in homeland security and related anti-terrorism initiatives.
The paper will end with a brief discussion on the increasing concerns towards the negative aspects of the process such as privacy issues, etc.
Finally, some overall conclusions and prospects for the future are touched upon.
Data Mining Overview--Definition, Techniques, & Applications
Definition and Techniques
Data Mining, which is also sometimes referred to as Knowledge-Discovery in Databases or KDD, is an analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables. After a relationship has been discovered, typically the findings are then validated and used by applying the detected patterns to new sets of data. Usually the data mining process consists of large amounts of data and is typically business or market related. The ultimate goal of data mining is prediction, and predictive data mining is the most common type of data mining and one that has the most direct business applications.
Depending upon the source, there are slightly different steps inherent in the data mining process. However, the common theme consists of three stages as described by StatSoft: "(1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions).
Stage 1: Exploration. This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage.
Stage 2: Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.
Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome."
Applications
Data mining software allows users to analyze large databases to solve business decision problems. Data mining is, in some ways, an extension of statistics, with a few artificial intelligence and machine learning twists thrown in. Like statistics, data mining is not a business solution, it is just a technology. For example, consider a catalog retailer who needs to decide who should receive information about a new product. The information operated on by the data mining process is contained in a historical database of previous interactions with customers and the features associated with the customers, such as age, zip code, and their responses. The data mining software would use this historical information to build a model of customer behavior that could be used to predict which customers would be likely to respond to the new product. By using this information, a marketing manager
...
...