Hi all, This Blog is an English archive of my PhD experience in Imperial College London, mainly logging my research and working process, as well as some visual records.
Monday, 10 December 2007
Thursday, 29 November 2007
Dummy Variables
Why use dummies?
Regression analysis is used with numerical variables. Results only have a valid interpretation if it makes sense to assume that having a value of 2 on some variable is does indeed mean having twice as much of something as a 1, and having a 50 means 50 times as much as 1.
However, social scientists often need to work with categorical variables in which the different values have no real numerical relationship with each other. Examples include variables for race, political affiliation, or marital status. If you have a variable for political affiliation with possible responses including Democrat, Independent, and Republican, it obviously doesn't make sense to assign values of 1 - 3 and interpret that as meaning that a Republican is somehow three times as politically affiliated as a Democrat.
The solution is to use dummy variables - variables with only two values, zero and one. It does make sense to create a variable called "Republican" and interpret it as meaning that someone assigned a 1 on this varible is Republican and someone with an 0 is not.
Nominal variables with multiple levels
If you have a nominal variable that has more than two levels, you need to create multiple dummy variables to "take the place of" the original nominal variable. For example, imagine that you wanted to predict depression from year in school: freshman, sophomore, junior, or senior. Obviously, "year in school" has more than two levels.
What you need to do is to recode "year in school" into a set of dummy variables, each of which has two levels. The first step in this process is to decide the number of dummy variables. This is easy; it's simply k-1, where k is the number of levels of the original variable.
You could also create dummy variables for all levels in the original variable, and simply drop one from each analysis.
In this instance, we would need to create 4-1=3 dummy variables. In order to create these variables, we are going to take 3 of the levels of "year of school", and create a variable corresponding to each level, which will have the value of yes or no (i.e., 1 or 0). In this instance, we can create a variable called "sophomore," "junior," and "senior." Each instance of "year of school" would then be recoded into a value for "sophomore," "junior," and "senior." If a person were a junior, then "sophomore" would be equal to 0, "junior" would be equal to 1, and "senior" would be equal to 0.
Interpreting results
The decision as to which level is not coded is often arbitrary. The level which is not coded is the category to which all other categories will be compared. As such, often the biggest group will be the not- coded category. For example, often "Caucasian" will be the not-coded group if that is the race of the majority of participants in the sample. In that case, if you have a variable called "Asian", the coefficient on the "Asian" variable in your regression will show the effect being Asian rather than Caucasian has on your dependant variable.
In our example, "freshman" was not coded so that we could determine if being a sophomore, junior, or senior predicts a different depressive level than being a freshman. Consequently, if the variable, "junior" was significant in our regression, with a positive beta coefficient, this would mean that juniors are significantly more depressed than freshman. Alternatively, we could have decided to not code "senior," if we thought that being a senior is qualitatively different from being of another year.Levels of Measurement
The level of measurement refers to the relationship among the values that are assigned to the attributes for a variable. What does that mean? Begin with the idea of the variable, in this example "party affiliation." That variable has a number of attributes. Let's assume that in this particular election context the only relevant attributes are "republican", "democrat", and "independent". For purposes of analyzing the results of this variable, we arbitrarily assign the values 1, 2 and 3 to the three attributes. The level of measurement describes the relationship among these three values. In this case, we simply are using the numbers as shorter placeholders for the lengthier text terms. We don't assume that higher values mean "more" of something and lower numbers signify "less". We don't assume the the value of 2 means that democrats are twice something that republicans are. We don't assume that republicans are in first place or have the highest priority just because they have the value of 1. In this case, we only use the values as a shorter name for the attribute. Here, we would describe the level of measurement as "nominal".
Why is Level of Measurement Important?
First, knowing the level of measurement helps you decide how to interpret the data from that variable. When you know that a measure is nominal (like the one just described), then you know that the numerical values are just short codes for the longer names. Second, knowing the level of measurement helps you decide what statistical analysis is appropriate on the values that were assigned. If a measure is nominal, then you know that you would never average the data values or do a t-test on the data.
There are typically four levels of measurement that are defined:
- Nominal
- Ordinal
- Interval
- Ratio
In nominal measurement the numerical values just "name" the attribute uniquely. No ordering of the cases is implied. For example, jersey numbers in basketball are measures at the nominal level. A player with number 30 is not more of anything than a player with number 15, and is certainly not twice whatever number 15 is.
In ordinal measurement the attributes can be rank-ordered. Here, distances between attributes do not have any meaning. For example, on a survey you might code Educational Attainment as 0=less than H.S.; 1=some H.S.; 2=H.S. degree; 3=some college; 4=college degree; 5=post college. In this measure, higher numbers mean more education. But is distance from 0 to 1 same as 3 to 4? Of course not. The interval between values is not interpretable in an ordinal measure.
In interval measurement the distance between attributes does have meaning. For example, when we measure temperature (in Fahrenheit), the distance from 30-40 is same as distance from 70-80. The interval between values is interpretable. Because of this, it makes sense to compute an average of an interval variable, where it doesn't make sense to do so for ordinal scales. But note that in interval measurement ratios don't make any sense - 80 degrees is not twice as hot as 40 degrees (although the attribute value is twice as large).
Finally, in ratio measurement there is always an absolute zero that is meaningful. This means that you can construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio variable. In applied social research most "count" variables are ratio, for example, the number of clients in past six months. Why? Because you can have zero clients and because it is meaningful to say that "...we had twice as many clients in the past six months as we did in the previous six months."
It's important to recognize that there is a hierarchy implied in the level of measurement idea. At lower levels of measurement, assumptions tend to be less restrictive and data analyses tend to be less sensitive. At each level up the hierarchy, the current level includes all of the qualities of the one below it and adds something new. In general, it is desirable to have a higher level of measurement (e.g., interval or ratio) rather than a lower one (nominal or ordinal).
Wednesday, 28 November 2007
What is overfitting and how can I avoid it?
The critical issue in developing a neural network is generalization: how
well will the network make predictions for cases that are not in the
training set? NNs, like other flexible nonlinear estimation methods such as
kernel regression and smoothing splines, can suffer from either underfitting
or overfitting. A network that is not sufficiently complex can fail to
detect fully the signal in a complicated data set, leading to underfitting.
A network that is too complex may fit the noise, not just the signal,
leading to overfitting. Overfitting is especially dangerous because it can
easily lead to predictions that are far beyond the range of the training
data with many of the common types of NNs. Overfitting can also produce wild
predictions in multilayer perceptrons even with noise-free data.
For an elementary discussion of overfitting, see Smith (1996). For a more
rigorous approach, see the article by Geman, Bienenstock, and Doursat (1992)
on the bias/variance trade-off (it's not really a dilemma). We are talking
about statistical bias here: the difference between the average value of an
estimator and the correct value. Underfitting produces excessive bias in the
outputs, whereas overfitting produces excessive variance. There are
graphical examples of overfitting and underfitting in Sarle (1995, 1999).
The best way to avoid overfitting is to use lots of training data. If you
have at least 30 times as many training cases as there are weights in the
network, you are unlikely to suffer from much overfitting, although you may
get some slight overfitting no matter how large the training set is. For
noise-free data, 5 times as many training cases as weights may be
sufficient. But you can't arbitrarily reduce the number of weights for fear
of underfitting.
Given a fixed amount of training data, there are at least six approaches to
avoiding underfitting and overfitting, and hence getting good
generalization:
o Model selection
o Jittering
o Early stopping
o Weight decay
o Bayesian learning
o Combining networks
The first five approaches are based on well-understood theory. Methods for
combining networks do not have such a sound theoretical basis but are the
subject of current research. These six approaches are discussed in more
detail under subsequent questions.
The complexity of a network is related to both the number of weights and the
size of the weights. Model selection is concerned with the number of
weights, and hence the number of hidden units and layers. The more weights
there are, relative to the number of training cases, the more overfitting
amplifies noise in the targets (Moody 1992). The other approaches listed
above are concerned, directly or indirectly, with the size of the weights.
Reducing the size of the weights reduces the "effective" number of
weights--see Moody (1992) regarding weight decay and Weigend (1994)
regarding early stopping. Bartlett (1997) obtained learning-theory results
in which generalization error is related to the L_1 norm of the weights
instead of the VC dimension.
Overfitting is not confined to NNs with hidden units. Overfitting can occur
in generalized linear models (networks with no hidden units) if either or
both of the following conditions hold:
1. The number of input variables (and hence the number of weights) is large
with respect to the number of training cases. Typically you would want at
least 10 times as many training cases as input variables, but with
noise-free targets, twice as many training cases as input variables would
be more than adequate. These requirements are smaller than those stated
above for networks with hidden layers, because hidden layers are prone to
creating ill-conditioning and other pathologies.
2. The input variables are highly correlated with each other. This condition
is called "multicollinearity" in the statistical literature.
Multicollinearity can cause the weights to become extremely large because
of numerical ill-conditioning--see "How does ill-conditioning affect NN
training?"
Methods for dealing with these problems in the statistical literature
include ridge regression (similar to weight decay), partial least squares
(similar to Early stopping), and various methods with even stranger names,
such as the lasso and garotte (van Houwelingen and le Cessie, ????).
References:
Bartlett, P.L. (1997), "For valid generalization, the size of the weights
is more important than the size of the network," in Mozer, M.C., Jordan,
M.I., and Petsche, T., (eds.) Advances in Neural Information Processing
Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140.
Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
the Bias/Variance Dilemma", Neural Computation, 4, 1-58.
Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
Generalization and Regularization in Nonlinear Learning Systems", in
Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural
Information Processing Systems 4, 847-854.
Sarle, W.S. (1995), "Stopped Training and Other Remedies for
Overfitting," Proceedings of the 27th Symposium on the Interface of
Computing Science and Statistics, 352-360,
ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large
compressed postscript file, 747K, 10 pages)
Sarle, W.S. (1999), "Donoho-Johnstone Benchmarks: Neural Net Results,"
ftp://ftp.sas.com/pub/neural/dojo/dojo.html
Smith, M. (1996). Neural Networks for Statistical Modeling, Boston:
International Thomson Computer Press, ISBN 1-850-32842-0.
van Houwelingen,H.C., and le Cessie, S. (????), "Shrinkage and penalized
likelihood as methods to improve predictive accuracy,"
http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/shrinkage.pdf and
http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/figures.pdf
Weigend, A. (1994), "On overfitting and the effective number of hidden
units," Proceedings of the 1993 Connectionist Models Summer School,
335-342.
Copyright 1997, 1998, 1999, 2000, 2001, 2002 by Warren S. Sarle, Cary, NC,
USA. Answers provided by other authors as cited below are copyrighted by
those authors, who by submitting the answers for the FAQ give permission for
the answer to be reproduced as part of the FAQ in any of the ways specified
in part 1 of the FAQ.
Overfitting and Underfitting
Saturday, 24 November 2007
Data Mining Techniques
referenced from http://www.statsoft.com/textbook/stdatmin.html
- Data Mining
- Crucial Concepts in Data Mining
- Data Warehousing
- On-Line Analytic Processing (OLAP)
- Exploratory Data Analysis (EDA) and Data Mining Techniques
- Neural Networks
Data Mining
Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/veri
Stage 1: Exploration. This stage usually starts with data preparation which may involve cleaning data, data transformations
Stage 2: Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations
Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.
The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business Data Mining (e.g., Classification Trees), but Data Mining is still based on the conceptual principles of statistics including the traditional Exploratory Data Analysis (EDA) and modeling and it shares with them both some components of its general approaches and specific techniques.
However, an important general difference in the focus and purpose between Data Mining and the traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented towards applications than the basic nature of the underlying phenomena. In other words, Data Mining is relatively less concerned with identifying the specific relations between the involved variables. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among others a "black box" approach to data exploration or knowledge discovery and uses not only the traditional Exploratory Data Analysis (EDA) techniques, but also such techniques as Neural Networks which can generate valid predictions but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based.
Data Mining is often considered to be "a blend of statistics, AI [artificial intelligence], and data base research" (Pregibon, 1997, p. 8), which until very recently was not commonly recognized as a field of interest for statisticians, and was even considered by some "a dirty word in Statistics" (Pregibon, 1997, p. 8). Due to its applied importance, however, the field emerges as a rapidly growing and major area (also in statistics) where important theoretical advances are being made (see, for example, the recent annual International Conferences on Knowledge Discovery and Data Mining, co-hosted by the American Statistical Association).
For information on Data Mining techniques, please review the summary topics included below in this chapter of the Electronic Statistics Textbook. There are numerous books that review the theory and practice of data mining; the following books offer a representative sample of recent general books on data mining, representing a variety of approaches and perspectives:
Berry, M., J., A., & Linoff, G., S., (2000). Mastering data mining. New York: Wiley.
Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed). Potomac, MD: Two Crows Corp.
Fayyad, U. M., Piatetsky-Shapi
Han, J., Kamber, M. (2000). Data mining: Concepts and Techniques. New York: Morgan-Kaufman.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning : Data mining, inference, and prediction. New York: Springer.
Pregibon, D. (1997). Data Mining. Statistical Computing and Graphics, 7, 8.
Weiss, S. M., & Indurkhya, N. (1997). Predictive data mining: A practical guide. New York: Morgan-Kaufman.
Westphal, C., Blaxton, T. (1998). Data mining solutions. New York: Wiley.
Witten, I. H., & Frank, E. (2000). Data mining. New York: Morgan-Kaufmann
Crucial Concepts in Data Mining
Bagging (Voting, Averaging)
The concept of bagging (voting for classification,
Boosting
The concept of boosting applies to the area of predictive data mining, to generate multiple models or classifiers (for prediction or classification)
A simple algorithm for boosting works like this: Start by applying some method (e.g., a tree classifier such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight. Compute the predicted classifications
Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an "expert" in classifying observations that were not well classified by those preceding it. During deployment (for prediction or classification of new cases), the predictions from the different classifiers can then be combined (e.g., via voting, or some weighted voting procedure) to derive a single best prediction or classification.
Note that boosting can also be applied to learning methods that do not explicitly support weights or misclassificati
CRISP
See Models for Data Mining.
Data Preparation (in Data Mining)
Data preparation and cleaning is an often neglected but extremely important step in the data mining process. The old saying "garbage-in-gar
Data Reduction (for Data Mining)
The term Data Reduction in the context of data mining is usually applied to projects where the goal is to aggregate or amalgamate the information contained in large datasets into manageable (smaller) information nuggets. Data reduction methods can include simple tabulation, aggregation (computing descriptive statistics) or more sophisticated techniques like clustering, principal components analysis, etc.
See also predictive data mining, drill-down analysis.
Deployment
The concept of deployment in predictive data mining refers to the application of a model for prediction or classification to new data. After a satisfactory model or set of models has been identified (trained) for a particular application, one usually wants to deploy those models so that predictions or predicted classifications
Drill-Down Analysis
The concept of drill-down analysis applies to the area of data mining, to denote the interactive exploration of data, in particular of large databases. The process of drill-down analyses begins by considering some simple break-downs of the data by a few variables of interest (e.g., Gender, geographic region, etc.). Various statistics, tables, histograms, and other graphical summaries can be computed for each group. Next one may want to "drill-down" to expose and further analyze the data "underneath" one of the categorizations
Feature Selection
One of the preliminary stage in predictive data mining, when the data set includes more variables than could be included (or would be efficient to include) in the actual model building phase (or even in initial exploratory operations), is to select predictors from a large list of candidates. For example, when data are collected via automated (computerized) methods, it is not uncommon that measurements are recorded for thousands or hundreds of thousands (or more) of predictors. The standard analytic methods for predictive data mining, such as neural network analyses, classification and regression trees, generalized linear models, or general linear models become impractical when the number of predictors exceed more than a few hundred variables.
Feature selection selects a subset of predictors from a large list of candidate predictors without assuming that the relationships between the predictors and the dependent or outcome variables of interest are linear, or even monotone. Therefore, this is used as a pre-processor for predictive data mining, to select manageable sets of predictors that are likely related to the dependent (outcome) variables of interest, for further analyses with any of the other methods for regression and classification.
Machine Learning
Machine learning, computational learning theory, and similar terms are often used in the context of Data Mining, to denote the application of generic model-fitting or classification algorithms for predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference, the emphasis in data mining (and machine learning) is usually on the accuracy of prediction (predicted classification)
Meta-Learning
The concept of meta-learning applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. In this context, this procedure is also referred to as Stacking (Stacked Generalization)
Suppose your data mining project includes tree classifiers, such as C&RT and CHAID, linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications
One can apply meta-learners to the results from different meta-learners to create "meta-meta"-lea
Models for Data Mining
In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organization. In the data mining literature, various "general frameworks" have been proposed to serve as blueprints for how to organize the process of gathering data, analyzing data, disseminating results, implementing results, and monitoring improvements.
One such model, CRISP (Cross-Industry
Another approach - the Six Sigma methodology - is a well-structured
- that grew up from the manufacturing, quality improvement, and process control traditions and is particularly well suited to production environments (including "production of services," i.e., service industries).
Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by SAS Institute called SEMMA -
- which is focusing more on the technical activities typically involved in a data mining project.
All of these models are concerned with the process of how to integrate data mining methodology into an organization, how to "convert data into information," how to involve important stake-holders, and how to disseminate the information in a form that can easily be converted by stake-holders into resources for strategic decision making.
Some software tools for data mining are specifically designed and documented to fit into one of these specific frameworks.
The general underlying philosophy of StatSoft's STATISTICA Data Miner is to provide a flexible data mining workbench that can be integrated into any organization, industry, or organizational culture, regardless of the general data mining process-model that the organization chooses to adopt. For example, STATISTICA Data Miner can include the complete set of (specific) necessary tools for ongoing company wide Six Sigma quality control efforts, and users can take advantage of its (still optional) DMAIC-centric user interface for industrial data mining tools. It can equally well be integrated into ongoing marketing research, CRM (Customer Relationship Management) projects, etc. that follow either the CRISP or SEMMA approach - it fits both of them perfectly well without favoring either one. Also, STATISTICA Data Miner offers all the advantages of a general data mining oriented "development kit" that includes easy to use tools for incorporating into your projects not only such components as custom database gateway solutions, prompted interactive queries, or proprietary algorithms, but also systems of access privileges, workgroup management, and other collaborative work tools that allow you to design large scale, enterprise-wide
Predictive Data Mining
The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models (e.g., neural networks, meta-learner) that can quickly identify transactions which have a high probability of being fraudulent. Other types of data mining projects may be more exploratory in nature (e.g., to identify cluster or segments of customers), in which case drill-down descriptive and exploratory methods would be applied. Data reduction is another possible objective for data mining (e.g., to aggregate or amalgamate the information in very large data sets into useful and manageable chunks).
SEMMA
See Models for Data Mining.
Stacked Generalization
See Stacking.
Stacking (Stacked Generalization)
The concept of stacking (short for Stacked Generalization)
Suppose your data mining project includes tree classifiers, such as C&RT or CHAID, linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications
Other methods for combining the prediction from multiple models or methods (e.g., from multiple datasets used for learning) are Boosting and Bagging (Voting).
Text Mining
While Data Mining is typically concerned with the detection of patterns in numeric data, very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data, text is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques (e.g., to determine co-occurrences of concepts, key phrases, names, addresses, product names, etc.).
To index |
Data Warehousing
StatSoft defines data warehousing as a process of organizing the storage of large, multivariate data sets in a way that facilitates the retrieval of information for analytic purposes.
The most efficient data warehousing architecture will be capable of incorporating or at least referencing all data available in the relevant enterprise-wide To index
On-Line Analytic Processing (OLAP)
The term On-Line Analytic Processing - OLAP (or Fast Analysis of Shared Multidimensiona To index
Exploratory Data Analysis (EDA)
As opposed to traditional hypothesis testing designed to verify a priori hypotheses about relations between variables (e.g., "There is a positive correlation between the AGE of a person and his/her RISK TAKING disposition"), exploratory data analysis (EDA) is used to identify systematic relations between variables when there are no (or not complete) a priori expectations as to the nature of those relations. In a typical exploratory data analysis process, many variables are taken into account and compared, using a variety of techniques in the search for systematic patterns.
Computational exploratory data analysis methods include both simple basic statistics and more advanced, designated multivariate exploratory techniques designed to identify patterns in multivariate data sets.
Basic statistical exploratory methods. The basic statistical exploratory methods include such techniques as examining distributions of variables (e.g., to identify highly skewed or non-normal, such as bi-modal patterns), reviewing large correlation matrices for coefficients that meet certain thresholds (see example above), or examining multi-way frequency tables (e.g., "slice by slice" systematically reviewing combinations of levels of control variables).
Multivariate exploratory techniques. Multivariate exploratory techniques designed specifically to identify patterns in multivariate (or univariate, such as sequences of measurements) data sets include: Cluster Analysis, Factor Analysis, Discriminant Function Analysis, Multidimensiona
Neural Networks. Neural Networks are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called learning from existing data.
For more information, see Neural Networks; see also STATISTICA Neural Networks.
Graphical (data visualization) EDA techniques
A large selection of powerful exploratory data analytic techniques is also offered by graphical data visualization methods that can identify relations, trends, and biases "hidden" in unstructured data sets.
Brushing. Perhaps the most common and historically first widely used technique explicitly identified as graphical exploratory data analysis is brushing, an interactive method allowing one to select on-screen specific data points or subsets of data and identify their (e.g., common) characteristics
Other graphical EDA techniques. Other graphical exploratory analytic techniques include function fitting and plotting, data smoothing, overlaying and merging of multiple displays, categorizing data, splitting/mergi
shading, plotting confidence intervals and confidence areas (e.g., ellipses),
generating tessellations, spectral planes,
integrated layered compressions,
and projected contours, data image reduction techniques, interactive (and continuous) rotation
with animated stratification (cross-sections
Verification of results of EDA
The exploration of data can only serve as the first stage of data analysis and its results can be treated as tentative at best as long as they are not confirmed, e.g., crossvalidated, using a different data set (or and independent subset). If the result of the exploratory stage suggests a particular model, then its validity can be verified by applying it to a new data set and testing its fit (e.g., testing its predictive validity). Case selection conditions can be used to quickly define subsets of data (e.g., for estimation and verification), and for testing the robustness of results.
To index |
Neural Networks
(see also Neural Networks chapter)
Neural Networks are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called learning from existing data. Neural Networks is one of the Data Mining techniques.
The first step is to design a specific network architecture (that includes a specific number of "layers" each consisting of a certain number of "neurons"). The size and structure of the network needs to match the nature (e.g., the formal complexity) of the investigated phenomenon. Because the latter is obviously not known very well at this early stage, this task is not easy and often involves multiple "trials and errors." (Now, there is, however, neural network software that applies artificial intelligence techniques to aid in that tedious task and finds "the best" network architecture.)
The new network is then subjected to the process of "training." In that phase, neurons apply an iterative process to the number of inputs (variables) to adjust the weights of the network in order to optimally predict (in traditional terms one could say, find a "fit" to) the sample data on which the "training" is performed. After the phase of learning from an existing data set, the new network is ready and it can then be used to generate predictions.
The resulting "network" developed in the process of "learning" represents a pattern detected in the data. Thus, in this approach, the "network" is the functional equivalent of a model of relations between variables in the traditional model building approach. However, unlike in the traditional models, in the "network," those relations cannot be articulated in the usual terms used in statistics or methodology to describe relations between variables (such as, for example, "A is positively correlated with B but only for observations where the value of C is low and D is high"). Some neural networks can produce highly accurate predictions; they represent, however, a typical a-theoretical (one can say, "a black box") research approach. That approach is concerned only with practical considerations,
However, it should be mentioned that Neural Network techniques can also be used as a component of analyses designed to build explanatory models because Neural Networks can help explore data sets in search for relevant variables or groups of variables; the results of such explorations can then facilitate the process of model building. Moreover, now there is neural network software that uses sophisticated algorithms to search for the most relevant input variables, thus potentially contributing directly to the model building process.
One of the major advantages of neural networks is that, theoretically, they are capable of approximating any continuous function, and thus the researcher does not need to have any hypotheses about the underlying model, or even to some extent, which variables matter. An important disadvantage, however, is that the final solution depends on the initial conditions of the network, and, as stated before, it is virtually impossible to "interpret" the solution in traditional, analytic terms, such as those used to build theories that explain phenomena.
Some authors stress the fact that neural networks use, or one should say, are expected to use, massively parallel computation models. For example Haykin (1994) defines neural network as:
"a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: (1) Knowledge is acquired by the network through a learning process, and (2) Interneuron connection strengths known as synaptic weights are used to store the knowledge." (p. 2).
However, as Ripley (1996) points out, the vast majority of contemporary neural network applications run on single-processo
Neural networks is one of the methods used in Data Mining; see also Exploratory Data Analysis. For more information on neural networks, see Haykin (1994), Masters (1995), Ripley (1996), and Welstead (1994). For a discussion of neural networks as statistical tools, see Warner and Misra (1996). See also, STATISTICA Neural Networks.
Friday, 23 November 2007
Supervised learning
Supervised learning
From Wikipedia, the free encyclopedia
Supervised learning is a machine learning technique for creating a function from training data. The training data consist of pairs of input objects (typically vectors), and desired outputs. The output of the function can be a continuous value (called regression), or can predict a class label of the input object (called classification). The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations in a "reasonable" way (see inductive bias). (Compare with unsupervised learning.) The parallel task in human and animal psychology is often referred to as concept learning.
Sunday, 18 November 2007
Overfitting
we will look at some techniques for preventing our model becoming too powerful (overfitting). In the next, we address the related question of selecting an appropriate architecture with just the right amount of trainable parameters.
Bias-Variance trade-off
Consider the two fitted functions below. The data points (circles) have all been generated from a smooth function, h(x), with some added noise. Obviously, we want to end up with a model which approximates h(x), given a specific set of data y(x) generated as:
(1) |
In the left hand panel we try to fit the points using a function g(x) which has too few parameters: a straight line. The model has the virtue of being simple; there are only two free parameters. However, it does not do a good job of fitting the data, and would not do well in predicting new data points. We say that the simpler model has a high bias.
The right hand panel shows a model which has been fitted using too many free parameters. It does an excellent job of fitting the data points, as the error at the data points is close to zero. However it would not do a good job of predicting h(x) for new values of x. We say that the model has a high variance. The model does not reflect the structure which we expect to be present in any data set generated by equation (1) above.
Clearly what we want is something in between: a model which is powerful enough to represent the underlying structure of the data (h(x)), but not so powerful that it faithfully models the noise associated with this particular data sample.
The bias-variance trade-off is most likely to become a problem if we have relatively few data points. In the opposite case, where we have essentially an infinite number of data points (as in continuous online learning), we are not usually in danger of overfitting the data, as the noise associated with any single data point plays a vanishingly small role in our overall fit. The following techniques therefore apply to situations in which we have a finite data set, and, typically, where we wish to train in batch mode.
Preventing overfitting
Early stopping
One of the simplest and most widely used means of avoiding overfitting is to divide the data into two sets: a training set and a validation set. We train using only the training data. Every now and then, however, we stop training, and test network performance on the independent validation set. No weight updates are made during this test! As the validation data is independent of the training data, network performance is a good measure of generalization,
One detail of note when using early stopping: if we wish to test the trained network on a set of independent data to measure its ability to generalize, we need a third, independent, test set. This is because we used the validation set to decide when to stop training, and thus our trained network is no longer entirely independent of the validation set. The requirements of independent training, validation and test sets means that early stopping can only be used in a data-rich situation.
Weight decay
The over-fitted function above shows a high degree of curvature, while the linear function is maximally smooth. Regularization refers to a set of techniques which help to ensure that the function computed by the network is no more curved than necessary. This is achieved by adding a penalty to the error function, giving:
(2) |
One possible form of the regularizer comes from the informal observation that an over-fitted mapping with regions of large curvature requires large weights. We thus penalize large weights by choosing
(3) |
Using this modified error function, the weights are now updated as
(4) |
where the right hand term causes the weight to decrease as a function of its own size. In the absence of any input, all weights will tend to decrease exponentially, hence the term "weight decay".
Training with noise
A final method which can often help to reduce the importance of the specific noise characteristicsAt first, this may seem a rather odd thing to do: to deliberately corrupt ones own data. However, perhaps you can see that it will now be difficult for the network to approximate any specific data point too closely. In practice, training with added noise has indeed been shown to reduce overfitting and thus improve generalization in some situations.
If we have a finite training set, another way of introducing noise into the training process is to use online training, that is, updating weights after every pattern presentation, and to randomly reorder the patterns at the end of each training epoch. In this manner, each weight update is based on a noisy estimate of the true gradient.
Tuesday, 23 October 2007
Algebra - Imperial College
M2P2 Algebra
Lecturer: Prof M W Liebeck
Recommended books
R.. Allenby, Rings, Fields and Groups, Arnold
I.N. Herstein, Topics in Algebra
For linear algebra, here are a few good ones:
J. Fraleigh and R. Beauregard, Linear Algebra, Addison Wesley
M2S1 PROBABILITY AND STATISTICS II - Imperial College
G. R. Grimmett and D. R. Stirzaker, Probability and Random Processes (2nd Edition/3rd Edition).
[Very useful for probability material of the course].
W. Feller, An Introduction to Probability Theory and Its Applications. Vols 1 and 2. [A classical
reference text].
G. Casella and R.L. Berger, Statistical Inference. [A very useful text, which covers statistical ideas as
well as probability material].
There are many such introductory texts in the Mathematics library. Other books relating to specific
parts of the course will be recommended when relevant.
Also, there will be a course WWW page accessible from http://stats.ma.ic.ac.uk/ayoung. It will
contain links to course handouts, exercises and solutions.
Professor A. Young (room 529, email alastair.young@imperial.ac.uk)
Friday, 19 October 2007
Confusion Matrix
The entries in the confusion matrix have the following meaning in the context of our study:
- a is the number of correct predictions that an instance is negative,
- b is the number of incorrect predictions that an instance is positive,
- c is the number of incorrect of predictions that an instance negative, and
- d is the number of correct predictions that an instance is positive.
Predicted | |||
Negative | Positive | ||
Actual | Negative | a | b |
Several standard terms have been defined for the 2 class matrix:
- The accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined using the equation:
- The recall or hit rate or true positive rate (TP) is the proportion of positive cases that were correctly identified, as calculated using the equation:
- The false alarm rate or false positive rate (FP) is the proportion of negatives cases that were incorrectly classified as positive, as calculated using the equation:
- The true negative rate (TN) is defined as the proportion of negatives cases that were classified correctly, as calculated using the equation:
- The false negative rate (FN) is the proportion of positives cases that were incorrectly classified as negative, as calculated using the equation:
- Finally, precision (P) is the proportion of the predicted positive cases that were correct, as calculated using the equation:
The accuracy determined using equation 1 may not be an adequate performance measure when the number of negative cases is much greater than the number of positive cases (Kubat et al., 1998). Suppose there are 1000 cases, 995 of which are negative cases and 5 of which are positive cases. If the system classifies them all as negative, the accuracy would be 99.5%, even though the classifier missed all positive cases. Other performance measures account for this by including TP in a product: for example, geometric mean (g-mean) (Kubat et al., 1998), as defined in equations 7 and 8, and F-Measure (Lewis and Gale, 1994), as defined in equation 9.
[7]
[8]
[9]
In equation 9, b has a value from 0 to infinity and is used to control the weight assigned to TP and P. Any classifier evaluated using equations 7, 8 or 9 will have a measure value of 0, if all positive cases are classified incorrectly.
Another way to examine the performance of classifiers is to use a ROC graph, described on the next page.
Timetable II
Autumn 2007
MSc Advanced Computing (Weeks 2 - 11)
Week 2 start date: Monday 8 October, 2007Date Published: 19 October 2007
Time | Monday | Tuesday | Wednesday | Thursday | Friday |
---|---|---|---|---|---|
0900 | Advanced Topics in Software Engineering LEC (2-10) / jnm (2-10),sue (2-10) / 311 | Modal and Temporal Logic LEC (2-10) / imh (2-10),mjs (2-10) / 144 | Commemoration Day (No Teaching Week 4 - 24.10.07) Wks (4-4) / None (4-4) / Intelligent Data and Probabilistic Inference LEC (2-10) / dfg (2-10) / 145 | Advanced Issues in Object Oriented Programming LEC (2-10) / scd (2-10) / 145 | Network Security LEC (2-10) / ecl1 (2-10),mrh (2-10) / 308 |
1000 | Advanced Topics in Software Engineering TUT (2-10) / jnm (2-10),sue (2-10) / 311 Laboratory Prolog LAB (2-10) / nr600 (2-10) / 202,206 Intelligent Data and Probabilistic Inference LEC (11-11) / dfg (11-11) / 308 | Modal and Temporal Logic TUT (2-10) / imh (2-10),mjs (2-10) / 144 | Commemoration Day (No Teaching Week 4 - 24.10.07) Wks (4-4) / None (4-4) / Intelligent Data and Probabilistic Inference LEC (2-10) / dfg (2-10) / 308 | Advanced Issues in Object Oriented Programming TUT (2-10) / scd (2-10) / 145 | Lexus Preperation Wks (11-11) / nr600 (11-11) / 202,206 Network Security TUT (2-10) / ecl1 (2-10),mrh (2-10) / 308 |
1100 | Laboratory Prolog LAB (2-10) / nr600 (2-10) / 202,206 Advanced Topics in Software Engineering LEC (2-10) / jnm (2-10),sue (2-10) / 311 Intelligent Data and Probabilistic Inference LEC (11-11) / dfg (11-11) / 308 Lexus Preperation Wks (5-5) / nr600 (5-5) / 202,206 | Models of Concurrent Computation LEC (2-10) / dirk (2-10),pg (2-10) / 145 | Commemoration Day (No Teaching Week 4 - 24.10.07) Wks (4-4) / None (4-4) / Laboratory Prolog LAB (2-10) / nr600 (2-10) / 202,206 Intelligent Data and Probabilistic Inference TUT (2-10) / dfg (2-10) / 344 | Machine Learning LAB (2-10) / maja (2-10),shm (2-10) / 219 Computing for Optimal Decisions LEC (2-10) / br (2-10) / 145 | Machine Learning LAB (2-10) / maja (2-10),shm (2-10) / 219 Lexis Prolog LAB (11-11) / nr600 (11-11) / 202,206 Modal and Temporal Logic LEC (2-10) / imh (2-10),mjs (2-10) / 144 |
1200 | Laboratory Workshop (MAC & MSc CS in depth pathway) LEC (2-11) / nr600 (2-11) / 311 Lexis Prolog LAB (5-5) / nr600 (5-5) / 202,206 Intelligent Data and Probabilistic Inference TUT (11-11) / dfg (11-11) / 308 | Prolog Support Lectures LEC (2-10) / cjh (2-10),klc (2-10) / 144 | Commemoration Day (No Teaching Week 4 - 24.10.07) Wks (4-4) / None (4-4) / Laboratory Prolog LAB (2-10) / nr600 (2-10) / 202,206 | Machine Learning Alternative Lab LAB (2-10) / maja (2-10),shm (2-10) / 219 | Machine Learning Alternative Lab LAB (2-10) / maja (2-10),shm (2-10) / 219 Lexis Prolog LAB (11-11) / nr600 (11-11) / 202,206 |
1300 | Lexis Prolog LAB (5-5) / nr600 (5-5) / 202,206 | | | | |
1400 | Advanced Issues in Object Oriented Programming LEC (2-10) / scd (2-10) / 144 | Network Security LEC (2-10) / ecl1 (2-10),mrh (2-10) / 144 | | Automated Reasoning LEC (2-10) / kb (2-10) / 145 | Automated Reasoning LEC (2-10) / kb (2-10) / 144 |
1500 | | Models of Concurrent Computation TUT (2-10) / dirk (2-10),pg (2-10) / 145 | | Computer Vision LEC (2-10) / gzy (2-10) / 144 | Automated Reasoning TUT (2-10) / kb (2-10) / 144 |
1600 | Machine Learning LEC (2-10) / maja (2-10),shm (2-10) / 311 | Models of Concurrent Computation LEC (2-10) / dirk (2-10),pg (2-10) / 145 | | Computer Vision TUT (2-10) / gzy (2-10) / 144 | Computing for Optimal Decisions LEC (2-10) / br (2-10) / 145 |
1700 | Machine Learning LEC (2-10) / maja (2-10),shm (2-10) / 311 | Project Lecture LEC (8-8) / / 308 | | Computer Vision LEC (2-10) / gzy (2-10) / 144 | Computing for Optimal Decisions TUT (2-10) / br (2-10) / 145 |