<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-2556963392609813338</id><updated>2011-11-27T23:34:22.175Z</updated><category term='Data Mining: Information Theory'/><category term='Data Mining: Sampling'/><category term='Stachastic Process'/><category term='Life: Double Seventh Festival'/><category term='Data Mining: Data'/><category term='English'/><category term='Statistics'/><category term='Basic Concept'/><category term='Data Mining: Statistics'/><category term='Regression'/><category term='Data Mining: Concept'/><category term='Probability Theory'/><category term='nonparametric Test'/><category term='Process of Data mining'/><category term='Data Mining: Graph'/><category term='Data Mining: Regression'/><category term='Data Mining: Math'/><category term='Symbol'/><category term='Data Mining: Database'/><category term='Data Mining: Code'/><category term='Data Mining: Course'/><category term='Book of Data mining'/><category term='Data Mining: Syllabus'/><category term='Conference Data Mining'/><category term='Data Mining: Theory'/><title type='text'>Johnny Deng's Column</title><subtitle type='html'>Hi all,
This Blog is an English archive of my PhD experience in Imperial College London, mainly logging my research and working process, as well as some visual records.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>73</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-172182031341526308</id><published>2008-05-15T17:49:00.000+01:00</published><updated>2008-05-15T17:50:06.969+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Stachastic Process'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Regression'/><title type='text'>Autoregressive Moving-Average Process</title><content type='html'>&lt;p class="body"&gt;An &lt;i&gt;n&lt;/i&gt;-dimensional &lt;b&gt;       &lt;a name="autoregressive moving-average process"&gt;autoregressive        moving-average process&lt;/a&gt;&lt;/b&gt; of orders &lt;i&gt;p&lt;/i&gt; and &lt;i&gt;q&lt;/i&gt;, ARMA(&lt;i&gt;p&lt;/i&gt;,&lt;i&gt;q&lt;/i&gt;),        is a &lt;a href="http://www.riskglossary.com/articles/time_series_stochastic_process.htm"&gt;stochastic        process&lt;/a&gt; of the form &lt;/p&gt;        &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" height="36"&gt;         &lt;tbody&gt;&lt;tr&gt;           &lt;td align="center" width="100%"&gt;           &lt;p style="margin-top: 18px; margin-bottom: 18px;"&gt;           &lt;img src="http://www.riskglossary.com/formulas/ARMA_%5B1%5D.gif" border="0" height="57" width="254" /&gt;&lt;/p&gt;&lt;/td&gt;           &lt;td align="right" width="5%"&gt;[1]&lt;/td&gt;         &lt;/tr&gt;       &lt;/tbody&gt;&lt;/table&gt;         &lt;p class="body"&gt;where &lt;i&gt;&lt;b&gt;a&lt;/b&gt;&lt;/i&gt; is an &lt;i&gt;n&lt;/i&gt;-dimensional vector,        the       &lt;img src="http://www.riskglossary.com/formulas/moving_average_process_beta_k.gif" align="absbottom" border="0" height="15" width="15" /&gt;        and       &lt;img src="http://www.riskglossary.com/formulas/autoregressive_process_bk.gif" align="absbottom" border="0" height="15" width="14" /&gt;        are &lt;i&gt;n&lt;/i&gt;&lt;img src="http://www.riskglossary.com/symbols/cross.gif" align="absbottom" border="0" height="13" width="11" /&gt;&lt;i&gt;n&lt;/i&gt;        matrices, and &lt;i&gt;&lt;b&gt;W&lt;/b&gt;&lt;/i&gt; is &lt;i&gt;n&lt;/i&gt;-dimensional       &lt;a href="http://www.riskglossary.com/articles/white_noise.htm"&gt;white noise&lt;/a&gt; (see the &lt;i&gt;       &lt;a target="_blank" href="http://www.riskglossary.com/Notation.pdf"&gt;       notation conventions&lt;/a&gt;&lt;/i&gt; documentation). As the name        suggests, this combines an       &lt;a href="http://www.riskglossary.com/articles/autoregressive_process.htm"&gt;AR(&lt;i&gt;p&lt;/i&gt;) model&lt;/a&gt;        with an &lt;a href="http://www.riskglossary.com/articles/moving_average_process.htm"&gt;MA(&lt;i&gt;q&lt;/i&gt;)        model&lt;/a&gt; of the same dimension &lt;i&gt;n&lt;/i&gt;. In applications, ARMA(1,1)        processes are common. &lt;/p&gt;       &lt;p class="body"&gt;Exhibit 1 indicates a realization of the univariate        ARMA(1,1) &lt;a name="[2]"&gt;process&lt;/a&gt; &lt;/p&gt;        &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" height="36"&gt;         &lt;tbody&gt;&lt;tr&gt;           &lt;td align="center" width="100%"&gt;           &lt;p style="margin-top: 18px; margin-bottom: 18px;"&gt;           &lt;img src="http://www.riskglossary.com/formulas/ARMA_%5B2%5D.gif" border="0" height="28" width="207" /&gt;&lt;/p&gt;&lt;/td&gt;           &lt;td align="right" width="5%"&gt;[2]&lt;/td&gt;         &lt;/tr&gt;       &lt;/tbody&gt;&lt;/table&gt;         &lt;p&gt;where &lt;i&gt;W&lt;/i&gt; is &lt;a href="http://www.riskglossary.com/articles/standard_deviation.htm"&gt;variance&lt;/a&gt;        1 &lt;a href="http://www.riskglossary.com/articles/white_noise.htm"&gt;Gaussian white noise&lt;/a&gt;. &lt;/p&gt;        &lt;div align="center"&gt;         &lt;center&gt;         &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" width="300"&gt;           &lt;tbody&gt;&lt;tr&gt;             &lt;td&gt;             &lt;p class="exhibit_header"&gt;&lt;span class="exhibit_header"&gt;ARMA Process&lt;/span&gt;&lt;br /&gt;            &lt;span class="exhibit_subheader"&gt;Exhibit 1&lt;/span&gt;&lt;/p&gt;&lt;/td&gt;           &lt;/tr&gt;           &lt;tr&gt;             &lt;td bgcolor="#cc9966" height="2"&gt;             &lt;img src="http://www.riskglossary.com/images/blank_2x2.gif" border="0" height="2" width="2" /&gt;&lt;/td&gt;           &lt;/tr&gt;           &lt;tr&gt;             &lt;td&gt;             &lt;img src="http://www.riskglossary.com/images/ex1_ARMA.gif" border="0" height="164" width="250" /&gt;&lt;/td&gt;           &lt;/tr&gt;           &lt;tr&gt;             &lt;td&gt;             &lt;p class="exhibit_legend" style="margin-bottom: 30px;"&gt;A realization of              the ARMA(1,1) process [&lt;a href="http://www.riskglossary.com/articles/ARMA.htm#%5B2%5D"&gt;2&lt;/a&gt;].&lt;/p&gt;&lt;/td&gt;           &lt;/tr&gt;         &lt;/tbody&gt;&lt;/table&gt;         &lt;/center&gt;       &lt;/div&gt;                 &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" width="100%"&gt;&lt;tbody&gt;&lt;tr&gt;           &lt;td colspan="2" height="30" valign="bottom"&gt;           &lt;p class="header2_black"&gt;Exercises&lt;/p&gt;&lt;/td&gt;         &lt;/tr&gt;         &lt;tr&gt;           &lt;td colspan="2" bgcolor="#cc9966" height="2"&gt;           &lt;img src="http://www.riskglossary.com/images/blank_2x2.gif" border="0" height="2" width="2" /&gt;&lt;/td&gt;         &lt;/tr&gt;         &lt;tr&gt;           &lt;td colspan="2" bgcolor="#ffffcc"&gt;           &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="6" cellspacing="0" width="100%"&gt;             &lt;tbody&gt;&lt;tr&gt;               &lt;td width="100%"&gt;               &lt;p class="body" style="margin-left: 10px; margin-top: 8px; margin-bottom: 8px;"&gt;               &lt;img src="http://www.riskglossary.com/images/bullet_exercise.gif" align="absbottom" border="0" height="17" width="19" /&gt;Below                are indicated a realization of 50                    consecutive terms of a variance 1                &lt;a href="http://www.riskglossary.com/articles/white_noise.htm"&gt;Gaussian white noise&lt;/a&gt;.&lt;/p&gt;&lt;div align="center"&gt;                     &lt;center&gt;                     &lt;table style="border-collapse: collapse;" border="0" cellpadding="0" cellspacing="0" width="320"&gt;                       &lt;colgroup&gt;                         &lt;col style="width: 48pt;" span="5" width="64"&gt;                       &lt;/colgroup&gt;                       &lt;tbody&gt;&lt;tr style="height: 12.75pt;" height="17"&gt;                         &lt;td style="border-style: solid none none; border-color: rgb(204, 153, 102) -moz-use-text-color -moz-use-text-color; border-width: 2px medium medium; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="0.29299999999999998" class="exhibit_legend" align="right" bgcolor="#ffffff" height="17" width="64"&gt;                         0.293&lt;/td&gt;                         &lt;td style="border-style: solid none none; border-color: rgb(204, 153, 102) -moz-use-text-color -moz-use-text-color; border-width: 2px medium medium; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="0.317" class="exhibit_legend" align="right" bgcolor="#ffffff" width="64"&gt;                         0.317&lt;/td&gt;                         &lt;td style="border-style: solid none none; border-color: rgb(204, 153, 102) -moz-use-text-color -moz-use-text-color; border-width: 2px medium medium; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="4.7E-2" class="exhibit_legend" align="right" bgcolor="#ffffff" width="64"&gt;                         0.047&lt;/td&gt;                         &lt;td style="border-style: solid none none; border-color: rgb(204, 153, 102) -moz-use-text-color -moz-use-text-color; border-width: 2px medium medium; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="-0.28599999999999998" class="exhibit_legend" align="right" bgcolor="#ffffff" width="64"&gt;                         -0.286&lt;/td&gt;                         &lt;td style="border-style: solid none none; border-color: rgb(204, 153, 102) -moz-use-text-color -moz-use-text-color; border-width: 2px medium medium; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="-1.2370000000000001" class="exhibit_legend" align="right" bgcolor="#ffffff" width="64"&gt;                         -1.237&lt;/td&gt;                       &lt;/tr&gt;                       &lt;tr style="height: 12.75pt;" height="17"&gt;                         &lt;td style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="-0.55400000000000005" class="exhibit_legend" align="right" bgcolor="#ffffff" height="17"&gt;                         -0.554&lt;/td&gt;                         &lt;td num="0.53500000000000003" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.535&lt;/td&gt;                         &lt;td style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         -1.640&lt;/td&gt;                         &lt;td num="-0.89900000000000002" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         -0.899&lt;/td&gt;                         &lt;td num="-0.70399999999999996" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         -0.704&lt;/td&gt;                       &lt;/tr&gt;                       &lt;tr style="height: 12.75pt;" height="17"&gt;                         &lt;td style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="-1.8859999999999999" class="exhibit_legend" align="right" bgcolor="#ffffff" height="17"&gt;                         -1.886&lt;/td&gt;                         &lt;td num="0.27100000000000002" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.271&lt;/td&gt;                         &lt;td num="0.41799999999999998" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.418&lt;/td&gt;                         &lt;td num="1.651" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         1.651&lt;/td&gt;                         &lt;td num="7.8E-2" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.078&lt;/td&gt;                       &lt;/tr&gt;                       &lt;tr style="height: 12.75pt;" height="17"&gt;                         &lt;td style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="0.52800000000000002" class="exhibit_legend" align="right" bgcolor="#ffffff" height="17"&gt;                         0.528&lt;/td&gt;                         &lt;td num="1.0129999999999999" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         1.013&lt;/td&gt;                         &lt;td num="2.2959999999999998" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         2.296&lt;/td&gt;                         &lt;td num="8.5999999999999993E-2" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.086&lt;/td&gt;                         &lt;td num="1.4710000000000001" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         1.471&lt;/td&gt;                       &lt;/tr&gt;                       &lt;tr style="height: 12.75pt;" height="17"&gt;                         &lt;td style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="-00.58" class="exhibit_legend" align="right" bgcolor="#ffffff" height="17"&gt;                         -0.580&lt;/td&gt;                         &lt;td num="-1.776" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         -1.776&lt;/td&gt;                         &lt;td num="-2.2170000000000001" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         -2.217&lt;/td&gt;                         &lt;td num="0.502" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.502&lt;/td&gt;                         &lt;td num="-1.1040000000000001" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         -1.104&lt;/td&gt;                       &lt;/tr&gt;                       &lt;tr style="height: 12.75pt;" height="17"&gt;                         &lt;td style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="-1.2110000000000001" class="exhibit_legend" align="right" bgcolor="#ffffff" height="17"&gt;                         -1.211&lt;/td&gt;                         &lt;td num="0.20499999999999999" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.205&lt;/td&gt;                         &lt;td num="00.11" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.110&lt;/td&gt;                         &lt;td num="1.0999999999999999E-2" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.011&lt;/td&gt;                         &lt;td num="0.77800000000000002" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.778&lt;/td&gt;                       &lt;/tr&gt;                       &lt;tr style="height: 12.75pt;" height="17"&gt;                         &lt;td style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="-1.036" class="exhibit_legend" align="right" bgcolor="#ffffff" height="17"&gt;                         -1.036&lt;/td&gt;                         &lt;td num="1.1950000000000001" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         1.195&lt;/td&gt;                         &lt;td num="-1.169" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         -1.169&lt;/td&gt;                         &lt;td num="-0.16200000000000001" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         -0.162&lt;/td&gt;                         &lt;td num="-0.504" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         -0.504&lt;/td&gt;                       &lt;/tr&gt;                       &lt;tr style="height: 12.75pt;" height="17"&gt;                         &lt;td style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="-0.67900000000000005" class="exhibit_legend" align="right" bgcolor="#ffffff" height="17"&gt;                         -0.679&lt;/td&gt;                         &lt;td num="-1.3660000000000001" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         -1.366&lt;/td&gt;                         &lt;td num="0.88500000000000001" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.885&lt;/td&gt;                         &lt;td num="-0.47599999999999998" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         -0.476&lt;/td&gt;                         &lt;td num="1.6439999999999999" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         1.644&lt;/td&gt;                       &lt;/tr&gt;                       &lt;tr style="height: 12.75pt;" height="17"&gt;                         &lt;td style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="-1.665" class="exhibit_legend" align="right" bgcolor="#ffffff" height="17"&gt;                         -1.665&lt;/td&gt;                         &lt;td num="0.129" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.129&lt;/td&gt;                         &lt;td num="2.8820000000000001" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         2.882&lt;/td&gt;                         &lt;td num="0.97799999999999998" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.978&lt;/td&gt;                         &lt;td num="5.3999999999999999E-2" style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.054&lt;/td&gt;                       &lt;/tr&gt;                       &lt;tr style="height: 12.75pt;" height="17"&gt;                         &lt;td style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color rgb(204, 153, 102); border-width: medium medium 2px; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" num="-0.39600000000000002" class="exhibit_legend" align="right" bgcolor="#ffffff" height="17"&gt;                         -0.396&lt;/td&gt;                         &lt;td num="0.68500000000000005" style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color rgb(204, 153, 102); border-width: medium medium 2px; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.685&lt;/td&gt;                         &lt;td num="1.403" style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color rgb(204, 153, 102); border-width: medium medium 2px; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         1.403&lt;/td&gt;                         &lt;td num="-8.9999999999999993E-3" style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color rgb(204, 153, 102); border-width: medium medium 2px; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         -0.009&lt;/td&gt;                         &lt;td num="0.91800000000000004" style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color rgb(204, 153, 102); border-width: medium medium 2px; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; padding-left: 1px; padding-right: 1px; padding-top: 1px;" class="exhibit_legend" align="right" bgcolor="#ffffff"&gt;                         0.918&lt;/td&gt;                       &lt;/tr&gt;                       &lt;tr style="height: 12.75pt;" height="17"&gt;                         &lt;td style="border: medium none ; color: windowtext; font-size: 10pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; vertical-align: bottom; white-space: nowrap; width: 310px; padding-left: 1px; padding-right: 1px; padding-top: 1px;" colspan="5"&gt;                         &lt;p class="exhibit_legend"&gt;Realization of 50 consecutive terms of a variance 1                          Gaussian white noise.&lt;/p&gt;&lt;/td&gt;                       &lt;/tr&gt;                     &lt;/tbody&gt;&lt;/table&gt;                     &lt;/center&gt;                   &lt;/div&gt;                   &lt;p class="body" style="margin-left: 10px;"&gt;Use this to generate a                    corresponding realization of                    the ARMA(1,1) process&lt;/p&gt;                        &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" width="100%"&gt;       &lt;tbody&gt;&lt;tr&gt;         &lt;td align="center" width="98%"&gt;         &lt;p class="body"&gt;         &lt;img src="http://www.riskglossary.com/formulas/ARMA_%5Be1%5D.gif" border="0" height="29" width="216" /&gt;&lt;/p&gt;&lt;/td&gt;         &lt;td width="2%"&gt;         &lt;p class="body" align="right"&gt;[e1]&lt;/p&gt;&lt;/td&gt;       &lt;/tr&gt;     &lt;/tbody&gt;&lt;/table&gt;                   &lt;p style="margin-left: 10px;"&gt;where &lt;i&gt;&lt;sup&gt;t&lt;/sup&gt;W&lt;/i&gt; is a variance 1 Gaussian white                    noise. Initialize the realization with term &lt;sup&gt;0&lt;/sup&gt;&lt;i&gt;x&lt;/i&gt;                    = 0.                 [&lt;a target="_blank" href="http://www.value-at-risk.net/spreadsheets/solution_04_16.xls"&gt;spreadsheet solution&lt;/a&gt;]&lt;/p&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-172182031341526308?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/172182031341526308/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=172182031341526308' title='40 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/172182031341526308'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/172182031341526308'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2008/05/autoregressive-moving-average-process.html' title='Autoregressive Moving-Average Process'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>40</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-4448228472281515858</id><published>2008-05-15T17:38:00.000+01:00</published><updated>2008-05-15T17:39:56.388+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Stachastic Process'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Regression'/><title type='text'>White Noise</title><content type='html'>&lt;p class="body" style="margin-bottom: 24px;"&gt;A &lt;b&gt;&lt;a name="white noise"&gt;white noise&lt;/a&gt;&lt;/b&gt;        is a simple type of &lt;a href="http://www.riskglossary.com/articles/time_series_stochastic_process.htm"&gt;stochastic process&lt;/a&gt;. Precise definitions vary. One        simple definition is that a white noise is a (univariate or multivariate)        &lt;a href="http://www.riskglossary.com/articles/time_series_stochastic_process.htm"&gt;discrete-time&lt;/a&gt; stochastic process whose terms are independent and        identically distributed (IID), all with zero &lt;a href="http://www.riskglossary.com/articles/mean.htm"&gt;mean&lt;/a&gt;. While this definition        captures the spirit of what constitutes a white noise, the IID requirement        is often too restrictive for applications. Typically the IID requirement        is replaced with a requirement that terms have constant second moments,        zero &lt;a href="http://www.riskglossary.com/articles/time_series_stochastic_process.htm"&gt;autocorrelations&lt;/a&gt; and zero means. Let's formalize this.&lt;/p&gt;        &lt;p class="body" style="margin-bottom: 28px;"&gt;If you have not already done so, see the &lt;i&gt;       &lt;a target="_blank" href="http://www.riskglossary.com/Notation.pdf"&gt;       notation conventions&lt;/a&gt;&lt;/i&gt; documentation. A one-dimensional &lt;a href="http://www.riskglossary.com/articles/time_series_stochastic_process.htm"&gt;       stochastic process&lt;/a&gt; &lt;/p&gt;        &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" height="36"&gt;         &lt;tbody&gt;&lt;tr&gt;           &lt;td align="center" width="100%"&gt;           &lt;p style="margin-top: 6px; margin-bottom: 12px;"&gt;           ..., &lt;sup&gt;&lt;i&gt;t&lt;/i&gt;–2&lt;/sup&gt;&lt;i&gt;W&lt;/i&gt;, &lt;sup&gt;&lt;i&gt;t&lt;/i&gt;–1&lt;/sup&gt;&lt;i&gt;W&lt;/i&gt;,           &lt;sup&gt;&lt;i&gt;t&lt;/i&gt;&lt;/sup&gt;&lt;i&gt;W&lt;/i&gt;, &lt;sup&gt;&lt;i&gt;t&lt;/i&gt;+1&lt;/sup&gt;&lt;i&gt;W&lt;/i&gt;, ...&lt;/p&gt;&lt;/td&gt;           &lt;td align="right" width="5%"&gt;[1]&lt;/td&gt;         &lt;/tr&gt;       &lt;/tbody&gt;&lt;/table&gt;         &lt;p class="body" style="margin-top: 26px;"&gt;is said to be white noise        if unconditional means, &lt;a href="http://www.riskglossary.com/articles/standard_deviation.htm"&gt;standard deviations&lt;/a&gt; and autocorrelations       &lt;a name="[2] [3] [4]"&gt;satisfy&lt;/a&gt;       &lt;/p&gt;                 &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" height="36"&gt;&lt;tbody&gt;&lt;tr&gt;           &lt;td align="center" width="100%"&gt;           &lt;p style="margin-top: 6px; margin-bottom: 6px;"&gt;           &lt;i&gt;E&lt;/i&gt;(&lt;sup&gt;&lt;i&gt;t&lt;/i&gt;&lt;/sup&gt;&lt;i&gt;W&lt;/i&gt;) = 0&lt;/p&gt;&lt;/td&gt;           &lt;td align="right" width="5%"&gt;           &lt;p style="margin-top: 6px; margin-bottom: 6px;"&gt;[2]&lt;/p&gt;&lt;/td&gt;         &lt;/tr&gt;         &lt;tr&gt;           &lt;td align="center" width="100%"&gt;           &lt;p style="margin-top: 6px; margin-bottom: 6px;"&gt;           &lt;i&gt;std&lt;/i&gt;(&lt;sup&gt;&lt;i&gt;t&lt;/i&gt;&lt;/sup&gt;&lt;i&gt;W&lt;/i&gt;) =           &lt;img src="http://www.riskglossary.com/symbols/sigma.gif" align="absbottom" border="0" height="12" width="9" /&gt;&lt;/p&gt;&lt;/td&gt;           &lt;td align="right" width="5%"&gt;           &lt;p style="margin-top: 6px; margin-bottom: 6px;"&gt;[3]&lt;/p&gt;&lt;/td&gt;         &lt;/tr&gt;         &lt;tr&gt;           &lt;td align="center" width="100%"&gt;           &lt;p style="margin-top: 6px; margin-bottom: 6px;"&gt;           &lt;i&gt;cor&lt;/i&gt;(&lt;sup&gt;&lt;i&gt;t&lt;/i&gt;&lt;/sup&gt;&lt;i&gt;W, &lt;/i&gt;&lt;sup&gt;&lt;i&gt;t+n&lt;/i&gt;&lt;/sup&gt;&lt;i&gt;W&lt;/i&gt;)            = 0&lt;/p&gt;&lt;/td&gt;           &lt;td align="right" width="5%"&gt;           &lt;p style="margin-top: 6px; margin-bottom: 6px;"&gt;[4]&lt;/p&gt;&lt;p style="margin-top: 6px; margin-bottom: 6px; text-align: left;"&gt;&lt;br /&gt;&lt;/p&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;p class="body"&gt;for some constant       &lt;img src="http://www.riskglossary.com/symbols/sigma.gif" align="absbottom" border="0" height="12" width="9" /&gt; and any       &lt;a href="http://www.riskglossary.com/articles/set.htm"&gt;integer&lt;/a&gt; &lt;i&gt;n&lt;/i&gt;. To distinguish this definition of        white noise from that which requires IID terms, we call the latter an       &lt;b&gt;&lt;a name="independent white noise"&gt;independent white noise&lt;/a&gt;&lt;/b&gt; or       &lt;b&gt;&lt;a name="strong white noise"&gt;strong white noise&lt;/a&gt;&lt;/b&gt;. Note        that the definition of white noise is more restrictive than that of        independent white noise in just one respect. With a white noise, means,        standard deviations and autocorrelations must exist. For independent white        noise, they need not.&lt;/p&gt;         &lt;p class="body"&gt;While the definition of independent white noise is        otherwise more restrictive than that of white noise, it is also simpler.        An independent white noise is necessarily a very simple process.        Conditions [&lt;a href="http://www.riskglossary.com/articles/white_noise.htm#%5B2%5D%20%5B3%5D%20%5B4%5D"&gt;2&lt;/a&gt;] through [&lt;a href="http://www.riskglossary.com/articles/white_noise.htm#%5B2%5D%20%5B3%5D%20%5B4%5D"&gt;4&lt;/a&gt;],        which define a white noise, can accommodate more complicated        processes. For example, conditions [&lt;a href="http://www.riskglossary.com/articles/white_noise.htm#%5B2%5D%20%5B3%5D%20%5B4%5D"&gt;2&lt;/a&gt;] and [&lt;a href="http://www.riskglossary.com/articles/white_noise.htm#%5B2%5D%20%5B3%5D%20%5B4%5D"&gt;3&lt;/a&gt;]        apply only to unconditional moments. There is nothing to stop a white        noise from being &lt;a href="http://www.riskglossary.com/articles/heteroskedasticity.htm"&gt;conditionally        heteroskedastic&lt;/a&gt;. That is impossible with an independent white noise.&lt;/p&gt;         &lt;p class="body"&gt;An independent white noise whose terms are all        &lt;a href="http://www.riskglossary.com/articles/normal_distribution.htm"&gt;normally&lt;/a&gt;        distributed is called a &lt;b&gt;&lt;a name="Gaussian white noise"&gt;Gaussian white noise&lt;/a&gt;&lt;/b&gt;.        A realization of a univariate Gaussian white noise with       &lt;a href="http://www.riskglossary.com/articles/standard_deviation.htm"&gt;variance&lt;/a&gt; 1 is graphed in        Exhibit 1. &lt;/p&gt;        &lt;div align="center"&gt;         &lt;center&gt;         &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" width="300"&gt;           &lt;tbody&gt;&lt;tr&gt;             &lt;td&gt;             &lt;p class="exhibit_header"&gt;             &lt;span class="exhibit_header"&gt;Univariate Gaussian White Noise&lt;/span&gt;&lt;br /&gt;            &lt;span class="exhibit_subheader"&gt;Exhibit 1&lt;/span&gt;&lt;/p&gt;&lt;/td&gt;           &lt;/tr&gt;           &lt;tr&gt;             &lt;td bgcolor="#cc9966" height="2"&gt;             &lt;img src="http://www.riskglossary.com/images/blank_2x2.gif" border="0" height="2" width="2" /&gt;&lt;/td&gt;           &lt;/tr&gt;           &lt;tr&gt;             &lt;td&gt;             &lt;img src="http://www.riskglossary.com/images/ex1_white_noise.gif" border="0" height="165" width="250" /&gt;&lt;/td&gt;           &lt;/tr&gt;           &lt;tr&gt;             &lt;td&gt;             &lt;p class="exhibit_legend"&gt;A realization of              a univariate Gaussian white noise with variance 1.&lt;/p&gt;&lt;/td&gt;           &lt;/tr&gt;         &lt;/tbody&gt;&lt;/table&gt;         &lt;/center&gt;       &lt;/div&gt;         &lt;p class="body"&gt;All these concepts generalize to multivariate processes. An &lt;i&gt;n&lt;/i&gt;-dimensional       stochastic process &lt;/p&gt;                            &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" height="36"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center" width="100%"&gt;           &lt;p style="margin-top: 12px; margin-bottom: 12px;"&gt;           &lt;img src="http://www.riskglossary.com/formulas/white_noise_%5B1%5D.gif" border="0" height="27" width="191" /&gt;&lt;/p&gt;&lt;/td&gt;           &lt;td align="right" width="5%"&gt;[5]&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;p class="body"&gt;is said to be white noise        if unconditional &lt;a href="http://www.riskglossary.com/articles/mean.htm"&gt;expectations&lt;/a&gt; satisfy       &lt;/p&gt;        &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" height="36"&gt;         &lt;tbody&gt;&lt;tr&gt;           &lt;td rowspan="2" align="center" width="100%"&gt;           &lt;p style="margin-top: 12px; margin-bottom: 12px;"&gt;           &lt;img src="http://www.riskglossary.com/formulas/white_noise_%5B2%5D_%5B3%5D.gif" border="0" height="92" width="171" /&gt;&lt;/p&gt;&lt;/td&gt;           &lt;td align="right" width="5%"&gt;[6]&lt;/td&gt;         &lt;/tr&gt;         &lt;tr&gt;           &lt;td align="right" width="5%"&gt;[7]&lt;/td&gt;         &lt;/tr&gt;       &lt;/tbody&gt;&lt;/table&gt;         &lt;p class="body"&gt;for some constant &lt;a href="http://www.riskglossary.com/articles/correlation.htm"&gt;       covariance matrix&lt;/a&gt;       &lt;img src="http://www.riskglossary.com/symbols/sigma_bold_cap.gif" align="absbottom" border="0" height="15" width="10" /&gt;.        Condition [7] does not require that the       &lt;img src="http://www.riskglossary.com/symbols/t_W_bold_italic.gif" align="absbottom" border="0" height="19" width="18" /&gt;        be independent. If we make this stronger assumption, the process is called independent white noise. If        we further assume the       &lt;img src="http://www.riskglossary.com/symbols/t_W_bold_italic.gif" align="absbottom" border="0" height="19" width="18" /&gt;        are &lt;a href="http://www.riskglossary.com/articles/joint_normal_distribution.htm"&gt;joint normal&lt;/a&gt;,        it is called Gaussian white noise. &lt;/p&gt;         &lt;p class="body"&gt;White noises are important in       &lt;a href="http://www.riskglossary.com/articles/time_series_stochastic_process.htm"&gt;time series analysis&lt;/a&gt;        because more complicated stochastic processes are generally defined in        terms of white noises.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-4448228472281515858?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/4448228472281515858/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=4448228472281515858' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/4448228472281515858'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/4448228472281515858'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2008/05/white-noise.html' title='White Noise'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-2980588129394850358</id><published>2008-05-15T17:31:00.001+01:00</published><updated>2008-05-15T17:34:11.837+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Regression'/><title type='text'>ARCH and GARCH Processes</title><content type='html'>&lt;p class="body"&gt;&lt;a name="autoregressive conditional heteroskedastic"&gt;Autoregressive        conditional heteroskedastic&lt;/a&gt; (ARCH) processes are a form of       &lt;a href="http://www.riskglossary.com/articles/time_series_stochastic_process.htm"&gt;stochastic        process&lt;/a&gt; that are widely used in finance and economics for modeling       &lt;a href="http://www.riskglossary.com/articles/heteroskedasticity.htm"&gt;conditional        heteroskedasticity&lt;/a&gt; and &lt;a href="http://www.riskglossary.com/articles/volatility_clustering.htm"&gt;volatility clustering&lt;/a&gt;. First proposed by Engle        (&lt;a href="http://www.riskglossary.com/articles/ARCH_GARCH.htm#Engle82"&gt;1982&lt;/a&gt;), ARCH processes are univariate conditionally heteroskedastic       &lt;a href="http://www.riskglossary.com/articles/white_noise.htm"&gt;white noises&lt;/a&gt;. An ARCH(&lt;i&gt;q&lt;/i&gt;)        process &lt;i&gt;X&lt;/i&gt; is defined by two interrelated formulas (see the &lt;i&gt;       &lt;a target="_blank" href="http://www.riskglossary.com/Notation.pdf"&gt;       notation conventions&lt;/a&gt;&lt;/i&gt; documentation):&lt;/p&gt;        &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" height="36"&gt;         &lt;tbody&gt;&lt;tr&gt;           &lt;td rowspan="2" align="center" width="100%"&gt;           &lt;p&gt;           &lt;img src="http://www.riskglossary.com/formulas/ARCH_GARCH_%5B1%5D_%5B2%5D.gif" border="0" height="89" width="164" /&gt;&lt;/p&gt;&lt;/td&gt;           &lt;td align="right" width="5%"&gt;[1]&lt;/td&gt;         &lt;/tr&gt;         &lt;tr&gt;           &lt;td align="right" width="5%"&gt;[2]&lt;/td&gt;         &lt;/tr&gt;       &lt;/tbody&gt;&lt;/table&gt;         &lt;p class="body"&gt;where &lt;i&gt;&lt;b&gt;W&lt;/b&gt;&lt;/i&gt; is a       &lt;a href="http://www.riskglossary.com/articles/normal_distribution.htm"&gt;standard normal&lt;/a&gt;       &lt;a href="http://www.riskglossary.com/articles/white_noise.htm"&gt;Gaussian white noise&lt;/a&gt;. This means that the time &lt;i&gt;t&lt;/i&gt; distribution of       &lt;b&gt;&lt;i&gt;X&lt;/i&gt;&lt;/b&gt;,        conditional on information available at time &lt;i&gt;t&lt;/i&gt;–1, is       &lt;a href="http://www.riskglossary.com/articles/normal_distribution.htm"&gt;normal&lt;/a&gt;, with constant       &lt;a href="http://www.riskglossary.com/articles/mean.htm"&gt;mean&lt;/a&gt; 0 and a conditional       &lt;a href="http://www.riskglossary.com/articles/standard_deviation.htm"&gt;variance&lt;/a&gt;       &lt;img src="http://www.riskglossary.com/formulas/ARCH_GARCH_stdev.gif" align="absbottom" border="0" height="20" width="37" /&gt;        that changes with time. Our notation indicates that       &lt;img src="http://www.riskglossary.com/formulas/ARCH_GARCH_stdev.gif" align="absbottom" border="0" height="20" width="37" /&gt;        is a variance at time &lt;i&gt;t&lt;/i&gt;, but conditional on information available        at time &lt;i&gt;t&lt;/i&gt;–1. Formula [2] defines       &lt;img src="http://www.riskglossary.com/formulas/ARCH_GARCH_stdev.gif" align="absbottom" border="0" height="20" width="37" /&gt;        as a function of preceding values of &lt;b&gt;&lt;i&gt;X&lt;/i&gt;&lt;/b&gt;. Together, formulas        [1] and [2] ensure that, if &lt;b&gt;&lt;i&gt;X&lt;/i&gt;&lt;/b&gt; takes on large positive or        negative values at some point in time, its conditional variance will be        elevated for subsequent points in time, thereby making it likely that &lt;b&gt;       &lt;i&gt;X&lt;/i&gt;&lt;/b&gt; will also take on large positive or negative values at those        times too. In this manner, an ARCH process models volatility        clustering—periods of high or low &lt;a href="http://www.riskglossary.com/articles/volatility.htm"&gt;       volatility&lt;/a&gt;.&lt;/p&gt;         &lt;p class="body"&gt;Bollerslev (&lt;a href="http://www.riskglossary.com/articles/ARCH_GARCH.htm#Bollerslev86"&gt;1986&lt;/a&gt;) extended the model by allowing       &lt;img src="http://www.riskglossary.com/formulas/ARCH_GARCH_stdev.gif" align="absbottom" border="0" height="20" width="37" /&gt; to also depend        on its own past values. His generalized ARCH, or &lt;b&gt;&lt;a name="GARCH"&gt;GARCH&lt;/a&gt;&lt;/b&gt;(&lt;i&gt;p&lt;/i&gt;,&lt;i&gt;q&lt;/i&gt;), process has        form &lt;/p&gt;        &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" height="36"&gt;         &lt;tbody&gt;&lt;tr&gt;           &lt;td rowspan="2" align="center" width="100%"&gt;           &lt;img src="http://www.riskglossary.com/formulas/ARCH_GARCH_%5B3%5D_%5B4%5D.gif" border="0" height="92" width="278" /&gt;&lt;/td&gt;           &lt;td align="right" width="5%"&gt;[3]&lt;/td&gt;         &lt;/tr&gt;         &lt;tr&gt;           &lt;td align="right" width="5%"&gt;[4]&lt;/td&gt;         &lt;/tr&gt;       &lt;/tbody&gt;&lt;/table&gt;         &lt;p class="body"&gt;See Hamilton (&lt;a href="http://www.riskglossary.com/articles/ARCH_GARCH.htm#Hamilton"&gt;1994&lt;/a&gt;) for       &lt;a href="http://www.riskglossary.com/articles/time_series_stochastic_process.htm"&gt;stationarity&lt;/a&gt; conditions. In applications,        GARCH(1,1) processes are common. Exhibit 1 indicates a realization of        the GARCH(1,1) &lt;a name="[6]"&gt;process&lt;/a&gt; &lt;/p&gt;        &lt;table style="border-collapse: collapse;" border="0" bordercolor="#111111" cellpadding="0" cellspacing="0" height="36"&gt;         &lt;tbody&gt;&lt;tr&gt;           &lt;td rowspan="2" align="center" width="100%"&gt;           &lt;p style="margin-top: 0pt; margin-bottom: 0pt;"&gt;           &lt;img src="http://www.riskglossary.com/formulas/ARCH_GARCH_%5B5%5D_%5B6%5D.gif" border="0" height="63" width="240" /&gt;&lt;/p&gt;&lt;/td&gt;           &lt;td align="right" width="5%"&gt;           &lt;p style="margin-top: 0pt; margin-bottom: 0pt;"&gt;[5]&lt;/p&gt;&lt;/td&gt;         &lt;/tr&gt;         &lt;tr&gt;           &lt;td align="right" width="5%"&gt;           &lt;p style="margin-bottom: 0pt; margin-top: 0pt;"&gt;[6]&lt;/p&gt;&lt;/td&gt;         &lt;/tr&gt;       &lt;/tbody&gt;&lt;/table&gt;         &lt;div align="center"&gt;         &lt;center&gt;         &lt;table style="border-collapse: collapse; color: rgb(17, 17, 17);" border="0" cellpadding="0" cellspacing="0" width="275"&gt;           &lt;tbody&gt;&lt;tr&gt;             &lt;td&gt;             &lt;p class="exhibit_header" style="margin-top: 24px;"&gt;&lt;span class="exhibit_header"&gt;GARCH(1,1)              Process&lt;/span&gt;&lt;br /&gt;          &lt;span class="exhibit_subheader"&gt;Exhibit 1&lt;/span&gt;&lt;/p&gt;&lt;/td&gt;           &lt;/tr&gt;           &lt;tr&gt;             &lt;td bgcolor="#cc9966" height="2"&gt;             &lt;img src="http://www.riskglossary.com/images/blank_2x2.gif" border="0" height="2" width="2" /&gt;&lt;/td&gt;           &lt;/tr&gt;           &lt;tr&gt;             &lt;td&gt;             &lt;img src="http://www.riskglossary.com/images/ex1_ARCH_GARCH.gif" border="0" height="165" width="250" /&gt;&lt;/td&gt;           &lt;/tr&gt;           &lt;tr&gt;             &lt;td&gt;             &lt;p class="exhibit_legend"&gt;A realization of the GARCH(1,1) process              defined by [&lt;a href="http://www.riskglossary.com/articles/ARCH_GARCH.htm#%5B6%5D"&gt;5&lt;/a&gt;] and [&lt;a href="http://www.riskglossary.com/articles/ARCH_GARCH.htm#%5B6%5D"&gt;6&lt;/a&gt;].&lt;/p&gt;&lt;/td&gt;           &lt;/tr&gt;         &lt;/tbody&gt;&lt;/table&gt;         &lt;/center&gt;       &lt;/div&gt;        &lt;table style="border-collapse: collapse; color: rgb(17, 17, 17);" align="left" border="0" cellpadding="0" cellspacing="0" width="345"&gt;         &lt;tbody&gt;&lt;tr&gt;           &lt;td height="14" width="336"&gt;&lt;span style="font-size:85%;"&gt; &lt;/span&gt;&lt;span style="font-size:130%;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;/td&gt;           &lt;td height="14" width="9"&gt;&lt;br /&gt;&lt;/td&gt;         &lt;/tr&gt;                &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;p class="body"&gt;There have been many attempts to generalize GARCH models to multiple        dimensions. Attempts include: &lt;/p&gt;       &lt;p class="bullet_line_space"&gt;&lt;img src="http://www.riskglossary.com/images/bullet.gif" border="0" height="10" width="20" /&gt;the        vech and BEKK models of Engle and Kroner (&lt;a href="http://www.riskglossary.com/articles/ARCH_GARCH.htm#Engle%20Kroner"&gt;1995&lt;/a&gt;), &lt;/p&gt;       &lt;p class="bullet_line_space"&gt;&lt;img src="http://www.riskglossary.com/images/bullet.gif" border="0" height="10" width="20" /&gt;the        CCC-GARCH of Bollerslev (&lt;a href="http://www.riskglossary.com/articles/ARCH_GARCH.htm#Bollerslev"&gt;1990&lt;/a&gt;), &lt;/p&gt;       &lt;p class="bullet_line_space"&gt;&lt;img src="http://www.riskglossary.com/images/bullet.gif" border="0" height="10" width="20" /&gt;the        orthogonal GARCH of Ding (&lt;a href="http://www.riskglossary.com/articles/ARCH_GARCH.htm#Ding"&gt;1994&lt;/a&gt;), Alexander and Chibumba (&lt;a href="http://www.riskglossary.com/articles/ARCH_GARCH.htm#Alexander"&gt;1997&lt;/a&gt;), and Klaassen (&lt;a href="http://www.riskglossary.com/articles/ARCH_GARCH.htm#Klaassen"&gt;2000&lt;/a&gt;), and &lt;/p&gt;       &lt;p class="bullet_line_space"&gt;&lt;img src="http://www.riskglossary.com/images/bullet.gif" border="0" height="10" width="20" /&gt;the        DCC-GARCH of Engle (&lt;a href="http://www.riskglossary.com/articles/ARCH_GARCH.htm#Engle00"&gt;2000&lt;/a&gt;), and Engle and Sheppard (&lt;a href="http://www.riskglossary.com/articles/ARCH_GARCH.htm#Engle%20Sheppard"&gt;2001&lt;/a&gt;). &lt;/p&gt;       &lt;p class="body"&gt;With some of these approaches, the number of parameters that must be        specified becomes unmanageable as dimensionality &lt;i&gt;n&lt;/i&gt; increases. With some, estimation requires considerable user intervention or entails other challenges. Some require assumptions that are difficult to reconcile with phenomena to be modeled. This is an area of ongoing research. &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-2980588129394850358?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/2980588129394850358/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=2980588129394850358' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/2980588129394850358'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/2980588129394850358'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2008/05/arch-and-garch-processes.html' title='ARCH and GARCH Processes'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-8071453525758157433</id><published>2007-12-10T13:15:00.001Z</published><updated>2007-12-10T13:18:49.753Z</updated><title type='text'>Greek alphabet, again</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_1ET8wy93LPw/R108GUFmNoI/AAAAAAAAANc/3M5bk77Cfo0/s1600-h/1.JPG"&gt;&lt;img style="cursor: pointer; width: 698px; height: 176px;" src="http://bp2.blogger.com/_1ET8wy93LPw/R108GUFmNoI/AAAAAAAAANc/3M5bk77Cfo0/s400/1.JPG" alt="" id="BLOGGER_PHOTO_ID_5142332428696041090" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-8071453525758157433?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/8071453525758157433/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=8071453525758157433' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/8071453525758157433'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/8071453525758157433'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/12/greek-alphabet-again.html' title='Greek alphabet, again'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp2.blogger.com/_1ET8wy93LPw/R108GUFmNoI/AAAAAAAAANc/3M5bk77Cfo0/s72-c/1.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-5216278726911992473</id><published>2007-11-29T00:27:00.001Z</published><updated>2007-11-29T00:27:28.536Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Data'/><title type='text'>Dummy Variables</title><content type='html'>&lt;h2&gt;Why use dummies?&lt;/h2&gt; &lt;p&gt; Regression analysis is used with numerical variables. Results only have a valid interpretation if it makes sense to assume that having a value of 2 on some variable is does indeed mean having twice as much of something as a 1, and having a 50 means 50 times as much as 1. &lt;/p&gt;&lt;p&gt; However, social scientists often need to work with categorical variables in which the different values have no real numerical relationship with each other. Examples include variables for race, political affiliation, or marital status. If you have a variable for political affiliation with possible responses including Democrat, Independent, and Republican, it obviously doesn't make sense to assign values of 1 - 3 and interpret that as meaning that a Republican is somehow three times as politically affiliated as a Democrat. &lt;/p&gt;&lt;p&gt;The solution is to use dummy variables - variables with only two values, zero and one. It does make sense to create a variable called "Republican" and interpret it as meaning that someone assigned a 1 on this varible is Republican and someone with an 0 is not. &lt;a name="nominal"&gt;&lt;/a&gt;&lt;/p&gt;&lt;h2&gt;Nominal variables with multiple levels&lt;/h2&gt; &lt;p&gt; If you have a nominal  variable that has more than two levels, you need to  create multiple dummy variables to "take the place of" the original nominal variable. For  example, imagine that you wanted to predict depression from year in school:   freshman, sophomore, junior, or senior. Obviously, "year in school" has more  than two levels.   &lt;/p&gt;&lt;p&gt;  What you need to do is to recode "year in school" into a set of  dummy variables, each of which has two levels. The first step in this process is  to decide the number of dummy variables. This is easy; it's simply k-1, where k  is the number of levels of the original variable.  &lt;/p&gt;&lt;p&gt;You could also create dummy variables for all levels in the original variable, and simply drop one from each analysis.   &lt;/p&gt;&lt;p&gt; In this instance, we would  need to create 4-1=3 dummy variables. In order to create these variables, we are  going to take 3 of the levels of "year of school", and create a variable  corresponding to each level, which will have the value of yes or no (i.e., 1 or  0). In this instance, we can create a variable called "sophomore," "junior," and  "senior." Each instance of "year of school" would then be recoded into a value  for "sophomore," "junior," and "senior."  If a person were a junior, then  "sophomore" would be equal to 0, "junior" would be equal to 1, and "senior"  would be equal to 0.   &lt;a name="interpreting"&gt;&lt;/a&gt;&lt;/p&gt;&lt;h2&gt;Interpreting results&lt;/h2&gt; &lt;p&gt;The decision as to which level is not coded is often arbitrary. The level which is not coded is the category to which all other categories will be compared. As such, often the biggest group will be the not- coded category. For example, often "Caucasian" will be the not-coded group if that is the race of the majority of participants in the sample. In that case, if you have a variable called "Asian", the coefficient on the "Asian" variable in your regression will show the effect being Asian rather than Caucasian has on your dependant variable. &lt;/p&gt;  In our example,  "freshman" was not coded so that we could determine if being a sophomore,  junior, or senior predicts a different depressive level than being a freshman.  Consequently, if the variable, "junior" was significant in our regression, with  a positive beta coefficient, this would mean that juniors are significantly more  depressed than freshman. Alternatively, we could have decided to not code  "senior," if we thought that being a senior is qualitatively different from  being of another year.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-5216278726911992473?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/5216278726911992473/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=5216278726911992473' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/5216278726911992473'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/5216278726911992473'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/11/dummy-variables.html' title='Dummy Variables'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-1273550460685520018</id><published>2007-11-29T00:07:00.000Z</published><updated>2007-11-29T00:08:16.821Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Data'/><title type='text'>Levels of Measurement</title><content type='html'>&lt;h1&gt;&lt;br /&gt;&lt;/h1&gt;              &lt;!-- InstanceBeginEditable name="content" --&gt;&lt;p&gt;The level of measurement refers to the relationship among the values that are assigned to the attributes for a variable. What does that mean? Begin with the idea of the variable, in this example "party affiliation." &lt;img src="http://www.socialresearchmethods.net/kb/Assets/images/measlevl.gif" align="right" height="270" hspace="10" vspace="10" width="500" /&gt; That variable has a number of attributes. Let's assume that in this particular election context the only relevant attributes are "republican", "democrat", and "independent". For purposes of analyzing the results of this variable, we arbitrarily assign the values 1, 2 and 3 to the three attributes. The &lt;em&gt;&lt;strong&gt;level of measurement&lt;/strong&gt;&lt;/em&gt; describes the relationship among these three values. In this case, we simply are using the numbers as shorter placeholders for the lengthier text terms. We don't assume that higher values mean "more" of something and lower numbers signify "less". We don't assume the the value of 2 means that democrats are twice something that republicans are. We don't assume that republicans are in first place or have the highest priority just because they have the value of 1. In this case, we only use the values as a shorter name for the attribute. Here, we would describe the level of measurement as "nominal".&lt;/p&gt;  &lt;h2&gt;Why is Level of Measurement Important?&lt;/h2&gt;  &lt;p&gt;First, knowing the level of measurement helps you decide how to interpret the data from that variable. When you know that a measure is nominal (like the one just described), then you know that the numerical values are just short codes for the longer names. Second, knowing the level of measurement helps you decide what statistical analysis is appropriate on the values that were assigned. If a measure is nominal, then you know that you would never average the data values or do a t-test on the data.&lt;/p&gt;  &lt;p&gt;There are typically four levels of measurement that are defined: &lt;/p&gt;  &lt;ul&gt;&lt;li&gt;Nominal &lt;/li&gt;&lt;li&gt;Ordinal &lt;/li&gt;&lt;li&gt;Interval &lt;/li&gt;&lt;li&gt;Ratio &lt;/li&gt;&lt;/ul&gt;  &lt;p&gt;In &lt;strong&gt;nominal &lt;/strong&gt;measurement the numerical values just "name" the attribute uniquely. No ordering of the cases is implied. For example, jersey numbers in basketball are measures at the nominal level. A player with number 30 is not more of anything than a player with number 15, and is certainly not twice whatever number 15 is.&lt;/p&gt;  &lt;p&gt;In &lt;strong&gt;ordinal &lt;/strong&gt;measurement the attributes can be rank-ordered. Here, distances between attributes do not have any meaning. For example, on a survey you might code Educational Attainment as 0=less than H.S.; 1=some H.S.; 2=H.S. degree; 3=some college; 4=college degree; 5=post college. In this measure, higher numbers mean &lt;em&gt;more &lt;/em&gt;education. But is distance from 0 to 1 same as 3 to 4? Of course not. The interval between values is not interpretable in an ordinal measure.&lt;/p&gt;  &lt;p&gt;&lt;img src="http://www.socialresearchmethods.net/kb/Assets/images/measlev2.gif" align="left" height="268" hspace="10" vspace="10" width="400" /&gt; In &lt;strong&gt;interval &lt;/strong&gt;measurement the distance between attributes &lt;em&gt;does&lt;/em&gt; have meaning. For example, when we measure temperature (in Fahrenheit), the distance from 30-40 is same as distance from 70-80. The interval between values is interpretable. Because of this, it makes sense to compute an average of an interval variable, where it doesn't make sense to do so for ordinal scales. But note that in interval measurement ratios don't make any sense - 80 degrees is not twice as hot as 40 degrees (although the attribute value is twice as large).&lt;/p&gt;  &lt;p&gt;Finally, in &lt;strong&gt;ratio &lt;/strong&gt;measurement there is always an absolute zero that is meaningful. This means that you can construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio variable. In applied social research most "count" variables are ratio, for example, the number of clients in past six months. Why? Because you can have zero clients and because it is meaningful to say that "...we had twice as many clients in the past six months as we did in the previous six months."&lt;/p&gt;  &lt;p&gt;It's important to recognize that there is a hierarchy implied in the level of measurement idea. At lower levels of measurement, assumptions tend to be less restrictive and data analyses tend to be less sensitive. At each level up the hierarchy, the current level includes all of the qualities of the one below it and adds something new. In general, it is desirable to have a higher level of measurement (e.g., interval or ratio) rather than a lower one (nominal or ordinal). &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-1273550460685520018?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/1273550460685520018/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=1273550460685520018' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/1273550460685520018'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/1273550460685520018'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/11/levels-of-measurement.html' title='Levels of Measurement'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-2847106347731354763</id><published>2007-11-28T22:39:00.001Z</published><updated>2007-11-28T22:50:45.409Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Graph'/><title type='text'>AUC: Z distribution</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_1ET8wy93LPw/R03wkW_oQ7I/AAAAAAAAANM/bYtgv2S75tg/s1600-h/auc.JPG"&gt;&lt;img style="cursor: pointer;" src="http://bp0.blogger.com/_1ET8wy93LPw/R03wkW_oQ7I/AAAAAAAAANM/bYtgv2S75tg/s320/auc.JPG" alt="" id="BLOGGER_PHOTO_ID_5138027257338020786" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_1ET8wy93LPw/R03uOW_oQ6I/AAAAAAAAANE/5JFgIPDTrmo/s1600-h/auc.JPG"&gt;&lt;br /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-2847106347731354763?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/2847106347731354763/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=2847106347731354763' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/2847106347731354763'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/2847106347731354763'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/11/auc-z-distribution.html' title='AUC: Z distribution'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp0.blogger.com/_1ET8wy93LPw/R03wkW_oQ7I/AAAAAAAAANM/bYtgv2S75tg/s72-c/auc.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-8452367353775322954</id><published>2007-11-28T19:27:00.000Z</published><updated>2007-11-28T19:30:50.719Z</updated><title type='text'>What is overfitting and how can I avoid it?</title><content type='html'>&lt;hr noshade="noshade" size="2" width="600"&gt;  &lt;pre&gt;The critical issue in developing a neural network is generalization: how&lt;br /&gt;well will the network make predictions for cases that are not in the&lt;br /&gt;training set? NNs, like other flexible nonlinear estimation methods such as&lt;br /&gt;kernel regression and smoothing splines, can suffer from either underfitting&lt;br /&gt;or overfitting. A network that is not sufficiently complex can fail to&lt;br /&gt;detect fully the signal in a complicated data set, leading to underfitting.&lt;br /&gt;A network that is too complex may fit the noise, not just the signal,&lt;br /&gt;leading to overfitting. Overfitting is especially dangerous because it can&lt;br /&gt;easily lead to predictions that are far beyond the range of the training&lt;br /&gt;data with many of the common types of NNs. Overfitting can also produce wild&lt;br /&gt;predictions in multilayer perceptrons even with noise-free data.&lt;br /&gt;&lt;br /&gt;For an elementary discussion of overfitting, see Smith (1996). For a more&lt;br /&gt;rigorous approach, see the article by Geman, Bienenstock, and Doursat (1992)&lt;br /&gt;on the bias/variance trade-off (it's not really a dilemma). We are talking&lt;br /&gt;about statistical bias here: the difference between the average value of an&lt;br /&gt;estimator and the correct value. Underfitting produces excessive bias in the&lt;br /&gt;outputs, whereas overfitting produces excessive variance. There are&lt;br /&gt;graphical examples of overfitting and underfitting in Sarle (1995, 1999).&lt;br /&gt;&lt;br /&gt;The best way to avoid overfitting is to use lots of training data. If you&lt;br /&gt;have at least 30 times as many training cases as there are weights in the&lt;br /&gt;network, you are unlikely to suffer from much overfitting, although you may&lt;br /&gt;get some slight overfitting no matter how large the training set is. For&lt;br /&gt;noise-free data, 5 times as many training cases as weights may be&lt;br /&gt;sufficient. But you can't arbitrarily reduce the number of weights for fear&lt;br /&gt;of underfitting.&lt;br /&gt;&lt;br /&gt;Given a fixed amount of training data, there are at least six approaches to&lt;br /&gt;avoiding underfitting and overfitting, and hence getting good&lt;br /&gt;generalization:&lt;br /&gt;&lt;br /&gt;o Model selection&lt;br /&gt;o Jittering&lt;br /&gt;o Early stopping&lt;br /&gt;o Weight decay&lt;br /&gt;o Bayesian learning&lt;br /&gt;o Combining networks&lt;br /&gt;&lt;br /&gt;The first five approaches are based on well-understood theory. Methods for&lt;br /&gt;combining networks do not have such a sound theoretical basis but are the&lt;br /&gt;subject of current research. These six approaches are discussed in more&lt;br /&gt;detail under subsequent questions.&lt;br /&gt;&lt;br /&gt;The complexity of a network is related to both the number of weights and the&lt;br /&gt;size of the weights. Model selection is concerned with the number of&lt;br /&gt;weights, and hence the number of hidden units and layers. The more weights&lt;br /&gt;there are, relative to the number of training cases, the more overfitting&lt;br /&gt;amplifies noise in the targets (Moody 1992). The other approaches listed&lt;br /&gt;above are concerned, directly or indirectly, with the size of the weights.&lt;br /&gt;Reducing the size of the weights reduces the "effective" number of&lt;br /&gt;weights--see Moody (1992) regarding weight decay and Weigend (1994)&lt;br /&gt;regarding early stopping. Bartlett (1997) obtained learning-theory results&lt;br /&gt;in which generalization error is related to the L_1 norm of the weights&lt;br /&gt;instead of the VC dimension.&lt;br /&gt;&lt;br /&gt;Overfitting is not confined to NNs with hidden units. Overfitting can occur&lt;br /&gt;in generalized linear models (networks with no hidden units) if either or&lt;br /&gt;both of the following conditions hold:&lt;br /&gt;&lt;br /&gt;1. The number of input variables (and hence the number of weights) is large&lt;br /&gt; with respect to the number of training cases. Typically you would want at&lt;br /&gt; least 10 times as many training cases as input variables, but with&lt;br /&gt; noise-free targets, twice as many training cases as input variables would&lt;br /&gt; be more than adequate. These requirements are smaller than those stated&lt;br /&gt; above for networks with hidden layers, because hidden layers are prone to&lt;br /&gt; creating ill-conditioning and other pathologies.&lt;br /&gt;&lt;br /&gt;2. The input variables are highly correlated with each other. This condition&lt;br /&gt; is called "multicollinearity" in the statistical literature.&lt;br /&gt; Multicollinearity can cause the weights to become extremely large because&lt;br /&gt; of numerical ill-conditioning--see "How does ill-conditioning affect NN&lt;br /&gt; training?"&lt;br /&gt;&lt;br /&gt;Methods for dealing with these problems in the statistical literature&lt;br /&gt;include ridge regression (similar to weight decay), partial least squares&lt;br /&gt;(similar to Early stopping), and various methods with even stranger names,&lt;br /&gt;such as the lasso and garotte (van Houwelingen and le Cessie, ????).&lt;br /&gt;&lt;br /&gt;References:&lt;br /&gt;&lt;br /&gt; Bartlett, P.L. (1997), "For valid generalization, the size of the weights&lt;br /&gt; is more important than the size of the network," in Mozer, M.C., Jordan,&lt;br /&gt; M.I., and Petsche, T., (eds.) Advances in Neural Information Processing&lt;br /&gt; Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140.&lt;br /&gt;&lt;br /&gt; Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and&lt;br /&gt; the Bias/Variance Dilemma", Neural Computation, 4, 1-58.&lt;br /&gt;&lt;br /&gt; Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of&lt;br /&gt; Generalization and Regularization in Nonlinear Learning Systems", in&lt;br /&gt; Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural&lt;br /&gt; Information Processing Systems 4, 847-854.&lt;br /&gt;&lt;br /&gt; Sarle, W.S. (1995), "Stopped Training and Other Remedies for&lt;br /&gt; Overfitting," Proceedings of the 27th Symposium on the Interface of&lt;br /&gt; Computing Science and Statistics, 352-360,&lt;br /&gt; &lt;a href="ftp://ftp.sas.com/pub/neural/inter95.ps.Z"&gt;ftp://ftp.sas.com/pub/neural/inter95.ps.Z&lt;/a&gt; (this is a very large&lt;br /&gt; compressed postscript file, 747K, 10 pages)&lt;br /&gt;&lt;br /&gt; Sarle, W.S. (1999), "Donoho-Johnstone Benchmarks: Neural Net Results,"&lt;br /&gt; &lt;a href="ftp://ftp.sas.com/pub/neural/dojo/dojo.html"&gt;ftp://ftp.sas.com/pub/neural/dojo/dojo.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; Smith, M. (1996). Neural Networks for Statistical Modeling, Boston:&lt;br /&gt; International Thomson Computer Press, ISBN 1-850-32842-0.&lt;br /&gt;&lt;br /&gt; van Houwelingen,H.C., and le Cessie, S. (????), "Shrinkage and penalized&lt;br /&gt; likelihood as methods to improve predictive accuracy,"&lt;br /&gt; &lt;a linkindex="15" href="http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/shrinkage.pdf"&gt;http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/shrinkage.pdf&lt;/a&gt; and&lt;br /&gt; &lt;a linkindex="15" href="http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/figures.pdf"&gt;http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/figures.pdf&lt;/a&gt;&lt;br /&gt;&lt;br /&gt; Weigend, A. (1994), "On overfitting and the effective number of hidden&lt;br /&gt; units," Proceedings of the 1993 Connectionist Models Summer School,&lt;br /&gt; 335-342.&lt;br /&gt;&lt;br /&gt;Copyright 1997, 1998, 1999, 2000, 2001, 2002 by Warren S. Sarle, Cary, NC,&lt;br /&gt;USA. Answers provided by other authors as cited below are copyrighted by&lt;br /&gt;those authors, who by submitting the answers for the FAQ give permission for&lt;br /&gt;the answer to be reproduced as part of the FAQ in any of the ways specified&lt;br /&gt;in part 1 of the FAQ.&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-8452367353775322954?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/8452367353775322954/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=8452367353775322954' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/8452367353775322954'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/8452367353775322954'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/11/what-is-overfitting-and-how-can-i-avoid.html' title='What is overfitting and how can I avoid it?'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-3551054859128574963</id><published>2007-11-28T14:24:00.000Z</published><updated>2007-11-28T14:28:45.772Z</updated><title type='text'>Overfitting and Underfitting</title><content type='html'>&lt;p:colorscheme colors="#ffffff,#000000,#808080,#000000,#bbe0e3,#333399,#009999,#99cc00"&gt;&lt;/p:colorscheme&gt;    &lt;span style="color: rgb(51, 102, 0);font-size:133;" &gt;&lt;span style=""&gt;&lt;!--[if !ppt]--&gt;&lt;img src="file:///C:/DOCUME%7E1/DX/LOCALS%7E1/Temp/msohtml1/01/clip_image001.gif" alt="*" style="position: absolute; top: 13.63%; left: -4.1%; width: 2.73%; height: 63.63%;" /&gt;&lt;!--[endif]--&gt;&lt;/span&gt;&lt;/span&gt;Overfitting and Underfitting&lt;br /&gt;&lt;br /&gt;&lt;p:colorscheme colors="#ffffff,#000000,#808080,#000000,#bbe0e3,#333399,#009999,#99cc00"&gt;  &lt;/p:colorscheme&gt;&lt;div shape="_x0000_s1026"&gt;  &lt;div class="O1" style=""&gt;&lt;span style="font-size: 89%; color: rgb(153, 51, 0);"&gt;&lt;span style=""&gt;&lt;!--[if !ppt]--&gt;&lt;img src="file:///C:/DOCUME%7E1/DX/LOCALS%7E1/Temp/msohtml1/01/clip_image001.gif" alt="*" style="position: absolute; top: 20%; left: -3.03%; width: 1.6%; height: 60%;" /&gt;&lt;!--[endif]--&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;When fitting a model to noisy data (ALWAYS), we make the assumption that the &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;data have been generated from some “TRUE” model by making predictions at &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;given values of the inputs, then adding some amount of noise to each point, where &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;the noise is drawn from a normal distribution with an unknown variance. &lt;/b&gt;&lt;/span&gt;&lt;/div&gt;  &lt;div class="O1" style=""&gt;&lt;span style="font-size: 89%; color: rgb(153, 51, 0);"&gt;&lt;span style=""&gt;&lt;!--[if !ppt]--&gt;&lt;img src="file:///C:/DOCUME%7E1/DX/LOCALS%7E1/Temp/msohtml1/01/clip_image001.gif" alt="*" style="position: absolute; top: 20%; left: -2.99%; width: 1.58%; height: 60%;" /&gt;&lt;!--[endif]--&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;&lt;u&gt;Our task is to discover both this model and the width of the noise distribution&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;. In &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;doing so, we aim for a &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;&lt;u&gt;compromise between bias&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;, where our model does not follow &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;the right trend in the data (and so does not match well with the underlying truth), &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;&lt;u&gt;and variance&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;, where our model fits the data points too closely, fitting the noise &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;rather than trying to capture the true distribution. &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;&lt;u&gt;These two extremes are known &lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;&lt;u&gt;as underfitting and overfitting&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;. &lt;/b&gt;&lt;/span&gt;&lt;/div&gt;  &lt;div class="O1" style=""&gt;&lt;span style="font-size: 89%; color: rgb(153, 51, 0);"&gt;&lt;span style=""&gt;&lt;!--[if !ppt]--&gt;&lt;img src="file:///C:/DOCUME%7E1/DX/LOCALS%7E1/Temp/msohtml1/01/clip_image001.gif" alt="*" style="position: absolute; top: 20%; left: -3.28%; width: 1.73%; height: 60%;" /&gt;&lt;!--[endif]--&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;IMPORTANT! : the number of parameters in a model; the higher, the more &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;complexly the model can fit the data. If the number of parameters in our model is &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;larger than that the “true one”, then we risk &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;&lt;u&gt;overfitting&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;, and if our model contains &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;fewer parameters than the truth, we could &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;&lt;u&gt;underfit&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;p:colorscheme colors="#ffffff,#000000,#808080,#000000,#bbe0e3,#333399,#009999,#99cc00"&gt;  &lt;/p:colorscheme&gt;&lt;div shape="_x0000_s1026"&gt;  &lt;div class="O1" style=""&gt;&lt;span style="font-size: 89%; color: rgb(153, 51, 0);"&gt;&lt;span style=""&gt;&lt;!--[if !ppt]--&gt;&lt;img src="file:///C:/DOCUME%7E1/DX/LOCALS%7E1/Temp/msohtml1/01/clip_image001.gif" alt="*" style="position: absolute; top: 20%; left: -2.95%; width: 1.56%; height: 60%;" /&gt;&lt;!--[endif]--&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;The illustration shows how increasing the number of parameters in the model can &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;result in overfitting. &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;&lt;u&gt;The 9 data points are generated from a cubic polynomial &lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;&lt;u&gt;which contains 4 parameters&lt;/u&gt;&lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt; (the true model) and adding noise. We can see that &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;by selecting candidate models containing more parameters than the truth, we can &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;reduce, and even eliminate, any mismatch between the data points and our model. &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;This occurs when the number of parameters is the same as the number of data &lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-US"&gt;&lt;b&gt;points (an 8th order polynomial has 9 parameters).&lt;/b&gt;&lt;/span&gt;&lt;span style="font-size: 16pt; color: rgb(153, 51, 0);" lang="EN-GB"&gt;&lt;b&gt; &lt;/b&gt;&lt;/span&gt;&lt;/div&gt;    &lt;/div&gt;&lt;!--[if !ppt]--&gt;&lt;!--[endif]--&gt;&lt;/div&gt;    &lt;/div&gt;&lt;span style="color: rgb(153, 51, 0);font-size:89;" &gt;&lt;span style=""&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="color: rgb(153, 51, 0);font-size:16;" lang="EN-US" &gt;&lt;b&gt;&lt;/b&gt;&lt;/span&gt;&lt;div shape="_x0000_s1026"&gt;&lt;div class="O1" style=""&gt;&lt;span style="color: rgb(153, 51, 0);font-size:16;" lang="EN-US" &gt;&lt;b&gt;&lt;br /&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_1ET8wy93LPw/R016sm_oQ5I/AAAAAAAAAM8/SLSkiR9O1jg/s1600-h/overfitting_underfittiong.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp1.blogger.com/_1ET8wy93LPw/R016sm_oQ5I/AAAAAAAAAM8/SLSkiR9O1jg/s320/overfitting_underfittiong.png" alt="" id="BLOGGER_PHOTO_ID_5137897656699863954" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-3551054859128574963?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/3551054859128574963/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=3551054859128574963' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/3551054859128574963'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/3551054859128574963'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/11/overfitting-and-underfitting.html' title='Overfitting and Underfitting'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp1.blogger.com/_1ET8wy93LPw/R016sm_oQ5I/AAAAAAAAAM8/SLSkiR9O1jg/s72-c/overfitting_underfittiong.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-5809434346056913863</id><published>2007-11-24T16:11:00.000Z</published><updated>2007-11-24T16:12:56.699Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Concept'/><title type='text'>Data Mining Techniques</title><content type='html'>&lt;center&gt;&lt;span style="font-size:180%;color:#aa0000;"&gt;&lt;b&gt;Data Mining Techniques&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/center&gt; &lt;hr size="1"&gt;&lt;a name="index"&gt;&lt;/a&gt; referenced from  http://www.statsoft.com/textbook/stdatmin.html&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a linkindex="0" href="http://www.statsoft.com/textbook/stdatmin.html#mining"&gt;Data Mining&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a linkindex="1" href="http://www.statsoft.com/textbook/stdatmin.html#concepts"&gt;Crucial Concepts in Data Mining&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a linkindex="2" href="http://www.statsoft.com/textbook/stdatmin.html#warehousing"&gt;Data Warehousing&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a linkindex="3" href="http://www.statsoft.com/textbook/stdatmin.html#olap"&gt;On-Line Analytic Processing (OLAP)&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a linkindex="4" href="http://www.statsoft.com/textbook/stdatmin.html#eda"&gt;Exploratory Data Analysis (EDA) and Data Mining Techniques&lt;/a&gt;   &lt;ul&gt;&lt;li&gt;&lt;a linkindex="5" href="http://www.statsoft.com/textbook/stdatmin.html#eda1"&gt;EDA vs. Hypothesis Testing&lt;/a&gt;   &lt;/li&gt;&lt;li&gt;&lt;a linkindex="6" href="http://www.statsoft.com/textbook/stdatmin.html#eda2"&gt;Computational EDA Techniques&lt;/a&gt;   &lt;/li&gt;&lt;li&gt;&lt;a set="yes" linkindex="7" href="http://www.statsoft.com/textbook/stdatmin.html#eda3"&gt;Graphical (data visualization) EDA techniques&lt;/a&gt;   &lt;/li&gt;&lt;li&gt;&lt;a set="yes" linkindex="8" href="http://www.statsoft.com/textbook/stdatmin.html#eda4"&gt;Verification of results of EDA&lt;/a&gt;   &lt;/li&gt;&lt;/ul&gt; &lt;/li&gt;&lt;li&gt;&lt;a linkindex="9" href="http://www.statsoft.com/textbook/stdatmin.html#neural"&gt;Neural Networks&lt;/a&gt; &lt;/li&gt;&lt;/ul&gt;&lt;hr size="1"&gt;  &lt;a name="mining"&gt;&lt;/a&gt; &lt;p&gt;&lt;span style="font-size:180%;color:navy;"&gt;Data Mining&lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;i&gt;Data Mining&lt;/i&gt; is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and &lt;a linkindex="10" href="http://www.statsoft.com/textbook/stdatmin.html#pdm"&gt;predictive data mining&lt;/a&gt; is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with &lt;a linkindex="11" href="http://www.statsoft.com/textbook/glosc.html#Cross-Validation"&gt;validation/veri&lt;wbr&gt;fication&lt;/a&gt;, and (3) &lt;a linkindex="12" href="http://www.statsoft.com/textbook/stdatmin.html#deploy"&gt;deployment&lt;/a&gt; (i.e., the application of the model to new data in order to generate predictions).   &lt;img src="http://www.statsoft.com/textbook/graphics/stadatm1.gif" alt="" align="left" border="0" height="257" hspace="10" vspace="10" width="326" /&gt; &lt;/p&gt;&lt;p&gt; &lt;b&gt;Stage 1:  Exploration.&lt;/b&gt;  This stage usually starts with data preparation which may involve cleaning data, data transformations&lt;wbr&gt;, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary &lt;a linkindex="13" href="http://www.statsoft.com/textbook/stdatmin.html#feature"&gt;feature selection&lt;/a&gt; operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward&lt;wbr&gt; predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods (see &lt;a set="yes" linkindex="14" href="http://www.statsoft.com/textbook/stdatmin.html#eda"&gt;&lt;i&gt;Exploratory Data Analysis (EDA)&lt;/i&gt;&lt;/a&gt;) in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage. &lt;/p&gt;&lt;p&gt;  &lt;b&gt;Stage 2:  Model building and validation.&lt;/b&gt; This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques - which are often considered the core of &lt;a linkindex="15" href="http://www.statsoft.com/textbook/stdatmin.html#pdm"&gt;predictive data mining&lt;/a&gt; - include: &lt;a linkindex="16" href="http://www.statsoft.com/textbook/stdatmin.html#bagging"&gt;Bagging&lt;/a&gt; (Voting, Averaging), &lt;a linkindex="17" href="http://www.statsoft.com/textbook/stdatmin.html#boosting"&gt;Boosting&lt;/a&gt;, &lt;a linkindex="18" href="http://www.statsoft.com/textbook/stdatmin.html#stackedgen"&gt;Stacking (Stacked Generalizations&lt;wbr&gt;)&lt;/a&gt;, and &lt;a linkindex="19" href="http://www.statsoft.com/textbook/stdatmin.html#meta"&gt;Meta-Learning&lt;/a&gt;. &lt;/p&gt;&lt;p&gt;  &lt;b&gt;Stage 3:  Deployment.&lt;/b&gt; That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.&lt;/p&gt;&lt;p&gt;  The concept of &lt;i&gt;Data Mining&lt;/i&gt; is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business &lt;i&gt;Data Mining&lt;/i&gt; (e.g., &lt;a linkindex="20" href="http://www.statsoft.com/textbook/stclatre.html"&gt;&lt;i&gt;Classification Trees&lt;/i&gt;&lt;/a&gt;), but Data Mining is still based on the conceptual principles of statistics including the traditional &lt;a linkindex="21" href="http://www.statsoft.com/textbook/stdatmin.html#eda"&gt;&lt;i&gt;Exploratory Data Analysis (EDA)&lt;/i&gt;&lt;/a&gt; and modeling and it shares with them both some components of its general approaches and specific techniques. &lt;/p&gt;&lt;p&gt;  However, an important general difference in the focus and purpose between &lt;i&gt;Data Mining&lt;/i&gt; and the traditional &lt;a set="yes" linkindex="22" href="http://www.statsoft.com/textbook/stdatmin.html#eda"&gt;&lt;i&gt;Exploratory Data Analysis (EDA)&lt;/i&gt;&lt;/a&gt; is that &lt;i&gt;Data Mining&lt;/i&gt; is more oriented towards applications than the basic nature of the underlying phenomena. In other words, &lt;i&gt;Data Mining &lt;/i&gt;is relatively less concerned with identifying the specific relations between the involved variables. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of &lt;i&gt;Data Mining&lt;/i&gt;. Instead, the focus is on producing a solution that can generate useful predictions. Therefore, &lt;i&gt;Data Mining&lt;/i&gt; accepts among others a "black box" approach to data exploration or knowledge discovery and uses not only the traditional &lt;a set="yes" linkindex="23" href="http://www.statsoft.com/textbook/stdatmin.html#eda"&gt;&lt;i&gt;Exploratory Data Analysis (EDA)&lt;/i&gt;&lt;/a&gt; techniques, but also such techniques as &lt;a linkindex="24" href="http://www.statsoft.com/textbook/stdatmin.html#neural"&gt;&lt;i&gt;Neural Networks&lt;/i&gt;&lt;/a&gt; which can generate valid predictions but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based. &lt;/p&gt;&lt;p&gt;  &lt;i&gt;Data Mining&lt;/i&gt; is often considered to be "&lt;i&gt;a blend of statistics, AI [artificial intelligence], and data base research&lt;/i&gt;" (Pregibon, 1997, p. 8), which until very recently was not commonly recognized as a field of interest for statisticians, and was even considered by some "&lt;i&gt;a dirty word in Statistics&lt;/i&gt;" (Pregibon, 1997, p. 8). Due to its applied importance, however, the field emerges as a rapidly growing and major area (also in statistics) where important theoretical advances are being made (see, for example, the recent annual &lt;i&gt;International Conferences on Knowledge Discovery and Data Mining&lt;/i&gt;, co-hosted by the &lt;i&gt;American Statistical Association&lt;/i&gt;). &lt;/p&gt;&lt;p&gt; For information on &lt;i&gt;Data Mining&lt;/i&gt; techniques, please review the summary topics included below in this chapter of the &lt;i&gt;Electronic Statistics Textbook&lt;/i&gt;. There are numerous books that review the theory and practice of data mining; the following books offer a representative sample of recent general books on data mining, representing a variety of approaches and perspectives:&lt;br /&gt;&lt;br /&gt;Berry, M., J., A., &amp;amp; Linoff, G., S., (2000). &lt;i&gt;Mastering data mining&lt;/i&gt;. New York: Wiley.&lt;br /&gt;&lt;br /&gt;Edelstein, H., A. (1999). &lt;i&gt;Introduction to data mining and knowledge discovery (3rd ed)&lt;/i&gt;. Potomac, MD: Two Crows Corp.&lt;br /&gt;&lt;br /&gt;Fayyad, U. M., Piatetsky-Shapi&lt;wbr&gt;ro, G., Smyth, P., &amp;amp; Uthurusamy, R. (1996). &lt;i&gt;Advances in knowledge discovery &amp;amp; data mining&lt;/i&gt;. Cambridge, MA: MIT Press.&lt;br /&gt;&lt;br /&gt;Han, J., Kamber, M. (2000). &lt;i&gt;Data mining: Concepts and Techniques&lt;/i&gt;. New York: Morgan-Kaufman.&lt;wbr&gt;&lt;br /&gt;&lt;br /&gt;Hastie, T., Tibshirani, R., &amp;amp; Friedman, J. H. (2001). &lt;i&gt;The elements of statistical learning : Data mining, inference, and prediction&lt;/i&gt;. New York: Springer.&lt;br /&gt;&lt;br /&gt;Pregibon, D. (1997). &lt;i&gt;Data Mining&lt;/i&gt;. Statistical Computing and Graphics, 7, 8. &lt;br /&gt;&lt;br /&gt;Weiss, S. M., &amp;amp; Indurkhya, N. (1997). &lt;i&gt;Predictive data mining: A practical guide&lt;/i&gt;. New York: Morgan-Kaufman.&lt;wbr&gt;&lt;br /&gt;&lt;br /&gt;Westphal, C., Blaxton, T. (1998). &lt;i&gt;Data mining solutions&lt;/i&gt;. New York: Wiley.&lt;br /&gt;&lt;br /&gt;Witten, I. H., &amp;amp; Frank, E. (2000). &lt;i&gt;Data mining&lt;/i&gt;. New York: Morgan-Kaufmann&lt;wbr&gt;.&lt;br /&gt;&lt;br /&gt; &lt;a name="concepts"&gt;&lt;span style="font-size:180%;color:navy;"&gt;Crucial Concepts in Data Mining&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;   &lt;a name="bagging"&gt;&lt;b&gt;Bagging (Voting, Averaging)&lt;/b&gt;&lt;/a&gt;&lt;br /&gt;The concept of bagging (voting for classification,&lt;wbr&gt; averaging for regression-type&lt;wbr&gt; problems with continuous dependent variables of interest) applies to the area of &lt;a linkindex="25" href="http://www.statsoft.com/textbook/stdatmin.html#pdm"&gt;predictive data mining&lt;/a&gt;, to combine the predicted classifications&lt;wbr&gt; (prediction) from multiple models, or from the same type of model for different learning data. It is also used to address the inherent instability of results when applying complex models to relatively small data sets. Suppose your data mining task is to build a model for predictive classification,&lt;wbr&gt; and the dataset from which to train the model (learning data set, which contains observed classifications&lt;wbr&gt;) is relatively small. You could repeatedly sub-sample (with replacement) from the dataset, and apply, for example, a tree classifier (e.g., &lt;a linkindex="26" href="http://www.statsoft.com/textbook/glosc.html#Cart"&gt;C&amp;amp;RT&lt;/a&gt; and  &lt;a linkindex="27" href="http://www.statsoft.com/textbook/glosc.html#Chaid"&gt;CHAID&lt;/a&gt;) to the successive samples. In practice, very different trees will often be grown for the different samples, illustrating the instability of models often evident with small datasets. One method of deriving a single prediction (for new observations) is to use all trees found in the different samples, and to apply some simple voting: The final classification is the one most often predicted by the different trees. Note that some weighted combination of predictions (weighted vote, weighted average) is also possible, and commonly used. A sophisticated (&lt;a linkindex="28" href="http://www.statsoft.com/textbook/stdatmin.html#machinelearn"&gt;machine learning&lt;/a&gt;) algorithm for generating weights for weighted prediction or voting is the &lt;a linkindex="29" href="http://www.statsoft.com/textbook/stdatmin.html#boosting"&gt;Boosting procedure&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;  &lt;a name="boosting"&gt;&lt;b&gt;Boosting&lt;/b&gt;&lt;/a&gt;&lt;br /&gt;The concept of boosting applies to the area of &lt;a linkindex="30" href="http://www.statsoft.com/textbook/stdatmin.html#pdm"&gt;predictive data mining&lt;/a&gt;, to generate multiple models or classifiers (for prediction or classification)&lt;wbr&gt;, and to derive weights to combine the predictions from those models into a single prediction or predicted classification (see also &lt;a linkindex="31" href="http://www.statsoft.com/textbook/stdatmin.html#bagging"&gt;Bagging&lt;/a&gt;).&lt;/p&gt;&lt;p&gt;  A simple algorithm for boosting works like this: Start by applying some method (e.g., a tree classifier such as &lt;a linkindex="32" href="http://www.statsoft.com/textbook/glosc.html#Cart"&gt;C&amp;amp;RT&lt;/a&gt; or &lt;a linkindex="33" href="http://www.statsoft.com/textbook/glosc.html#Chaid"&gt;CHAID&lt;/a&gt;) to the learning data, where each observation is assigned an equal weight. Compute the predicted classifications&lt;wbr&gt;, and apply weights to the observations in the learning sample that are inversely proportional to the accuracy of the classification.&lt;wbr&gt; In other words, assign greater weight to those observations that were difficult to classify (where the misclassificati&lt;wbr&gt;on rate was high), and lower weights to those that were easy to classify (where the misclassificati&lt;wbr&gt;on rate was low). In the context of &lt;a linkindex="34" href="http://www.statsoft.com/textbook/glosc.html#Cart"&gt;C&amp;amp;RT&lt;/a&gt; for example, different misclassificati&lt;wbr&gt;on costs (for the different classes) can be applied, inversely proportional to the accuracy of prediction in each class. Then apply the classifier again to the weighted data (or with different misclassificati&lt;wbr&gt;on costs), and continue with the next iteration (application of the analysis method for classification to the re-weighted data).&lt;/p&gt;&lt;p&gt; Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an "expert" in classifying observations that were not well classified by those preceding it. During &lt;a linkindex="35" href="http://www.statsoft.com/textbook/stdatmin.html#deploy"&gt;deployment&lt;/a&gt; (for prediction or classification of new cases), the predictions from the different classifiers can then be combined (e.g., via voting, or some weighted voting procedure) to derive a single best prediction or classification.&lt;wbr&gt; &lt;/p&gt;&lt;p&gt;  Note that boosting can also be applied to learning methods that do not explicitly support weights or misclassificati&lt;wbr&gt;on costs. In that case, random sub-sampling can be applied to the learning data in the successive steps of the iterative boosting procedure, where the probability for selection of an observation into the subsample is inversely proportional to the accuracy of the prediction for that observation in the previous iteration (in the sequence of iterations of the boosting procedure).&lt;/p&gt;&lt;p&gt;  &lt;a name="CRISP"&gt;&lt;b&gt;CRISP&lt;/b&gt;&lt;/a&gt;&lt;br /&gt;See &lt;a linkindex="36" href="http://www.statsoft.com/textbook/stdatmin.html#Models%20for%20Data%20Mining"&gt;Models for Data Mining&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;  &lt;a name="dpidm"&gt;&lt;b&gt;Data Preparation (in Data Mining)&lt;/b&gt;&lt;/a&gt;&lt;br /&gt;Data preparation and cleaning is an often neglected but extremely important step in the data mining process. The old saying "garbage-in-gar&lt;wbr&gt;bage-out" is particularly applicable to the typical data mining projects where large data sets collected via some automatic methods (e.g., via the Web) serve as the input into the analyses. Often, the method by which the data where gathered was not tightly controlled, and so the data may contain out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), and the like. Analyzing data that has not been carefully screened for such problems can produce highly misleading results, in particular in &lt;a linkindex="37" href="http://www.statsoft.com/textbook/stdatmin.html#pdm"&gt;predictive data mining&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;  &lt;a name="drfdm"&gt;&lt;b&gt;Data Reduction (for Data Mining)&lt;/b&gt;&lt;/a&gt;&lt;br /&gt;The term Data Reduction in the context of data mining is usually applied to projects where the goal is to aggregate or amalgamate the information contained in large datasets into manageable (smaller) information nuggets. Data reduction methods can include simple tabulation, aggregation (computing descriptive statistics) or more sophisticated techniques like &lt;a linkindex="38" href="http://www.statsoft.com/textbook/glosc.html#Cluster%20Analysis"&gt;clustering&lt;/a&gt;, &lt;a linkindex="39" href="http://www.statsoft.com/textbook/glosp.html#Principal%20Components%20Analysis"&gt;principal components analysis&lt;/a&gt;, etc.&lt;/p&gt;&lt;p&gt; See also &lt;a linkindex="40" href="http://www.statsoft.com/textbook/stdatmin.html#pdm"&gt;predictive data mining&lt;/a&gt;, &lt;a linkindex="41" href="http://www.statsoft.com/textbook/stdatmin.html#ddanal"&gt;drill-down analysis&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;  &lt;a name="deploy"&gt;&lt;b&gt;Deployment&lt;/b&gt;&lt;br /&gt;The concept of deployment in &lt;/a&gt;&lt;a linkindex="42" href="http://www.statsoft.com/textbook/stdatmin.html#pdm"&gt;predictive data mining&lt;/a&gt; refers to the application of a model for prediction or classification to new data. After a satisfactory model or set of models has been identified (trained) for a particular application, one usually wants to deploy those models so that predictions or predicted classifications&lt;wbr&gt; can quickly be obtained for new data. For example, a credit card company may want to deploy a trained model or set of models (e.g., neural networks, &lt;a linkindex="43" href="http://www.statsoft.com/textbook/stdatmin.html#meta"&gt;meta-learner&lt;/a&gt;) to quickly identify transactions which have a high probability of being fraudulent.&lt;/p&gt;&lt;p&gt;  &lt;a name="ddanal"&gt;&lt;b&gt;Drill-Down Analysis&lt;/b&gt;&lt;/a&gt;&lt;br /&gt;The concept of drill-down analysis applies to the area of data mining, to denote the interactive exploration of data, in particular of large databases. The process of drill-down analyses begins by considering some simple break-downs of the data by a few variables of interest (e.g., Gender, geographic region, etc.). Various statistics, tables, histograms, and other graphical summaries can be computed for each group. Next one may want to "drill-down" to expose and further analyze the data "underneath" one of the categorizations&lt;wbr&gt;, for example, one might want to further review the data for males from the mid-west. Again, various statistical and graphical summaries can be computed for those cases only, which might suggest further break-downs by other variables (e.g., income, age, etc.). At the lowest ("bottom") level are the raw data: For example, you may want to review the addresses of male customers from one region, for a certain income group, etc., and to offer to those customers some particular services of particular utility to that group. &lt;/p&gt;&lt;p&gt;  &lt;a name="feature"&gt;&lt;b&gt;Feature Selection&lt;/b&gt;&lt;br /&gt;One of the preliminary stage in &lt;/a&gt;&lt;a linkindex="44" href="http://www.statsoft.com/textbook/glosp.html#Predictive%20Data%20Mining"&gt;predictive data mining&lt;/a&gt;, when the data set includes more variables than could be included (or would be efficient to include) in the actual model building phase (or even in initial exploratory operations), is to select predictors from a large list of candidates. For example, when data are collected via automated (computerized) methods, it is not uncommon that measurements are recorded for thousands or hundreds of thousands (or more) of predictors. The standard analytic methods for predictive data mining, such as &lt;a set="yes" linkindex="45" href="http://www.statsoft.com/textbook/glosn.html#Neural%20Networks"&gt;neural network&lt;/a&gt; analyses, &lt;a linkindex="46" href="http://www.statsoft.com/textbook/stcart.html"&gt;classification and regression trees&lt;/a&gt;, &lt;a linkindex="47" href="http://www.statsoft.com/textbook/glosf.html#Generalized%20Linear%20Model"&gt;generalized linear models&lt;/a&gt;, or &lt;a linkindex="48" href="http://www.statsoft.com/textbook/glosf.html#General%20Linear%20Model"&gt;general linear models&lt;/a&gt; become impractical when the number of predictors exceed more than a few hundred variables.  &lt;/p&gt;&lt;p&gt;Feature selection selects a subset of predictors from a large list of candidate predictors without assuming that the relationships between the predictors and the &lt;a linkindex="49" href="http://www.statsoft.com/textbook/glosi.html#Independent%20vs.%20Dependent%20Variables"&gt;dependent&lt;/a&gt; or outcome variables of interest are linear, or even monotone. Therefore, this is used as a pre-processor for predictive data mining, to select manageable sets of predictors that are likely related to the dependent (outcome) variables of interest, for further analyses with any of the other methods for regression and classification.&lt;wbr&gt;  &lt;/p&gt;&lt;p&gt;  &lt;a name="machinelearn"&gt;&lt;b&gt;Machine Learning&lt;/b&gt;&lt;/a&gt;&lt;br /&gt;Machine learning, computational learning theory, and similar terms are often used in the context of &lt;a linkindex="50" href="http://www.statsoft.com/textbook/glosd.html#Data%20Mining"&gt;&lt;i&gt;Data Mining&lt;/i&gt;&lt;/a&gt;, to denote the application of generic model-fitting or classification algorithms for &lt;a linkindex="51" href="http://www.statsoft.com/textbook/stdatmin.html#pdm"&gt;predictive data mining&lt;/a&gt;. Unlike traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference, the emphasis in data mining (and machine learning) is usually on the accuracy of prediction (predicted classification)&lt;wbr&gt;, regardless of whether or not the "models" or techniques that are used to generate the prediction is interpretable or open to simple explanation. Good examples of this type of technique often applied to &lt;a linkindex="52" href="http://www.statsoft.com/textbook/stdatmin.html#pdm"&gt;predictive data mining&lt;/a&gt; are neural networks or &lt;a set="yes" linkindex="53" href="http://www.statsoft.com/textbook/stdatmin.html#meta"&gt;meta-learning techniques&lt;/a&gt; such as &lt;a linkindex="54" href="http://www.statsoft.com/textbook/stdatmin.html#boosting"&gt;boosting&lt;/a&gt;, etc. These methods usually involve the fitting of very complex "generic" models, that are not related to any reasoning or theoretical understanding of underlying causal processes; instead, these techniques can be shown to generate accurate predictions or classification in &lt;a linkindex="55" href="http://www.statsoft.com/textbook/glosc.html#Cross-Validation"&gt;crossvalidation&lt;wbr&gt;&lt;/a&gt; samples. &lt;/p&gt;&lt;p&gt; &lt;a name="meta"&gt;&lt;b&gt;Meta-Learning&lt;/b&gt;&lt;/a&gt;&lt;br /&gt;The concept of meta-learning applies to the area of &lt;a set="yes" linkindex="56" href="http://www.statsoft.com/textbook/stdatmin.html#pdm"&gt;predictive data mining&lt;/a&gt;, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. In this context, this procedure is also referred to as &lt;a name="stackinggen"&gt;Stacking (Stacked Generalization)&lt;wbr&gt;&lt;/a&gt;. &lt;/p&gt;&lt;p&gt; Suppose your data mining project includes tree classifiers, such as &lt;a linkindex="57" href="http://www.statsoft.com/textbook/glosc.html#Cart"&gt;C&amp;amp;RT&lt;/a&gt; and &lt;a linkindex="58" href="http://www.statsoft.com/textbook/glosc.html#Chaid"&gt;CHAID&lt;/a&gt;, linear discriminant analysis (e.g., see &lt;a linkindex="59" href="http://www.statsoft.com/textbook/glosd.html#Discriminant%20Function%20Analysis"&gt;GDA&lt;/a&gt;), and &lt;a linkindex="60" href="http://www.statsoft.com/textbook/stdatmin.html#neural"&gt;&lt;i&gt;Neural Networks&lt;/i&gt;&lt;/a&gt;. Each computes predicted classifications&lt;wbr&gt; for a &lt;a linkindex="61" href="http://www.statsoft.com/textbook/glosc.html#Cross-Validation"&gt;crossvalidation&lt;wbr&gt;&lt;/a&gt; sample, from which overall goodness-of-fit&lt;wbr&gt; statistics (e.g., misclassificati&lt;wbr&gt;on rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000). The predictions from different classifiers can be used as input into a meta-learner, which will attempt to combine the predictions to create a final best predicted classification.&lt;wbr&gt; So, for example, the predicted classifications&lt;wbr&gt; from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier&lt;wbr&gt;, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy. &lt;/p&gt;&lt;p&gt; One can apply meta-learners to the results from different meta-learners to create "meta-meta"-lea&lt;wbr&gt;rners, and so on; however, in practice such exponential increase in the amount of data processing, in order to derive an accurate prediction, will yield less and less marginal utility.&lt;/p&gt;&lt;p&gt;  &lt;a name="Models for Data Mining"&gt;&lt;b&gt;Models for Data Mining&lt;/b&gt;&lt;/a&gt;&lt;br /&gt;In the business environment, complex &lt;a linkindex="62" href="http://www.statsoft.com/textbook/glosd.html#Data%20Mining"&gt;data   mining&lt;/a&gt; projects may require the coordinate efforts of various experts,   stakeholders, or departments throughout an entire organization. In the   data mining literature, various "general frameworks" have been   proposed to serve as blueprints for how to organize the process of gathering   data, analyzing data, disseminating results, implementing results, and   monitoring improvements.   &lt;/p&gt;&lt;p&gt;One such model, CRISP (Cross-Industry&lt;wbr&gt; Standard Process for data mining)   was proposed in the mid-1990s by a European consortium of companies to   serve as a non-proprietary&lt;wbr&gt; standard process model for data mining. This   general approach postulates the following (perhaps not particularly controversial)   general sequence of steps for data mining projects:&lt;/p&gt;  &lt;p&gt;&lt;img src="http://www.statsoft.com/textbook/graphics/ModelsDM1.gif" ratio="TRUE" style="border: medium none ; width: 333px; height: 185px; float: none;" border="0" height="185" width="333" /&gt;&lt;/p&gt;  &lt;p&gt;Another approach - the &lt;a linkindex="63" href="http://www.statsoft.com/textbook/gloss.html#Six%20Sigma%20DMAIC"&gt;Six Sigma&lt;/a&gt; methodology   - is a well-structured&lt;wbr&gt;, data-driven methodology for eliminating defects,   waste, or quality control problems of all kinds in manufacturing, service   delivery, management, and other business activities. This model has recently   become very popular (due to its successful implementations&lt;wbr&gt;) in various   American industries, and it appears to gain favor worldwide. It postulated   a sequence of, so-called, DMAIC steps - &lt;/p&gt;  &lt;p&gt;&lt;img src="http://www.statsoft.com/textbook/graphics/ModelsDM2.gif" ratio="TRUE" style="border: medium none ; width: 359px; height: 24px; float: none;" border="0" height="24" width="359" /&gt;&lt;/p&gt;  &lt;p&gt;- that grew up from the manufacturing, quality improvement, and process   control traditions and is particularly well suited to production environments   (including "production of services," i.e., service industries).   &lt;/p&gt;  &lt;p&gt;Another framework of this kind (actually somewhat similar to Six Sigma)   is the approach proposed by SAS Institute called SEMMA -&lt;/p&gt;  &lt;p&gt;&lt;img src="http://www.statsoft.com/textbook/graphics/ModelsDM3.gif" ratio="TRUE" style="border: medium none ; width: 340px; height: 26px; float: none;" border="0" height="26" width="340" /&gt; &lt;/p&gt;  &lt;p&gt;- which is focusing more on the technical activities typically involved   in a data mining project. &lt;/p&gt;  &lt;p&gt;All of these models are concerned with the process of how to integrate   data mining methodology into an organization, how to "convert data   into information," how to involve important stake-holders, and how   to disseminate the information in a form that can easily be converted   by stake-holders into resources for strategic decision making.&lt;/p&gt;  &lt;p&gt;Some software tools for data mining are specifically designed and documented   to fit into one of these specific frameworks. &lt;/p&gt;  &lt;p&gt;The general underlying philosophy of StatSoft's &lt;a linkindex="64" href="http://www.statsoft.com/textbook/gloss.html#Data%20Miner"&gt;&lt;i&gt;STATISTICA&lt;/i&gt;   Data Miner&lt;/a&gt; is to provide a flexible data mining workbench that can be   integrated into any organization, industry, or organizational culture,   regardless of the general data mining process-model that the organization   chooses to adopt. For example, &lt;i&gt;STATISTICA&lt;/i&gt;   Data Miner can include the complete set of (specific) necessary tools   for ongoing company wide Six Sigma quality control efforts, and users   can take advantage of its (still optional) DMAIC-centric user interface   for industrial data mining tools. It can equally well be integrated into   ongoing marketing research, CRM (Customer Relationship Management) projects,   etc. that follow either the CRISP or SEMMA approach - it fits both of   them perfectly well without favoring either one. Also, &lt;i&gt;STATISTICA&lt;/i&gt;   Data Miner offers all the advantages of a general data mining oriented   "development kit" that includes easy to use tools for incorporating   into your projects not only such components as custom database gateway   solutions, prompted interactive queries, or proprietary algorithms, but   also systems of access privileges, workgroup management, and other collaborative   work tools that allow you to design large scale, enterprise-wide&lt;wbr&gt; systems   (e.g., following the CRISP, SEMMA, or a combination of both models) that   involve your entire organization.&lt;/p&gt;&lt;p&gt;  &lt;a name="pdm"&gt;&lt;b&gt;Predictive Data Mining&lt;/b&gt;&lt;br /&gt;The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models (e.g., neural networks, &lt;/a&gt;&lt;a linkindex="65" href="http://www.statsoft.com/textbook/stdatmin.html#meta"&gt;meta-learner&lt;/a&gt;) that can quickly identify transactions which have a high probability of being fraudulent. Other types of data mining projects may be more exploratory in nature (e.g., to identify cluster or segments of customers), in which case &lt;a set="yes" linkindex="66" href="http://www.statsoft.com/textbook/stdatmin.html#ddanal"&gt;drill-down&lt;/a&gt; descriptive and exploratory methods would be applied. &lt;a linkindex="67" href="http://www.statsoft.com/textbook/stdatmin.html#drfdm"&gt;Data reduction&lt;/a&gt; is another possible objective for data mining (e.g., to aggregate or amalgamate the information in very large data sets into useful and manageable chunks).&lt;/p&gt;&lt;p&gt;  &lt;a name="SEMMA"&gt;&lt;b&gt;SEMMA&lt;/b&gt;&lt;/a&gt;&lt;br /&gt; See &lt;a linkindex="68" href="http://www.statsoft.com/textbook/stdatmin.html#Models%20for%20Data%20Mining"&gt;Models for Data Mining&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;  &lt;b&gt;Stacked Generalization  &lt;/b&gt;&lt;br /&gt;See &lt;a linkindex="69" href="http://www.statsoft.com/textbook/stdatmin.html#stackedgen"&gt;Stacking&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;  &lt;a name="stackedgen"&gt;&lt;b&gt;Stacking (Stacked Generalization)&lt;wbr&gt;&lt;/b&gt;&lt;/a&gt;&lt;br /&gt;The concept of stacking (short for Stacked Generalization)&lt;wbr&gt; applies to the area of &lt;a linkindex="70" href="http://www.statsoft.com/textbook/stdatmin.html#pdm"&gt;predictive data mining&lt;/a&gt;, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different.&lt;/p&gt;&lt;p&gt;  Suppose your data mining project includes tree classifiers, such as &lt;a linkindex="71" href="http://www.statsoft.com/textbook/glosc.html#Cart"&gt;C&amp;amp;RT&lt;/a&gt; or &lt;a linkindex="72" href="http://www.statsoft.com/textbook/glosc.html#Chaid"&gt;CHAID&lt;/a&gt;, linear discriminant analysis (e.g., see &lt;a linkindex="73" href="http://www.statsoft.com/textbook/glosd.html#Discriminant%20Function%20Analysis"&gt;GDA&lt;/a&gt;), and &lt;a linkindex="74" href="http://www.statsoft.com/textbook/stdatmin.html#neural"&gt;&lt;i&gt;Neural Networks&lt;/i&gt;&lt;/a&gt;. Each computes predicted classifications&lt;wbr&gt; for a &lt;a linkindex="75" href="http://www.statsoft.com/textbook/glosc.html#Cross-Validation"&gt;crossvalidation&lt;wbr&gt;&lt;/a&gt; sample, from which overall goodness-of-fit&lt;wbr&gt; statistics (e.g., misclassificati&lt;wbr&gt;on rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000). In stacking, the predictions from different classifiers are used as input into a &lt;a linkindex="76" href="http://www.statsoft.com/textbook/stdatmin.html#meta"&gt;meta-learner&lt;/a&gt;, which attempts to combine the predictions to create a final best predicted classification.&lt;wbr&gt; So, for example, the predicted classifications&lt;wbr&gt; from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier&lt;wbr&gt;, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy.&lt;/p&gt;&lt;p&gt;  Other methods for combining the prediction from multiple models or methods (e.g., from multiple datasets used for learning) are &lt;a linkindex="77" href="http://www.statsoft.com/textbook/stdatmin.html#boosting"&gt;Boosting&lt;/a&gt; and &lt;a linkindex="78" href="http://www.statsoft.com/textbook/stdatmin.html#bagging"&gt;Bagging&lt;/a&gt; (&lt;a linkindex="79" href="http://www.statsoft.com/textbook/stdatmin.html#voting"&gt;Voting&lt;/a&gt;).&lt;/p&gt;&lt;p&gt;   &lt;a name="textmining"&gt;&lt;b&gt;Text Mining&lt;/b&gt;&lt;/a&gt;&lt;br /&gt;While &lt;a linkindex="80" href="http://www.statsoft.com/textbook/glosd.html#Data%20Mining"&gt;&lt;i&gt;Data Mining&lt;/i&gt;&lt;/a&gt; is typically concerned with the detection of patterns in numeric data, very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data, text is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques (e.g., to determine co-occurrences of concepts, key phrases, names, addresses, product names, etc.). &lt;/p&gt;&lt;p&gt;   &lt;a name="voting"&gt;&lt;b&gt;Voting&lt;/b&gt;&lt;br /&gt;&lt;/a&gt; See &lt;a linkindex="81" href="http://www.statsoft.com/textbook/stdatmin.html#bagging"&gt;Bagging.&lt;/a&gt;&lt;/p&gt;&lt;p&gt;   &lt;/p&gt;&lt;p&gt;  &lt;table align="right"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt; &lt;a linkindex="82" href="http://www.statsoft.com/textbook/stdatmin.html#index"&gt; &lt;span style="font-size:78%;"&gt;To index&lt;/span&gt;&lt;/a&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt; &lt;/p&gt;&lt;p&gt; &lt;a name="warehousing"&gt;&lt;/a&gt; &lt;/p&gt;&lt;p&gt;&lt;span style="font-size:180%;color:navy;"&gt;Data Warehousing&lt;/span&gt;&lt;/p&gt;&lt;p&gt; StatSoft defines &lt;i&gt;data warehousing&lt;/i&gt; as a process of organizing the storage of large, multivariate data sets in a way that facilitates the retrieval of information for analytic purposes. &lt;/p&gt;&lt;p&gt;&lt;img src="http://www.statsoft.com/textbook/graphics/stadatmquery.gif" alt="" align="left" border="0" height="244" hspace="10" vspace="10" width="326" /&gt;&lt;/p&gt;&lt;p&gt; The most efficient data warehousing architecture will be capable of incorporating or at least referencing all data available in the relevant enterprise-wide&lt;wbr&gt; information management systems, using designated technology suitable for corporate data base management (e.g., &lt;i&gt;Oracle&lt;/i&gt;, &lt;i&gt;Sybase&lt;/i&gt;, &lt;i&gt;MS SQL Server&lt;/i&gt;.  Also, a flexible, high-performanc&lt;wbr&gt;e (see the &lt;a linkindex="83" href="http://www.statsoft.com/textbook/glosi.html#IDP"&gt;IDP technology&lt;/a&gt;), open architecture approach to data warehousing - that flexibly integrates with the existing corporate systems and allows the users to organize and efficiently reference for analytic purposes enterprise repositories of data of practically any complexity - is offered in StatSoft &lt;a linkindex="84" href="http://www.statsoft.com/textbook/glose.html#Enterprise-Wide%20Software%20Systems"&gt;enterprise systems&lt;/a&gt; such as &lt;a linkindex="85" href="http://www.statsoft.com/textbook/gloss.html#sedas"&gt;&lt;i&gt;SEDAS&lt;/i&gt;&lt;/a&gt; (&lt;i&gt;STATISTICA Enterprise-wide&lt;wbr&gt; Data Analysis System&lt;/i&gt;) and &lt;a linkindex="86" href="http://www.statsoft.com/textbook/gloss.html#sewss"&gt;&lt;i&gt;SEWSS&lt;/i&gt;&lt;/a&gt; (&lt;i&gt;STATISTICA Enterprise-wide&lt;wbr&gt; SPC System&lt;/i&gt;), which can also work in conjunction with &lt;a linkindex="87" href="http://www.statsoft.com/textbook/gloss.html#Data%20Miner"&gt;&lt;i&gt;STATISTICA Data Miner&lt;/i&gt;&lt;/a&gt; and &lt;a linkindex="88" href="http://www.statsoft.com/textbook/glosu.html#WebSTATISTICA"&gt;&lt;i&gt;WebSTATISTICA Server Applications&lt;/i&gt;&lt;/a&gt;. &lt;br /&gt;&lt;table align="right"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt; &lt;a linkindex="89" href="http://www.statsoft.com/textbook/stdatmin.html#index"&gt; &lt;span style="font-size:78%;"&gt;To index&lt;/span&gt;&lt;/a&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt; &lt;/p&gt;&lt;p&gt; &lt;a name="olap"&gt;&lt;/a&gt; &lt;span style="font-size:180%;color:navy;"&gt;On-Line Analytic Processing (OLAP)&lt;/span&gt;&lt;/p&gt;&lt;p&gt; The term &lt;i&gt;On-Line Analytic Processing&lt;/i&gt; - &lt;i&gt;OLAP&lt;/i&gt; (or &lt;i&gt;Fast Analysis of Shared Multidimensiona&lt;wbr&gt;l Information&lt;/i&gt; - &lt;i&gt;FASMI&lt;/i&gt;) refers to technology that allows users of multidimensiona&lt;wbr&gt;l databases to generate on-line descriptive or comparative summaries ("views") of data and other analytic queries. Note that despite its name, analyses referred to as &lt;i&gt;OLAP&lt;/i&gt; do not need to be performed truly "on-line" (or in real-time); the term applies to analyses of multidimensiona&lt;wbr&gt;l databases (that may, obviously, contain dynamically updated information) through efficient "multidimension&lt;wbr&gt;al" queries that reference various types of data.  &lt;i&gt;OLAP&lt;/i&gt; facilities can be integrated into corporate (enterprise-wid&lt;wbr&gt;e) database systems and they allow analysts and managers to monitor the performance of the business (e.g., such as various aspects of the manufacturing process or numbers and types of completed transactions at different locations) or the market. The final result of &lt;i&gt;OLAP&lt;/i&gt; techniques can be very simple (e.g., frequency tables, descriptive statistics, simple cross-tabulatio&lt;wbr&gt;ns) or more complex (e.g., they may involve seasonal adjustments, removal of outliers, and other forms of cleaning the data). Although &lt;a linkindex="90" href="http://www.statsoft.com/textbook/stdatmin.html#mining"&gt;Data Mining&lt;/a&gt; techniques can operate on any kind of unprocessed or even unstructured information, they can also be applied to the data views and summaries generated by &lt;i&gt;OLAP&lt;/i&gt; to provide more in-depth and often more multidimensiona&lt;wbr&gt;l knowledge. In this sense, &lt;a linkindex="91" href="http://www.statsoft.com/textbook/stdatmin.html#eda"&gt;Data Mining&lt;/a&gt; techniques could be considered to represent either a different analytic approach (serving different purposes than &lt;i&gt;OLAP&lt;/i&gt;) or as an analytic extension of &lt;i&gt;OLAP&lt;/i&gt;. &lt;table align="right"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt; &lt;a linkindex="92" href="http://www.statsoft.com/textbook/stdatmin.html#index"&gt; &lt;span style="font-size:78%;"&gt;To index&lt;/span&gt;&lt;/a&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt; &lt;/p&gt;&lt;p&gt; &lt;a name="eda"&gt;&lt;/a&gt; &lt;/p&gt;&lt;p&gt;&lt;span style="font-size:180%;color:navy;"&gt;Exploratory Data Analysis (EDA)&lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;a name="eda1"&gt;&lt;/a&gt; &lt;span style="font-size:130%;color:navy;"&gt;EDA vs. Hypothesis Testing&lt;/span&gt;&lt;/p&gt;&lt;p&gt;   As opposed to traditional &lt;i&gt;hypothesis testing&lt;/i&gt; designed to verify &lt;i&gt;a priori&lt;/i&gt; hypotheses about relations between variables (e.g., &lt;i&gt;"There is a positive correlation between the AGE of a person and his/her RISK TAKING disposition"&lt;/i&gt;), &lt;i&gt;exploratory data analysis (EDA)&lt;/i&gt; is used to identify systematic relations between variables when there are no (or not complete) &lt;i&gt;a priori&lt;/i&gt; expectations as to the nature of those relations. In a typical exploratory data analysis process, many variables are taken into account and compared, using a variety of techniques in the search for systematic patterns. &lt;/p&gt;&lt;p&gt;&lt;a name="eda2"&gt;&lt;/a&gt; &lt;span style="font-size:130%;color:navy;"&gt;Computational EDA techniques&lt;/span&gt;&lt;/p&gt;&lt;p&gt;Computational exploratory data analysis methods include both simple basic statistics and more advanced, designated multivariate exploratory techniques designed to identify patterns in multivariate data sets.&lt;/p&gt;&lt;p&gt;  &lt;b&gt;Basic statistical exploratory methods.&lt;/b&gt;  The &lt;a linkindex="93" href="http://www.statsoft.com/textbook/stbasic.html"&gt;basic statistical exploratory methods&lt;/a&gt; include such techniques as &lt;a linkindex="94" href="http://www.statsoft.com/textbook/stbasic.html#Descriptive%20statisticsb"&gt;examining distributions of variables&lt;/a&gt; (e.g., to identify highly skewed or non-normal, such as bi-modal patterns), reviewing large &lt;a linkindex="95" href="http://www.statsoft.com/textbook/stbasic.html#Correlations"&gt;correlation matrices&lt;/a&gt; for coefficients that meet certain thresholds (see example above), or examining &lt;a linkindex="96" href="http://www.statsoft.com/textbook/stbasic.html#frequency%20tables"&gt;multi-way frequency tables&lt;/a&gt; (e.g., "slice by slice" systematically reviewing combinations of levels of control variables). &lt;/p&gt;&lt;p&gt;&lt;img src="http://www.statsoft.com/textbook/graphics/p204.gif" alt="[Correlations Screenshot]" height="246" width="326" /&gt;&lt;/p&gt;&lt;p&gt; &lt;b&gt;Multivariate exploratory techniques.&lt;/b&gt; Multivariate exploratory techniques designed specifically to identify patterns in multivariate (or univariate, such as sequences of measurements) data sets include: &lt;a linkindex="97" href="http://www.statsoft.com/textbook/stcluan.html"&gt;Cluster Analysis&lt;/a&gt;, &lt;a linkindex="98" href="http://www.statsoft.com/textbook/stfacan.html"&gt;Factor Analysis&lt;/a&gt;, &lt;a linkindex="99" href="http://www.statsoft.com/textbook/stdiscan.html"&gt;Discriminant Function Analysis&lt;/a&gt;, &lt;a linkindex="100" href="http://www.statsoft.com/textbook/stmulsca.html"&gt;Multidimensiona&lt;wbr&gt;l Scaling&lt;/a&gt;, &lt;a linkindex="101" href="http://www.statsoft.com/textbook/stloglin.html"&gt;Log-linear Analysis&lt;/a&gt;, &lt;a linkindex="102" href="http://www.statsoft.com/textbook/stcanan.html"&gt;Canonical Correlation&lt;/a&gt;, &lt;a linkindex="103" href="http://www.statsoft.com/textbook/stmulreg.html"&gt;Stepwise Linear&lt;/a&gt; and &lt;a linkindex="104" href="http://www.statsoft.com/textbook/stnonlin.html"&gt;Nonlinear (e.g., Logit) Regression&lt;/a&gt;, &lt;a linkindex="105" href="http://www.statsoft.com/textbook/stcoran.html"&gt;Correspondence Analysis&lt;/a&gt;, &lt;a linkindex="106" href="http://www.statsoft.com/textbook/sttimser.html"&gt;Time Series Analysis&lt;/a&gt;, and &lt;a linkindex="107" href="http://www.statsoft.com/textbook/stclatre.html"&gt;Classification Trees&lt;/a&gt;. &lt;/p&gt;&lt;p&gt;&lt;img src="http://www.statsoft.com/textbook/graphics/p241.gif" alt="[Cluster Analysis Screenshot]" height="246" width="326" /&gt;&lt;/p&gt;&lt;p&gt; &lt;b&gt;Neural Networks.&lt;/b&gt;  &lt;i&gt;Neural Networks&lt;/i&gt; are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called learning from existing data. &lt;/p&gt;&lt;p&gt; &lt;img src="http://www.statsoft.com/textbook/graphics/snn_iris.gif" alt="[Neural Network Example]" height="117" width="199" /&gt; &lt;/p&gt;&lt;p&gt; For more information, see &lt;a linkindex="108" href="http://www.statsoft.com/textbook/stdatmin.html#neural"&gt;Neural Networks&lt;/a&gt;; see also &lt;a linkindex="109" href="http://www.statsoft.com/textbook/gloss.html#SNN"&gt;&lt;i&gt;STATISTICA Neural Networks&lt;/i&gt;&lt;/a&gt;. &lt;/p&gt;&lt;p&gt;  &lt;/p&gt;&lt;p&gt;&lt;a name="eda3"&gt;&lt;/a&gt; &lt;span style="font-size:130%;color:navy;"&gt;Graphical (data visualization) EDA techniques&lt;/span&gt;&lt;/p&gt;&lt;p&gt; A large selection of powerful exploratory data analytic techniques is also offered by &lt;a set="yes" linkindex="110" href="http://www.statsoft.com/textbook/stgraph.html"&gt;graphical data visualization methods&lt;/a&gt; that can identify relations, trends, and biases "hidden"&lt;i&gt; &lt;/i&gt;in unstructured data sets. &lt;/p&gt;&lt;p&gt;&lt;br /&gt; &lt;img src="http://www.statsoft.com/textbook/graphics/p155.gif" align="right" border="0" height="245" width="326" /&gt;&lt;b&gt;Brushing.&lt;/b&gt;  Perhaps the most common and historically first widely used technique explicitly identified as &lt;i&gt;graphical exploratory data analysis&lt;/i&gt; is &lt;a linkindex="111" href="http://www.statsoft.com/textbook/stgraph.html#brushing"&gt;&lt;i&gt;brushing&lt;/i&gt;&lt;/a&gt;, an interactive method allowing one to select on-screen specific data points or subsets of data and identify their (e.g., common) characteristics&lt;wbr&gt;, or to examine their effects on relations between relevant variables. Those relations between variables can be visualized by fitted functions (e.g., 2D lines or 3D surfaces) and their confidence intervals, thus, for example, one can examine changes in those functions by interactively (temporarily) removing or adding specific subsets of data. For example, one of many applications of the brushing technique is to select (i.e., highlight) in a matrix scatterplot all data points that belong to a certain category (e.g., a "medium" income level, see the highlighted subset in the fourth component graph of the first row in the illustration left) in order to examine how those specific observations contribute to relations between other variables in the same data set (e.g, the correlation between the "debt" and "assets" in the current example). If the brushing facility supports features like "animated brushing" or "automatic function re-fitting", one can define a dynamic brush that would move over the consecutive ranges of a criterion variable (e.g., "income" measured on a continuous scale or a discrete [3-level] scale as on the illustration above) and examine the dynamics of the contribution of the criterion variable to the relations between other relevant variables in the same data set. &lt;br /&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;img src="http://www.statsoft.com/textbook/graphics/stdatmin2Dbrush_sm.gif" alt="[2D Animated Brushing]" align="left" border="0" height="215" width="286" /&gt;&lt;/td&gt;&lt;td&gt; &lt;img src="http://www.statsoft.com/textbook/graphics/stdatmin3Dbrush_sm.gif" alt="[3D Animated Brushing]" align="left" border="0" height="215" width="286" /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt; &lt;b&gt;Other graphical EDA techniques.&lt;/b&gt;  Other graphical exploratory analytic techniques include function fitting and plotting, &lt;a linkindex="112" href="http://www.statsoft.com/textbook/stgraph.html#smoothing"&gt;data smoothing&lt;/a&gt;, overlaying and merging of multiple displays, categorizing data, splitting/mergi&lt;wbr&gt;ng subsets of data in graphs, aggregating data in graphs, &lt;a linkindex="113" href="http://www.statsoft.com/textbook/stgraph.html#icon7"&gt;identifying and marking subsets of data that meet specific conditions&lt;/a&gt;, &lt;a linkindex="114" href="http://www.statsoft.com/textbook/stgraph.html#icon%20plots"&gt;icon plots&lt;/a&gt;, &lt;/p&gt;&lt;p&gt; &lt;img src="http://www.statsoft.com/textbook/popups/popup163.gif" border="0" height="240" width="340" /&gt; &lt;/p&gt;&lt;p&gt; shading, plotting confidence intervals and confidence areas (e.g., &lt;a linkindex="115" href="http://www.statsoft.com/textbook/glose.html#Ellipse,%20%28Confidence%29"&gt;ellipses&lt;/a&gt;), &lt;/p&gt;&lt;p&gt; &lt;img src="http://www.statsoft.com/textbook/popups/popup155.gif" height="213" width="211" /&gt; &lt;/p&gt;&lt;p&gt;  generating &lt;a linkindex="116" href="http://www.statsoft.com/textbook/glosu.html#Voronoi"&gt;tessellations&lt;/a&gt;, &lt;a linkindex="117" href="http://www.statsoft.com/textbook/gloss.html#Spectral%20Plot"&gt;spectral planes&lt;/a&gt;, &lt;/p&gt;&lt;p&gt; &lt;img src="http://www.statsoft.com/textbook/popups/popup121.gif" border="0" height="312" width="343" /&gt; &lt;/p&gt;&lt;p&gt; integrated &lt;a linkindex="118" href="http://www.statsoft.com/textbook/stgraph.html#layered"&gt;layered compressions&lt;/a&gt;, &lt;/p&gt;&lt;p&gt;&lt;img src="http://www.statsoft.com/textbook/graphics/p153.gif" alt="[Layered Compression Screenshot]" height="246" width="326" /&gt; &lt;/p&gt;&lt;p&gt;&lt;a name="eda4"&gt;&lt;/a&gt; &lt;/p&gt;&lt;p&gt; and &lt;a linkindex="119" href="http://www.statsoft.com/textbook/stgraph.html#projections"&gt;projected contours&lt;/a&gt;, &lt;a linkindex="120" href="http://www.statsoft.com/textbook/stgraph.html#reduction"&gt;data image reduction techniques&lt;/a&gt;, &lt;a linkindex="121" href="http://www.statsoft.com/textbook/stgraph.html#rotation"&gt;interactive (and continuous) rotation&lt;/a&gt; &lt;/p&gt;&lt;p&gt;&lt;img src="http://www.statsoft.com/textbook/graphics/an_rotate.gif" alt="[Data Rotation Animation]" height="185" width="185" /&gt;&lt;/p&gt;&lt;p&gt; with animated stratification (cross-sections&lt;wbr&gt;) of 3D displays, and selective highlighting of specific series and blocks of data. &lt;/p&gt;&lt;p&gt;&lt;a name="eda4"&gt;&lt;/a&gt; &lt;span style="font-size:130%;color:navy;"&gt;Verification of results of EDA&lt;/span&gt;&lt;/p&gt;&lt;p&gt;The exploration of data can only serve as the first stage of data analysis and its results can be treated as tentative at best as long as they are not confirmed, e.g., &lt;a linkindex="122" href="http://www.statsoft.com/textbook/glosc.html#Cross-Validation"&gt;crossvalidated&lt;/a&gt;, using a different data set (or and independent subset). If the result of the exploratory stage suggests a particular model, then its validity can be verified by applying it to a new data set and testing its fit (e.g., testing its &lt;i&gt;predictive validity&lt;/i&gt;). Case selection conditions can be used to quickly define subsets of data (e.g., for estimation and verification), and for testing the robustness of results. &lt;table align="right"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt; &lt;a linkindex="123" href="http://www.statsoft.com/textbook/stdatmin.html#index"&gt; &lt;span style="font-size:78%;"&gt;To index&lt;/span&gt;&lt;/a&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt; &lt;/p&gt;&lt;p&gt; &lt;a name="neural"&gt;&lt;/a&gt; &lt;span style="font-size:180%;color:navy;"&gt;Neural Networks&lt;/span&gt;&lt;br /&gt;(see also &lt;a linkindex="124" href="http://www.statsoft.com/textbook/stneunet.html"&gt;Neural Networks&lt;/a&gt; chapter) &lt;/p&gt;&lt;p&gt; &lt;i&gt;Neural Networks&lt;/i&gt; are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called &lt;i&gt;learning&lt;/i&gt; from existing data.  Neural Networks is one of the &lt;a linkindex="125" href="http://www.statsoft.com/textbook/stdatmin.html#mining"&gt;Data Mining&lt;/a&gt; techniques. &lt;/p&gt;&lt;p&gt; &lt;img src="http://www.statsoft.com/textbook/graphics/neuralnw.gif" alt="[Neural Network]" height="200" width="301" /&gt; &lt;/p&gt;&lt;p&gt; The first step is to design a specific network architecture (that includes a specific number of "layers" each consisting of a certain number of "neurons"). The size and structure of the network needs to match the nature (e.g., the formal complexity) of the investigated phenomenon. Because the latter is obviously not known very well at this early stage, this task is not easy and often involves multiple "trials and errors." (Now, there is, however, neural network software that applies artificial intelligence techniques to aid in that tedious task and finds "the best" network architecture.)&lt;/p&gt;&lt;p&gt; The new network is then subjected to the process of "training." In that phase, neurons apply an iterative process to the number of inputs (variables) to adjust the weights of the network in order to optimally predict (in traditional terms one could say, find a "fit" to) the sample data on which the "training" is performed. After the phase of learning from an existing data set, the new network is ready and it can then be used to generate predictions. &lt;/p&gt;&lt;p&gt; &lt;img src="http://www.statsoft.com/textbook/graphics/snn_1.gif" alt="[STATISTICA Neural Networks Example]" height="246" width="326" /&gt; &lt;/p&gt;&lt;p&gt; The resulting &lt;i&gt;"network"&lt;/i&gt; developed in the process of &lt;i&gt;"learning"&lt;/i&gt; represents a pattern detected in the data.  Thus, in this approach, the &lt;i&gt;"network"&lt;/i&gt; is the functional equivalent of a model of relations between variables in the traditional &lt;i&gt;model building&lt;/i&gt; approach.  However, unlike in the traditional &lt;i&gt;models&lt;/i&gt;, in the &lt;i&gt;"network,"&lt;/i&gt; those relations cannot be articulated in the usual terms used in statistics or methodology to describe relations between variables (such as, for example, &lt;i&gt;"A is positively correlated with B but only for observations where the value of C is low and D is high"&lt;/i&gt;).  Some &lt;i&gt;neural networks&lt;/i&gt; can produce highly accurate predictions; they represent, however, a typical a-theoretical (one can say, "a black box") research approach. That approach is concerned only with practical considerations,&lt;wbr&gt; that is, with the predictive validity of the solution and its applied relevance and not with the nature of the underlying mechanism or its relevance for any "theory" of the underlying phenomena.&lt;/p&gt;&lt;p&gt; However, it should be mentioned that &lt;i&gt;Neural Network&lt;/i&gt; techniques can also be used as a component of analyses designed to build explanatory models because &lt;i&gt;Neural Networks&lt;/i&gt; can help explore data sets in search for relevant variables or groups of variables; the results of such explorations can then facilitate the process of model building. Moreover, now there is neural network software that uses sophisticated algorithms to search for the most relevant input variables, thus potentially contributing directly to the model building process. &lt;/p&gt;&lt;p&gt; One of the major advantages of &lt;i&gt;neural networks&lt;/i&gt; is that, theoretically, they are capable of approximating any continuous function, and thus the researcher does not need to have any hypotheses about the underlying model, or even to some extent, which variables matter. An important disadvantage, however, is that the final solution depends on the initial conditions of the network, and, as stated before, it is virtually impossible to "interpret" the solution in traditional, analytic terms, such as those used to build theories that explain phenomena. &lt;/p&gt;&lt;p&gt; &lt;img src="http://www.statsoft.com/textbook/graphics/snn_2.gif" alt="[STATISTICA Neural Networks Example]" height="246" width="326" /&gt; &lt;/p&gt;&lt;p&gt; Some authors stress the fact that &lt;i&gt;neural networks&lt;/i&gt; use, or one should say, are expected to use, massively parallel computation models.  For example Haykin (1994) defines &lt;i&gt;neural network&lt;/i&gt; as: &lt;/p&gt;&lt;p&gt; &lt;/p&gt;&lt;blockquote&gt; &lt;i&gt;"a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: (1) Knowledge is acquired by the network through a learning process, and (2) Interneuron connection strengths known as synaptic weights are used to store the knowledge."&lt;/i&gt; (p. 2). &lt;/blockquote&gt; &lt;p&gt; &lt;img src="http://www.statsoft.com/textbook/graphics/snn_3.gif" alt="[STATISTICA Neural Networks Example]" height="246" width="326" /&gt; &lt;/p&gt;&lt;p&gt; However, as Ripley (1996) points out, the vast majority of contemporary neural network applications run on single-processo&lt;wbr&gt;r computers and he argues that a large speed-up can be achieved not only by developing software that will take advantage of multiprocessor hardware by also by designing better (more efficient) learning &lt;a linkindex="126" href="http://www.statsoft.com/textbook/glosa.html#Algorithm"&gt;algorithms&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;  &lt;i&gt;Neural networks&lt;/i&gt; is one of the methods used in &lt;a linkindex="127" href="http://www.statsoft.com/textbook/stdatmin.html#mining"&gt;Data Mining&lt;/a&gt;; see also &lt;a linkindex="128" href="http://www.statsoft.com/textbook/stdatmin.html#eda"&gt;Exploratory Data Analysis&lt;/a&gt;. For more information on &lt;i&gt;neural networks&lt;/i&gt;, see Haykin (1994), Masters (1995), Ripley (1996), and Welstead (1994).  For a discussion of &lt;i&gt;neural networks&lt;/i&gt; as statistical tools, see Warner and Misra (1996).  See also, &lt;a set="yes" linkindex="129" href="http://www.statsoft.com/textbook/gloss.html#SNN"&gt;&lt;i&gt;STATISTICA Neural Networks&lt;/i&gt;&lt;/a&gt;. &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-5809434346056913863?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/5809434346056913863/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=5809434346056913863' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/5809434346056913863'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/5809434346056913863'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/11/data-mining-techniques.html' title='Data Mining Techniques'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-1964616646115928020</id><published>2007-11-23T16:02:00.000Z</published><updated>2007-11-23T16:04:03.518Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Concept'/><title type='text'>Supervised learning</title><content type='html'>&lt;h1 class="firstHeading"&gt;Supervised learning&lt;/h1&gt;       &lt;h3 id="siteSub"&gt;From Wikipedia, the free encyclopedia&lt;/h3&gt;                 &lt;!-- start content --&gt;    &lt;p&gt;&lt;b&gt;Supervised learning&lt;/b&gt; is a &lt;a linkindex="9" href="http://en.wikipedia.org/wiki/Machine_learning" title="Machine learning"&gt;machine learning&lt;/a&gt; technique for creating a function from training data. The &lt;a linkindex="10" href="http://en.wikipedia.org/wiki/Training_set" title="Training set"&gt;training data&lt;/a&gt; consist of pairs of input objects (typically vectors), and desired outputs. The output of the function can be a continuous value (called &lt;a linkindex="11" href="http://en.wikipedia.org/wiki/Regression_analysis" title="Regression analysis"&gt;regression&lt;/a&gt;), or can predict a class label of the input object (called &lt;a set="yes" linkindex="12" href="http://en.wikipedia.org/wiki/Statistical_classification" title="Statistical classification"&gt;classification&lt;/a&gt;). The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations in a "reasonable" way (see &lt;a linkindex="13" href="http://en.wikipedia.org/wiki/Inductive_bias" title="Inductive bias"&gt;inductive bias&lt;/a&gt;). (Compare with &lt;a linkindex="14" href="http://en.wikipedia.org/wiki/Unsupervised_learning" title="Unsupervised learning"&gt;unsupervised learning&lt;/a&gt;.) The parallel task in human and animal psychology is often referred to as &lt;a set="yes" linkindex="15" href="http://en.wikipedia.org/wiki/Concept_learning" title="Concept learning"&gt;concept learning&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-1964616646115928020?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/1964616646115928020/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=1964616646115928020' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/1964616646115928020'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/1964616646115928020'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/11/supervised-learning.html' title='Supervised learning'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-1958917342534998796</id><published>2007-11-18T16:25:00.000Z</published><updated>2007-11-18T16:27:51.448Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Theory'/><title type='text'>Overfitting</title><content type='html'>&lt;a name="top"&gt;&lt;/a&gt;&lt;p&gt;we will look at some techniques for preventing our model becoming too powerful (overfitting).  In the next, we address the related question of selecting an appropriate architecture with just the right amount of trainable parameters.  &lt;/p&gt;&lt;p&gt; &lt;/p&gt;&lt;h2&gt;Bias-Variance trade-off&lt;/h2&gt;  Consider the two fitted functions below.  The data points (circles) have all been generated from a smooth function, &lt;i&gt;h(x)&lt;/i&gt;, with some added noise.  Obviously, we want to end up with a model which approximates &lt;i&gt;h(x)&lt;/i&gt;, given a specific set of data &lt;i&gt;y(x)&lt;/i&gt; generated as: &lt;p&gt; &lt;/p&gt;&lt;center&gt; &lt;table align="center" width="60%"&gt; &lt;tbody&gt;&lt;tr valign="middle"&gt;&lt;td&gt;&lt;img src="http://www.willamette.edu/%7Egorr/classes/cs449/equations/overfit/img1.gif" /&gt;&lt;/td&gt; &lt;td align="right" width="10"&gt; (1)&lt;/td&gt;&lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; &lt;/center&gt;  &lt;p&gt; In the left hand panel we try to fit the points using a function &lt;i&gt;g(x)&lt;/i&gt; which has too few parameters: a straight line.  The model has the virtue of being simple; there are only two free parameters.  However, it does not do a good job of fitting the data, and would not do well in predicting new data points.  We say that the simpler model has a high &lt;b&gt;bias&lt;/b&gt;. &lt;/p&gt;&lt;p&gt;  &lt;/p&gt;&lt;center&gt; &lt;img src="http://www.willamette.edu/%7Egorr/classes/cs449/figs/overfit.gif" alt="under and overfitting data" width="80%" /&gt; &lt;/center&gt;  &lt;p&gt; The right hand panel shows a model which has been fitted using too many free parameters.  It does an excellent job of fitting the data points, as the error at the data points is close to zero. However it would not do a good job of predicting &lt;i&gt;h(x)&lt;/i&gt; for new values of &lt;i&gt;x&lt;/i&gt;.  We say that the model has a high &lt;b&gt;variance&lt;/b&gt;.  The model does not reflect the structure which we expect to be present in &lt;i&gt;any&lt;/i&gt; data set generated by equation (1) above. &lt;/p&gt;&lt;p&gt;  Clearly what we want is something in between: a model which is powerful enough to represent the underlying structure of the data (&lt;i&gt;h(x)&lt;/i&gt;), but not so powerful that it faithfully models the noise associated with this particular data sample. &lt;/p&gt;&lt;p&gt;  The bias-variance trade-off is most likely to become a problem if we have relatively few data points.  In the opposite case, where we have essentially an infinite number of data points (as in continuous online learning), we are not usually in danger of overfitting the data, as the noise associated with any single data point plays a vanishingly small role in our overall fit.  The following techniques therefore apply to situations in which we have a finite data set, and, typically, where we wish to train in batch mode.  &lt;/p&gt;&lt;p&gt;  &lt;/p&gt;&lt;h2&gt;Preventing overfitting&lt;/h2&gt; &lt;p&gt;  &lt;/p&gt;&lt;h3&gt;Early stopping&lt;/h3&gt;  One of the simplest and most widely used means of avoiding overfitting is to divide the data into two sets: a training set and a &lt;b&gt;validation&lt;/b&gt; set.  We train using only the training data.  Every now and then, however, we stop training, and test network performance on the independent validation set.  No weight updates are made during this test!  As the validation data is independent of the training data, network performance is a good measure of generalization,&lt;wbr&gt; and as long as the network is learning the underlying structure of the data (&lt;i&gt;h(x)&lt;/i&gt; above), performance on the validation set will improve with training.  Once the network stops learning things which are expected to be true of any data sample and learns things which are true only of this sample (epsilon in Eqn 1 above), performance on the validation set will stop improving, and will typically get worse. Schematic learning curves showing error on the training and validation sets are shown below.  To avoid overfitting, we simply stop training at time &lt;i&gt;t&lt;/i&gt;, where performance on the validation set is optimal.  &lt;p&gt;  &lt;/p&gt;&lt;center&gt; &lt;img src="http://www.willamette.edu/%7Egorr/classes/cs449/figs/earlystop.gif" alt="early stopping" width="80%" /&gt; &lt;/center&gt;  &lt;p&gt; One detail of note when using early stopping:  if we wish to test the trained network on a set of independent data to measure its ability to generalize, we need a third, independent, test set.  This is because we used the validation set to decide when to stop training, and thus our trained network is no longer entirely independent of the validation set.  The requirements of independent training, validation and test sets means that early stopping can only be used in a data-rich situation.  &lt;/p&gt;&lt;p&gt; &lt;/p&gt;&lt;h3&gt;Weight decay&lt;/h3&gt;  The over-fitted function above shows a high degree of curvature, while the linear function is maximally smooth.  &lt;b&gt;Regularization&lt;/b&gt; refers to a set of techniques which help to ensure that the function computed by the network is no more curved than necessary.  This is achieved by adding a penalty to the error function, giving:  &lt;p&gt; &lt;/p&gt;&lt;center&gt; &lt;table align="center" width="60%"&gt; &lt;tbody&gt;&lt;tr valign="middle"&gt;&lt;td&gt;&lt;img src="http://www.willamette.edu/%7Egorr/classes/cs449/equations/overfit/img2.gif" /&gt;&lt;/td&gt; &lt;td align="right" width="10"&gt; (2)&lt;/td&gt;&lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; &lt;/center&gt;  &lt;p&gt; One possible form of the regularizer comes from the informal observation that an over-fitted mapping with regions of large curvature requires large weights.  We thus penalize large weights by choosing  &lt;/p&gt;&lt;p&gt; &lt;/p&gt;&lt;center&gt; &lt;table align="center" width="60%"&gt; &lt;tbody&gt;&lt;tr valign="middle"&gt;&lt;td&gt;&lt;img src="http://www.willamette.edu/%7Egorr/classes/cs449/equations/overfit/img3.gif" /&gt;&lt;/td&gt; &lt;td align="right" width="10"&gt; (3)&lt;/td&gt;&lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; &lt;/center&gt; &lt;p&gt; Using this modified error function, the weights are now updated as &lt;/p&gt;&lt;p&gt; &lt;/p&gt;&lt;center&gt; &lt;table align="center" width="60%"&gt; &lt;tbody&gt;&lt;tr valign="middle"&gt;&lt;td&gt;&lt;img src="http://www.willamette.edu/%7Egorr/classes/cs449/equations/overfit/img4.gif" /&gt;&lt;/td&gt; &lt;td align="right" width="10"&gt; (4)&lt;/td&gt;&lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; &lt;/center&gt; &lt;p&gt; where the right hand term causes the weight to decrease as a function of its own size.  In the absence of any input, all weights will tend to decrease exponentially, hence the term "weight decay".  &lt;/p&gt;&lt;h3&gt;Training with noise&lt;/h3&gt;  A final method which can often help to reduce the importance of the specific noise characteristics&lt;wbr&gt; associated with a particular data sample is to add an extra small amount of noise (a small random value with mean value of zero) to each input.  Each time a specific input pattern &lt;i&gt;x&lt;/i&gt; is presented, we add a different random number, and use &lt;img src="http://www.willamette.edu/%7Egorr/classes/cs449/equations/overfit/img5.gif" align="top" /&gt; instead. &lt;p&gt;  At first, this may seem a rather odd thing to do: to deliberately corrupt ones own data. However, perhaps you can see that it will now be difficult for the network to approximate any specific data point too closely.  In practice, training with added noise has indeed been shown to reduce overfitting and thus improve generalization in some situations. &lt;/p&gt;&lt;p&gt;  If we have a finite training set, another way of introducing noise into the training process is to use online training, that is, updating weights after every pattern presentation, and to randomly reorder the patterns at the end of each training epoch.  In this manner, each weight update is based on a noisy estimate of the true gradient.      &lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;hr width="50%"&gt;&lt;p&gt;  &lt;/p&gt;&lt;center&gt; &lt;a linkindex="0" href="http://www.willamette.edu/%7Egorr/classes/cs449/overfitting.html#top"&gt;[Top]&lt;/a&gt; &lt;img src="http://www.willamette.edu/%7Egorr/classes/cs449/icons/pixel.gif" height="20" width="100" /&gt; &lt;a linkindex="1" href="http://www.willamette.edu/%7Egorr/classes/cs449/growing.html"&gt;[Next: Growing and pruning networks]&lt;/a&gt; &lt;img src="http://www.willamette.edu/%7Egorr/classes/cs449/icons/pixel.gif" height="20" width="100" /&gt; &lt;a linkindex="2" href="http://www.willamette.edu/%7Egorr/classes/cs449/intro.html"&gt;[Back to the first page] &lt;/a&gt;&lt;/center&gt; &lt;p&gt;  &lt;/p&gt;&lt;div style="position: absolute; width: 28px; height: 28px; z-index: 1000; display: none;"&gt;&lt;/div&gt;&lt;img style="position: absolute; width: 35px; height: 29px; z-index: 1000; display: none;" src="chrome://piclens/content/launch.png" /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-1958917342534998796?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/1958917342534998796/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=1958917342534998796' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/1958917342534998796'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/1958917342534998796'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/11/overfitting.html' title='Overfitting'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-7717925477710785038</id><published>2007-10-23T16:16:00.000+01:00</published><updated>2007-10-23T16:17:38.199+01:00</updated><title type='text'>Algebra - Imperial College</title><content type='html'>&lt;h1&gt;&lt;span style="font-size: 16pt;"&gt;M2P2&lt;/span&gt;&lt;span style="font-size: 16pt;"&gt;&lt;o:p&gt; Algebra&lt;/o:p&gt;&lt;/span&gt;&lt;/h1&gt;&lt;h1&gt;&lt;span style="font-size: 16pt;"&gt;Lecturer: Prof M W Liebeck&lt;/span&gt;&lt;/h1&gt;  &lt;h1&gt;&lt;br /&gt;&lt;span style="font-size: 14pt;"&gt;&lt;/span&gt;&lt;/h1&gt;&lt;h1&gt;&lt;span style="font-size: 14pt;"&gt;Recommended books&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/h1&gt;    &lt;p class="MsoNormal"&gt;&lt;span style="font-size: 14pt;"&gt;&lt;o:p&gt; &lt;/o:p&gt;For group theory, there are many good introductory books. Here are a few suggestions:&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;&lt;span style="font-size: 14pt;"&gt;J.B. &lt;span class="SpellE"&gt;Fraleigh&lt;/span&gt;, &lt;span class="GramE"&gt;A&lt;/span&gt; First Course in Abstract Algebra, Addison Wesley&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;  &lt;p class="MsoNormal"&gt;&lt;span class="GramE"&gt;&lt;span style="font-size: 14pt;"&gt;R..&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size: 14pt;"&gt; &lt;span class="SpellE"&gt;Allenby&lt;/span&gt;, Rings, Fields and Groups, Arnold&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span style="font-size: 14pt;"&gt;I.N. &lt;span class="SpellE"&gt;Herstein&lt;/span&gt;, Topics in Algebra&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span style="font-size: 14pt;"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span style="font-size: 14pt;"&gt;For linear algebra, here are a few good ones:&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span style="font-size: 14pt;"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span style="font-size: 14pt;"&gt;J. &lt;span class="SpellE"&gt;Fraleigh&lt;/span&gt; and R. Beauregard, Linear Algebra, Addison Wesley&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;span style="font-size: 14pt;"&gt;S. &lt;span class="SpellE"&gt;Lipschutz&lt;/span&gt; and M. Lipson, Linear Algebra, &lt;span class="SpellE"&gt;Schaum&lt;/span&gt; Outline Series, McGraw Hill&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-7717925477710785038?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/7717925477710785038/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=7717925477710785038' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/7717925477710785038'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/7717925477710785038'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/10/algebra-imperial-college.html' title='Algebra - Imperial College'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-7337108793385463283</id><published>2007-10-23T16:00:00.000+01:00</published><updated>2007-10-23T16:01:07.607+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='Probability Theory'/><title type='text'>M2S1 PROBABILITY AND STATISTICS II - Imperial College</title><content type='html'>Recommended Texts&lt;br /&gt;G. R. Grimmett and D. R. Stirzaker, Probability and Random Processes (2nd Edition/3rd Edition).&lt;br /&gt;[Very useful for probability material of the course].&lt;br /&gt;W. Feller, An Introduction to Probability Theory and Its Applications. Vols 1 and 2. [A classical&lt;br /&gt;reference text].&lt;br /&gt;G. Casella and R.L. Berger, Statistical Inference. [A very useful text, which covers statistical ideas as&lt;br /&gt;well as probability material].&lt;br /&gt;There are many such introductory texts in the Mathematics library. Other books relating to specific&lt;br /&gt;parts of the course will be recommended when relevant.&lt;br /&gt;Also, there will be a course WWW page accessible from http://stats.ma.ic.ac.uk/ayoung. It will&lt;br /&gt;contain links to course handouts, exercises and solutions.&lt;br /&gt;Professor A. Young (room 529, email alastair.young@imperial.ac.uk)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-7337108793385463283?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/7337108793385463283/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=7337108793385463283' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/7337108793385463283'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/7337108793385463283'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/10/m2s1-probability-and-statistics-ii.html' title='M2S1 PROBABILITY AND STATISTICS II - Imperial College'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-2385514731017952339</id><published>2007-10-19T22:09:00.001+01:00</published><updated>2007-10-19T23:37:46.829+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Graph'/><title type='text'>Confusion Matrix</title><content type='html'>A confusion matrix (Kohavi and Provost, 1998) contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two class classifier.  &lt;p&gt;The entries in the confusion matrix have the following meaning in the context of our study:&lt;/p&gt; &lt;ul&gt;&lt;li&gt;&lt;i&gt;a&lt;/i&gt; is the number of &lt;b&gt;correct&lt;/b&gt; predictions that an instance is &lt;b&gt;negative&lt;/b&gt;,&lt;/li&gt;&lt;li&gt;&lt;i&gt;b&lt;/i&gt; is the number of &lt;b&gt;incorrect&lt;/b&gt; predictions that an instance is &lt;b&gt;positive&lt;/b&gt;,&lt;/li&gt;&lt;li&gt;&lt;i&gt;c&lt;/i&gt; is the number of &lt;b&gt;incorrect&lt;/b&gt; of predictions that an instance &lt;b&gt;negative&lt;/b&gt;, and&lt;/li&gt;&lt;li&gt;&lt;i&gt;d&lt;/i&gt; is the number of &lt;b&gt;correct&lt;/b&gt; predictions that an instance is &lt;b&gt;positive&lt;/b&gt;.&lt;/li&gt;&lt;/ul&gt;  &lt;table align="center" border="1" cellspacing="0" cols="4" width="50%"&gt;   &lt;tbody&gt;&lt;tr&gt;     &lt;td&gt;&lt;br /&gt;&lt;/td&gt;     &lt;td&gt;&lt;br /&gt;&lt;/td&gt;     &lt;td colspan="2" align="center"&gt;Predicted&lt;/td&gt;   &lt;/tr&gt;   &lt;tr&gt;     &lt;td&gt;&lt;br /&gt;&lt;/td&gt;     &lt;td&gt;&lt;br /&gt;&lt;/td&gt;     &lt;td align="center"&gt;Negative&lt;/td&gt;     &lt;td align="center"&gt;Positive&lt;/td&gt;   &lt;/tr&gt;   &lt;tr&gt;     &lt;td rowspan="2" align="center"&gt;Actual&lt;/td&gt;     &lt;td align="center"&gt;Negative&lt;/td&gt;     &lt;td align="center"&gt;&lt;b&gt;a&lt;/b&gt;&lt;/td&gt;     &lt;td align="center"&gt;&lt;b&gt;b&lt;/b&gt;&lt;/td&gt;   &lt;/tr&gt;   &lt;tr&gt;     &lt;td&gt;&lt;center&gt;Positive&lt;/center&gt;&lt;/td&gt;     &lt;td&gt;&lt;center&gt;&lt;b&gt;c&lt;/b&gt;&lt;/center&gt;&lt;/td&gt;     &lt;td&gt;&lt;center&gt;&lt;b&gt;d&lt;/b&gt;&lt;/center&gt;&lt;/td&gt;   &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt; &lt;/p&gt;  &lt;p&gt;Several standard terms have been defined for the 2 class matrix: &lt;/p&gt;&lt;ul&gt;&lt;li&gt; The &lt;i&gt;accuracy&lt;/i&gt; (&lt;i&gt;AC&lt;/i&gt;) is the proportion of the total number of predictions that were correct. It is determined using the equation:&lt;/li&gt;&lt;/ul&gt;  &lt;center&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/confusion_matrix/cm1.gif" height="34" width="128" /&gt; [1]&lt;/center&gt;  &lt;ul&gt;&lt;li&gt; The &lt;i&gt;recall&lt;/i&gt; or &lt;i&gt;hit rate &lt;/i&gt;or &lt;i&gt;true positive rat&lt;/i&gt;e (&lt;i&gt;TP&lt;/i&gt;) is the proportion of positive cases that were correctly identified, as calculated using the equation:&lt;/li&gt;&lt;/ul&gt;  &lt;center&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/confusion_matrix/cm2.gif" height="34" width="76" /&gt; [2]&lt;/center&gt;  &lt;ul&gt;&lt;li&gt; The &lt;i&gt;false alarm rate &lt;/i&gt;or&lt;i&gt; &lt;/i&gt;&lt;i&gt;false positive rate&lt;/i&gt; (&lt;i&gt;FP&lt;/i&gt;) is the proportion of negatives cases that were incorrectly classified as positive, as calculated using the equation:&lt;/li&gt;&lt;/ul&gt;  &lt;center&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/confusion_matrix/cm3.gif" height="34" width="77" /&gt; [3]&lt;/center&gt;  &lt;ul&gt;&lt;li&gt; The &lt;i&gt;true negative rate&lt;/i&gt; (&lt;i&gt;TN&lt;/i&gt;) is defined as the proportion of negatives cases that were classified correctly, as calculated using the equation:&lt;/li&gt;&lt;/ul&gt;  &lt;center&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/confusion_matrix/cm4.gif" height="34" width="77" /&gt; [4]&lt;/center&gt;  &lt;ul&gt;&lt;li&gt; The&lt;i&gt; false negative rate&lt;/i&gt; (&lt;i&gt;FN&lt;/i&gt;) is the proportion of positives cases that were incorrectly classified as negative, as calculated using the equation:&lt;/li&gt;&lt;/ul&gt;  &lt;center&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/confusion_matrix/cm5.gif" height="34" width="81" /&gt; [5]&lt;/center&gt;  &lt;ul&gt;&lt;li&gt; Finally, &lt;i&gt;precision&lt;/i&gt; (&lt;i&gt;P&lt;/i&gt;) is the proportion of the predicted positive cases that were correct, as calculated using the equation:&lt;/li&gt;&lt;/ul&gt;  &lt;center&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/confusion_matrix/cm6.gif" height="34" width="69" /&gt; [6]&lt;/center&gt;  &lt;p&gt; The accuracy determined using equation 1 may not be an adequate  performance measure when the number of negative cases is much greater  than the number of positive cases (Kubat et al., 1998).  Suppose there are 1000 cases, 995 of which are negative cases and 5 of which are positive cases. If the system classifies them all as negative, the accuracy would be 99.5%, even though the classifier missed all positive cases. Other performance measures account for this by including &lt;i&gt;TP&lt;/i&gt; in a product: for example, &lt;i&gt;geometric mean&lt;/i&gt; (&lt;i&gt;g-mean&lt;/i&gt;) (Kubat et al., 1998), as defined in equations 7 and 8, and &lt;i&gt;F-Measure&lt;/i&gt; (Lewis and Gale, 1994), as defined in equation 9. &lt;/p&gt;&lt;center&gt; &lt;p&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/confusion_matrix/cm7.gif" height="24" width="138" /&gt;&lt;span style="font-family:Arial;"&gt; [7]&lt;/span&gt; &lt;/p&gt;&lt;p&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/confusion_matrix/cm8.gif" height="24" width="150" /&gt;&lt;span style="font-family:Arial;"&gt; [8]&lt;/span&gt; &lt;/p&gt;&lt;p&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/confusion_matrix/cm9.gif" height="45" width="140" /&gt;&lt;span style="font-family:Arial;"&gt; [9]&lt;/span&gt;&lt;/p&gt;&lt;/center&gt;  &lt;p&gt;In equation 9, &lt;span style="font-family:Symbol;"&gt;b&lt;/span&gt; has a value from 0 to infinity and is used to control the weight assigned to &lt;i&gt;TP&lt;/i&gt; and &lt;i&gt;P&lt;/i&gt;. Any classifier evaluated using equations 7, 8 or 9 will have a measure value of 0, if all positive cases are classified incorrectly. &lt;/p&gt;&lt;p&gt;Another way to examine the performance of classifiers is to use a &lt;a linkindex="3" href="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/ROC/ROC.html"&gt;ROC graph&lt;/a&gt;, described on the next page. &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-2385514731017952339?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/2385514731017952339/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=2385514731017952339' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/2385514731017952339'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/2385514731017952339'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/10/confusion-matrix.html' title='Confusion Matrix'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-6307889372278372108</id><published>2007-10-19T20:04:00.001+01:00</published><updated>2007-10-19T20:04:30.655+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Course'/><title type='text'>Timetable II</title><content type='html'>&lt;h3&gt; &lt;span style="font-family:Arial;color:#000000;"&gt;Autumn 2007&lt;/span&gt; &lt;/h3&gt; &lt;h3&gt; &lt;span style="font-family:Arial;color:#000000;"&gt;MSc Advanced Computing (Weeks 2 - 11)&lt;/span&gt; &lt;/h3&gt; &lt;span style="font-family:Arial;color:#000000;"&gt;Week 2 start date: Monday 8 October, 2007&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Date Published: 19 October 2007&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;table bg border="1" cellpadding="4" cellspacing="0" style="color:#ffffff;"&gt; &lt;thead&gt; &lt;tr&gt; &lt;th border bg style="color:#c0c0c0;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;Time&lt;/span&gt; &lt;/th&gt; &lt;th border bg style="color:#c0c0c0;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;Monday&lt;/span&gt; &lt;/th&gt; &lt;th border bg style="color:#c0c0c0;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;Tuesday&lt;/span&gt; &lt;/th&gt; &lt;th border bg style="color:#c0c0c0;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;Wednesday&lt;/span&gt; &lt;/th&gt; &lt;th border bg style="color:#c0c0c0;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;Thursday&lt;/span&gt; &lt;/th&gt; &lt;th border bg style="color:#c0c0c0;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;Friday&lt;/span&gt; &lt;/th&gt; &lt;/tr&gt; &lt;/thead&gt; &lt;tbody&gt; &lt;tr bg valign="top" style="color:#ffffff;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;0900&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Advanced Topics in Software Engineering&lt;br /&gt;LEC (2-10) / jnm (2-10),sue (2-10) / 311&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Modal and Temporal Logic&lt;br /&gt;LEC (2-10) / imh (2-10),mjs (2-10) / 144&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Commemoration Day (No Teaching Week 4 - 24.10.07)&lt;br /&gt;Wks (4-4) / None (4-4) /&lt;br /&gt;&lt;br /&gt;Intelligent Data and Probabilistic Inference&lt;br /&gt;LEC (2-10) / dfg (2-10) / 145&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Advanced Issues in Object Oriented Programming&lt;br /&gt;LEC (2-10) / scd (2-10) / 145&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Network Security&lt;br /&gt;LEC (2-10) / ecl1 (2-10),mrh (2-10) / 308&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffff99;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1000&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Advanced Topics in Software Engineering&lt;br /&gt;TUT (2-10) / jnm (2-10),sue (2-10) / 311&lt;br /&gt;&lt;br /&gt;Laboratory Prolog&lt;br /&gt;LAB (2-10) / nr600 (2-10) / 202,206&lt;br /&gt;&lt;br /&gt;Intelligent Data and Probabilistic Inference&lt;br /&gt;LEC (11-11) / dfg (11-11) / 308&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Modal and Temporal Logic&lt;br /&gt;TUT (2-10) / imh (2-10),mjs (2-10) / 144&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Commemoration Day (No Teaching Week 4 - 24.10.07)&lt;br /&gt;Wks (4-4) / None (4-4) /&lt;br /&gt;&lt;br /&gt;Intelligent Data and Probabilistic Inference&lt;br /&gt;LEC (2-10) / dfg (2-10) / 308&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Advanced Issues in Object Oriented Programming&lt;br /&gt;TUT (2-10) / scd (2-10) / 145&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Lexus Preperation&lt;br /&gt;Wks (11-11) / nr600 (11-11) / 202,206&lt;br /&gt;&lt;br /&gt;Network Security&lt;br /&gt;TUT (2-10) / ecl1 (2-10),mrh (2-10) / 308&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffffff;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1100&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Laboratory Prolog&lt;br /&gt;LAB (2-10) / nr600 (2-10) / 202,206&lt;br /&gt;&lt;br /&gt;Advanced Topics in Software Engineering&lt;br /&gt;LEC (2-10) / jnm (2-10),sue (2-10) / 311&lt;br /&gt;&lt;br /&gt;Intelligent Data and Probabilistic Inference&lt;br /&gt;LEC (11-11) / dfg (11-11) / 308&lt;br /&gt;&lt;br /&gt;Lexus Preperation&lt;br /&gt;Wks (5-5) / nr600 (5-5) / 202,206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Models of Concurrent Computation&lt;br /&gt;LEC (2-10) / dirk (2-10),pg (2-10) / 145&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Commemoration Day (No Teaching Week 4 - 24.10.07)&lt;br /&gt;Wks (4-4) / None (4-4) /&lt;br /&gt;&lt;br /&gt;Laboratory Prolog&lt;br /&gt;LAB (2-10) / nr600 (2-10) / 202,206&lt;br /&gt;&lt;br /&gt;Intelligent Data and Probabilistic Inference&lt;br /&gt;TUT (2-10) / dfg (2-10) / 344&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Machine Learning&lt;br /&gt;LAB (2-10) / maja (2-10),shm (2-10) / 219&lt;br /&gt;&lt;br /&gt;Computing for Optimal Decisions&lt;br /&gt;LEC (2-10) / br (2-10) / 145&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Machine Learning&lt;br /&gt;LAB (2-10) / maja (2-10),shm (2-10) / 219&lt;br /&gt;&lt;br /&gt;Lexis Prolog&lt;br /&gt;LAB (11-11) / nr600 (11-11) / 202,206&lt;br /&gt;&lt;br /&gt;Modal and Temporal Logic&lt;br /&gt;LEC (2-10) / imh (2-10),mjs (2-10) / 144&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffff99;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1200&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Laboratory Workshop (MAC &amp;amp; MSc CS in depth pathway)&lt;br /&gt;LEC (2-11) / nr600 (2-11) / 311&lt;br /&gt;&lt;br /&gt;Lexis Prolog&lt;br /&gt;LAB (5-5) / nr600 (5-5) / 202,206&lt;br /&gt;&lt;br /&gt;Intelligent Data and Probabilistic Inference&lt;br /&gt;TUT (11-11) / dfg (11-11) / 308&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Prolog Support Lectures&lt;br /&gt;LEC (2-10) / cjh (2-10),klc (2-10) / 144&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Commemoration Day (No Teaching Week 4 - 24.10.07)&lt;br /&gt;Wks (4-4) / None (4-4) /&lt;br /&gt;&lt;br /&gt;Laboratory Prolog&lt;br /&gt;LAB (2-10) / nr600 (2-10) / 202,206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Machine Learning Alternative Lab&lt;br /&gt;LAB (2-10) / maja (2-10),shm (2-10) / 219&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Machine Learning Alternative Lab&lt;br /&gt;LAB (2-10) / maja (2-10),shm (2-10) / 219&lt;br /&gt;&lt;br /&gt;Lexis Prolog&lt;br /&gt;LAB (11-11) / nr600 (11-11) / 202,206&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffffff;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1300&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Lexis Prolog&lt;br /&gt;LAB (5-5) / nr600 (5-5) / 202,206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffff99;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1400&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Advanced Issues in Object Oriented Programming&lt;br /&gt;LEC (2-10) / scd (2-10) / 144&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Network Security&lt;br /&gt;LEC (2-10) / ecl1 (2-10),mrh (2-10) / 144&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Automated Reasoning&lt;br /&gt;LEC (2-10) / kb (2-10) / 145&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Automated Reasoning&lt;br /&gt;LEC (2-10) / kb (2-10) / 144&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffffff;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1500&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Models of Concurrent Computation&lt;br /&gt;TUT (2-10) / dirk (2-10),pg (2-10) / 145&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Computer Vision&lt;br /&gt;LEC (2-10) / gzy (2-10) / 144&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Automated Reasoning&lt;br /&gt;TUT (2-10) / kb (2-10) / 144&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffff99;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1600&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Machine Learning&lt;br /&gt;LEC (2-10) / maja (2-10),shm (2-10) / 311&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Models of Concurrent Computation&lt;br /&gt;LEC (2-10) / dirk (2-10),pg (2-10) / 145&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Computer Vision&lt;br /&gt;TUT (2-10) / gzy (2-10) / 144&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Computing for Optimal Decisions&lt;br /&gt;LEC (2-10) / br (2-10) / 145&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffffff;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1700&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Machine Learning&lt;br /&gt;LEC (2-10) / maja (2-10),shm (2-10) / 311&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Project Lecture&lt;br /&gt;LEC (8-8) /  / 308&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Computer Vision&lt;br /&gt;LEC (2-10) / gzy (2-10) / 144&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Computing for Optimal Decisions&lt;br /&gt;TUT (2-10) / br (2-10) / 145&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt; &lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-6307889372278372108?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/6307889372278372108/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=6307889372278372108' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/6307889372278372108'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/6307889372278372108'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/10/timetable-ii.html' title='Timetable II'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-1845861816147005364</id><published>2007-10-19T20:03:00.001+01:00</published><updated>2007-10-19T20:03:55.630+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Course'/><title type='text'>Timetable</title><content type='html'>&lt;h3&gt; &lt;span style="font-family:Arial;color:#000000;"&gt;Autumn 2007&lt;/span&gt; &lt;/h3&gt; &lt;h3&gt; &lt;span style="font-family:Arial;color:#000000;"&gt;MSc Computing Science (Weeks 2 - 11)&lt;/span&gt; &lt;/h3&gt; &lt;span style="font-family:Arial;color:#000000;"&gt;Week 2 start date: Monday 8 October, 2007&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Date Published: 19 October 2007&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;table bg border="1" cellpadding="4" cellspacing="0" style="color:#ffffff;"&gt; &lt;thead&gt; &lt;tr&gt; &lt;th border bg style="color:#c0c0c0;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;Time&lt;/span&gt; &lt;/th&gt; &lt;th border bg style="color:#c0c0c0;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;Monday&lt;/span&gt; &lt;/th&gt; &lt;th border bg style="color:#c0c0c0;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;Tuesday&lt;/span&gt; &lt;/th&gt; &lt;th border bg style="color:#c0c0c0;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;Wednesday&lt;/span&gt; &lt;/th&gt; &lt;th border bg style="color:#c0c0c0;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;Thursday&lt;/span&gt; &lt;/th&gt; &lt;th border bg style="color:#c0c0c0;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;Friday&lt;/span&gt; &lt;/th&gt; &lt;/tr&gt; &lt;/thead&gt; &lt;tbody&gt; &lt;tr bg valign="top" style="color:#ffffff;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;0900&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Programming in C and C++&lt;br /&gt;LEC (2-2) / wjk (2-2) / 311&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Commemoration Day (No Teaching Week 4 - 24.10.07)&lt;br /&gt;Wks (4-4) / None (4-4) /&lt;br /&gt;&lt;br /&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-10) / ap7 (2-10) / 202,206&lt;br /&gt;&lt;br /&gt;Programming in C and C++&lt;br /&gt;LEC (2-2) / wjk (2-2) / 144&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Programming in C and C++&lt;br /&gt;LEC (2-2) / wjk (2-2) / 308&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffff99;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1000&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Computer Systems&lt;br /&gt;LEC (3-11) / dirk (3-11),jamm (3-11) / 145&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 202&lt;br /&gt;&lt;br /&gt;Object Oriented Design &amp;amp; Programming.&lt;br /&gt;LEC (3-11) / scd (3-11),wjk (3-11) / 343&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-10) / ap7 (2-10) / 202&lt;br /&gt;&lt;br /&gt;Commemoration Day (No Teaching Week 4 - 24.10.07)&lt;br /&gt;Wks (4-4) / None (4-4) / &lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Programming in C and C++&lt;br /&gt;LEC (2-2) / wjk (2-2) / 144&lt;br /&gt;&lt;br /&gt;Object Oriented Design &amp;amp; Programming.&lt;br /&gt;LEC (3-11) / scd (3-11),wjk (3-11) / 343&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffffff;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1100&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Computer Systems&lt;br /&gt;TUT (3-11) / dirk (3-11),jamm (3-11) / 144&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 202&lt;br /&gt;&lt;br /&gt;Logic and AI Programming&lt;br /&gt;LEC (6-8) / klc (6-8) / 343&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-10) / ap7 (2-10) / 202&lt;br /&gt;&lt;br /&gt;Commemoration Day (No Teaching Week 4 - 24.10.07)&lt;br /&gt;Wks (4-4) / None (4-4) / &lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Computer Systems&lt;br /&gt;LEC (3-9) / ajd (3-9) / 144&lt;br /&gt;&lt;br /&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 202,206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Object Oriented Design &amp;amp; Programming.&lt;br /&gt;TUT (3-11) / scd (3-11),wjk (3-11) / 343&lt;br /&gt;&lt;br /&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (1-2) / ap7 (1-2) / 202,206&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffff99;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1200&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Computer Systems&lt;br /&gt;LEC (3-11) / dirk (3-11),jamm (3-11) / 145&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Programming in C and C++&lt;br /&gt;LEC (2-2) / wjk (2-2) / 311&lt;br /&gt;&lt;br /&gt;Logic and AI Programming&lt;br /&gt;TUT (6-8) / klc (6-8) / 343,202,206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-10) / ap7 (2-10) / 202,206&lt;br /&gt;&lt;br /&gt;Commemoration Day (No Teaching Week 4 - 24.10.07)&lt;br /&gt;Wks (4-4) / None (4-4) / &lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Programming in C and C++&lt;br /&gt;LEC (2-2) / wjk (2-2) / 308&lt;br /&gt;&lt;br /&gt;Computer Systems&lt;br /&gt;TUT (3-9) / ajd (3-9) / 144,202,206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Computer Systems&lt;br /&gt;LEC (3-9) / ajd (3-9) / 144&lt;br /&gt;&lt;br /&gt;Programming in C and C++&lt;br /&gt;TUT (2-2) / wjk (2-2) / 308&lt;br /&gt;&lt;br /&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 202,206&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffffff;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1300&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffff99;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1400&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Logic and AI Programming&lt;br /&gt;LEC (3-11) / fs (3-11),klc (3-11) / 145&lt;br /&gt;&lt;br /&gt;Programming in C and C++&lt;br /&gt;LEC (2-2) / wjk (2-2) / 145&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Logic and AI Programming&lt;br /&gt;LEC (3-11) / fs (3-11),klc (3-11) / 343&lt;br /&gt;&lt;br /&gt;Programming in C and C++&lt;br /&gt;TUT (2-2) / wjk (2-2) / 344&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Programming in C and C++&lt;br /&gt;LEC (2-2) / wjk (2-2) / 144&lt;br /&gt;&lt;br /&gt;Computer Systems&lt;br /&gt;LEC (3-9) / ajd (3-9) / 144&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Programming in C and C++&lt;br /&gt;TUT (2-2) / wjk (2-2) / 344&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffffff;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1500&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Programming in C and C++&lt;br /&gt;LEC (2-2) / wjk (2-2) / 145&lt;br /&gt;&lt;br /&gt;Logic and AI Programming&lt;br /&gt;LEC (6-8) / klc (6-8) / 145&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Logic and AI Programming&lt;br /&gt;TUT (3-11) / fs (3-11),klc (3-11) / 343&lt;br /&gt;&lt;br /&gt;Logic and AI Programming&lt;br /&gt;LAB (9-10) / klc (9-10) / 202,206&lt;br /&gt;&lt;br /&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 202&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 202&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffff99;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1600&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 202,206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 202,206&lt;br /&gt;&lt;br /&gt;Lab Workshop (MSc CS)&lt;br /&gt;LAB (3-10) / ap7 (3-10) / 202,206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (7-11) / ap7 (7-11) / 202&lt;br /&gt;&lt;br /&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-4) / ap7 (2-4) / 202&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;MSc in Computing Group Project Presentations&lt;br /&gt;Wks (8-8) / ap7 (8-8) / 308&lt;br /&gt;&lt;br /&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 202&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;tr bg valign="top" style="color:#ffffff;"&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:85%;color:#000000;"&gt;1700&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 202,206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-2) / ap7 (2-2) / 202,206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;Integrated Programming Laboratory&lt;br /&gt;LAB (2-4) / ap7 (2-4) / 202,206&lt;/span&gt; &lt;/td&gt; &lt;td border style="color:#000000;"&gt; &lt;span style="font-family:Arial;font-size:78%;color:#000000;"&gt;MSc in Computing Group Project Presentations&lt;br /&gt;Wks (8-8) / ap7 (8-8) / 308&lt;/span&gt; &lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt; &lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-1845861816147005364?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/1845861816147005364/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=1845861816147005364' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/1845861816147005364'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/1845861816147005364'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/10/timetable.html' title='Timetable'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-9078764218164658263</id><published>2007-10-19T20:01:00.001+01:00</published><updated>2007-10-19T23:02:15.264+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Graph'/><title type='text'>ROC Graph</title><content type='html'>&lt;p&gt;ROC graphs are another way besides &lt;a href="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/confusion_matrix/confusion_matrix.html"&gt;confusion matrices&lt;/a&gt; to examine the performance of classifiers (Swets, 1988). A ROC graph is a plot with the false positive rate on the &lt;i&gt;X&lt;/i&gt; axis and the true positive rate on the &lt;i&gt;Y&lt;/i&gt; axis. &lt;span style="color: rgb(255, 0, 0);"&gt;The point (0,1) is the perfect classifier: it classifies all positive cases and negative cases correctly. It is (0,1) because the false positive rate is 0 (none), and the true positive rate is 1 (all). The point (0,0) represents a classifier that predicts all cases to be negative, while the point (1,1) corresponds to a classifier that predicts every case to be positive. Point (1,0) is the classifier that is incorrect for all classifications.&lt;/span&gt; In many cases, &lt;span style="color: rgb(51, 102, 255);"&gt;a classifier has a parameter that can be adjusted to increase &lt;/span&gt;&lt;i style="color: rgb(51, 102, 255);"&gt;TP&lt;/i&gt;&lt;span style="color: rgb(51, 102, 255);"&gt; at the cost of an increased &lt;/span&gt;&lt;i style="color: rgb(51, 102, 255);"&gt;FP&lt;/i&gt;&lt;span style="color: rgb(51, 102, 255);"&gt; or decrease &lt;/span&gt;&lt;i style="color: rgb(51, 102, 255);"&gt;FP&lt;/i&gt;&lt;span style="color: rgb(51, 102, 255);"&gt; at the cost of a decrease in &lt;/span&gt;&lt;i style="color: rgb(51, 102, 255);"&gt;TP&lt;/i&gt;. Each parameter setting provides a  (&lt;i&gt;FP&lt;/i&gt;, &lt;i&gt;TP&lt;/i&gt;) pair and a series of such pairs can be used to plot an ROC curve. A non-parametric classifier is represented by a single ROC point, corresponding to its (&lt;i&gt;FP&lt;/i&gt;,&lt;i&gt;TP&lt;/i&gt;) pair. &lt;/p&gt;&lt;center&gt; &lt;p&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/ROC/roc1.jpg" border="1" height="341" width="536" /&gt;&lt;/p&gt;&lt;/center&gt;  &lt;p&gt;The above figure shows an example of an ROC graph with two ROC curves labeled C1 and C2, and two ROC points labeled P1 and P2. Non-parametric algorithms produce a single ROC point for a particular data set. &lt;/p&gt;&lt;p&gt;&lt;b&gt;Features of ROC Graphs&lt;/b&gt; &lt;/p&gt;&lt;ul&gt;&lt;li&gt; An ROC curve or point is independent of class distribution or error costs (Provost et al., 1998).&lt;/li&gt;&lt;li&gt; An ROC graph encapsulates all information contained in the &lt;a href="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/confusion_matrix/confusion_matrix.html"&gt;confusion matrix&lt;/a&gt;, since &lt;i&gt;FN&lt;/i&gt; is the complement of &lt;i&gt;TP&lt;/i&gt; and &lt;i&gt;TN&lt;/i&gt; is the complement of &lt;i&gt;FP&lt;/i&gt; (Swets, 1988).&lt;/li&gt;&lt;li&gt; ROC curves provide a visual tool for examining the tradeoff between the ability of a classifier to correctly identify positive cases and the number of negative cases that are incorrectly classified.&lt;/li&gt;&lt;li style="color: rgb(51, 204, 0);"&gt;The closer the curve follows the left-hand border and then the top border of the ROC space,  the more accurate the test.The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.&lt;/li&gt;&lt;/ul&gt;  &lt;p&gt;&lt;br /&gt;&lt;b&gt;Area-based Accuracy Measure&lt;/b&gt; &lt;/p&gt;&lt;p&gt;It has been suggested that the area beneath an ROC curve can be used as a measure of accuracy in many applications (Swets, 1988). Provost and Fawcett (1997) argue that using classification accuracy to compare classifiers is not adequate unless cost and class distributions are completely unknown and a single classifier must be chosen to handle any situation. They propose a method of evaluating classifiers using a ROC graph and imprecise cost and class distribution information. &lt;/p&gt;&lt;p&gt;&lt;b&gt;Euclidian Distance Comparison&lt;/b&gt; &lt;/p&gt;&lt;p&gt;Another way of comparing ROC points is by using an equation that equates accuracy with the Euclidian distance from the perfect classifier, point (0,1) on the graph. We include a weight factor that allows us to define relative misclassification costs, if such information is available. We define &lt;i&gt;AC&lt;sub&gt;d&lt;/sub&gt;&lt;/i&gt; as a distance based performance measure for an ROC point and calculate it using the equation: &lt;/p&gt;&lt;center&gt; &lt;p&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/ROC/roc2.gif" height="28" width="264" /&gt;&lt;span style="font-family:Arial;"&gt; [22]&lt;/span&gt;&lt;/p&gt;&lt;/center&gt;  &lt;p&gt;where &lt;i&gt;W&lt;/i&gt; is a factor, ranging from 0 to 1, that is used to assign relative importance to false positives and false negatives.  &lt;i&gt;AC&lt;sub&gt;d&lt;/sub&gt;&lt;/i&gt; ranges from 0 for the perfect classifier to &lt;i&gt;sqrt(2)&lt;/i&gt;  for a classifier that classifies all cases incorrectly, point (1,0) on the ROC graph. &lt;i&gt;AC&lt;sub&gt;d&lt;/sub&gt;&lt;/i&gt; differs from &lt;i&gt;g-mean&lt;/i&gt;&lt;sub&gt;1&lt;/sub&gt;, &lt;i&gt;g-mean&lt;/i&gt;&lt;sub&gt;2&lt;/sub&gt; and &lt;i&gt;F&lt;/i&gt;-&lt;i&gt;measure&lt;/i&gt; in that it is equal to zero only if all cases are classified correctly. In other words, a classifier evaluated using &lt;i&gt;AC&lt;sub&gt;d&lt;/sub&gt;&lt;/i&gt; gets some credit for correct classification of negative cases, regardless of its accuracy in correctly identifying positive cases. &lt;/p&gt;&lt;p&gt;&lt;b&gt;Example&lt;/b&gt; &lt;/p&gt;&lt;p&gt;Consider two algorithms A and B that perform adequately against most data sets. However, assume both A and B misclassify all positive cases in a particular data set and A classifies 10 times the number of infrequent itemsets as potentially frequent compared to B. Algorithm B is the better algorithm in this case because there has been less wasted effort counting infrequent itemsets. &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-9078764218164658263?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/9078764218164658263/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=9078764218164658263' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/9078764218164658263'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/9078764218164658263'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/10/roc-graph.html' title='ROC Graph'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-3164370385219965162</id><published>2007-10-19T19:59:00.000+01:00</published><updated>2007-10-19T20:00:24.258+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Graph'/><title type='text'>Cumulative Gains and Lift Charts</title><content type='html'>&lt;ul&gt;&lt;li&gt; &lt;b&gt;Lift&lt;/b&gt; is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model.&lt;/li&gt;&lt;li&gt; Cumulative gains and lift charts are visual aids for measuring model performance&lt;/li&gt;&lt;li&gt; Both charts consist of a lift curve and a baseline&lt;/li&gt;&lt;li&gt; The greater the area between the lift curve and the baseline, the better the model&lt;/li&gt;&lt;/ul&gt;  &lt;h3&gt; &lt;b&gt;Example Problem 1&lt;/b&gt;&lt;/h3&gt; A company wants to do a mail marketing campaign. It costs the company $1 for each item mailed. They have information on 100,000 customers. Create a cumulative gains and a lift chart from the following data. &lt;ul&gt;&lt;li&gt; &lt;b&gt;Overall Response Rate:&lt;/b&gt; If we assume we have no model other than the prediction of the overall response rate, then we can predict the number of positive responses as a fraction of the total customers contacted. Suppose the response rate is 20%. If all 100,000 customers are contacted we will receive around 20,000 positive responses.&lt;/li&gt;&lt;/ul&gt;  &lt;center&gt;&lt;table bgcolor="#fdfbe8" border="1" cellspacing="0" cols="3" width="350"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;th&gt;Cost ($)&lt;/th&gt;  &lt;th&gt;Total Customers Contacted&lt;/th&gt;  &lt;th&gt;Positive Responses&lt;/th&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;100000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;100000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;20000&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;/center&gt;  &lt;ul&gt;&lt;li&gt; &lt;b&gt;Prediction of Response Model: &lt;/b&gt;A response model predicts who will respond to a marketing campaign. If we have a response model, we can make more detailed predictions. For example, we use the response model to assign a score to all 100,000 customers and predict the results of contacting only the top 10,000 customers, the top 20,000 customers, etc.&lt;/li&gt;&lt;/ul&gt;  &lt;center&gt;&lt;table bgcolor="#fdfbe8" border="1" cellspacing="0" cols="3" width="350"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;th&gt;Cost ($)&lt;/th&gt;  &lt;th&gt;Total Customers Contacted&lt;/th&gt;  &lt;th&gt;Positive Responses&lt;/th&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;10000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;10000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;6000&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;20000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;20000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;10000&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;30000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;30000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;13000&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;40000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;40000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;15800&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;50000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;50000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;17000&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;60000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;60000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;18000&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;70000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;70000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;18800&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;80000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;80000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;19400&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;90000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;90000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;19800&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;100000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;100000&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;20000&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;/center&gt;  &lt;p&gt;&lt;b&gt;Cumulative Gains Chart:&lt;/b&gt; &lt;/p&gt;&lt;ul&gt;&lt;li&gt; The &lt;i&gt;y&lt;/i&gt;-axis shows the percentage of positive responses. This is a percentage of the total possible positive responses (20,000 as the overall response rate shows).&lt;/li&gt;&lt;li&gt; The &lt;i&gt;x&lt;/i&gt;-axis shows the percentage of customers contacted, which is a fraction of the 100,000 total customers.&lt;/li&gt;&lt;li&gt; &lt;b&gt;Baseline (overall response rate):&lt;/b&gt; If we contact &lt;i&gt;X&lt;/i&gt;% of customers then we will receive &lt;i&gt;X&lt;/i&gt;% of the total positive responses.&lt;/li&gt;&lt;li&gt; &lt;b&gt;Lift Curve:&lt;/b&gt; Using the predictions of the response model, calculate the percentage of positive responses for the percent of customers contacted and map these points to create the lift curve.&lt;/li&gt;&lt;/ul&gt;  &lt;center&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/lift_chart/cumulative_gains.gif" border="1" height="294" width="443" /&gt;&lt;/center&gt;  &lt;p&gt;&lt;br /&gt;&lt;b&gt;Lift Chart:&lt;/b&gt; &lt;/p&gt;&lt;ul&gt;&lt;li&gt; Shows the actual lift.&lt;/li&gt;&lt;li&gt; To plot the chart: Calculate the points on the lift curve by determining the ratio between the result predicted by our model and the result using no model.&lt;/li&gt;&lt;li&gt; Example: For contacting 10% of customers, using no model we should get 10% of responders and using the given model we should get 30% of responders. The &lt;i&gt;y&lt;/i&gt;-value of the lift curve at 10% is 30 / 10 = 3.&lt;/li&gt;&lt;/ul&gt;  &lt;center&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/lift_chart/lift_chart.gif" border="1" height="271" width="432" /&gt;&lt;/center&gt;  &lt;p&gt;&lt;br /&gt;&lt;b&gt;Analyzing the Charts: &lt;/b&gt;Cumulative gains and lift charts are a graphical representation of the advantage of using a predictive model to choose which customers to contact. The lift chart shows how much more likely we are to receive respondents than if we contact a random sample of customers. For example, by contacting only 10% of customers based on the predictive model we will reach 3 times as many respondents as if we use no model.&lt;br /&gt;  &lt;/p&gt;&lt;h3&gt; &lt;b&gt;Evaluating a Predictive Model&lt;/b&gt;&lt;/h3&gt; We can assess the value of a predictive model by using the model to score a set of customers and then contacting them in this order. The actual response rates are recorded for each cutoff point, such as the first 10% contacted, the first 20% contacted, etc. We create cumulative gains and lift charts using the actual response rates to see how much the predictive model would have helped in this situation. The information can be used to determine whether we should use this model or one similar to it in the future.&lt;br /&gt;  &lt;h3&gt; &lt;b&gt;Example Problem 2&lt;/b&gt;&lt;/h3&gt; Using the response model P(&lt;i&gt;x&lt;/i&gt;)=100-AGE(&lt;i&gt;x&lt;/i&gt;) for customer &lt;i&gt;x&lt;/i&gt; and the data table shown below, construct the cumulative gains and lift charts. Ties in ranking should be arbitrarily broken by assigning a higher rank to who appears first in the table.&lt;br /&gt;  &lt;center&gt;&lt;table bgcolor="#fdfbe8" border="1" cellspacing="0" cols="4"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td width="120"&gt; &lt;center&gt;&lt;b&gt;Customer Name&lt;/b&gt;&lt;/center&gt; &lt;/td&gt;  &lt;td width="60"&gt; &lt;center&gt;&lt;b&gt;Height&lt;/b&gt;&lt;/center&gt; &lt;/td&gt;  &lt;td width="60"&gt; &lt;center&gt;&lt;b&gt;Age&lt;/b&gt;&lt;/center&gt; &lt;/td&gt;  &lt;td width="120"&gt; &lt;center&gt;&lt;b&gt;Actual Response&lt;/b&gt;&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Alan&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;70&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;39&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;N&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Bob&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;72&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;21&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;Y&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Jessica&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;65&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;25&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;Y&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Elizabeth&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;62&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;30&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;Y&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Hilary&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;67&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;19&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;Y&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Fred&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;69&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;48&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;N&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Alex&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;65&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;12&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;Y&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Margot&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;63&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;51&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;N&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Sean&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;71&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;65&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;Y&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Chris&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;73&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;42&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;N&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Philip&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;75&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;20&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;Y&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Catherine&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;70&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;23&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;N&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Amy&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;69&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;13&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;N&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Erin&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;68&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;35&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;Y&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Trent&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;72&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;55&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;N&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Preston&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;68&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;25&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;N&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;John&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;64&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;76&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;N&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Nancy&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;64&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;24&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;Y&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Kim&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;72&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;31&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;N&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;center&gt;Laura&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;62&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;29&lt;/center&gt; &lt;/td&gt;  &lt;td&gt; &lt;center&gt;Y&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;/center&gt;  &lt;p&gt;&lt;b&gt;1.  Calculate P(&lt;i&gt;x&lt;/i&gt;) for each person &lt;i&gt;x&lt;/i&gt;&lt;/b&gt; &lt;/p&gt;&lt;p&gt;&lt;b&gt;2.  Order the people according to rank P(&lt;i&gt;x&lt;/i&gt;)&lt;/b&gt; &lt;/p&gt;&lt;blockquote&gt; &lt;center&gt;&lt;table bgcolor="#fdfbe8" border="1" cellspacing="0" cols="3"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td width="120"&gt;&lt;b&gt;Customer Name&lt;/b&gt;&lt;/td&gt;  &lt;td width="60"&gt;&lt;b&gt;P(&lt;i&gt;x&lt;/i&gt;)&lt;/b&gt;&lt;/td&gt;  &lt;td width="60"&gt;&lt;b&gt;Actual Response&lt;/b&gt;&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Alex&lt;/td&gt;  &lt;td&gt;88&lt;/td&gt;  &lt;td&gt;Y&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Amy&lt;/td&gt;  &lt;td&gt;87&lt;/td&gt;  &lt;td&gt;N&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Hilary&lt;/td&gt;  &lt;td&gt;81&lt;/td&gt;  &lt;td&gt;Y&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Philip&lt;/td&gt;  &lt;td&gt;80&lt;/td&gt;  &lt;td&gt;Y&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Bob&lt;/td&gt;  &lt;td&gt;79&lt;/td&gt;  &lt;td&gt;Y&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Catherine&lt;/td&gt;  &lt;td&gt;77&lt;/td&gt;  &lt;td&gt;N&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Nancy&lt;/td&gt;  &lt;td&gt;76&lt;/td&gt;  &lt;td&gt;Y&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Jessica&lt;/td&gt;  &lt;td&gt;75&lt;/td&gt;  &lt;td&gt;Y&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Preston&lt;/td&gt;  &lt;td&gt;75&lt;/td&gt;  &lt;td&gt;N&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Laura&lt;/td&gt;  &lt;td&gt;71&lt;/td&gt;  &lt;td&gt;Y&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Elizabeth&lt;/td&gt;  &lt;td&gt;70&lt;/td&gt;  &lt;td&gt;Y&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Kim&lt;/td&gt;  &lt;td&gt;69&lt;/td&gt;  &lt;td&gt;N&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Erin&lt;/td&gt;  &lt;td&gt;65&lt;/td&gt;  &lt;td&gt;Y&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Alan&lt;/td&gt;  &lt;td&gt;61&lt;/td&gt;  &lt;td&gt;N&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Chris&lt;/td&gt;  &lt;td&gt;58&lt;/td&gt;  &lt;td&gt;N&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Fred&lt;/td&gt;  &lt;td&gt;52&lt;/td&gt;  &lt;td&gt;N&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Margot&lt;/td&gt;  &lt;td&gt;49&lt;/td&gt;  &lt;td&gt;N&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Trent&lt;/td&gt;  &lt;td&gt;45&lt;/td&gt;  &lt;td&gt;N&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;Sean&lt;/td&gt;  &lt;td&gt;35&lt;/td&gt;  &lt;td&gt;Y&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;John&lt;/td&gt;  &lt;td&gt;24&lt;/td&gt;  &lt;td&gt;N&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;/center&gt; &lt;/blockquote&gt; &lt;b&gt;3.  Calculate the percentage of total responses for each cutoff point&lt;/b&gt; &lt;ul&gt;&lt;li&gt; Response Rate = Number of Responses  / Total Number of Responses (10)&lt;/li&gt;&lt;/ul&gt;  &lt;blockquote&gt; &lt;center&gt;&lt;table bgcolor="#fdfbe8" border="1" cellspacing="0" cols="3"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td width="120"&gt; &lt;center&gt;&lt;b&gt;Total Customers Contacted&lt;/b&gt;&lt;/center&gt; &lt;/td&gt;  &lt;td width="120"&gt; &lt;center&gt;&lt;b&gt;Number of Responses&lt;/b&gt;&lt;/center&gt; &lt;/td&gt;  &lt;td width="120"&gt; &lt;center&gt;&lt;b&gt;Response Rate&lt;/b&gt;&lt;/center&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;2&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;1&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;10%&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;4&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;3&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;30%&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;6&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;4&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;40%&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;8&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;6&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;60%&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;10&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;7&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;70%&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;12&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;8&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;80%&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;14&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;9&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;90%&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;16&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;9&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;90%&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;18&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;9&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;90%&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt; &lt;div align="right"&gt;20&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;10&lt;/div&gt; &lt;/td&gt;  &lt;td&gt; &lt;div align="right"&gt;100%&lt;/div&gt; &lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;/center&gt; &lt;/blockquote&gt; &lt;b&gt;4.  Create the cumulative gains chart:&lt;/b&gt; &lt;ul&gt;&lt;li&gt; The lift curve and the baseline have the same values for 10%-20% and 90%-100%.&lt;/li&gt;&lt;/ul&gt;  &lt;blockquote&gt; &lt;center&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/lift_chart/cumulative_gains2.gif" border="1" height="301" width="427" /&gt;&lt;/center&gt; &lt;/blockquote&gt; &lt;b&gt;5.  Create the lift chart:&lt;/b&gt; &lt;blockquote&gt; &lt;center&gt;&lt;img src="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/lift_chart/lift_chart2.gif" border="1" height="320" width="455" /&gt;&lt;/center&gt; &lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-3164370385219965162?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/3164370385219965162/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=3164370385219965162' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/3164370385219965162'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/3164370385219965162'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/10/cumulative-gains-and-lift-charts.html' title='Cumulative Gains and Lift Charts'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-8214468355995580912</id><published>2007-10-19T19:52:00.000+01:00</published><updated>2007-10-20T21:55:36.136+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Graph'/><title type='text'>Introduction to ROC Curves</title><content type='html'>&lt;h2&gt;  &lt;/h2&gt;&lt;center&gt; &lt;p&gt;&lt;a name="roccurves"&gt;&lt;span style="font-size:130%;color:RED;"&gt;About ROC curves&lt;/span&gt;&lt;/a&gt;  &lt;/p&gt;&lt;p&gt;A ROC curve provides a graphical representation of the relationship between the true-positive and false-positive prediction rate of a model.  The y-axis corresponds to the sensitivity of the model, i.e. how well the model is able to predict true positives (real cleavages) from sites that are not cleaved, and the y-coordinates are calculated as:  &lt;/p&gt;&lt;center&gt;&lt;img src="http://pops.csse.monash.edu.au/equ2_Ypoint_sens.png" alt="Calculation of sensitivity." align="middle" vspace="10" /&gt;&lt;/center&gt;  &lt;p&gt;The x-axis corresponds to the specificity (expressed on the curve as 1-specificity), i.e. the ability of the model to identify true negatives. An increase in specificity (i.e. a decrease along the X-axis) results in an increase in sensitivity.  The x-coordinates are calculated as:  &lt;/p&gt;&lt;center&gt;&lt;img src="http://pops.csse.monash.edu.au/equ1_Xpoint_spec.png" alt="Calculation of specificity." align="middle" vspace="10" /&gt;&lt;/center&gt;  &lt;p&gt;The greater the sensitivity at high specificity values (i.e. high y-axis values at low X-axis values) the better the model.  A numerical measure of the accuracy of the model can be obtained from the area under the curve, where an area of 1.0 signifies near perfect accuracy, while an area of less than 0.5 indicates that the model is worse than just random.  The quantitative-qualitative relationship between area and accuracy follows a fairly linear pattern, such that the following could be used as a guide:  &lt;/p&gt;&lt;ul&gt;&lt;li&gt;0.9-1: Excellent&lt;/li&gt;&lt;li&gt;0.8-0.9: Very good&lt;/li&gt;&lt;li&gt;0.7-0.8: Good&lt;/li&gt;&lt;li&gt;0.6-0.7: Average&lt;/li&gt;&lt;li&gt;0.5-0.6: Poor&lt;/li&gt;&lt;/ul&gt;&lt;h2&gt; &lt;/h2&gt;&lt;h2&gt;&lt;br /&gt;&lt;/h2&gt;&lt;h2&gt;&lt;span style="color: rgb(0, 128, 128);"&gt;Introduction to ROC Curves&lt;/span&gt;&lt;/h2&gt;&lt;/center&gt;  &lt;center&gt;&lt;b&gt;&lt;span style="font-family:Arial;"&gt;&lt;span style="color: rgb(0, 128, 128);"&gt;&lt;span style=""&gt;&lt;br /&gt;&lt;a href="http://gim.unmc.edu/dxtests/roc2.htm"&gt;&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/center&gt;   &lt;p&gt;&lt;img src="http://gim.unmc.edu/dxtests/distrib.jpg" align="right" height="300" width="300" /&gt; The sensitivity and specificity of a diagnostic test depends on more than just the "quality" of the test--they also depend on the definition of what constitutes an abnormal test.  Look at the the idealized graph at right showing the number of patients with and without a disease arranged according to the value of a diagnostic test.  This distributions overlap--the test (like most) does not distinguish normal from disease with 100% accuracy.  The area of overlap indicates where the test cannot distinguish normal from disease.  In practice, we choose a cutpoint  (indicated by the vertical black line) above which we consider the test to be abnormal and below which we consider the test to be normal.  The position of the cutpoint will determine the number of true positive, true negatives, false positives and false negatives.  We may wish to  use different cutpoints for different clinical situations if we wish to minimize one of the erroneous  types of test results.&lt;br /&gt;&lt;/p&gt;&lt;p&gt; We can use the hypothyroidism data from the &lt;a href="http://gim.unmc.edu/dxtests/LR.htm"&gt;likelihood ratio section &lt;/a&gt;to illustrate how sensitivity and specificity change depending on the choice of T4 level that defines hypothyroidism.  Recall the data on patients with suspected hypothyroidism reported by Goldstein and Mushlin (J Gen Intern Med 1987;2:20-24.). The data on T4 values  in hypothyroid and euthyroid patients are shown graphically (below left) and in  a simplified tabular form (below right). &lt;img src="http://gim.unmc.edu/dxtests/t4dist.jpg" align="left" height="420" width="389" /&gt; &lt;table align="right" border="1"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td&gt;&lt;b&gt;T4 value&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Hypothyroid&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Euthyroid&lt;/b&gt;&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;5 or less&lt;/td&gt;  &lt;td align="right"&gt;18&lt;/td&gt;  &lt;td align="right"&gt;1&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;5.1 - 7&lt;/td&gt;  &lt;td align="right"&gt;7&lt;/td&gt;  &lt;td align="right"&gt;17&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;7.1 - 9&lt;/td&gt;  &lt;td align="right"&gt;4&lt;/td&gt;  &lt;td align="right"&gt;36&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;9 or more&lt;/td&gt;  &lt;td align="right"&gt;3&lt;/td&gt;  &lt;td align="right"&gt;39&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="right"&gt;&lt;b&gt;Totals:&lt;/b&gt;&lt;/td&gt;  &lt;td align="right"&gt;32&lt;/td&gt;  &lt;td align="right"&gt;93&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;Suppose that patients with T4 values of 5 or less are considered to be hypothyroid.  The data display then reduces to:&lt;br /&gt;&lt;table border="1" cols="3" height="100" width="273"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td&gt;&lt;b&gt;T4 value&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Hypothyroid&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Euthyroid&lt;/b&gt;&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;5 or less&lt;/td&gt;  &lt;td align="right"&gt;18&lt;/td&gt;  &lt;td align="right"&gt;1&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;&gt; 5&lt;/td&gt;  &lt;td align="right"&gt;14&lt;/td&gt;  &lt;td align="right"&gt;92&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="right"&gt;&lt;b&gt;Totals&lt;/b&gt;:&lt;/td&gt;  &lt;td align="right"&gt;32&lt;/td&gt;  &lt;td align="right"&gt;93&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; You should be able to verify that the sensivity is &lt;b&gt;0.56&lt;/b&gt; and the specificity is&lt;b&gt; 0.99&lt;/b&gt;.  &lt;/p&gt;&lt;p&gt;Now, suppose we decide to make the definition of hypothyroidism less stringent and now consider patients with T4 values of 7 or less to be hypothyroid.  The data display will now look like this:&lt;br /&gt;&lt;table border="1" cols="3" height="100" width="273"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td&gt;&lt;b&gt;T4 value&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Hypothyroid&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Euthyroid&lt;/b&gt;&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;7 or less&lt;/td&gt;  &lt;td align="right"&gt;25&lt;/td&gt;  &lt;td align="right"&gt;18&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;&gt; 7&lt;/td&gt;  &lt;td align="right"&gt;7&lt;/td&gt;  &lt;td align="right"&gt;75&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="right"&gt;&lt;b&gt;Totals&lt;/b&gt;:&lt;/td&gt;  &lt;td align="right"&gt;32&lt;/td&gt;  &lt;td align="right"&gt;93&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; You should be able to verify that the sensivity is &lt;b&gt;0.78&lt;/b&gt; and the specificity is&lt;b&gt; 0.81&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;Lets move the cut point for hypothyroidism one more time: &lt;table border="1" cols="3" height="100" width="273"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td&gt;&lt;b&gt;T4 value&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Hypothyroid&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Euthyroid&lt;/b&gt;&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;&lt;&gt;  &lt;/td&gt;&lt;td align="right"&gt;29&lt;/td&gt;  &lt;td align="right"&gt;54&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;9 or more&lt;/td&gt;  &lt;td align="right"&gt;3&lt;/td&gt;  &lt;td align="right"&gt;39&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="right"&gt;&lt;b&gt;Totals&lt;/b&gt;:&lt;/td&gt;  &lt;td align="right"&gt;32&lt;/td&gt;  &lt;td align="right"&gt;93&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; You should be able to verify that the sensivity is &lt;b&gt;0.91&lt;/b&gt; and the specificity is &lt;b&gt;0.42&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;Now, take the sensitivity and specificity values above and put them into a table: &lt;table border="1" cols="3" height="100" width="273"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td align="center"&gt;&lt;b&gt;Cutpoint&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Sensitivity&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Specificity&lt;/b&gt;&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="center"&gt;5&lt;/td&gt;  &lt;td align="right"&gt;0.56&lt;/td&gt;  &lt;td align="right"&gt;0.99&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="center"&gt;7&lt;/td&gt;  &lt;td align="right"&gt;0.78&lt;/td&gt;  &lt;td align="right"&gt;0.81&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="center"&gt;9&lt;/td&gt;  &lt;td align="right"&gt;0.91&lt;/td&gt;  &lt;td align="right"&gt;0.42&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; Notice that you can improve the sensitivity by moving to cutpoint to a &lt;i&gt;higher &lt;/i&gt;T4 value--that is, you can make the criterion for a positive test &lt;i&gt;less &lt;/i&gt;strict.  You can improve the specificity by moving the cutpoint to a &lt;i&gt;lower &lt;/i&gt;T4 value--that is, you can make the criterion for a positive test &lt;i&gt;more &lt;/i&gt;strict.  Thus, there is a tradeoff between sensitivity and specificity. You can change the definition of a positive test to improve one but the other will decline.  &lt;/p&gt;&lt;p&gt;The next section covers how to use the numbers we just calculated to draw and interpret an ROC curve.&lt;br /&gt;. &lt;/p&gt;&lt;center&gt;&lt;b&gt;&lt;span style="font-family:Arial;"&gt;&lt;span style="color: rgb(0, 128, 128);"&gt;&lt;span style="font-size:0;"&gt;&lt;br /&gt;|&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;  &lt;center&gt; &lt;h2&gt;&lt;br /&gt;&lt;/h2&gt;&lt;h2&gt;&lt;span style="color: rgb(0, 128, 128);"&gt;Plotting and Intrepretating an ROC Curve&lt;/span&gt;&lt;/h2&gt;&lt;/center&gt;  &lt;center&gt;&lt;b&gt;&lt;span style="font-family:Arial;"&gt;&lt;span style="color: rgb(0, 128, 128);"&gt;&lt;span style=""&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/center&gt; This section continues the hypothyroidism example started in the the previous section. We showed that the table at left can be summarized by the operating characteristics at right:&lt;br /&gt;&lt;table align="left" border="1"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td&gt;&lt;b&gt;T4 value&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Hypothyroid&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Euthyroid&lt;/b&gt;&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;5 or less&lt;/td&gt;  &lt;td align="right"&gt;18&lt;/td&gt;  &lt;td align="right"&gt;1&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;5.1 - 7&lt;/td&gt;  &lt;td align="right"&gt;7&lt;/td&gt;  &lt;td align="right"&gt;17&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;7.1 - 9&lt;/td&gt;  &lt;td align="right"&gt;4&lt;/td&gt;  &lt;td align="right"&gt;36&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td&gt;9 or more&lt;/td&gt;  &lt;td align="right"&gt;3&lt;/td&gt;  &lt;td align="right"&gt;39&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="right"&gt;&lt;b&gt;Totals:&lt;/b&gt;&lt;/td&gt;  &lt;td align="right"&gt;32&lt;/td&gt;  &lt;td align="right"&gt;93&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;table align="right" border="1" cols="3" height="100" width="239"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td align="center"&gt;&lt;b&gt;Cutpoint&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Sensitivity&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;Specificity&lt;/b&gt;&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="center"&gt;5&lt;/td&gt;  &lt;td align="right"&gt;0.56&lt;/td&gt;  &lt;td align="right"&gt;0.99&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="center"&gt;7&lt;/td&gt;  &lt;td align="right"&gt;0.78&lt;/td&gt;  &lt;td align="right"&gt;0.81&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="center"&gt;9&lt;/td&gt;  &lt;td align="right"&gt;0.91&lt;/td&gt;  &lt;td align="right"&gt;0.42&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;p&gt;The operating characteristics (above right) can be reformulated slightly and then presented graphically as shown below to the right:&lt;/p&gt; &lt;table align="left" border="1" cols="3" height="100" width="239"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td align="center"&gt;&lt;b&gt;Cutpoint&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;True Positives&lt;/b&gt;&lt;/td&gt;  &lt;td&gt;&lt;b&gt;False Positives&lt;/b&gt;&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="center"&gt;5&lt;/td&gt;  &lt;td align="right"&gt;0.56&lt;/td&gt;  &lt;td align="right"&gt;0.01&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="center"&gt;7&lt;/td&gt;  &lt;td align="right"&gt;0.78&lt;/td&gt;  &lt;td align="right"&gt;0.19&lt;/td&gt; &lt;/tr&gt;  &lt;tr&gt; &lt;td align="center"&gt;9&lt;/td&gt;  &lt;td align="right"&gt;0.91&lt;/td&gt;  &lt;td align="right"&gt;0.58&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; &lt;img src="http://gim.unmc.edu/dxtests/t4roc.jpg" align="right" height="300" width="300" /&gt;&lt;br /&gt;&lt;p&gt;This type of graph is called a &lt;b&gt;Receiver Operating Characteristic curve&lt;/b&gt; (or ROC curve.)   It is a plot of the true positive rate against the false positive rate for the different possible cutpoints of a diagnostic test. &lt;/p&gt;&lt;p&gt; An ROC curve demonstrates several things:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity). &lt;/li&gt;&lt;li&gt;The closer the curve follows the left-hand border and then the top border of the ROC space,  the more accurate the test. &lt;/li&gt;&lt;li&gt;The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. &lt;/li&gt;&lt;li&gt;The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test.  You can check this out on the graph above.  Recall that the LR for T4 &lt;&gt; 9 is 0.2.  This corresponds to the far right, nearly horizontal portion of the curve. &lt;/li&gt;&lt;li&gt;The area under the curve is a measure of text accuracy.  This is discussed further in the  &lt;a linkindex="3" href="http://gim.unmc.edu/dxtests/roc3.htm"&gt;next section&lt;/a&gt;. &lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;.&lt;br /&gt;&lt;center&gt; &lt;h2&gt; &lt;span style="color: rgb(0, 128, 128);"&gt;The Area Under an ROC Curve&lt;/span&gt;&lt;/h2&gt;&lt;/center&gt;  &lt;center&gt;&lt;b&gt;&lt;span style="font-family:Arial;"&gt;&lt;span style="color: rgb(0, 128, 128);"&gt;&lt;span style=""&gt;|&lt;a linkindex="0" href="http://gim.unmc.edu/dxtests/roc2.htm"&gt; Previous Section&lt;/a&gt; | &lt;a linkindex="1" href="http://gim.unmc.edu/dxtests/Default.htm"&gt;Main Menu&lt;/a&gt; |&lt;a linkindex="2" href="http://gim.unmc.edu/dxtests/effect1.htm"&gt; Next Section&lt;/a&gt; |&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/center&gt;  &lt;p&gt;&lt;img src="http://gim.unmc.edu/dxtests/roccomp.jpg" align="right" height="300" width="300" /&gt; The graph at right shows three ROC curves representing excellent, good, and worthless tests  plotted on the same graph.  The accuracy of the test depends on how well the test separates the group being tested into those with and without the disease in question.  Accuracy is measured by the area under the ROC curve.  An area of 1 represents a perfect test; an area of .5 represents a worthless test.    A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system: &lt;/p&gt;&lt;ul&gt;&lt;li&gt;.90-1 = excellent (A) &lt;/li&gt;&lt;li&gt;.80-.90 = good (B) &lt;/li&gt;&lt;li&gt;.70-.80 = fair (C) &lt;/li&gt;&lt;li&gt;.60-.70 = poor (D) &lt;/li&gt;&lt;li&gt;.50-.60 = fail (F)&lt;/li&gt;&lt;/ul&gt; Recall the T4 data from the &lt;a linkindex="3" href="http://gim.unmc.edu/dxtests/roc2.htm"&gt;previous section&lt;/a&gt;.  The area under the T4 ROC curve is .86.  The T4 would be considered to be "good" at separating hypothyroid from euthyroid patients.    &lt;p&gt;&lt;img src="http://gim.unmc.edu/dxtests/strep.jpg" align="right" height="300" width="300" /&gt; ROC curves can also be constructed from clinical prediction rules.  The graphs at right  come from a study of how clinical findings predict strep throat (Wigton RS, Connor JL, Centor RM. Transportability of  a decision rule for the diagnosis of streptococcal pharyngitis. Arch Intern Med. 1986;146:81-83.)  In that study, the presence of tonsillar exudate, fever, adenopathy and the absence of cough all predicted strep.  The curves were constructed by computing the sensitivity and specificity of increasing numbers of clinical findings (from 0 to 4) in  predicting strep.  The study compared patients in Virginia and Nebraska and found that the rule performed more  accurately in Virginia (area under the curve = .78) compared to Nebraska (area under the curve = .73).  These differences turn out not to be statistically different, however.  &lt;/p&gt;&lt;p&gt;At this point, you may be wondering what this area number really means and how it is computed. The area measures &lt;b&gt;discrimination&lt;/b&gt;, that is, the ability of the test to correctly classify  those with and without the disease.  Consider the situation in which patients are already correctly classified into two groups.  You randomly pick on from the disease group and one from the no-disease group and do the test on both.  The patient with the more abnormal test result should be the one from the disease group.  The area under the curve is the percentage of randomly drawn pairs for which this is true (that is, the test correctly classifies the two patients in the random pair). &lt;/p&gt;&lt;p&gt;Computing the area is more difficult to explain and beyond the scope of this introductory material. Two methods are commonly used: a non-parametric method based on constructing trapeziods under the curve as an approximation of area and a parametric method using a maximum likelihood estimator to fit a smooth curve to the data points. Both methods are available as computer programs and give an estimate of area and standard error that can be used to compare different tests or the same test in different patient populations. For more on quantitative ROC analysis, see Metz CE. Basic principles of ROC analysis. Sem Nuc Med. 1978;8:283-298. &lt;/p&gt;&lt;b&gt;A final note of historical interest&lt;/b&gt;&lt;br /&gt;You may be wondering where the name "Reciever Operating Characteristic" came from.  ROC analysis  is part of a field called "Signal Dectection Therory" developed during World War II for the analysis of radar images.  Radar operators had to decide whether a blip on the screen represented an enemy target, a friendly ship, or just noise.  Signal detection theory measures the ability of radar receiver operators to make these important distinctions.  Their ability to do so was called the Receiver Operating Characteristics.  It was not until the 1970's that signal detection theory was recognized as useful for interpreting medical test results. &lt;br /&gt;&lt;center&gt;&lt;b&gt;&lt;span style="font-family:Arial;"&gt;&lt;span style="color: rgb(0, 128, 128);"&gt;&lt;span style="font-size:0;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/center&gt;  &lt;/center&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-8214468355995580912?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/8214468355995580912/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=8214468355995580912' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/8214468355995580912'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/8214468355995580912'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/10/introduction-to-roc-curves.html' title='Introduction to ROC Curves'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-4297943029645179012</id><published>2007-10-19T19:50:00.000+01:00</published><updated>2007-10-19T19:51:43.021+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Regression'/><title type='text'>Logistic Regression</title><content type='html'>&lt;p align="center"&gt;&lt;b&gt;&lt;span style="font-size:180%;"&gt;Logistic Regression&lt;/span&gt;&lt;/b&gt;&lt;/p&gt; &lt;p&gt;What is the logistic curve? What is the base of the natural logarithm? Why do statisticians prefer logistic regression to ordinary linear regression when the DV is binary? How are probabilities, odds and logits related? What is an odds ratio? How can logistic regression be considered a linear regression? What is a loss function? What is a maximum likelihood estimate? How is the &lt;i&gt;b&lt;/i&gt; weight in logistic regression for a categorical variable related to the odds ratio of its constituent categories? &lt;/p&gt; &lt;p&gt;This chapter is difficult because there are many new concepts in it. Studying this may bring back feelings that you had in the first third of the course, when there were many new concepts each week. &lt;/p&gt; &lt;p&gt;For this chapter only, we are going to deal with a dependent variable that is binary (a categorical variable that has two values such as "yes" and "no") rather than continuous. &lt;/p&gt; &lt;p&gt;[Technical note: Logistic regression can also be applied to ordered categories (ordinal data), that is, variables with more than two ordered categories, such as what you find in many surveys. However, we won't be dealing with that in this course and you probably will never be taught it. If our dependent variable has several unordered categories (e.g., suppose our DV was state of origin in the U.S.), then we can use something called discriminant analysis, which will be taught to you in a course on multivariate statistics.]&lt;/p&gt; &lt;p&gt;It is customary to code a binary DV either 0 or 1. For example, we might code a successfully kicked field goal as 1 and a missed field goal as 0 or we might code yes as 1 and no as 0 or admitted as 1 and rejected as 0 or Cherry Garcia flavor ice cream as 1 and all other flavors as zero. If we code like this, then the mean of the distribution is equal to the proportion of 1s in the distribution. For example if there are 100 people in the distribution and 30 of them are coded 1, then the mean of the distribution is .30, which is the proportion of 1s. The mean of the distribution is also the &lt;i&gt;probability&lt;/i&gt; of drawing a person labeled as 1 at random from the distribution. That is, if we grab a person at random from our sample of 100 that I just described, the probability that the person will be a 1 is .30. Therefore, proportion and probability of 1 are the same in such cases. The mean of a binary distribution so coded is denoted as P, the proportion of 1s. The proportion of zeros is (1-P), which is sometimes denoted as Q. The variance of such a distribution is PQ, and the standard deviation is Sqrt(PQ). {Why can't all of stats be this easy?}&lt;/p&gt; &lt;p&gt;Suppose we want to predict whether someone is male or female (DV, M=1, F=0) using height in inches (IV). We could plot the relations between the two variables as we customarily do in regression. The plot might look something like this:&lt;/p&gt; &lt;p&gt; &lt;/p&gt; &lt;p&gt;&lt;img src="http://luna.cas.usf.edu/%7Embrannic/files/regression/gifs/lo1.gif" height="412" width="551" /&gt;&lt;/p&gt; &lt;p&gt;Points to notice about the graph (data are fictional):&lt;/p&gt; &lt;ol&gt;&lt;li&gt;The regression line is a rolling average, just as in linear regression. The Y-axis is P, which indicates the proportion of 1s at any given value of height. (review graph)&lt;/li&gt;&lt;li&gt;The regression line is nonlinear. (review graph)&lt;/li&gt;&lt;li&gt;None of the observations --the raw data points-- actually fall on the regression line. They all fall on zero or one. (review graph)&lt;/li&gt;&lt;/ol&gt;  &lt;b&gt;&lt;p&gt;Why use logistic regression rather than ordinary linear regression?&lt;/p&gt; &lt;/b&gt;&lt;p&gt;When I was in graduate school, people didn't use logistic regression with a binary DV. They just used ordinary linear regression instead. Statisticians won the day, however, and now most psychologists use logistic regression with a binary DV for the following reasons:&lt;/p&gt; &lt;ol&gt;&lt;li&gt;If you use linear regression, the predicted values will become greater than one and less than zero if you move far enough on the X-axis. Such values are theoretically inadmissible.&lt;/li&gt;&lt;li&gt;One of the assumptions of regression is that the variance of Y is constant across values of X (homoscedasticity). This cannot be the case with a binary variable, because the variance is PQ. When 50 percent of the people are 1s, then the variance is .25, its maximum value. As we move to more extreme values, the variance decreases. When P=.10, the variance is .1*.9 = .09, so as P approaches 1 or zero, the variance approaches zero.&lt;/li&gt;&lt;li&gt;The significance testing of the &lt;i&gt;b&lt;/i&gt; weights rest upon the assumption that errors of prediction (Y-Y') are normally distributed. Because Y only takes the values 0 and 1, this assumption is pretty hard to justify, even approximately. Therefore, the tests of the regression weights are suspect if you use linear regression with a binary DV.&lt;/li&gt;&lt;/ol&gt;  &lt;p&gt;&lt;b&gt;&lt;span style="font-size:180%;"&gt;The Logistic Curve&lt;/span&gt;&lt;/b&gt;&lt;/p&gt; &lt;p&gt;The logistic curve relates the independent variable, X, to the rolling mean of the DV, P (&lt;img src="http://luna.cas.usf.edu/%7Embrannic/files/regression/gifs/lo2.gif" height="23" width="19" /&gt;). The formula to do so may be written either&lt;/p&gt; &lt;p&gt;&lt;img src="http://luna.cas.usf.edu/%7Embrannic/files/regression/gifs/lo3.gif" height="150" width="300" /&gt;&lt;/p&gt; &lt;p&gt;Or&lt;/p&gt; &lt;p&gt;&lt;img src="http://luna.cas.usf.edu/%7Embrannic/files/regression/gifs/lo4.gif" height="150" width="300" /&gt;&lt;/p&gt; &lt;p&gt;where P is the probability of a 1 (the proportion of 1s, the mean of Y), &lt;i&gt;e &lt;/i&gt;is the base of the natural logarithm (about 2.718) and &lt;i&gt;a&lt;/i&gt; and &lt;i&gt;b&lt;/i&gt; are the parameters of the model. The value of &lt;i&gt;a&lt;/i&gt; yields P when X is zero, and &lt;i&gt;b&lt;/i&gt; adjusts how quickly the probability changes with changing X a single unit (we can have standardized and unstandardized b weights in logistic regression, just as in ordinary linear regression). Because the relation between X and P is nonlinear, &lt;i&gt;b&lt;/i&gt; does not have a straightforward interpretation in this model as it does in ordinary linear regression.&lt;/p&gt; &lt;p&gt;&lt;b&gt;&lt;span style="font-size:180%;"&gt;Loss Function&lt;/span&gt;&lt;/b&gt;&lt;/p&gt; &lt;p&gt;A loss function is a measure of fit between a mathematical model of data and the actual data. We choose the parameters of our model to minimize the badness-of-fit or to maximize the goodness-of-fit of the model to the data. With least squares (the only loss function we have used thus far), we minimize SS&lt;sub&gt;res&lt;/sub&gt;, the sum of squares residual. This also happens to maximize SS&lt;sub&gt;reg&lt;/sub&gt;, the sum of squares due to regression. With linear or curvilinear models, there is a mathematical solution to the problem that will minimize the sum of squares, that is, &lt;/p&gt; &lt;b&gt;&lt;p&gt;b = (X'X)&lt;sup&gt;-1&lt;/sup&gt;X'y&lt;/p&gt; &lt;/b&gt;&lt;p&gt;Or&lt;/p&gt; &lt;b&gt;&lt;p&gt;&lt;span style="font-family:Symbol;"&gt;b&lt;/span&gt; = R&lt;sup&gt;-1&lt;/sup&gt;r&lt;/p&gt; &lt;/b&gt;&lt;p&gt;With some models, like the logistic curve, there is no mathematical solution that will produce least squares estimates of the parameters. For many of these models, the loss function chosen is called &lt;i&gt;maximum likelihood&lt;/i&gt;. A &lt;i&gt;likelihood&lt;/i&gt; is a conditional probability (e.g., P(Y|X), the probability of Y given X). We can pick the parameters of the model (&lt;i&gt;a&lt;/i&gt; and &lt;i&gt;b &lt;/i&gt;of the logistic curve) at random or by trial-and-error and then compute the likelihood of the data given those parameters (actually, we do better than trail-and-error, but not perfectly). We will choose as our parameters, those that result in the greatest likelihood computed. The estimates are called maximum likelihood because the parameters are chosen to maximize the likelihood (conditional probability of the data given parameter estimates) of the sample data. The techniques actually employed to find the maximum likelihood estimates fall under the general label &lt;i&gt;numerical analysis&lt;/i&gt;. There are several methods of numerical analysis, but they all follow a similar series of steps. First, the computer picks some initial estimates of the parameters. Then it will compute the likelihood of the data given these parameter estimates. Then it will improve the parameter estimates slightly and recalculate the likelihood of the data. It will do this forever until we tell it to stop, which we usually do when the parameter estimates do not change much (usually a change .01 or .001 is small enough to tell the computer to stop). [Sometimes we tell the computer to stop after a certain number of tries or iterations, e.g., 20 or 250. This usually indicates a problem in estimation.]&lt;/p&gt; &lt;p&gt;&lt;b&gt;&lt;span style="font-size:180%;"&gt;Where on Earth Did This Stuff Come From?&lt;/span&gt;&lt;/b&gt;&lt;/p&gt; &lt;p&gt;Suppose we only know a person's height and we want to predict whether that person is male or female. We can talk about the probability of being male or female, or we can talk about the odds of being male or female. Let's say that the probability of being male at a given height is .90. Then the odds of being male would be &lt;/p&gt; &lt;p&gt;&lt;img src="http://luna.cas.usf.edu/%7Embrannic/files/regression/gifs/lo5.gif" height="90" width="180" /&gt;.&lt;/p&gt; &lt;p&gt;(Odds can also be found by counting the number of people in each group and dividing one number by the other. Clearly, the probability is not the same as the odds.) In our example, the odds would be .90/.10 or 9 to one. Now the odds of being female would be .10/.90 or 1/9 or .11. This asymmetry is unappealing, because the odds of being a male should be the opposite of the odds of being a female. We can take care of this asymmetry though the natural logarithm, ln. The natural log of 9 is 2.217 (ln(.9/.1)=2.217). The natural log of 1/9 is -2.217 (ln(.1/.9)=-2.217), so the log odds of being male is exactly opposite to the log odds of being female. The natural log function looks like this:&lt;/p&gt; &lt;p&gt;&lt;img src="http://luna.cas.usf.edu/%7Embrannic/files/regression/gifs/lo6.gif" height="390" width="523" /&gt;&lt;/p&gt; &lt;p&gt; &lt;/p&gt; &lt;p&gt;Note that the natural log is zero when X is 1. When X is larger than one, the log curves up slowly. When X is less than one, the natural log is less than zero, and decreases rapidly as X approaches zero. When P = .50, the odds are .50/.50 or 1, and ln(1) =0. If P is greater than .50, ln(P/(1-P) is positive; if P is less than .50, ln(odds) is negative. [A number taken to a negative power is one divided by that number, e.g. e&lt;sup&gt;-10&lt;/sup&gt; = 1/e&lt;sup&gt;10. &lt;/sup&gt;A logarithm is an exponent from a given base, for example ln(e&lt;sup&gt;10&lt;/sup&gt;) = 10.]&lt;/p&gt; &lt;p&gt;Back to logistic regression.&lt;/p&gt; &lt;p&gt;In logistic regression, the dependent variable is a &lt;i&gt;logit&lt;/i&gt;, which is the natural log of the odds, that is, &lt;/p&gt; &lt;p&gt;&lt;img src="http://luna.cas.usf.edu/%7Embrannic/files/regression/gifs/lo7.gif" height="95" width="450" /&gt;&lt;/p&gt; &lt;p&gt;So a logit is a log of odds and odds are a function of P, the probability of a 1. In logistic regression, we find&lt;/p&gt; &lt;p&gt;logit(P) = a + bX,&lt;/p&gt; &lt;p&gt;Which is assumed to be linear, that is, the log odds (logit) is assumed to be linearly related to X, our IV. So there's an ordinary regression hidden in there. We could in theory do ordinary regression with logits as our DV, but of course, we don't have logits in there, we have 1s and 0s. Then, too, people have a hard time understanding logits. We could talk about odds instead. Of course, people like to talk about probabilities more than odds. To get there (from logits to probabilities), we first have to take the log out of both sides of the equation. Then we have to convert odds to a simple probability:&lt;/p&gt; &lt;p&gt;&lt;img src="http://luna.cas.usf.edu/%7Embrannic/files/regression/gifs/lo8.gif" height="350" width="324" /&gt;&lt;/p&gt; &lt;p&gt;The simple probability is this ugly equation that you saw earlier. If log odds are linearly related to X, then the relation between X and P is nonlinear, and has the form of the S-shaped curve you saw in the graph and the function form (equation) shown immediately above.&lt;/p&gt; &lt;p&gt;&lt;b&gt;&lt;span style="font-size:180%;"&gt;An Example&lt;/span&gt;&lt;/b&gt;&lt;/p&gt; &lt;p&gt;Suppose that we are working with some doctors on heart attack patients. The dependent variable is whether the patient has had a second heart attack within 1 year (yes = 1). We have two independent variables, one is whether the patient completed a treatment consistent of anger control practices (yes=1). The other IV is a score on a trait anxiety scale (a higher score means more anxious).&lt;/p&gt; &lt;p&gt; Our data:&lt;/p&gt; &lt;table border="1" border cellpadding="7" cellspacing="1" width="590" style="color:#000000;"&gt; &lt;tbody&gt;&lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;Person &lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;2&lt;sup&gt;nd&lt;/sup&gt; Heart Attack&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;Treatment of Anger&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;Trait Anxiety&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;70&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;2&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;80&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;3&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;50&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;4&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;60&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;5&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;40&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;6&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;65&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;7&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;75&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;8&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;80&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;9&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;70&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;10&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;60&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;11&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;65&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;12&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;50&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;13&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;45&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;14&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;35&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;15&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;40&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;16&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;50&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;17&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;55&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;18&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;45&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;19&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;50&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;20&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;0&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;60&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt; &lt;/p&gt; &lt;p&gt;  Our correlation matrix:&lt;/p&gt; &lt;table border="1" border cellpadding="7" cellspacing="1" width="590" style="color:#000000;"&gt; &lt;tbody&gt;&lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt; &lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;Heart&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;Treat&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;Anx&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;Heart&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt; &lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt; &lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;Treat&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;-.30&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt; &lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;Anx&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;.59**&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;-.23&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;1&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;Mean&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;.50&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;.45&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;57.25&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;SD&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;.51&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;.51&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;13.42&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;Note that half of our patients have had a second heart attack. Knowing nothing else about a patient, and following the best in current medical practice, we would flip a coin to predict whether they will have a second attack within 1 year. According to our correlation coefficients, those in the anger treatment group are less likely to have another attack, but the result is not significant. Greater anxiety is associated with a higher probability of another attack, and the result is significant (according to &lt;i&gt;r&lt;/i&gt;).&lt;/p&gt; &lt;p&gt;Now let's look at the logistic regression, for the moment examining the treatment of anger by itself, ignoring the anxiety test scores. SAS prints this:&lt;/p&gt; &lt;p&gt;&lt;span style="font-family:SAS Monospace;"&gt;Response Variable: HEART&lt;/span&gt;&lt;/p&gt; &lt;p&gt;&lt;span style="font-family:SAS Monospace;"&gt;Response Levels: 2&lt;/span&gt;&lt;/p&gt; &lt;p&gt;&lt;span style="font-family:SAS Monospace;"&gt;Number of Observations: 20&lt;/span&gt;&lt;/p&gt; &lt;p&gt;&lt;span style="font-family:SAS Monospace;"&gt;Link Function: Logit&lt;/span&gt;&lt;/p&gt; &lt;p&gt;&lt;span style="font-family:SAS Monospace;"&gt;Response Profile&lt;/span&gt;&lt;/p&gt; &lt;p&gt;&lt;span style="font-family:SAS Monospace;"&gt; &lt;/span&gt;&lt;/p&gt; &lt;p&gt;&lt;span style="font-family:SAS Monospace;"&gt;Ordered&lt;/span&gt;&lt;/p&gt; &lt;p&gt;&lt;span style="font-family:SAS Monospace;"&gt;Value HEART Count&lt;/span&gt;&lt;/p&gt; &lt;p&gt;&lt;span style="font-family:SAS Monospace;"&gt; &lt;/span&gt;&lt;/p&gt; &lt;p&gt;&lt;span style="font-family:SAS Monospace;"&gt;1 0 10&lt;/span&gt;&lt;/p&gt; &lt;p&gt;&lt;span style="font-family:SAS Monospace;"&gt;2 1 10&lt;/span&gt;&lt;/p&gt; &lt;p&gt;&lt;span style="font-family:SAS Monospace;"&gt; &lt;/span&gt;&lt;/p&gt; &lt;p&gt;SAS tells us what it understands us to model, including the name of the DV, and its distribution.&lt;/p&gt; &lt;p&gt;Then we calculate probabilities with and without including the treatment variable.&lt;/p&gt; &lt;span style="font-family:SAS Monospace;"&gt;&lt;p&gt; &lt;/p&gt; &lt;p&gt;Model Fitting Information and Testing Global Null Hypothesis BETA=0&lt;/p&gt; &lt;p&gt; &lt;/p&gt; &lt;p&gt;Criterion Intercept Intercept Chi-sq&lt;/p&gt; &lt;p&gt;Only and &lt;/p&gt; &lt;p&gt;Covariates&lt;/p&gt; &lt;p&gt;-2 LOG L 27.726 25.878 1.848&lt;/p&gt; &lt;p&gt;1df (p=.17)&lt;/p&gt; &lt;/span&gt;&lt;p&gt; The computer calculates the likelihood of the data. Because there are equal numbers of people in the two groups, the probability of group membership initially (without considering anger treatment) is .50 for each person. Because the people are independent, the probability of the entire set of people is .50&lt;sup&gt;20&lt;/sup&gt;, a very small number. Because the number is so small, it is customary to first take the natural log of the probability and then multiply the result by -2. The latter step makes the result positive. The statistic -2LogL (minus 2 times the log of the likelihood) is a badness-of-fit indicator, that is, large numbers mean poor fit of the model to the data. SAS prints the result as -2 LOG L. For the initial model (intercept only), our result is the value 27.726. This is a baseline number indicating model fit. This number has no direct analog in linear regression. It is roughly analogous to generating some random numbers and finding R&lt;sup&gt;2&lt;/sup&gt; for these numbers as a baseline measure of fit in ordinary linear regression. By including a term for treatment, the loss function reduces to 25.878, a difference of 1.848, shown in the chi-square column. The difference between the two values of -2LogL is known as the likelihood ratio test.&lt;/p&gt; &lt;p&gt;When taken from large samples, the difference between two values of -2LogL is distributed as chi-square:&lt;/p&gt; &lt;p&gt;&lt;img src="http://luna.cas.usf.edu/%7Embrannic/files/regression/gifs/lo9.gif" height="95" width="550" /&gt;&lt;/p&gt; &lt;p&gt;Recall that multiplying numbers is equivalent to adding exponents (same for subtraction and division of logs).&lt;/p&gt; &lt;p&gt;This says that the (-2Log L) for a restricted (smaller) model - (-2LogL) for a full (larger) model is the same as the log of the ratio of two likelihoods, which is distributed as chi-square. The full or larger model has all the parameters of interest in it. The restricted is said to be &lt;i&gt;nested&lt;/i&gt; in the larger model. The restricted model has one or more of parameters in the full model restricted to some value (usually zero). The parameters in the nested model must be a proper subset of the parameters in the full model. For example, suppose we have two IVs, one categorical and once continuous, and we are looking at an ATI design. A full model could have included terms for the continuous variable, the categorical variable and their interaction (3 terms). Restricted models could delete the interaction or one or more main effects (e.g., we could have a model with only the categorical variable). A nested model cannot have as a single IV, some other categorical or continuous variable not contained in the full model. If it does, then it is no longer nested, and we cannot compare the two values of -2LogL to get a chi-square value. The chi-square is used to statistically test whether including a variable reduces badness-of-fit measure. This is analogous to producing an increment in R-square in hierarchical regression. If chi-square is significant, the variable is considered to be a significant predictor in the equation, analogous to the significance of the &lt;i&gt;b&lt;/i&gt; weight in simultaneous regression. &lt;/p&gt; &lt;p&gt;For our example with anger treatment only, SAS produces the following:&lt;/p&gt; &lt;table border="1" bordercolor="#000000" cellpadding="7" cellspacing="1" width="583"&gt; &lt;tbody&gt;&lt;tr&gt;&lt;td colspan="8" valign="top"&gt; &lt;p&gt;Analysis of Maximum Likelihood Estimates&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="16%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;Variable&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="10%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;DF&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="13%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;Par Est&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="13%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;Std Err&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="13%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;Wald Chisq&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="13%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;Pr &gt; Chi- sq&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="12%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;Stand. Est&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="12%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;Odds Ratio&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="16%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;Intercept&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="10%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;1&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="13%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;-.5596&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="13%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;.6268&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="13%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;.7972&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="13%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;.3719&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="12%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="12%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;.&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="16%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;Treatment&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="10%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;1&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="13%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;1.2528&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="13%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;.9449&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="13%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;17566&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="13%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;.1849&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="12%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;.3525&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="12%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt;3.50&lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;The intercept is the value of &lt;i&gt;a&lt;/i&gt;, in this case -.5596. As usual, we are not terribly interested in whether a is equal to zero. The value of &lt;i&gt;b&lt;/i&gt; given for Anger Treatment is 1.2528. the chi-square associated with this &lt;i&gt;b&lt;/i&gt; is not significant, just as the chi-square for covariates was not significant. Therefore we cannot reject the hypothesis that &lt;i&gt;b&lt;/i&gt; is zero in the population. Our equation can be written either:&lt;/p&gt; &lt;p&gt;Logit(P) = -.5596+1.2528X &lt;/p&gt; &lt;p&gt;Or &lt;/p&gt; &lt;p&gt;&lt;img src="http://luna.cas.usf.edu/%7Embrannic/files/regression/gifs/lo10.gif" height="150" width="510" /&gt;&lt;/p&gt; &lt;p&gt;The main interpretation of logistic regression results is to find the significant predictors of Y. However, other things can sometimes be done with the results.&lt;/p&gt; &lt;span style="font-size:180%;"&gt;&lt;p&gt;The Odds Ratio&lt;/p&gt; &lt;/span&gt;&lt;p&gt;Recall that the odds for a group is :&lt;/p&gt; &lt;p&gt;&lt;img src="http://luna.cas.usf.edu/%7Embrannic/files/regression/gifs/lo11.gif" height="90" width="180" /&gt;&lt;/p&gt; &lt;p&gt;Now the odds for another group would also be P/(1-P) for that group. Suppose we arrange our data in the following way:&lt;/p&gt; &lt;table border="1" bordercolor="#000000" cellpadding="7" cellspacing="1" width="590"&gt; &lt;tbody&gt;&lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt; &lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;td colspan="2" valign="top" width="50%"&gt; &lt;p align="center"&gt;Anger Treatment&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;&lt;span style="font-size:78%;"&gt; &lt;/span&gt;&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;Heart Attack&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;Yes (1)&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;No (0)&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;Total&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;Yes (1)&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;3 (a)&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;7 (b)&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;10 (a+b)&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;No (0)&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;6 (c)&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;4 (d)&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;10 (c+d)&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt;&lt;td valign="top" width="25%"&gt; &lt;p&gt;Total&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;9 (a+c)&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;11 (b+d)&lt;/p&gt;&lt;/td&gt; &lt;td valign="top" width="25%"&gt; &lt;p&gt;20 (a+b+c+d)&lt;/p&gt;&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;p&gt;Now we can compute the odds of having a heart attack for the treatment group and the no treatment group. For the treatment group, the odds are 3/6 = 1/2. The probability of a heart attack is 3/(3+6) = 3/9 = .33. The odds from this probability are .33/(1-.33) = .33/.66 = 1/2. The odds for the no treatment group are 7/4 or 1.75. The odds ratio is calculated to compare the odds across groups. &lt;/p&gt; &lt;p&gt;&lt;img src="http://luna.cas.usf.edu/%7Embrannic/files/regression/gifs/lo12.gif" height="105" width="270" /&gt;&lt;/p&gt; &lt;p&gt;If the odds are the same across groups, the odds ratio (OR) will be 1.0. If not, the OR will be larger or smaller than one. People like to see the ratio be phrased in the larger direction. In our case, this would be 1.75/.5 or 1.75*2 = 3.50.&lt;/p&gt; &lt;p&gt;Now if we go back up to the last column of the printout where is says odds ratio in the treatment column, you will see that the odds ratio is 3.50, which is what we got by finding the odds ratio for the odds from the two treatment conditions. It also happens that &lt;i&gt;e&lt;/i&gt;&lt;sup&gt;1.2528 &lt;/sup&gt;= 3.50. Note that the exponent is our value of &lt;i&gt;b&lt;/i&gt; for the logistic curve.&lt;/p&gt; &lt;span style="font-size:180%;"&gt;&lt;p&gt; &lt;/p&gt;&lt;/span&gt;  &lt;h2 style="text-align: center;" align="center"&gt;  &lt;/h2&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-4297943029645179012?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/4297943029645179012/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=4297943029645179012' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/4297943029645179012'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/4297943029645179012'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/10/logistic-regression.html' title='Logistic Regression'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-833131232865413473</id><published>2007-10-04T12:26:00.000+01:00</published><updated>2007-10-04T12:51:26.277+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Sampling'/><title type='text'>K-fold Cross Validation</title><content type='html'>&lt;h1&gt; &lt;/h1&gt;  &lt;p&gt; Cross validation is a method for estimating the true error of a model. When a model is built from training data, the error on the training data is a rather optimistic estimate of the error rates the model will achieve on unseen data.  The aim of building a model is usually to apply the model to new, unseen data--we expect the model to generalise to data other than the training data on which it was built.  Thus, we would like to have some method for better approximating the error that might occur in general. Cross validation provides such a method.  &lt;/p&gt;&lt;p&gt; Cross validation is also used to evaluate a model in deciding which algorithm to deploy for learning, when choosing from amongst a number of learning algorithms. It can also provide a guide as to the effect of parameter tuning in building a model from a specific algorithm.  &lt;/p&gt;&lt;p&gt; Test sample cross-validation is often a preferred method when there is plenty of data available. A model is built from a training set and its predictive accuracy is measured by applying the model a test set. A good rule of thumb is that a dataset is partitioned into a training set (66%) and a test set (33%).  &lt;/p&gt;&lt;p&gt; To measure error rates you might build multiple models with the one algorithm, using variations of the same training data for each model. The average performance is then the measure of how well this algorithm works in building models from the data.   &lt;/p&gt;&lt;p&gt; The basic idea is to use, say, 90% of the dataset to build a model. The data that was removed (the 10%) is then used to test the performance of the model on ``new'' data (usually by calculating the &lt;span class="textbf"&gt;mean squared error&lt;/span&gt;).  This simplest of cross validation approaches is referred to as the &lt;a name="7535"&gt;&lt;/a&gt;&lt;span class="textbf"&gt;holdout   method&lt;/span&gt;.  &lt;/p&gt;&lt;p&gt; For the &lt;span class="textbf"&gt;holdout method&lt;/span&gt; the two datasets are referred to as the &lt;a name="7538"&gt;&lt;/a&gt;&lt;a name="7539"&gt;&lt;/a&gt;&lt;a name="tex2html208" href="http://en.wikipedia.org/wiki/training_set"&gt;training set&lt;/a&gt; and the &lt;a name="7542"&gt;&lt;/a&gt;&lt;a name="7543"&gt;&lt;/a&gt;&lt;a name="tex2html209" href="http://en.wikipedia.org/wiki/test_set"&gt;test set&lt;/a&gt;.  With just a single evaluation though there can be a high variance since the evaluation is dependent on the data points which happen to end up in the training set and the test set. Different partitions might lead to different results.  &lt;/p&gt;&lt;p&gt; A solution to this problem is to have multiple subsets, and each time build the model based on all but one of these subsets. This is repeated for all possible combinations and the result is reported as the average error over all models.&lt;/p&gt;&lt;br /&gt;In k-­fold cross ­validation, sometimes called rotation estimation, the dataset D is randomly split into k mutually exclusive subsets (the folds) D1 ; D2 ; ...; Dk of approximately equal size. The inducer is trained and tested k times; each time t from  {1; 2;... ; k}, it is trained on D - Dt and tested on Dt . The cross­validation estimate of accuracy is the overall number of correct classifica­&lt;br /&gt;tions, divided by the number of instances in the dataset. Formally, let D (i) be the test set that includes instance x i = (vi ; yi), then the cross­validation estimate of accuracy&lt;br /&gt;&lt;div style="text-align: left;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_1ET8wy93LPw/RwTRx0py0FI/AAAAAAAAAMs/yZwxYum7C58/s1600-h/1.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp3.blogger.com/_1ET8wy93LPw/RwTRx0py0FI/AAAAAAAAAMs/yZwxYum7C58/s320/1.JPG" alt="" id="BLOGGER_PHOTO_ID_5117445730477461586" border="0" /&gt;&lt;/a&gt;The cross­validation estimate is a random number that depends on the division into folds. Complete cross ­validation is the average of all&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_1ET8wy93LPw/RwTSi0py0GI/AAAAAAAAAM0/eushWiujMyw/s1600-h/2.JPG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp3.blogger.com/_1ET8wy93LPw/RwTSi0py0GI/AAAAAAAAAM0/eushWiujMyw/s320/2.JPG" alt="" id="BLOGGER_PHOTO_ID_5117446572291051618" border="0" /&gt;&lt;/a&gt;&lt;img src="file:///C:/DOCUME%7E1/jdeng/LOCALS%7E1/Temp/moz-screenshot.jpg" alt="" /&gt;&lt;/div&gt;&lt;h1&gt;&lt;a name="SECTION06830000000000000000"&gt;&lt;/a&gt;&lt;/h1&gt;possibilities for choosing m-k instances out of m, but it is usually too expensive. Except for leave­one­one (n­fold cross­validation), which is always complete, k­fold cross­ validation is estimating complete k­fold cross­validation using a single split of the data into the folds. Repeat­ing cross­validation multiple times using different splits into folds provides a better Monte­Carlo estimate to the complete cross­validation at an added cost. In strati­fied cross­validation, the folds are stratified so that they contain approximately the same proportions of la­bels as the original dataset.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-833131232865413473?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/833131232865413473/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=833131232865413473' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/833131232865413473'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/833131232865413473'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/10/k-fold-cross-validation.html' title='K-fold Cross Validation'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp3.blogger.com/_1ET8wy93LPw/RwTRx0py0FI/AAAAAAAAAMs/yZwxYum7C58/s72-c/1.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-7144507589209572262</id><published>2007-09-03T13:33:00.000+01:00</published><updated>2007-09-03T14:28:09.671+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='Regression'/><title type='text'>Linear regression</title><content type='html'>&lt;p&gt;In &lt;a title="Statistics" href="/wiki/Statistics"&gt;statistics&lt;/a&gt;, &lt;b&gt;linear  regression&lt;/b&gt; is a &lt;a title="Regression analysis" href="/wiki/Regression_analysis"&gt;regression method&lt;/a&gt; that models the  relationship between a dependent variable &lt;i&gt;Y&lt;/i&gt;, independent variables  &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt;, &lt;i&gt;i&lt;/i&gt; = 1, ..., &lt;i&gt;p&lt;/i&gt;, and a random term ε.  The model can be written as&lt;/p&gt; &lt;div class="thumb tright"&gt; &lt;div class="thumbinner" style="width: 182px;"&gt;&lt;a class="image" title="Example of linear regression with one dependent and one independent variable." href="/wiki/Image:LinearRegression.svg"&gt;&lt;img class="thumbimage" alt="Example of linear regression with one dependent and one independent variable." src="http://upload.wikimedia.org/wikipedia/commons/thumb/4/41/LinearRegression.svg/180px-LinearRegression.svg.png" border="0" height="144" width="180" /&gt;&lt;/a&gt;  &lt;div class="thumbcaption"&gt; &lt;div class="magnify" style="float: right;"&gt;&lt;a class="internal" title="Enlarge" href="/wiki/Image:LinearRegression.svg"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;Example of  linear regression with one dependent and one independent  variable.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="Y = \beta_1  + \beta_2 X_2 +  \cdots +\beta_p X_p + \varepsilon" src="http://upload.wikimedia.org/math/e/e/0/ee02e5001acb8d4b1fd95b2855853ffd.png" /&gt;  &lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;where &lt;span class="texhtml"&gt;β&lt;sub&gt;1&lt;/sub&gt;&lt;/span&gt; is the intercept ("constant"  term), the &lt;span class="texhtml"&gt;β&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt;s are the respective  parameters of independent variables, and &lt;span class="texhtml"&gt;&lt;i&gt;p&lt;/i&gt;&lt;/span&gt; is  the number of parameters to be estimated in the linear regression. Linear  regression can be contrasted with &lt;a title="Nonlinear regression" href="/wiki/Nonlinear_regression"&gt;nonlinear regression&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;This method is called "linear" because the relation of the response (the  dependent variable &lt;span class="texhtml"&gt;&lt;i&gt;Y&lt;/i&gt;&lt;/span&gt;) to the independent  variables is assumed to be a &lt;a title="Linear function" href="/wiki/Linear_function"&gt;linear function&lt;/a&gt; of the parameters. It is often  erroneously thought that the reason the technique is called "linear regression"  is that the graph of &lt;span class="texhtml"&gt;&lt;i&gt;Y&lt;/i&gt; = β&lt;sub&gt;0&lt;/sub&gt; +  β&lt;i&gt;x&lt;/i&gt;&lt;/span&gt; is a straight line or that &lt;span class="texhtml"&gt;&lt;i&gt;Y&lt;/i&gt;&lt;/span&gt;  is a linear function of the &lt;span class="texhtml"&gt;&lt;i&gt;X&lt;/i&gt;&lt;/span&gt; variables. But  if the model is (for example)&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="Y = \alpha + \beta x + \gamma x^2 + \varepsilon" src="http://upload.wikimedia.org/math/1/1/7/117319a65c1b06d3713f926c19ef084e.png" /&gt;  &lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;the problem is still one of &lt;b&gt;linear&lt;/b&gt; regression, that is, linear in  &lt;span class="texhtml"&gt;&lt;i&gt;x&lt;/i&gt;&lt;/span&gt; and &lt;span class="texhtml"&gt;&lt;i&gt;x&lt;/i&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/span&gt; respectively, even though the graph on  &lt;span class="texhtml"&gt;&lt;i&gt;x&lt;/i&gt;&lt;/span&gt; by itself is not a straight&lt;/p&gt;&lt;br /&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;    &lt;span style="font-size:130%;"&gt;Sample data.&lt;/span&gt;  &lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt; &lt;img src="http://phoenix.phys.clemson.edu/tutorials/excel/lrdata.gif" align="left" /&gt;  Say we have a set of data,  &lt;img src="http://phoenix.phys.clemson.edu/tutorials/excel/xiyi.gif" align="absmiddle" /&gt;,  shown at the left.  If we have reason to believe that  there exists a &lt;b&gt;linear relationship&lt;/b&gt; between the variables &lt;b&gt;x&lt;/b&gt; and &lt;b&gt;y&lt;/b&gt;,  we can plot the data and draw a "best-fit" &lt;i&gt;straight line&lt;/i&gt; through the data.  Of course, this relationship is governed by the familiar equation  &lt;img src="http://phoenix.phys.clemson.edu/tutorials/excel/ymxb.gif" align="absbottom" /&gt;.  We can then find the &lt;b&gt;slope, &lt;i&gt;m&lt;/i&gt;,&lt;/b&gt; and &lt;b&gt;y-intercept, &lt;i&gt;b&lt;/i&gt;,&lt;/b&gt;   for the data, which are shown in the figure below.  &lt;/span&gt;&lt;/p&gt;&lt;center&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;  &lt;img src="http://phoenix.phys.clemson.edu/tutorials/excel/lrplot.gif" /&gt;  &lt;/span&gt;&lt;/center&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;   Let's enter the above data into an Excel spread sheet,   &lt;a href="http://phoenix.phys.clemson.edu/tutorials/excel/graph.html"&gt;plot the data&lt;/a&gt;, &lt;a href="http://phoenix.phys.clemson.edu/tutorials/excel/graph.html#5"&gt;create a  trendline&lt;/a&gt; and &lt;a href="http://phoenix.phys.clemson.edu/tutorials/excel/graph.html#6"&gt;display&lt;/a&gt; its slope, y-intercept   and R-squared value.  Recall that the R-squared value is the square of the correlation coefficient.  (Most  statistical texts show the correlation coefficient as "&lt;i&gt;r&lt;/i&gt;", but  Excel shows the coefficient as "&lt;i&gt;R&lt;/i&gt;". Whether you write is as &lt;i&gt;r&lt;/i&gt;  or &lt;i&gt;R&lt;/i&gt;, the correlation coefficient gives us a measure of the reliability of   the linear relationship between the &lt;i&gt;x&lt;/i&gt; and &lt;i&gt;y&lt;/i&gt; values.  (Values  close to 1 indicate excellent linear reliability.))  &lt;/span&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt; Enter your data as we did in columns B and C.  The reason for this is strictly  cosmetic as you will soon see.  &lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt; &lt;img src="http://phoenix.phys.clemson.edu/tutorials/excel/lrxy.gif" align="absmiddle" /&gt;  &lt;img src="http://phoenix.phys.clemson.edu/tutorials/excel/lrxyplot.gif" /&gt;      &lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;    &lt;a name="2"&gt;&lt;/a&gt;     &lt;/span&gt;&lt;/p&gt;&lt;hr noshade="noshade"&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;    &lt;/span&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;    &lt;span style="font-size:130%;"&gt;Linear regression equations.&lt;/span&gt;     &lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;   If we expect a set of data to have a linear correlation,     &lt;b&gt;it is not necessary for us to plot the data&lt;/b&gt; in order to     determine the constants &lt;i&gt;m&lt;/i&gt; (slope) and     &lt;i&gt;b&lt;/i&gt; (y-intercept) of     the equation &lt;img src="http://phoenix.phys.clemson.edu/tutorials/excel/ymxb.gif" align="absbottom" /&gt;.     Instead, we can apply a statistical     treatment known as &lt;a href="http://phoenix.phys.clemson.edu/tutorials/regression/index.html"&gt;&lt;b&gt;linear regression&lt;/b&gt;&lt;/a&gt;  to the data and determine these constants.  &lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt; Given a set of data &lt;img src="http://phoenix.phys.clemson.edu/tutorials/excel/xiyi.gif" align="absbottom" /&gt;     with &lt;i&gt;n&lt;/i&gt; data points, the slope, y-intercept and correlation coefficient, &lt;i&gt;r&lt;/i&gt;,  can be determined using the following:    &lt;br /&gt;    &lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;    &lt;/span&gt;&lt;/p&gt;&lt;center&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;        &lt;img src="http://phoenix.phys.clemson.edu/tutorials/excel/slope.gif" /&gt;         &lt;/span&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;        &lt;img src="http://phoenix.phys.clemson.edu/tutorials/excel/yint.gif" /&gt;         &lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;  &lt;img src="http://phoenix.phys.clemson.edu/tutorials/excel/regression.gif" /&gt;     &lt;/span&gt;&lt;/p&gt;&lt;/center&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;       &lt;br /&gt;    (Note that the limits of the summation, which are &lt;i&gt;i&lt;/i&gt; to &lt;i&gt;n&lt;/i&gt;,     and the summation indices on &lt;i&gt;x&lt;/i&gt; and &lt;i&gt;y&lt;/i&gt; have been omitted.)     &lt;/span&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;       &lt;a name="3"&gt;&lt;/a&gt;     &lt;/span&gt;&lt;/p&gt;&lt;hr noshade="noshade"&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;    &lt;/span&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt;    &lt;span style="font-size:130%;"&gt;Implicitly applying regression to the sample data.&lt;/span&gt;     &lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;span style="font-family:arial, helvetica;font-size:85%;"&gt; It may appear that the above equations are quite complicated, however upon inspection,  we see that their components are nothing more than simple algebraic manipulations of the  raw data.  We can expand our spread sheet to include these components.    &lt;/span&gt;&lt;/p&gt;&lt;ol&gt;&lt;span style="font-family:arial, helvetica;font-size:85%;"&gt; &lt;li&gt;First, add   three columns that will be used to determine the quantities &lt;b&gt;xy&lt;/b&gt;,    &lt;b&gt;x&lt;sup&gt;2&lt;/sup&gt;&lt;/b&gt; and &lt;b&gt;y&lt;sup&gt;2&lt;/sup&gt;&lt;/b&gt;, for each data point.  &lt;p&gt;  &lt;/p&gt;&lt;/li&gt;&lt;li&gt;Next, use   Excel to evaluate the following:    &lt;b&gt;&lt;span style="font-family:symbol;"&gt;S&lt;/span&gt;x&lt;/b&gt;,   &lt;b&gt;&lt;span style="font-family:symbol;"&gt;S&lt;/span&gt;y&lt;/b&gt;,   &lt;b&gt;&lt;span style="font-family:symbol;"&gt;S&lt;/span&gt;(xy)&lt;/b&gt;,   &lt;b&gt;&lt;span style="font-family:symbol;"&gt;S&lt;/span&gt;(x&lt;sup&gt;2&lt;/sup&gt;)&lt;/b&gt;,   &lt;b&gt;&lt;span style="font-family:symbol;"&gt;S&lt;/span&gt;(y&lt;sup&gt;2&lt;/sup&gt;)&lt;/b&gt;,   &lt;b&gt;(&lt;span style="font-family:symbol;"&gt;S&lt;/span&gt;x)&lt;sup&gt;2&lt;/sup&gt;&lt;/b&gt;,   &lt;b&gt;(&lt;span style="font-family:symbol;"&gt;S&lt;/span&gt;y)&lt;sup&gt;2&lt;/sup&gt;&lt;/b&gt;.   Recall that the symbol, &lt;b&gt;&lt;span style="font-family:symbol;"&gt;S&lt;/span&gt;&lt;/b&gt;, means   "summation".  Additionally, the term &lt;b&gt;xy&lt;/b&gt; is the product of &lt;b&gt;x&lt;/b&gt; and   &lt;b&gt;y&lt;/b&gt;, that is: &lt;b&gt;x * y&lt;/b&gt;.  Also, the term   &lt;b&gt;&lt;span style="font-family:symbol;"&gt;S&lt;/span&gt;(x&lt;sup&gt;2&lt;/sup&gt;)&lt;/b&gt; is very different than the term   &lt;b&gt;(&lt;span style="font-family:symbol;"&gt;S&lt;/span&gt;x)&lt;sup&gt;2&lt;/sup&gt;&lt;/b&gt;.  Be careful with your order   of operations!  &lt;p&gt;  &lt;/p&gt;&lt;/li&gt;&lt;li&gt;Now use Excel to count the number of data points, &lt;b&gt;n&lt;/b&gt;.  (To do this,   use the Excel COUNT() function.  The syntax for COUNT() in this example is:   =COUNT(B3:B8) and is shown in the formula bar in the screen shot below.  &lt;p&gt;  &lt;/p&gt;&lt;/li&gt;&lt;li&gt;Finally, use the above components and the linear regression equations    given in the previous section to calculate    the &lt;b&gt;slope (m)&lt;/b&gt;, &lt;b&gt;y-intercept (b)&lt;/b&gt; and   &lt;b&gt;correlation coefficient (r)&lt;/b&gt; of the data.  If you are careful, your   spread sheet should look like ours.  Note that our equations for the slope,   y-intercept and correlation coefficient are highlighted in yellow.   &lt;p&gt;   &lt;/p&gt;&lt;center&gt;    &lt;img src="http://phoenix.phys.clemson.edu/tutorials/excel/lrsheet.gif" /&gt;   &lt;/center&gt;  &lt;/li&gt;&lt;/span&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-7144507589209572262?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/7144507589209572262/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=7144507589209572262' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/7144507589209572262'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/7144507589209572262'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/09/linear-regression.html' title='Linear regression'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-3168626891640737512</id><published>2007-09-03T13:32:00.000+01:00</published><updated>2007-09-03T13:33:19.103+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='Regression'/><title type='text'></title><content type='html'>&lt;h1 class="firstHeading"&gt;Linear function&lt;/h1&gt; &lt;div id="bodyContent"&gt;&lt;!-- start content --&gt; &lt;p&gt;In &lt;a title="Mathematics" href="/wiki/Mathematics"&gt;mathematics&lt;/a&gt;, the term  &lt;b&gt;linear function&lt;/b&gt; can refer to either of two different but related  concepts.&lt;/p&gt; &lt;p&gt;&lt;a id="Usage_in_elementary_mathematics" name="Usage_in_elementary_mathematics"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h2&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Usage in elementary mathematics&lt;/span&gt;&lt;/h2&gt; &lt;p&gt;In elementary &lt;a title="Algebra" href="/wiki/Algebra"&gt;algebra&lt;/a&gt; and &lt;a title="Analytic geometry" href="/wiki/Analytic_geometry"&gt;analytic geometry&lt;/a&gt;,  the term &lt;i&gt;linear function&lt;/i&gt; is often used to mean a first degree &lt;a title="Polynomial" href="/wiki/Polynomial"&gt;polynomial&lt;/a&gt; &lt;a title="Function (mathematics)" href="/wiki/Function_%28mathematics%29"&gt;function&lt;/a&gt; of one &lt;a title="Variable" href="/wiki/Variable"&gt;variable&lt;/a&gt;. These functions are called "linear" because  they are precisely the functions whose &lt;a title="Graph of a function" href="/wiki/Graph_of_a_function"&gt;graph&lt;/a&gt; in the &lt;a title="Cartesian coordinate plane" href="/wiki/Cartesian_coordinate_plane"&gt;Cartesian coordinate plane&lt;/a&gt; is a  straight line.&lt;/p&gt; &lt;p&gt;Such a function can be written as&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;span class="texhtml"&gt;&lt;i&gt;f&lt;/i&gt;(&lt;i&gt;x&lt;/i&gt;) = &lt;i&gt;m&lt;/i&gt;&lt;i&gt;x&lt;/i&gt; + &lt;i&gt;b&lt;/i&gt;&lt;/span&gt;,  &lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;where &lt;span class="texhtml"&gt;&lt;i&gt;m&lt;/i&gt;&lt;/span&gt; and &lt;span class="texhtml"&gt;&lt;i&gt;b&lt;/i&gt;&lt;/span&gt; are &lt;a title="Real number" href="/wiki/Real_number"&gt;real&lt;/a&gt; &lt;a title="Constant" href="/wiki/Constant"&gt;constants&lt;/a&gt; and &lt;span class="texhtml"&gt;&lt;i&gt;x&lt;/i&gt;&lt;/span&gt; is a  real variable. The constant &lt;span class="texhtml"&gt;&lt;i&gt;m&lt;/i&gt;&lt;/span&gt; is often called  the &lt;a title="Slope" href="/wiki/Slope"&gt;slope&lt;/a&gt; while &lt;span class="texhtml"&gt;&lt;i&gt;b&lt;/i&gt;&lt;/span&gt; is the &lt;a title="Y-intercept" href="/wiki/Y-intercept"&gt;y-intercept&lt;/a&gt;, which gives the point of intersection  between the graph of the function and the &lt;span class="texhtml"&gt;&lt;i&gt;y&lt;/i&gt;&lt;/span&gt;-axis. Changing &lt;span class="texhtml"&gt;&lt;i&gt;m&lt;/i&gt;&lt;/span&gt;  makes the line steeper or shallower, while changing &lt;span class="texhtml"&gt;&lt;i&gt;b&lt;/i&gt;&lt;/span&gt; moves the line up or down.&lt;/p&gt; &lt;div class="thumb tright"&gt; &lt;div class="thumbinner" style="width: 302px;"&gt;&lt;a class="internal" title="Three geometric linear functions — the red and blue ones have same slope (m), while red and green ones have same y-intercept (b)." href="/wiki/Image:Linear_functions2.PNG"&gt;&lt;img class="thumbimage" alt="Three geometric linear functions — the red and blue ones have same slope (m), while red and green ones have same y-intercept (b)." src="http://upload.wikimedia.org/wikipedia/commons/thumb/8/80/Linear_functions2.PNG/300px-Linear_functions2.PNG" longdesc="/wiki/Image:Linear_functions2.PNG" height="301" width="300" /&gt;&lt;/a&gt;  &lt;div class="thumbcaption"&gt; &lt;div class="magnify" style="float: right;"&gt;&lt;a class="internal" title="Enlarge" href="/wiki/Image:Linear_functions2.PNG"&gt;&lt;br /&gt;&lt;/a&gt;&lt;/div&gt;Three  geometric linear functions — the red and blue ones have same slope (&lt;i&gt;m&lt;/i&gt;),  while red and green ones have same y-intercept (&lt;i&gt;b&lt;/i&gt;).&lt;/div&gt;&lt;/div&gt;&lt;/div&gt; &lt;p&gt;Examples of functions whose graph is a line include the following:&lt;/p&gt; &lt;ul&gt;&lt;li&gt;&lt;span class="texhtml"&gt;&lt;i&gt;f&lt;/i&gt;&lt;sub&gt;1&lt;/sub&gt;(&lt;i&gt;x&lt;/i&gt;) = 2&lt;i&gt;x&lt;/i&gt; + 1&lt;/span&gt;  &lt;/li&gt;&lt;li&gt;&lt;span class="texhtml"&gt;&lt;i&gt;f&lt;/i&gt;&lt;sub&gt;2&lt;/sub&gt;(&lt;i&gt;x&lt;/i&gt;) = &lt;i&gt;x&lt;/i&gt; / 2 + 1&lt;/span&gt;   &lt;/li&gt;&lt;li&gt;&lt;span class="texhtml"&gt;&lt;i&gt;f&lt;/i&gt;&lt;sub&gt;3&lt;/sub&gt;(&lt;i&gt;x&lt;/i&gt;) = &lt;i&gt;x&lt;/i&gt; / 2 − 1&lt;/span&gt;  &lt;/li&gt;&lt;/ul&gt; &lt;p&gt;The graphs of these are shown in the image at right.&lt;/p&gt; &lt;p&gt;&lt;a id="Usage_in_advanced_mathematics" name="Usage_in_advanced_mathematics"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h2&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Usage in advanced mathematics&lt;/span&gt;&lt;/h2&gt; &lt;p&gt;In advanced mathematics, a &lt;i&gt;linear function&lt;/i&gt; often means a &lt;a title="Function (mathematics)" href="/wiki/Function_%28mathematics%29"&gt;function&lt;/a&gt; that is a &lt;a title="Linear map" href="/wiki/Linear_map"&gt;linear map&lt;/a&gt;, that is, a map  between two &lt;a title="Vector space" href="/wiki/Vector_space"&gt;vector spaces&lt;/a&gt;  that preserves vector addition and &lt;a title="Scalar multiplication" href="/wiki/Scalar_multiplication"&gt;scalar multiplication&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;A function &lt;span class="texhtml"&gt;&lt;i&gt;f&lt;/i&gt;(&lt;i&gt;x&lt;/i&gt;) = &lt;i&gt;m&lt;/i&gt;&lt;i&gt;x&lt;/i&gt; +  &lt;i&gt;b&lt;/i&gt;&lt;/span&gt; is a linear map if and only if &lt;span class="texhtml"&gt;&lt;i&gt;b&lt;/i&gt; =  0&lt;/span&gt;. For other values of &lt;span class="texhtml"&gt;&lt;i&gt;b&lt;/i&gt;&lt;/span&gt; this falls in  the more general class of &lt;a title="Affine map" href="/wiki/Affine_map"&gt;affine  maps&lt;/a&gt;.&lt;/p&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-3168626891640737512?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/3168626891640737512/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=3168626891640737512' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/3168626891640737512'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/3168626891640737512'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/09/linear-function-in-mathematics-term.html' title=''/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-3903596348046580516</id><published>2007-09-03T13:07:00.000+01:00</published><updated>2007-09-03T13:08:19.360+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Spearman Correlation - Example</title><content type='html'>&lt;h2&gt;Example of Statistical Test using the Spearman Rank-Difference Correlation  Coefficient&lt;/h2&gt; &lt;p&gt;Is there a significant positive correlation between the rankings of 10  children on a reading test and their teacher's ranking of their reading ability?  In this problem we are relating a set of scores (interval level of measurement)  with the teacher's ranking of the children in reading (ordinal level of  measurement). To do this we first convert the reading test scores to ranks by  assigning the highest score a rank of 1, the next highest a rank of 2, etc. Now  we are looking at rankings on two variables and can use the Spearman  Rank-Difference Correlation Coefficient to test the significance of the  relationship. The two set of ranks, as well as the difference between the pairs  of ranks (D) and the differences squared (D&lt;sup&gt;2&lt;/sup&gt;), are shown in the  following table.&lt;/p&gt;  &lt;table border="1"&gt; &lt;caption&gt;&lt;strong&gt;Worksheet to Calculate the Correlation between Students Ranks  on a Reading Test and Teacher's Ranking of the Students on Reading&lt;strong&gt;  &lt;/strong&gt;&lt;/strong&gt;&lt;/caption&gt; &lt;tbody&gt; &lt;tr&gt; &lt;th&gt;Reading Test&lt;br /&gt;Score Rank  &lt;/th&gt;&lt;th&gt;Teacher's Ranking&lt;br /&gt;on Reading  &lt;/th&gt;&lt;th&gt;D &lt;/th&gt; &lt;th&gt;D&lt;sup&gt;2&lt;/sup&gt; &lt;/th&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;3&lt;/td&gt; &lt;td&gt;-2&lt;/td&gt; &lt;td&gt;4&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;2&lt;/td&gt; &lt;td&gt;2&lt;/td&gt; &lt;td&gt;0&lt;/td&gt; &lt;td&gt;0&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;3&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;2&lt;/td&gt; &lt;td&gt;4&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;0&lt;/td&gt; &lt;td&gt;0&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;5&lt;/td&gt; &lt;td&gt;5&lt;/td&gt; &lt;td&gt;0&lt;/td&gt; &lt;td&gt;0&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;6&lt;/td&gt; &lt;td&gt;6&lt;/td&gt; &lt;td&gt;0&lt;/td&gt; &lt;td&gt;0&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;7&lt;/td&gt; &lt;td&gt;8&lt;/td&gt; &lt;td&gt;-1&lt;/td&gt; &lt;td&gt;1&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;8&lt;/td&gt; &lt;td&gt;7&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;1&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;9&lt;/td&gt; &lt;td&gt;10&lt;/td&gt; &lt;td&gt;-1&lt;/td&gt; &lt;td&gt;1&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;10&lt;/td&gt; &lt;td&gt;9&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;1&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;th&gt;Total&lt;/th&gt; &lt;td&gt;&lt;br /&gt;&lt;/td&gt; &lt;td&gt;&lt;br /&gt;&lt;/td&gt; &lt;td&gt;12&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;p&gt;From the table we can see that:&lt;/p&gt; &lt;p&gt;&lt;img src="srprob1.jpeg" height="66" width="335" /&gt;&lt;/p&gt; &lt;p&gt;&lt;img src="srprob2.jpeg" height="53" width="242" /&gt;&lt;/p&gt; &lt;p&gt;df = N - 2 = 10 - 2 = 8&lt;/p&gt; &lt;p&gt;We now have the information we need to complete the six step process for  testing statistical hypotheses for our research problem.&lt;/p&gt; &lt;ol&gt;&lt;p&gt; &lt;/p&gt;&lt;li&gt;&lt;strong&gt;State the null hypothesis and the alternative hypothesis based on  your research question&lt;/strong&gt;.  &lt;p&gt;H&lt;sub&gt;0&lt;/sub&gt;: r&lt;sub&gt;S&lt;/sub&gt; = 0&lt;/p&gt; &lt;p&gt;H&lt;sub&gt;1&lt;/sub&gt;: r&lt;sub&gt;S&lt;/sub&gt; &gt; 0&lt;/p&gt; &lt;p&gt;Note: Our null hypothesis states that there is no significant relationship  between the two variables. The alternative hypothesis states that there is a  significant positive correlation between the two variables. &lt;/p&gt;  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Set the alpha level&lt;/strong&gt;.&lt;br /&gt;&lt;img src="alpha05.jpeg" height="12" width="39" /&gt;&lt;br /&gt;Note: As usual we will set our alpha level at .05, we have 5  chances in 100 of making a type I error.   &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Calculate the value of the appropriate statistic. Also indicate the  degrees of freedom for the statistical test if necessary.&lt;/strong&gt;  &lt;p&gt;r&lt;sub&gt;S&lt;/sub&gt; = .93&lt;/p&gt; &lt;p&gt;df = N - 2 = 10 - 2 = 8&lt;/p&gt;  &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Write the decision rule for rejecting the null hypothesis.&lt;/strong&gt;  &lt;p&gt;Reject H&lt;sub&gt;0&lt;/sub&gt; if r&lt;sub&gt;S&lt;/sub&gt; &gt;= .549&lt;/p&gt; &lt;p&gt;Note: To write the decision rule we had to know the critical value for  r&lt;sub&gt;S&lt;/sub&gt;, with an alpha level of .05, and 8 degrees of freedom. We can do  this by looking at Appendix Table B, this is the same table we used for the  Pearson r, and noting the tabled value for the column for the .10 level and the  row for 8 df (.549).&lt;/p&gt; &lt;p&gt;Note: We used the .10 column because we are doing a one-tailed test with an  alpha of .05 As noted in our problem above the the Pearson r, in the table of  critical values for r, the .10 column is used for alpha = .10 (two-tailed test)  and for alpha = .05 (one-tailed test).  &lt;/p&gt; &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Write a summary statement based on the decision.&lt;/strong&gt;&lt;br /&gt;Reject  H&lt;sub&gt;0&lt;/sub&gt;, p &lt; .05, one-tailed&lt;br /&gt;Note: Since our calculated value of  r&lt;sub&gt;S&lt;/sub&gt; (.93) is greater than .549, we reject the null hypothesis and  accept the alternative hypothesis.   &lt;/li&gt;&lt;li&gt;&lt;strong&gt;Write a statement of results in standard English.&lt;/strong&gt;&lt;br /&gt;There  is a significant positive correlation between the children's ranks on a reading  test and their teacher's ranking of them on reading.&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-3903596348046580516?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/3903596348046580516/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=3903596348046580516' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/3903596348046580516'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/3903596348046580516'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/09/spearman-correlation-example.html' title='Spearman Correlation - Example'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-8379974359569218312</id><published>2007-09-03T13:04:00.000+01:00</published><updated>2007-09-03T13:05:15.613+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Choosing the Proper Statistical Test</title><content type='html'>&lt;h2&gt;&lt;br /&gt;&lt;/h2&gt; &lt;p&gt;Let's finish our discussion of inferential statistics with a summary of all  the inferential statistics we have discussed and look at the conditions under  which we would use each of these statistics. Generally if we know the number of  groups or samples in our research design and the level of measurement of the  dependent variable we will know which inferential statistic to use.&lt;/p&gt;&lt;p&gt;First  let us look at statistical hypotheses in research designs where the dependent  variable is at the interval or ratio level. These statistics are known as  parametric statistics and we have used the following:&lt;/p&gt; &lt;ul&gt;&lt;li&gt;If we are testing a statistical hypothesis, involving a single score (we are  comparing the score with the population mean) we will use the &lt;strong&gt;z-score  test&lt;/strong&gt; (see lesson 9).  &lt;/li&gt;&lt;li&gt;If we are testing a statistical hypothesis involving a single group (we are  comparing the mean of the group with the population mean) and the standard  deviation of the population is know use the &lt;strong&gt;z test&lt;/strong&gt; (see lesson  10).  &lt;/li&gt;&lt;li&gt;If we are testing a statistical hypothesis involving a single group (we are  comparing the mean of the group with the population mean) and the standard  deviation of the population is not known use the &lt;strong&gt;single sample  t-test&lt;/strong&gt; (see lesson 10).  &lt;/li&gt;&lt;li&gt;If we are testing a statistical hypothesis involved two groups of subjects  (we are comparing the means of the two groups) and the two groups are  independent of one another, we use the &lt;strong&gt;independent t-test&lt;/strong&gt; (see  lesson 11).  &lt;/li&gt;&lt;li&gt;If we are testing a statistical hypothesis involved two groups of subjects  (we are comparing the means of the two groups) and the two groups are dependent  on one another (pretest/posttest or matched samples), we use the  &lt;strong&gt;dependent t-test&lt;/strong&gt; (see lesson 12).  &lt;/li&gt;&lt;li&gt;If we are testing a statistical hypothesis involved three or more groups of  subjects (we are comparing the means of three or more groups) and there is a  single dependent variable in the study, we use &lt;strong&gt;one-way analysis of  variance&lt;/strong&gt; (see lesson 13).  &lt;/li&gt;&lt;li&gt;If we are testing a statistical hypothesis involved the relationship between  two variables for one sample (we are measuring the relationship between the two  variables) and the data is at the interval or ratio level of measurement), use  the &lt;strong&gt;Pearson product moment correlation coefficient&lt;/strong&gt;. &lt;/li&gt;&lt;/ul&gt; &lt;p&gt;We also looked at two other statistics we could use with data that was not at  the interval or ratio level of measurement. These statistics are called  non-parametric statistics.&lt;/p&gt; &lt;ul&gt;&lt;li&gt;If we are testing a statistical hypothesis for one, two, or more groups with  one or two variables where the data is catagorical (frequencies). The data is at  the nominal level of measurement. For this type of study use  &lt;strong&gt;chi-square&lt;/strong&gt; (see lesson 14). We have discussed three different  variants of the chi-square statistic.  &lt;ol&gt;&lt;li&gt;one variable chi-square with equal expected frequencies  &lt;/li&gt;&lt;li&gt;one variable chi-square with unequal (predetermined) expected frequencies  &lt;/li&gt;&lt;li&gt;two variable chi-square &lt;/li&gt;&lt;/ol&gt; &lt;/li&gt;&lt;li&gt;If we are testing a statistical hypothesis involved the relationship between  two variables for one sample (we are measuring the relationship between the two  variables) and the data is at the ordinal level of measurement (ranks), use the  &lt;strong&gt;Spearman rank-difference correlation coefficient&lt;/strong&gt; (see lesson  15). &lt;/li&gt;&lt;/ul&gt; &lt;p&gt;The information we have discussed above can be put into the following table.  The table also includes other statistics that we have not included in this  course. If you think you may need one of the statistics we did not cover in your  research design, please send e-mail to the instructor and I will give you a  reference to the calculation and interpretation of that statistic. I wish you  the best as you complete the final examination for this course and as you apply  the information from this course to your own research design.&lt;/p&gt; &lt;table border="1"&gt; &lt;caption&gt;&lt;strong&gt;Selecting a Statistical Test&lt;strong&gt;  &lt;/strong&gt;&lt;/strong&gt;&lt;/caption&gt; &lt;tbody&gt; &lt;tr&gt; &lt;th rowspan="3"&gt;Level of&lt;br /&gt;Measurement&lt;/th&gt; &lt;th colspan="5"&gt;Sample Characteristics&lt;/th&gt;&lt;/tr&gt; &lt;tr&gt; &lt;th rowspan="2"&gt;One-Sample&lt;br /&gt;Statistical&lt;br /&gt;Tests&lt;/th&gt; &lt;th colspan="2"&gt;Two-Sample&lt;br /&gt;Statistical&lt;br /&gt;Tests&lt;/th&gt; &lt;th rowspan="2"&gt;Multiple Sample&lt;br /&gt;Statistical&lt;br /&gt;Tests&lt;/th&gt; &lt;th rowspan="2"&gt;Measures of&lt;br /&gt;Association&lt;br /&gt;(one-sample, more&lt;br /&gt;than one  variable)&lt;/th&gt;&lt;/tr&gt; &lt;tr&gt; &lt;th&gt;Independent&lt;br /&gt;Samples&lt;/th&gt; &lt;th&gt;Non-independent&lt;br /&gt;Samples&lt;/th&gt;&lt;/tr&gt; &lt;tr&gt; &lt;th&gt;Nominal or&lt;br /&gt;Categorical&lt;br /&gt;(frequencies)&lt;/th&gt; &lt;th&gt;Chi-Square&lt;/th&gt; &lt;th&gt;Chi-Square&lt;/th&gt; &lt;th&gt;McNemar&lt;br /&gt;Change Test&lt;/th&gt; &lt;th&gt;Chi-Square&lt;/th&gt; &lt;th&gt;Phi Coefficient&lt;/th&gt;&lt;/tr&gt; &lt;tr&gt; &lt;th&gt;Ordinal&lt;br /&gt;(Ranks)&lt;/th&gt; &lt;th&gt;Kolmagorov-Smirnov&lt;br /&gt;One-Sample&lt;br /&gt;Test&lt;/th&gt; &lt;th&gt;Mann Whitney&lt;br /&gt;U-Test&lt;/th&gt; &lt;th&gt;Wilcoxon&lt;br /&gt;Matched Pairs&lt;br /&gt;Signed-Rank&lt;br /&gt;Test&lt;/th&gt; &lt;th&gt;Krushcal-Wallis&lt;br /&gt;One-Way&lt;br /&gt;Analysis of&lt;br /&gt;Variance&lt;/th&gt; &lt;th&gt;Spearman rho&lt;br /&gt;r&lt;sub&gt;S&lt;/sub&gt;&lt;/th&gt;&lt;/tr&gt; &lt;tr&gt; &lt;th&gt;Interval&lt;br /&gt;or Ratio&lt;/th&gt; &lt;th&gt;Z test&lt;br /&gt;&lt;br /&gt;One-Sample&lt;br /&gt;t-Test&lt;/th&gt; &lt;th&gt;Independent&lt;br /&gt;t-test&lt;/th&gt; &lt;th&gt;Dependent&lt;br /&gt;t-test&lt;/th&gt; &lt;th&gt;Simple&lt;br /&gt;Analysis of Variance&lt;br /&gt;&lt;br /&gt;Factorial&lt;br /&gt;Analysis of  Variance&lt;br /&gt;&lt;br /&gt;Scheffe Tests&lt;br /&gt;&lt;br /&gt;Analysis of Covariance&lt;/th&gt; &lt;th&gt;Pearson r&lt;br /&gt;&lt;br /&gt;Multiple&lt;br /&gt;Regression&lt;/th&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-8379974359569218312?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/8379974359569218312/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=8379974359569218312' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/8379974359569218312'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/8379974359569218312'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/09/choosing-proper-statistical-test.html' title='Choosing the Proper Statistical Test'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-4779551698643443465</id><published>2007-09-01T19:00:00.000+01:00</published><updated>2007-09-01T19:08:43.776+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Spearman Correlation</title><content type='html'>&lt;b&gt;Spearman's rank correlation coefficient&lt;/b&gt;, named after &lt;a href="http://en.wikipedia.org/wiki/Charles_Spearman" title="Charles Spearman"&gt;Charles Spearman&lt;/a&gt; and often denoted by the Greek letter &lt;a href="http://en.wikipedia.org/wiki/Rho" title="Rho"&gt;ρ&lt;/a&gt; (rho), is a &lt;a href="http://en.wikipedia.org/wiki/Non-parametric_statistics" title="Non-parametric statistics"&gt;non-parametric&lt;/a&gt; measure of &lt;a href="http://en.wikipedia.org/wiki/Correlation" title="Correlation"&gt;correlation&lt;/a&gt; – that is, it assesses how well an arbitrary &lt;a href="http://en.wikipedia.org/wiki/Monotonic" title="Monotonic"&gt;monotonic&lt;/a&gt; function could describe the relationship between two &lt;a href="http://en.wikipedia.org/wiki/Variable" title="Variable"&gt;variables&lt;/a&gt;, without making any assumptions about the &lt;a href="http://en.wikipedia.org/wiki/Frequency_distribution" title="Frequency distribution"&gt;frequency distribution&lt;/a&gt; of the variables. Unlike the &lt;a href="http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient" title="Pearson product-moment correlation coefficient"&gt;Pearson product-moment correlation coefficient&lt;/a&gt;, it does not require the assumption that the relationship between the variables is &lt;a href="http://en.wikipedia.org/wiki/Linear_equation" title="Linear equation"&gt;linear&lt;/a&gt;, nor does it require the variables to be measured on &lt;a href="http://en.wikipedia.org/wiki/Interval_measurement" title="Interval measurement"&gt;interval scales&lt;/a&gt;; it can be used for variables measured at the &lt;a href="http://en.wikipedia.org/wiki/Ordinal_measurement" title="Ordinal measurement"&gt;ordinal&lt;/a&gt; level. &lt;p&gt;In principle, ρ is simply a special case of the Pearson product-moment coefficient in which the data are converted to &lt;a href="http://en.wikipedia.org/wiki/Ranking" title="Ranking"&gt;rankings&lt;/a&gt; before calculating the coefficient. In practice, however, a simpler procedure is normally used to calculate ρ. The &lt;a href="http://en.wikipedia.org/wiki/Raw_score" title="Raw score"&gt;raw scores&lt;/a&gt; are converted to ranks, and the differences &lt;i&gt;d&lt;/i&gt; between the ranks of each observation on the two variables are calculated.&lt;/p&gt; &lt;p&gt;If there are no tied ranks, i.e. &lt;img class="tex" alt="\neg\exists_{i,j} i\ne j \wedge (x_i=x_j \vee y_i=y_j)" src="http://upload.wikimedia.org/math/5/6/8/56857339544aab555e669c230f11822e.png" /&gt;&lt;/p&gt; &lt;p&gt;then ρ is given by:&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt=" \rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}" src="http://upload.wikimedia.org/math/5/a/a/5aa69d11142e11a759432e16f7d21970.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;where:&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;i&gt;&lt;span class="texhtml"&gt;&lt;i&gt;d&lt;/i&gt;&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt;&lt;/i&gt; = the difference between each rank of corresponding values of &lt;i&gt;x&lt;/i&gt; and &lt;i&gt;y&lt;/i&gt;, and&lt;/dd&gt;&lt;/dl&gt; &lt;dl&gt;&lt;dd&gt;&lt;i&gt;&lt;span class="texhtml"&gt;&lt;i&gt;n&lt;/i&gt;&lt;/span&gt;&lt;/i&gt; = the number of pairs of values.&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;If tied ranks exist, classic Pearson's &lt;a href="http://en.wikipedia.org/wiki/Correlation_coefficient" title="Correlation coefficient"&gt;correlation coefficient&lt;/a&gt; between ranks has to be used instead of this formula. You have to assign the same rank to each of the equal values. It is an average of their positions in the ascending order of the values:&lt;/p&gt; &lt;p&gt;&lt;br /&gt;&lt;b&gt;An Example of Averaging Ranks&lt;/b&gt;&lt;/p&gt; &lt;table class="wikitable"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;th&gt;Variable&lt;/th&gt; &lt;th&gt;Position in the decending order&lt;/th&gt; &lt;th&gt;Rank&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;0.8&lt;/td&gt; &lt;td&gt;5&lt;/td&gt; &lt;td&gt;5&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;1.2&lt;/td&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;&lt;img class="tex" alt="\frac{4+3}{2}=3.5\ " src="http://upload.wikimedia.org/math/c/1/a/c1a18727e19acf06316f5ffd6266c944.png" /&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;1.2&lt;/td&gt; &lt;td&gt;3&lt;/td&gt; &lt;td&gt;&lt;img class="tex" alt="\frac{4+3}{2}=3.5\ " src="http://upload.wikimedia.org/math/c/1/a/c1a18727e19acf06316f5ffd6266c944.png" /&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;2.3&lt;/td&gt; &lt;td&gt;2&lt;/td&gt; &lt;td&gt;2&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;18&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; &lt;p&gt;Spearman's rank correlation coefficient is equivalent to Pearson correlation on ranks. The formula above is a short-cut to its product-moment form, assuming no tie. The product-moment form can be used in both tied and untied cases.&lt;/p&gt; &lt;p&gt;A version of this correlation is called &lt;b&gt;&lt;a href="http://en.wikipedia.org/wiki/Spearman%27s_rho" title="Spearman's rho"&gt;Spearman's rho&lt;/a&gt;&lt;/b&gt;. In this case ranks are calculated as above, but in the formula of Pearson's correlation a standard deviation is taken as there were no ties.&lt;/p&gt; &lt;p&gt;Another popular method for computing rank correlation is the &lt;a href="http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient" title="Kendall tau rank correlation coefficient"&gt;Kendall tau rank correlation coefficient&lt;/a&gt;.&lt;/p&gt;&lt;h2&gt;&lt;span class="mw-headline"&gt;Example&lt;/span&gt;&lt;/h2&gt; &lt;p&gt;The raw data used in this example is shown below.&lt;/p&gt; &lt;table class="wikitable"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td&gt;IQ&lt;/td&gt; &lt;td&gt;Hours of TV per week.&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;106&lt;/td&gt; &lt;td&gt;7&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;86&lt;/td&gt; &lt;td&gt;0&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;100&lt;/td&gt; &lt;td&gt;27&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;101&lt;/td&gt; &lt;td&gt;50&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;99&lt;/td&gt; &lt;td&gt;28&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;103&lt;/td&gt; &lt;td&gt;29&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;97&lt;/td&gt; &lt;td&gt;20&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;113&lt;/td&gt; &lt;td&gt;12&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;112&lt;/td&gt; &lt;td&gt;6&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;110&lt;/td&gt; &lt;td&gt;17&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; &lt;p&gt;The first step is to sort this data by the first column. Next, two more columns are created. Both of these are for ranking the first two columns. Notice how the rank of values that are the same is the mean of what their ranks would otherwise be. Then a column "d" is created to hold the differences between the two rank columns. Finally another column "d&lt;sup&gt;2&lt;/sup&gt;" should be created. This is just column d squared.&lt;/p&gt; &lt;p&gt;After doing this process with the example data you should end up with something like:&lt;/p&gt; &lt;table class="wikitable"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;td&gt;IQ (i)&lt;/td&gt; &lt;td&gt;Hours of TV per week (t)&lt;/td&gt; &lt;td&gt;rank (i)&lt;/td&gt; &lt;td&gt;rank (t)&lt;/td&gt; &lt;td&gt;d&lt;/td&gt; &lt;td&gt;d&lt;sup&gt;2&lt;/sup&gt;&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;86&lt;/td&gt; &lt;td&gt;0&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;1&lt;/td&gt; &lt;td&gt;0&lt;/td&gt; &lt;td&gt;0&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;97&lt;/td&gt; &lt;td&gt;20&lt;/td&gt; &lt;td&gt;2&lt;/td&gt; &lt;td&gt;6&lt;/td&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;16&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;99&lt;/td&gt; &lt;td&gt;28&lt;/td&gt; &lt;td&gt;3&lt;/td&gt; &lt;td&gt;8&lt;/td&gt; &lt;td&gt;5&lt;/td&gt; &lt;td&gt;25&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;100&lt;/td&gt; &lt;td&gt;27&lt;/td&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;7&lt;/td&gt; &lt;td&gt;3&lt;/td&gt; &lt;td&gt;9&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;101&lt;/td&gt; &lt;td&gt;50&lt;/td&gt; &lt;td&gt;5&lt;/td&gt; &lt;td&gt;10&lt;/td&gt; &lt;td&gt;5&lt;/td&gt; &lt;td&gt;25&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;103&lt;/td&gt; &lt;td&gt;29&lt;/td&gt; &lt;td&gt;6&lt;/td&gt; &lt;td&gt;9&lt;/td&gt; &lt;td&gt;3&lt;/td&gt; &lt;td&gt;9&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;106&lt;/td&gt; &lt;td&gt;7&lt;/td&gt; &lt;td&gt;7&lt;/td&gt; &lt;td&gt;3&lt;/td&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;16&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;110&lt;/td&gt; &lt;td&gt;17&lt;/td&gt; &lt;td&gt;8&lt;/td&gt; &lt;td&gt;5&lt;/td&gt; &lt;td&gt;3&lt;/td&gt; &lt;td&gt;9&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;112&lt;/td&gt; &lt;td&gt;6&lt;/td&gt; &lt;td&gt;9&lt;/td&gt; &lt;td&gt;2&lt;/td&gt; &lt;td&gt;7&lt;/td&gt; &lt;td&gt;49&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;113&lt;/td&gt; &lt;td&gt;12&lt;/td&gt; &lt;td&gt;10&lt;/td&gt; &lt;td&gt;4&lt;/td&gt; &lt;td&gt;6&lt;/td&gt; &lt;td&gt;36&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; &lt;p&gt;The values in the d&lt;sup&gt;2&lt;/sup&gt; column can now be added to find &lt;img class="tex" alt="\sum d_i^2 = 194" src="http://upload.wikimedia.org/math/b/2/c/b2c1f7b2a9bbbd418e93f807a5d57d9d.png" /&gt;. The value of n is 10. So these values can now be substituted back into the equation,&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt=" \rho = 1- {\frac {6\times194}{10(10^2 - 1)}}" src="http://upload.wikimedia.org/math/f/f/f/fffe33920a3beae88df198f877f13467.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;which evaluates to &lt;span class="texhtml"&gt;ρ = − 0.175758&lt;/span&gt;. In the case of ties in the original values, then this formula should not be used. Instead, the Pearson correlation coefficient should be calculated on the ranks (where ties are given ranks, as described above).&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-4779551698643443465?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/4779551698643443465/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=4779551698643443465' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/4779551698643443465'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/4779551698643443465'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/09/spearman-correlation.html' title='Spearman Correlation'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-8110418151540707988</id><published>2007-09-01T18:58:00.000+01:00</published><updated>2007-09-02T12:53:13.402+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Syllabus'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Statistics: Another Syllabus</title><content type='html'>&lt;table class="navbox collapsible autocollapse nowraplinks" id="collapsibleTable0" style="margin: auto;"&gt;&lt;tbody&gt;&lt;tr&gt; &lt;th style="width: 100%; text-align: center;" colspan="2"&gt;&lt;span style="font-weight: normal; float: right; width: 6em; text-align: right;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;div style="float: left; width: 6em; text-align: left;"&gt; &lt;div class="noprint plainlinksneverexpand" style="padding: 0px; font-weight: normal; font-size: xx-small; color: rgb(0, 0, 0); white-space: nowrap; background-color: transparent;"&gt;&lt;a title="Template:Statistics" href="http://www.blogger.com/wiki/Template:Statistics"&gt;&lt;span title="View this template"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/a&gt;&lt;a class="external text" title="http://en.wikipedia.org/w/index.php?title=Template:Statistics&amp;action=edit" href="http://en.wikipedia.org/w/index.php?title=Template:Statistics&amp;amp;action=edit" rel="nofollow"&gt;&lt;span title="You can edit this template. Please use the preview button before saving." style="color: rgb(0, 43, 184);"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;span style="font-size:110;"&gt;&lt;a title="Statistics" href="http://www.blogger.com/wiki/Statistics"&gt;Statistics&lt;/a&gt;&lt;/span&gt;&lt;/th&gt;&lt;/tr&gt; &lt;tr&gt; &lt;th style="white-space: nowrap;"&gt;&lt;a title="Descriptive statistics" href="http://www.blogger.com/wiki/Descriptive_statistics"&gt;Descriptive statistics&lt;/a&gt;&lt;/th&gt; &lt;td style="width: 100%;"&gt;&lt;a title="Mean" href="http://www.blogger.com/wiki/Mean"&gt;Mean&lt;/a&gt; (&lt;a title="Arithmetic mean" href="http://www.blogger.com/wiki/Arithmetic_mean"&gt;Arithmetic&lt;/a&gt;, &lt;a title="Geometric mean" href="http://www.blogger.com/wiki/Geometric_mean"&gt;Geometric&lt;/a&gt;) - &lt;a title="Median" href="http://www.blogger.com/wiki/Median"&gt;Median&lt;/a&gt; - &lt;a title="Mode (statistics)" href="http://www.blogger.com/wiki/Mode_%28statistics%29"&gt;Mode&lt;/a&gt; - &lt;a title="Statistical power" href="http://www.blogger.com/wiki/Statistical_power"&gt;Power&lt;/a&gt; - &lt;a title="Variance" href="http://www.blogger.com/wiki/Variance"&gt;Variance&lt;/a&gt; - &lt;a title="Standard deviation" href="http://www.blogger.com/wiki/Standard_deviation"&gt;Standard deviation&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;th style="white-space: nowrap;"&gt;&lt;a title="Statistical inference" href="http://www.blogger.com/wiki/Statistical_inference"&gt;Inferential statistics&lt;/a&gt;&lt;/th&gt; &lt;td style="background: rgb(238, 238, 238) none repeat scroll 0% 50%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; width: 100%;"&gt;&lt;a title="Statistical hypothesis testing" href="http://www.blogger.com/wiki/Statistical_hypothesis_testing"&gt;Hypothesis testing&lt;/a&gt; - &lt;a title="Statistical significance" href="http://www.blogger.com/wiki/Statistical_significance"&gt;Significance&lt;/a&gt; - &lt;a title="Null hypothesis" href="http://www.blogger.com/wiki/Null_hypothesis"&gt;Null hypothesis&lt;/a&gt;/&lt;a title="Alternate hypothesis" href="http://www.blogger.com/wiki/Alternate_hypothesis"&gt;Alternate  hypothesis&lt;/a&gt; - &lt;a title="Type I and type II errors" href="http://www.blogger.com/wiki/Type_I_and_type_II_errors"&gt;Error&lt;/a&gt; - &lt;a title="Z-test" href="http://www.blogger.com/wiki/Z-test"&gt;Z-test&lt;/a&gt; - &lt;a title="Student's t-test" href="http://www.blogger.com/wiki/Student%27s_t-test"&gt;Student's t-test&lt;/a&gt; - &lt;a title="Maximum likelihood" href="http://www.blogger.com/wiki/Maximum_likelihood"&gt;Maximum  likelihood&lt;/a&gt; - &lt;a title="Standard score" href="http://www.blogger.com/wiki/Standard_score"&gt;Standard  score/Z score&lt;/a&gt; - &lt;a title="P-value" href="http://www.blogger.com/wiki/P-value"&gt;P-value&lt;/a&gt; - &lt;a title="Analysis of variance" href="http://www.blogger.com/wiki/Analysis_of_variance"&gt;Analysis of  variance&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;th style="white-space: nowrap;"&gt;&lt;a title="Survival analysis" href="http://www.blogger.com/wiki/Survival_analysis"&gt;Survival analysis&lt;/a&gt;&lt;/th&gt; &lt;td style="width: 100%;"&gt;&lt;a title="Survival function" href="http://www.blogger.com/wiki/Survival_function"&gt;Survival function&lt;/a&gt; - &lt;a title="Kaplan-Meier estimator" href="http://www.blogger.com/wiki/Kaplan-Meier_estimator"&gt;Kaplan-Meier&lt;/a&gt; - &lt;a title="Logrank test" href="http://www.blogger.com/wiki/Logrank_test"&gt;Logrank test&lt;/a&gt; - &lt;a title="Failure rate" href="http://www.blogger.com/wiki/Failure_rate"&gt;Failure rate&lt;/a&gt; - &lt;a title="Proportional hazards models" href="http://www.blogger.com/wiki/Proportional_hazards_models"&gt;Proportional hazards  models&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;th style="white-space: nowrap;"&gt;&lt;a title="Probability distribution" href="http://www.blogger.com/wiki/Probability_distribution"&gt;Probability distributions&lt;/a&gt;&lt;/th&gt; &lt;td style="background: rgb(238, 238, 238) none repeat scroll 0% 50%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; width: 100%;"&gt;&lt;a title="Normal distribution" href="http://www.blogger.com/wiki/Normal_distribution"&gt;Normal (bell curve)&lt;/a&gt; - &lt;a title="Poisson distribution" href="http://www.blogger.com/wiki/Poisson_distribution"&gt;Poisson&lt;/a&gt; - &lt;a title="Bernoulli distribution" href="http://www.blogger.com/wiki/Bernoulli_distribution"&gt;Bernoulli&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;th style="white-space: nowrap;"&gt;&lt;strong class="selflink"&gt;Correlation&lt;/strong&gt;&lt;/th&gt; &lt;td style="width: 100%;"&gt;&lt;a title="Pearson product-moment correlation coefficient" href="http://www.blogger.com/wiki/Pearson_product-moment_correlation_coefficient"&gt;Pearson  product-moment correlation coefficient&lt;/a&gt; - &lt;a title="Rank correlation" href="http://www.blogger.com/wiki/Rank_correlation"&gt;Rank correlation&lt;/a&gt; (&lt;a title="Spearman's rank correlation coefficient" href="http://www.blogger.com/wiki/Spearman%27s_rank_correlation_coefficient"&gt;Spearman's rank  correlation coefficient&lt;/a&gt;, &lt;a title="Kendall tau rank correlation coefficient" href="http://www.blogger.com/wiki/Kendall_tau_rank_correlation_coefficient"&gt;Kendall tau rank  correlation coefficient&lt;/a&gt;)&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;th style="white-space: nowrap;"&gt;&lt;a title="Regression analysis" href="http://www.blogger.com/wiki/Regression_analysis"&gt;Regression analysis&lt;/a&gt;&lt;/th&gt; &lt;td style="background: rgb(238, 238, 238) none repeat scroll 0% 50%; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; width: 100%;"&gt;&lt;a title="Linear regression" href="http://www.blogger.com/wiki/Linear_regression"&gt;Linear regression&lt;/a&gt; - &lt;a title="Nonlinear regression" href="http://www.blogger.com/wiki/Nonlinear_regression"&gt;Nonlinear  regression&lt;/a&gt; - &lt;a title="Logistic regression" href="http://www.blogger.com/wiki/Logistic_regression"&gt;Logistic regression&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-8110418151540707988?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/8110418151540707988/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=8110418151540707988' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/8110418151540707988'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/8110418151540707988'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/09/statistics-another-syllabus.html' title='Statistics: Another Syllabus'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-1546690679032298555</id><published>2007-09-01T18:00:00.000+01:00</published><updated>2007-09-01T18:02:39.139+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>STATISTICAL METHODS FOR RESEARCH WORKERS</title><content type='html'>&lt;p style="text-align: left;"&gt;&lt;span style="font-weight: bold;"&gt;Referenced from &lt;/span&gt;&lt;i&gt;&lt;a href="mailto:christo@yorku.ca"&gt;&lt;i&gt;Christopher D. Green&lt;/i&gt;&lt;/a&gt;'s Website: &lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/"&gt;http://psychclassics.yorku.ca/Fisher/Methods/&lt;/a&gt;&lt;br /&gt;&lt;/i&gt;&lt;/p&gt;&lt;p align="center"&gt;&lt;br /&gt;&lt;/p&gt;&lt;p align="center"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;p align="center"&gt;&lt;b&gt;STATISTICAL METHODS FOR RESEARCH WORKERS&lt;/b&gt;&lt;/p&gt; &lt;p align="center"&gt;&lt;b&gt;By Ronald A. Fisher (1925)&lt;/b&gt;&lt;br /&gt;Originally published in Edinburgh by Oliver and Boyd.&lt;/p&gt; &lt;p align="right"&gt;&lt;i&gt;&lt;span style="font-size:85%;"&gt;Posted March 2000&lt;br /&gt;Modified Sept. 2005&lt;/span&gt;&lt;/i&gt;&lt;/p&gt; &lt;hr /&gt;&lt;dir&gt; &lt;dir&gt; &lt;dir&gt; &lt;dir&gt; &lt;dir&gt; &lt;dir&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/TitlePage.jpg"&gt;TITLE PAGE&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/preface.htm"&gt;PREFACES&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/ToC.jpg"&gt;Image of original TABLE OF CONTENTS&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/chap1.htm"&gt;I. INTRODUCTION&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/chap2.htm"&gt;II. DIAGRAMS&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/chap3.htm"&gt;III. DISTRIBUTIONS&lt;/a&gt;&lt;/p&gt; &lt;p&gt;Image of TECHNICAL APPENDIX: &lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/p74.jpg"&gt;p. 74&lt;/a&gt;, &lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/p75.jpg"&gt;p. 75&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/chap4.htm"&gt;IV. TESTS OF GOODNESS OF FIT, INDEPENDENCE AND HOMOGENEITY; WITH TABLE OF χ&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/chap5.htm"&gt;V. TESTS OF SIGNIFICANCE OF &lt;i&gt;MEANS&lt;/i&gt;, DIFFERENCE OF MEANS, AND REGRESSION COEFFICIENTS&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/chap6.htm"&gt;VI. THE CORRELATION COEFFICIENT&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/chap7.htm"&gt;VII. INTRACLASS CORRELATIONS AND THE ANALYSIS OF VARIANCE&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/chap8.htm"&gt;VIII. FURTHER APPLICATIONS OF THE ANALYSIS OF VARIANCE&lt;/a&gt; &lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/sources.htm"&gt;SOURCES USED FOR DATA AND METHODS INDEX&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/findex.htm"&gt;INDEX&lt;/a&gt;&lt;/p&gt; &lt;/dir&gt; &lt;/dir&gt; &lt;/dir&gt; &lt;/dir&gt; &lt;/dir&gt; &lt;/dir&gt;  &lt;p align="center"&gt;&lt;b&gt;TABLES&lt;/b&gt;&lt;/p&gt;&lt;dir&gt; &lt;dir&gt; &lt;dir&gt; &lt;dir&gt; &lt;dir&gt; &lt;dir&gt;   &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/tabI-II.gif"&gt;I. and II. NORMAL DISTRIBUTION&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/tabIII.gif"&gt;III. TABLE OF &lt;span style="font-family:Symbol;"&gt;c&lt;/span&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; &lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/tabIV.gif"&gt;IV. TABLE OF &lt;i&gt;t&lt;/i&gt;&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/tabVa.gif"&gt;V.&lt;span style="font-size:85%;"&gt;A&lt;/span&gt;. CORRELATION COEFFICIENT -- SIGNIFICANT VALUES&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/tabVb.gif"&gt;V.&lt;span style="font-size:85%;"&gt;B&lt;/span&gt;. CORRELATION COEFFICIENT -- TRANSFORMED VALUES&lt;/a&gt;&lt;/p&gt; &lt;p&gt;&lt;a href="http://psychclassics.yorku.ca/Fisher/Methods/tabVI.gif"&gt;VI. TABLE OF &lt;i&gt;z&lt;/i&gt;&lt;/a&gt; &lt;/p&gt;&lt;/dir&gt; &lt;/dir&gt; &lt;/dir&gt; &lt;/dir&gt; &lt;/dir&gt; &lt;/dir&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-1546690679032298555?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/1546690679032298555/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=1546690679032298555' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/1546690679032298555'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/1546690679032298555'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/09/statistical-methods-for-research.html' title='STATISTICAL METHODS FOR RESEARCH WORKERS'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-7070943366603082236</id><published>2007-09-01T17:11:00.000+01:00</published><updated>2007-09-01T18:00:46.966+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Correlation ratio</title><content type='html'>&lt;p&gt;In &lt;a title="Statistics" href="/wiki/Statistics"&gt;statistics&lt;/a&gt;, the  &lt;b&gt;correlation ratio&lt;/b&gt; is a measure of the relationship between the &lt;a title="Statistical dispersion" href="/wiki/Statistical_dispersion"&gt;statistical  dispersion&lt;/a&gt; within individual categories and the dispersion across the whole  population or sample.&lt;/p&gt; &lt;p&gt;Suppose each observation is &lt;i&gt;y&lt;sub&gt;xi&lt;/sub&gt;&lt;/i&gt; where &lt;i&gt;x&lt;/i&gt; indicates  the category that observation is in and &lt;i&gt;i&lt;/i&gt; is the label of the particular  observation. We will write &lt;i&gt;n&lt;sub&gt;x&lt;/sub&gt;&lt;/i&gt; for the number of observations  in category &lt;i&gt;x&lt;/i&gt; (not necessarily the same for different values of &lt;i&gt;x&lt;/i&gt;)  and&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="\overline{y}_x=\frac{\sum_i y_{xi}}{n_x}" src="http://upload.wikimedia.org/math/e/e/5/ee59cf85abfbf51b0daa5434b5336278.png" /&gt;  and &lt;img class="tex" alt="\overline{y}=\frac{\sum_x n_x \overline{y}_x}{\sum_x n_x}" src="http://upload.wikimedia.org/math/8/5/c/85cdba32e6db511e53fd6c93180921eb.png" /&gt;  &lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;then the correlation ratio η (&lt;a title="Eta (letter)" href="/wiki/Eta_%28letter%29"&gt;eta&lt;/a&gt;) is defined so as to satisfy&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="\eta^2 = \frac{\sum_x n_x (\overline{y}_x-\overline{y})^2}{\sum_{xi} (y_{xi}-\overline{y})^2}" src="http://upload.wikimedia.org/math/5/a/5/5a52596090c9550e3e088b474cf2c8ec.png" /&gt;  &lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;which might be written as&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="\frac{{\sigma_{\overline{y}}}^2}{{\sigma_{y}}^2}." src="http://upload.wikimedia.org/math/6/7/f/67fa9d139d9c12b10e1b3bd36453094e.png" /&gt;  &lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;It is worth noting that if the relationship between values of &lt;img class="tex" alt="x \;\ " src="http://upload.wikimedia.org/math/d/8/1/d811816626f6498ac3931fa1b6dc87fd.png" /&gt;  and values of &lt;img class="tex" alt="\overline{y}_x" src="http://upload.wikimedia.org/math/9/4/4/9444cc806d0d9967242aeb152cbd7d3d.png" /&gt;  is linear (which is certainly true when there are only two possibilities for  &lt;i&gt;x&lt;/i&gt;) this will give the same result as the square of the &lt;a title="Correlation coefficient" href="/wiki/Correlation_coefficient"&gt;correlation  coefficient&lt;/a&gt;; if not then the correlation ratio will be larger in magnitude,  though still no more than 1 in magnitude. It can therefore be used for judging  non-linear relationships.&lt;/p&gt;&lt;!-- Saved in parser cache with key enwiki:pcache:idhash:699570-0!1!0!default!!en!2 and timestamp 20070809122203 --&gt; &lt;div class="printfooter"&gt;Retrieved from "&lt;a href="http://en.wikipedia.org/wiki/Correlation_ratio"&gt;http://en.wikipedia.org/wiki/Correlation_ratio&lt;/a&gt;"&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-7070943366603082236?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/7070943366603082236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=7070943366603082236' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/7070943366603082236'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/7070943366603082236'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/09/correlation-ratio.html' title='Correlation ratio'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-6174438248247291789</id><published>2007-09-01T17:07:00.000+01:00</published><updated>2007-09-01T19:23:33.537+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Correlation, Pearson Correlation</title><content type='html'>&lt;p&gt;In &lt;a href="http://en.wikipedia.org/wiki/Probability_theory" title="Probability theory"&gt;probability theory&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Statistics" title="Statistics"&gt;statistics&lt;/a&gt;, &lt;b&gt;correlation&lt;/b&gt;, also called &lt;b&gt;correlation coefficient&lt;/b&gt;, indicates the strength and direction of a linear relationship between two &lt;a href="http://en.wikipedia.org/wiki/Random_variables" title="Random variables"&gt;random variables&lt;/a&gt;. In general statistical usage, &lt;i&gt;correlation&lt;/i&gt; or co-relation refers to the departure of two variables from independence. In this broad sense there are several coefficients, measuring the degree of correlation, adapted to the nature of data.&lt;/p&gt; &lt;p&gt;A number of different coefficients are used for different situations. The best known is the &lt;a href="http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient" title="Pearson product-moment correlation coefficient"&gt;Pearson product-moment correlation coefficient&lt;/a&gt;, which is obtained by dividing the &lt;a href="http://en.wikipedia.org/wiki/Covariance" title="Covariance"&gt;covariance&lt;/a&gt; of the two variables by the product of their &lt;a href="http://en.wikipedia.org/wiki/Standard_deviation" title="Standard deviation"&gt;standard deviations&lt;/a&gt;. Despite its name, it was first introduced by &lt;a href="http://en.wikipedia.org/wiki/Francis_Galton" title="Francis Galton"&gt;Francis Galton&lt;/a&gt;.&lt;/p&gt;&lt;script type="text/javascript"&gt;//&lt;![CDATA[  if (window.showTocToggle) { var tocShowText = "show"; var tocHideText = "hide"; showTocToggle(); }  //]]&gt; &lt;/script&gt; &lt;p&gt;&lt;a name="Pearson.27s_product-moment_coefficient" id="Pearson.27s_product-moment_coefficient"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h2&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Pearson's product-moment coefficient&lt;/span&gt;&lt;/h2&gt; &lt;dl&gt;&lt;dd&gt; &lt;div class="noprint"&gt;&lt;i&gt;Main article: &lt;a href="http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient" title="Pearson product-moment correlation coefficient"&gt;Pearson product-moment correlation coefficient&lt;/a&gt;&lt;/i&gt;&lt;/div&gt; &lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;&lt;a name="Mathematical_properties" id="Mathematical_properties"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h3&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Mathematical properties&lt;/span&gt;&lt;/h3&gt; &lt;p&gt;The correlation coefficient ρ&lt;sub&gt;&lt;i&gt;X, Y&lt;/i&gt;&lt;/sub&gt; between two &lt;a href="http://en.wikipedia.org/wiki/Random_variables" title="Random variables"&gt;random variables&lt;/a&gt; &lt;i&gt;X&lt;/i&gt; and &lt;i&gt;Y&lt;/i&gt; with &lt;a href="http://en.wikipedia.org/wiki/Expected_value" title="Expected value"&gt;expected values&lt;/a&gt; μ&lt;sub&gt;&lt;i&gt;X&lt;/i&gt;&lt;/sub&gt; and μ&lt;sub&gt;&lt;i&gt;Y&lt;/i&gt;&lt;/sub&gt; and &lt;a href="http://en.wikipedia.org/wiki/Standard_deviation" title="Standard deviation"&gt;standard deviations&lt;/a&gt; σ&lt;sub&gt;&lt;i&gt;X&lt;/i&gt;&lt;/sub&gt; and σ&lt;sub&gt;&lt;i&gt;Y&lt;/i&gt;&lt;/sub&gt; is defined as:&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="\rho_{X,Y}={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E((X-\mu_X)(Y-\mu_Y)) \over \sigma_X\sigma_Y}," src="http://upload.wikimedia.org/math/7/6/b/76b133327e4c98581ddb42a557522731.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;where &lt;i&gt;E&lt;/i&gt; is the &lt;a href="http://en.wikipedia.org/wiki/Expected_value" title="Expected value"&gt;expected value&lt;/a&gt; operator and cov means &lt;a href="http://en.wikipedia.org/wiki/Covariance" title="Covariance"&gt;covariance&lt;/a&gt;. Since μ&lt;sub&gt;&lt;i&gt;X&lt;/i&gt;&lt;/sub&gt; = E(&lt;i&gt;X&lt;/i&gt;), σ&lt;sub&gt;&lt;i&gt;X&lt;/i&gt;&lt;/sub&gt;&lt;sup&gt;2&lt;/sup&gt; = E(&lt;i&gt;X&lt;/i&gt;&lt;sup&gt;2&lt;/sup&gt;) − E&lt;sup&gt;2&lt;/sup&gt;(&lt;i&gt;X&lt;/i&gt;) and likewise for &lt;i&gt;Y&lt;/i&gt;, we may also write&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="\rho_{X,Y}=\frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^2)-E^2(X)}~\sqrt{E(Y^2)-E^2(Y)}}." src="http://upload.wikimedia.org/math/2/0/6/2061fa9538c1c341e87863dcdba91233.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;The correlation is defined only if both of the standard deviations are finite and both of them are nonzero. It is a corollary of the &lt;a href="http://en.wikipedia.org/wiki/Cauchy-Schwarz_inequality" title="Cauchy-Schwarz inequality"&gt;Cauchy-Schwarz inequality&lt;/a&gt; that the correlation cannot exceed 1 in &lt;a href="http://en.wikipedia.org/wiki/Absolute_value" title="Absolute value"&gt;absolute value&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The correlation is 1 in the case of an increasing linear relationship, −1 in the case of a decreasing linear relationship, and some value in between in all other cases, indicating the degree of &lt;a href="http://en.wikipedia.org/wiki/Linear_dependence" title="Linear dependence"&gt;linear dependence&lt;/a&gt; between the variables. The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.&lt;/p&gt; &lt;p&gt;If the variables are &lt;a href="http://en.wikipedia.org/wiki/Statistical_independence" title="Statistical independence"&gt;independent&lt;/a&gt; then the correlation is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables. Here is an example: Suppose the random variable &lt;i&gt;X&lt;/i&gt; is uniformly distributed on the interval from −1 to 1, and &lt;i&gt;Y&lt;/i&gt; = &lt;i&gt;X&lt;/i&gt;&lt;sup&gt;2&lt;/sup&gt;. Then &lt;i&gt;Y&lt;/i&gt; is completely determined by &lt;i&gt;X&lt;/i&gt;, so that &lt;i&gt;X&lt;/i&gt; and &lt;i&gt;Y&lt;/i&gt; are dependent, but their correlation is zero; they are &lt;a href="http://en.wikipedia.org/wiki/Uncorrelated" title="Uncorrelated"&gt;uncorrelated&lt;/a&gt;. However, in the special case when &lt;i&gt;X&lt;/i&gt; and &lt;i&gt;Y&lt;/i&gt; are &lt;a href="http://en.wikipedia.org/wiki/Bivariate_Gaussian_distribution" title="Bivariate Gaussian distribution"&gt;jointly normal&lt;/a&gt;, independence is equivalent to uncorrelatedness.&lt;/p&gt; &lt;p&gt;A correlation between two variables is diluted in the presence of measurement error around estimates of one or both variables, in which case &lt;a href="http://en.wikipedia.org/wiki/Disattenuation" title="Disattenuation"&gt;disattenuation&lt;/a&gt; provides a more accurate coefficient .&lt;/p&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;div class="thumb tright"&gt; &lt;div class="thumbinner" style="width: 182px;"&gt;&lt;a href="http://en.wikipedia.org/wiki/Image:Corr-example.png" class="internal" title="Positive linear correlations between 1000 pairs of numbers.  The data are graphed on the lower left and their correlation coefficients listed on the upper right.  Each square in the upper right corresponds to its mirror-image square in the lower left, the &amp;quot;mirror&amp;quot; being the diagonal of the whole array.  Each set of points correlates maximally with itself, as shown on the diagonal (all correlations = +1)."&gt;&lt;img alt="Positive linear correlations between 1000 pairs of numbers.  The data are graphed on the lower left and their correlation coefficients listed on the upper right.  Each square in the upper right corresponds to its mirror-image square in the lower left, the &amp;quot;mirror&amp;quot; being the diagonal of the whole array.  Each set of points correlates maximally with itself, as shown on the diagonal (all correlations = +1)." longdesc="/wiki/Image:Corr-example.png" class="thumbimage" src="http://upload.wikimedia.org/wikipedia/en/thumb/c/c4/Corr-example.png/180px-Corr-example.png" height="180" width="180" /&gt;&lt;/a&gt; &lt;div class="thumbcaption"&gt; &lt;div class="magnify" style="float: right;"&gt;&lt;a href="http://en.wikipedia.org/wiki/Image:Corr-example.png" class="internal" title="Enlarge"&gt;&lt;img src="http://en.wikipedia.org/skins-1.5/common/images/magnify-clip.png" alt="" height="11" width="15" /&gt;&lt;/a&gt;&lt;/div&gt; Positive linear correlations between 1000 pairs of numbers. The data are graphed on the lower left and their correlation coefficients listed on the upper right. Each square in the upper right corresponds to its mirror-image square in the lower left, the "mirror" being the diagonal of the whole array. Each set of points correlates maximally with itself, as shown on the diagonal (all correlations = +1).&lt;/div&gt; &lt;/div&gt; &lt;/div&gt;&lt;p&gt;&lt;a name="The_sample_correlation" id="The_sample_correlation"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h3&gt;&lt;span class="editsection"&gt;&lt;/span&gt; &lt;span class="mw-headline"&gt;The sample correlation&lt;/span&gt;&lt;/h3&gt; &lt;p&gt;If we have a series of &lt;i&gt;n&lt;/i&gt;  measurements of &lt;i&gt;X&lt;/i&gt;  and &lt;i&gt;Y&lt;/i&gt;  written as &lt;i&gt;x&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt;  and &lt;i&gt;y&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt;  where &lt;i&gt;i&lt;/i&gt; = 1, 2, ..., &lt;i&gt;n&lt;/i&gt;, then the &lt;a href="http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient" title="Pearson product-moment correlation coefficient"&gt;Pearson product-moment correlation coefficient&lt;/a&gt; can be used to estimate the correlation of &lt;i&gt;X&lt;/i&gt;  and &lt;i&gt;Y&lt;/i&gt; . The Pearson coefficient is also known as the "sample correlation coefficient". It is especially important if &lt;i&gt;X&lt;/i&gt;  and &lt;i&gt;Y&lt;/i&gt;  are both &lt;a href="http://en.wikipedia.org/wiki/Normal_distribution" title="Normal distribution"&gt;normally distributed&lt;/a&gt;. The Pearson correlation coefficient is then the best estimate of the correlation of &lt;i&gt;X&lt;/i&gt;  and &lt;i&gt;Y&lt;/i&gt; . The Pearson correlation coefficient is written:&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="r_{xy}=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{(n-1) s_x s_y}," src="http://upload.wikimedia.org/math/7/9/7/79757f6f48de27ae2260b3989ef7726f.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;where &lt;img class="tex" alt="\bar{x}" src="http://upload.wikimedia.org/math/8/4/7/84790e2b15a305120bc3fbeb4a4eeb4f.png" /&gt; and &lt;img class="tex" alt="\bar{y}" src="http://upload.wikimedia.org/math/1/0/b/10b9fdacffcecc3574e9306610427486.png" /&gt; are the sample &lt;a href="http://en.wikipedia.org/wiki/Arithmetic_mean" title="Arithmetic mean"&gt;means&lt;/a&gt; of &lt;i&gt;X&lt;/i&gt;  and &lt;i&gt;Y&lt;/i&gt; , &lt;i&gt;s&lt;/i&gt;&lt;sub&gt;&lt;i&gt;x&lt;/i&gt;&lt;/sub&gt;  and &lt;i&gt;s&lt;/i&gt;&lt;sub&gt;&lt;i&gt;y&lt;/i&gt;&lt;/sub&gt;  are the sample &lt;a href="http://en.wikipedia.org/wiki/Standard_deviation" title="Standard deviation"&gt;standard deviations&lt;/a&gt; of &lt;i&gt;X&lt;/i&gt;  and &lt;i&gt;Y&lt;/i&gt;  and the sum is from &lt;i&gt;i&lt;/i&gt; = 1 to &lt;i&gt;n&lt;/i&gt;. As with the population correlation, we may rewrite this as&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="r_{xy}=\frac{n\sum x_iy_i-\sum x_i\sum y_i} {\sqrt{n\sum x_i^2-(\sum x_i)^2}~\sqrt{n\sum y_i^2-(\sum y_i)^2}}." src="http://upload.wikimedia.org/math/0/d/8/0d8118f654bb2e789dd23f490859cf0d.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;Again, as is true with the population correlation, the absolute value of the sample correlation must be less than or equal to 1. Though the above formula conveniently suggests a single-pass algorithm for calculating sample correlations, it is notorious for its numerical instability (see below for something more accurate).&lt;/p&gt; &lt;p&gt;The square of the sample correlation coefficient, which is also known as the &lt;a href="http://en.wikipedia.org/wiki/Coefficient_of_determination" title="Coefficient of determination"&gt;coefficient of determination&lt;/a&gt;, is the fraction of the variance in &lt;i&gt;y&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt;  that is accounted for by a linear fit of &lt;i&gt;x&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt;  to &lt;i&gt;y&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; . This is written&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="r_{xy}^2=1-\frac{s_{y|x}^2}{s_y^2}," src="http://upload.wikimedia.org/math/2/0/5/205bc8655668dbbf8aab53321699f1b3.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;where &lt;i&gt;s&lt;/i&gt;&lt;sub&gt;&lt;i&gt;y&lt;/i&gt;|&lt;i&gt;x&lt;/i&gt;&lt;/sub&gt;&lt;sup&gt;2&lt;/sup&gt;  is the square of the error of a &lt;a href="http://en.wikipedia.org/wiki/Linear_regression" title="Linear regression"&gt;linear regression&lt;/a&gt; of &lt;i&gt;x&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt;  on &lt;i&gt;y&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt;  by the &lt;a href="http://en.wikipedia.org/wiki/Equation" title="Equation"&gt;equation&lt;/a&gt; &lt;i&gt;y = a + bx&lt;/i&gt;:&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="s_{y|x}^2=\frac{1}{n-1}\sum_{i=1}^n (y_i-a-bx_i)^2," src="http://upload.wikimedia.org/math/0/2/5/025c4c8bd3c0c74a7424682efc345244.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;and &lt;i&gt;s&lt;/i&gt;&lt;sub&gt;&lt;i&gt;y&lt;/i&gt;&lt;/sub&gt;&lt;sup&gt;2&lt;/sup&gt;  is just the variance of &lt;i&gt;y&lt;/i&gt;:&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="s_y^2=\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2." src="http://upload.wikimedia.org/math/1/e/5/1e5432524dbf2ec7fec390acd8caf3e7.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;Note that since the sample correlation coefficient is symmetric in &lt;i&gt;x&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt;  and &lt;i&gt;y&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; , we will get the same value for a fit of &lt;i&gt;x&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt;  to &lt;i&gt;y&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; :&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="r_{xy}^2=1-\frac{s_{x|y}^2}{s_x^2}." src="http://upload.wikimedia.org/math/8/3/6/836461cbf3b7c56196f149b9aa5d9980.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;This equation also gives an intuitive idea of the correlation coefficient for higher &lt;a href="http://en.wikipedia.org/wiki/Dimension" title="Dimension"&gt;dimensions&lt;/a&gt;. Just as the above described sample correlation coefficient is the fraction of variance accounted for by the fit of a 1-dimensional &lt;a href="http://en.wikipedia.org/wiki/Euclidean_space" title="Euclidean space"&gt;linear submanifold&lt;/a&gt; to a set of 2-dimensional vectors (&lt;i&gt;x&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; , &lt;i&gt;y&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; ), so we can define a correlation coefficient for a fit of an &lt;i&gt;m&lt;/i&gt;-dimensional linear submanifold to a set of &lt;i&gt;n&lt;/i&gt;-dimensional vectors. For example, if we fit a plane &lt;i&gt;z = a + bx + cy&lt;/i&gt;  to a set of data (&lt;i&gt;x&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; , &lt;i&gt;y&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; , &lt;i&gt;z&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; ) then the correlation coefficient of &lt;i&gt;z&lt;/i&gt;  to &lt;i&gt;x&lt;/i&gt;  and &lt;i&gt;y&lt;/i&gt;  is&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="r^2=1-\frac{\sigma_{z|xy}^2}{s_z^2}." src="http://upload.wikimedia.org/math/b/9/f/b9f99d6c8c6f0d6fbde2e774d92dc129.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;&lt;a name="Geometric_Interpretation_of_correlation" id="Geometric_Interpretation_of_correlation"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h3&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Geometric Interpretation of correlation&lt;/span&gt;&lt;/h3&gt; &lt;p&gt;The correlation coefficient can also be viewed as the &lt;a href="http://en.wikipedia.org/wiki/Cosine" title="Cosine"&gt;cosine&lt;/a&gt; of the &lt;a href="http://en.wikipedia.org/wiki/Angle" title="Angle"&gt;angle&lt;/a&gt; between the two &lt;a href="http://en.wikipedia.org/wiki/Vector_%28spatial%29" title="Vector (spatial)"&gt;vectors&lt;/a&gt; of samples drawn from the two random variables.&lt;/p&gt; &lt;p&gt;Caution: This method only works with centered data, i.e., data which have been shifted by the sample mean so as to have an average of zero. Some practitioners prefer an uncentered (non-Pearson-compliant) correlation coefficient. See the example below for a comparison.&lt;/p&gt; &lt;p&gt;As an example, suppose five countries are found to have gross national products of 1, 2, 3, 5, and 8 billion dollars, respectively. Suppose these same five countries (in the same order) are found to have 11%, 12%, 13%, 15%, and 18% poverty. Then let &lt;b&gt;x&lt;/b&gt; and &lt;b&gt;y&lt;/b&gt; be ordered 5-element vectors containing the above data: &lt;b&gt;x&lt;/b&gt; = (1, 2, 3, 5, 8) and &lt;b&gt;y&lt;/b&gt; = (0.11, 0.12, 0.13, 0.15, 0.18).&lt;/p&gt; &lt;p&gt;By the usual procedure for finding the angle between two vectors (see &lt;a href="http://en.wikipedia.org/wiki/Dot_product" title="Dot product"&gt;dot product&lt;/a&gt;), the &lt;i&gt;uncentered&lt;/i&gt; correlation coefficient is:&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt=" \cos \theta = \frac { \bold{x} \cdot \bold{y} } { \left\| \bold{x} \right\| \left\| \bold{y} \right\| } = \frac { 2.93 } { \sqrt { 103 } \sqrt { 0.0983 } } = 0.920814711. " src="http://upload.wikimedia.org/math/f/8/a/f8a942d1979bfb6d6ba350abe88559ca.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;Note that the above data were deliberately chosen to be perfectly correlated: &lt;i&gt;y&lt;/i&gt; = 0.10 + 0.01 &lt;i&gt;x&lt;/i&gt;. The Pearson correlation coefficient must therefore be exactly one. Centering the data (shifting &lt;b&gt;x&lt;/b&gt; by E(&lt;b&gt;x&lt;/b&gt;) = 3.8 and &lt;b&gt;y&lt;/b&gt; by E(&lt;b&gt;y&lt;/b&gt;) = 0.138) yields &lt;b&gt;x&lt;/b&gt; = (-2.8, -1.8, -0.8, 1.2, 4.2) and &lt;b&gt;y&lt;/b&gt; = (-0.028, -0.018, -0.008, 0.012, 0.042), from which&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt=" \cos \theta = \frac { \bold{x} \cdot \bold{y} } { \left\| \bold{x} \right\| \left\| \bold{y} \right\| } = \frac { 0.308 } { \sqrt { 30.8 } \sqrt { 0.00308 } } = 1, " src="http://upload.wikimedia.org/math/2/b/3/2b38aef20a069f7cc2a0a2b587567db0.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;as expected.&lt;/p&gt; &lt;p&gt;&lt;a name="Interpretation_of_the_size_of_a_correlation" id="Interpretation_of_the_size_of_a_correlation"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h3&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Interpretation of the size of a correlation&lt;/span&gt;&lt;/h3&gt; &lt;table class="wikitable" align="right"&gt; &lt;tbody&gt;&lt;tr&gt; &lt;th&gt;Correlation&lt;/th&gt; &lt;th&gt;Negative&lt;/th&gt; &lt;th&gt;Positive&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;Small&lt;/td&gt; &lt;td&gt;−0.29 to −0.10&lt;/td&gt; &lt;td&gt;0.10 to 0.29&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;Medium&lt;/td&gt; &lt;td&gt;−0.49 to −0.30&lt;/td&gt; &lt;td&gt;0.30 to 0.49&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;Large&lt;/td&gt; &lt;td&gt;−1.00 to −0.50&lt;/td&gt; &lt;td&gt;0.50 to 1.00&lt;/td&gt; &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; &lt;p&gt;Several authors have offered guidelines for the interpretation of a correlation coefficient. Cohen (1988),&lt;sup id="_ref-Cohen88_0" class="reference"&gt;&lt;a href="http://en.wikipedia.org/wiki/Correlation#_note-Cohen88" title=""&gt;[1]&lt;/a&gt;&lt;/sup&gt; for example, has suggested the following interpretations for correlations in psychological research, in the table on the right.&lt;/p&gt; &lt;p&gt;As Cohen himself has observed, however, all such criteria are in some ways arbitrary and should not be observed too strictly. This is because the interpretation of a correlation coefficient depends on the context and purposes. A correlation of 0.9 may be very low if one is verifying a physical law using high-quality instruments, but may be regarded as very high in the social sciences where there may be a greater contribution from complicating factors.&lt;/p&gt; &lt;p&gt;&lt;a name="Non-parametric_correlation_coefficients" id="Non-parametric_correlation_coefficients"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h2&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Non-parametric correlation coefficients&lt;/span&gt;&lt;/h2&gt; &lt;p&gt;Pearson's correlation coefficient is a &lt;a href="http://en.wikipedia.org/wiki/Parametric_statistics" title="Parametric statistics"&gt;parametric statistic&lt;/a&gt;, and it may be less useful if the underlying assumption of normality is violated. &lt;a href="http://en.wikipedia.org/wiki/Non-parametric_statistics" title="Non-parametric statistics"&gt;Non-parametric&lt;/a&gt; correlation methods, such as &lt;a href="http://en.wikipedia.org/wiki/Chi-square_test" title="Chi-square test"&gt;Chi-square&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient" title="Point-biserial correlation coefficient"&gt;Point biserial correlation&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient" title="Spearman's rank correlation coefficient"&gt;Spearman's ρ&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Kendall%27s_tau" title="Kendall's tau"&gt;Kendall's τ&lt;/a&gt; may be useful when distributions are not normal; they are a little less powerful than parametric methods if the assumptions underlying the latter are met, but are less likely to give distorted results when the assumptions fail.&lt;/p&gt; &lt;p&gt;&lt;a name="Other_measures_of_dependence_among_random_variables" id="Other_measures_of_dependence_among_random_variables"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h2&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Other measures of dependence among random variables&lt;/span&gt;&lt;/h2&gt; &lt;p&gt;To get a measure for more general dependencies in the data (also nonlinear) it is better to use the &lt;a href="http://en.wikipedia.org/wiki/Correlation_ratio" title="Correlation ratio"&gt;correlation ratio&lt;/a&gt; which is able to detect almost any functional dependency, or &lt;a href="http://en.wikipedia.org/wiki/Mutual_information" title="Mutual information"&gt;mutual information&lt;/a&gt;/&lt;a href="http://en.wikipedia.org/wiki/Total_correlation" title="Total correlation"&gt;total correlation&lt;/a&gt; which is capable of detecting even more general dependencies.&lt;/p&gt; &lt;p&gt;&lt;a name="Copulas_and_correlation" id="Copulas_and_correlation"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h2&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Copulas and correlation&lt;/span&gt;&lt;/h2&gt; &lt;p&gt;The information given by a correlation coefficient is not enough to define the dependence structure between random variables; to fully capture it we must consider a &lt;a href="http://en.wikipedia.org/wiki/Copula_%28statistics%29" title="Copula (statistics)"&gt;copula&lt;/a&gt; between them. The correlation coefficient completely defines the dependence structure only in very particular cases, for example when the &lt;a href="http://en.wikipedia.org/wiki/Cumulative_distribution_function" title="Cumulative distribution function"&gt;cumulative distribution functions&lt;/a&gt; are the &lt;a href="http://en.wikipedia.org/wiki/Multivariate_normal_distribution" title="Multivariate normal distribution"&gt;multivariate normal distributions&lt;/a&gt;. In the case of elliptic distributions it characterizes the (hyper-)ellipses of equal density, however, it does not completely characterize the dependence structure (for example, the a multivariate t-distribution's degrees of freedom determine the level of tail dependence).&lt;/p&gt; &lt;p&gt;&lt;a name="Correlation_matrices" id="Correlation_matrices"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h2&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Correlation matrices&lt;/span&gt;&lt;/h2&gt; &lt;p&gt;The correlation matrix of &lt;i&gt;n&lt;/i&gt; random variables &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;1&lt;/sub&gt;, ..., &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;n&lt;/i&gt;&lt;/sub&gt; is the &lt;i&gt;n&lt;/i&gt;  ×  &lt;i&gt;n&lt;/i&gt; matrix whose &lt;i&gt;i&lt;/i&gt;,&lt;i&gt;j&lt;/i&gt; entry is corr(&lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt;, &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;j&lt;/i&gt;&lt;/sub&gt;). If the measures of correlation used are product-moment coefficients, the correlation matrix is the same as the &lt;a href="http://en.wikipedia.org/wiki/Covariance_matrix" title="Covariance matrix"&gt;covariance matrix&lt;/a&gt; of the standardized random variables &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt; /SD(&lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt;) for &lt;i&gt;i&lt;/i&gt; = 1, ..., &lt;i&gt;n&lt;/i&gt;. Consequently it is necessarily a &lt;a href="http://en.wikipedia.org/wiki/Positive-semidefinite_matrix" title="Positive-semidefinite matrix"&gt;positive-semidefinite matrix&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The correlation matrix is symmetric because the correlation between &lt;span class="texhtml"&gt;&lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt; and &lt;span class="texhtml"&gt;&lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;j&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt; is the same as the correlation between &lt;span class="texhtml"&gt;&lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;j&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt; and &lt;span class="texhtml"&gt;&lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt;.&lt;/p&gt; &lt;p&gt;&lt;a name="Removing_correlation" id="Removing_correlation"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h2&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Removing correlation&lt;/span&gt;&lt;/h2&gt; &lt;p&gt;It is always possible to remove the correlation between zero-mean random variables with a linear transform, even if the relationship between the variables is nonlinear. Suppose a vector of &lt;i&gt;n&lt;/i&gt; random variables is sampled &lt;i&gt;m&lt;/i&gt; times. Let &lt;i&gt;X&lt;/i&gt; be a matrix where &lt;span class="texhtml"&gt;&lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;,&lt;i&gt;j&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt; is the &lt;i&gt;j&lt;/i&gt;th variable of sample &lt;i&gt;i&lt;/i&gt;. Let &lt;span class="texhtml"&gt;&lt;i&gt;Z&lt;/i&gt;&lt;sub&gt;&lt;i&gt;r&lt;/i&gt;,&lt;i&gt;c&lt;/i&gt;&lt;/sub&gt;&lt;/span&gt; be an &lt;i&gt;r&lt;/i&gt; by &lt;i&gt;c&lt;/i&gt; matrix with every element 1. Then &lt;i&gt;D&lt;/i&gt; is the data transformed so every random variable has zero mean, and &lt;i&gt;T&lt;/i&gt; is the data transformed so all variables have zero mean, unit variance, and zero correlation with all other variables.&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="D = X -\frac{1}{m} Z_{m,m} X" src="http://upload.wikimedia.org/math/6/7/1/6715b8d24e96a6f20f63c4dcb61018cc.png" /&gt;&lt;/dd&gt;&lt;dd&gt;&lt;img class="tex" alt="T = D (D^T D)^{-\frac{1}{2}}" src="http://upload.wikimedia.org/math/c/1/4/c1428a3ab02a5c99ca0f1cd7852f2171.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;where an exponent of -1/2 represents the &lt;a href="http://en.wikipedia.org/wiki/Matrix_square_root" title="Matrix square root"&gt;matrix square root&lt;/a&gt; of the &lt;a href="http://en.wikipedia.org/wiki/Matrix_inverse" title="Matrix inverse"&gt;inverse&lt;/a&gt; of a matrix. The covariance matrix of &lt;i&gt;T&lt;/i&gt; will be the identity matrix. If a new data sample &lt;i&gt;x&lt;/i&gt; is a row vector of &lt;i&gt;n&lt;/i&gt; elements, then the same transform can be applied to &lt;i&gt;x&lt;/i&gt; to get the transformed vectors &lt;i&gt;d&lt;/i&gt; and &lt;i&gt;t&lt;/i&gt;:&lt;/p&gt; &lt;dl&gt;&lt;dd&gt;&lt;img class="tex" alt="d = x - \frac{1}{m} Z_{1,m} X" src="http://upload.wikimedia.org/math/3/1/1/311997b7f9deb1cdc43dcf040df09fcf.png" /&gt;&lt;/dd&gt;&lt;dd&gt;&lt;img class="tex" alt="t = d (D^T D)^{-\frac{1}{2}}." src="http://upload.wikimedia.org/math/4/6/5/465ba75f99be9cf2a5c1f21206a5ff70.png" /&gt;&lt;/dd&gt;&lt;/dl&gt; &lt;p&gt;&lt;a name="Common_misconceptions_about_correlation" id="Common_misconceptions_about_correlation"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h2&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Common misconceptions about correlation&lt;/span&gt;&lt;/h2&gt; &lt;p&gt;&lt;a name="Correlation_and_causality" id="Correlation_and_causality"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h3&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Correlation and causality&lt;/span&gt;&lt;/h3&gt; &lt;p&gt;The conventional dictum that "&lt;a href="http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation" title="Correlation does not imply causation"&gt;correlation does not imply causation&lt;/a&gt;" means that correlation cannot be validly used to infer a causal relationship between the variables. This dictum should not be taken to mean that correlations cannot indicate causal relations. However, the causes underlying the correlation, if any, may be indirect and unknown. Consequently, establishing a correlation between two variables is a not sufficient condition to establish a causal relationship (in either direction).&lt;/p&gt; &lt;p&gt;Here is a simple example: hot weather may cause both crime and ice-cream purchases. Therefore crime is correlated with ice-cream purchases. But crime does not cause ice-cream purchases and ice-cream purchases do not cause crime.&lt;/p&gt; &lt;p&gt;A correlation between age and height in children is fairly causally transparent, but a correlation between mood and health in people is less so. Does improved mood lead to improved health? Or does good health lead to good mood? Or does some other factor underlie both? Or is it pure coincidence? In other words, a correlation can be taken as evidence for a possible causal relationship, but cannot indicate what the causal relationship, if any, might be.&lt;/p&gt; &lt;p&gt;&lt;a name="Correlation_and_linearity" id="Correlation_and_linearity"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h3&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Correlation and linearity&lt;/span&gt;&lt;/h3&gt; &lt;div class="thumb tright"&gt; &lt;div class="thumbinner" style="width: 327px;"&gt;&lt;a href="http://en.wikipedia.org/wiki/Image:Anscombe.svg" class="internal" title="Four sets of data with the same correlation of 0.81"&gt;&lt;img alt="Four sets of data with the same correlation of 0.81" longdesc="/wiki/Image:Anscombe.svg" class="thumbimage" src="http://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Anscombe.svg/325px-Anscombe.svg.png" height="222" width="325" /&gt;&lt;/a&gt; &lt;div class="thumbcaption"&gt; &lt;div class="magnify" style="float: right;"&gt;&lt;a href="http://en.wikipedia.org/wiki/Image:Anscombe.svg" class="internal" title="Enlarge"&gt;&lt;img src="http://en.wikipedia.org/skins-1.5/common/images/magnify-clip.png" alt="" height="11" width="15" /&gt;&lt;/a&gt;&lt;/div&gt; Four sets of data with the same correlation of 0.81&lt;/div&gt; &lt;/div&gt; &lt;/div&gt; &lt;p&gt;While Pearson correlation indicates the strength of a linear relationship between two variables, its value alone may not be sufficient to evaluate this relationship, especially in the case where the assumption of normality is incorrect.&lt;/p&gt; &lt;p&gt;The image on the right shows &lt;a href="http://en.wikipedia.org/wiki/Scatterplot" title="Scatterplot"&gt;scatterplots&lt;/a&gt; of &lt;a href="http://en.wikipedia.org/wiki/Anscombe%27s_quartet" title="Anscombe's quartet"&gt;Anscombe's quartet&lt;/a&gt;, a set of four different pairs of variables created by &lt;a href="http://en.wikipedia.org/w/index.php?title=Francis_Anscombe&amp;action=edit" class="new" title="Francis Anscombe"&gt;Francis Anscombe&lt;/a&gt;.&lt;sup id="_ref-0" class="reference"&gt;&lt;a href="http://en.wikipedia.org/wiki/Correlation#_note-0" title=""&gt;[2]&lt;/a&gt;&lt;/sup&gt; The four &lt;span class="texhtml"&gt;&lt;i&gt;y&lt;/i&gt;&lt;/span&gt; variables have the same mean (7.5), standard deviation (4.12), correlation (0.81) and regression line (&lt;span class="texhtml"&gt;&lt;i&gt;y&lt;/i&gt; = 3 + 0.5&lt;i&gt;x&lt;/i&gt;&lt;/span&gt;). However, as can be seen on the plots, the distribution of the variables is very different. The first one (top left) seems to be distributed normally, and corresponds to what one would expect when considering two variables correlated and following the assumption of normality. The second one (top right) is not distributed normally; while an obvious relationship between the two variables can be observed, it is not linear, and the Pearson correlation coefficient is not relevant. In the third case (bottom left), the linear relationship is perfect, except for one &lt;a href="http://en.wikipedia.org/wiki/Outlier" title="Outlier"&gt;outlier&lt;/a&gt; which exerts enough influence to lower the correlation coefficient from 1 to 0.81. Finally, the fourth example (bottom right) shows another example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.&lt;/p&gt; &lt;p&gt;These examples indicate that the correlation coefficient, as a summary statistic, cannot replace the individual examination of the data.&lt;/p&gt; &lt;p&gt;&lt;a name="Computing_correlation_accurately_in_a_single_pass" id="Computing_correlation_accurately_in_a_single_pass"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h2&gt;&lt;span class="editsection"&gt;&lt;/span&gt; &lt;span class="mw-headline"&gt;Computing correlation accurately in a single pass&lt;/span&gt;&lt;/h2&gt; &lt;p&gt;The following algorithm (in &lt;a href="http://en.wikipedia.org/wiki/Pseudocode" title="Pseudocode"&gt;pseudocode&lt;/a&gt;) will estimate correlation with good numerical stability&lt;/p&gt; &lt;pre&gt;sum_sq_x = 0&lt;br /&gt;sum_sq_y = 0&lt;br /&gt;sum_coproduct = 0&lt;br /&gt;mean_x = x[1]&lt;br /&gt;mean_y = y[1]&lt;br /&gt;for i in 2 to N:&lt;br /&gt;  sweep = (i - 1.0) / i&lt;br /&gt;  delta_x = x[i] - mean_x&lt;br /&gt;  delta_y = y[i] - mean_y&lt;br /&gt;  sum_sq_x += delta_x * delta_x * sweep&lt;br /&gt;  sum_sq_y += delta_y * delta_y * sweep&lt;br /&gt;  sum_coproduct += delta_x * delta_y * sweep&lt;br /&gt;  mean_x += delta_x / i&lt;br /&gt;  mean_y += delta_y / i&lt;br /&gt;pop_sd_x = sqrt( sum_sq_x / N )&lt;br /&gt;pop_sd_y = sqrt( sum_sq_y / N )&lt;br /&gt;cov_x_y = sum_coproduct / N&lt;br /&gt;correlation = cov_x_y / (pop_sd_x * pop_sd_y)&lt;br /&gt;&lt;/pre&gt; &lt;p&gt;For an enlightening experiment, check the correlation of {900,000,000 + i for i=1...100} with {900,000,000 - i for i=1...100}, perhaps with a few values modified. Poor algorithms will fail.&lt;/p&gt; &lt;p&gt;&lt;a name="Currency_correlation" id="Currency_correlation"&gt;&lt;/a&gt;&lt;/p&gt; &lt;h2&gt;&lt;span class="editsection"&gt;&lt;/span&gt;&lt;span class="mw-headline"&gt;Currency correlation&lt;/span&gt;&lt;/h2&gt; &lt;p&gt;Currency correlation is correlation between two &lt;a href="http://en.wikipedia.org/wiki/Currency_pairs" title="Currency pairs"&gt;currency pairs&lt;/a&gt;, or more generally, correlations between values of &lt;a href="http://en.wikipedia.org/wiki/Commodities" title="Commodities"&gt;commodities&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Stocks" title="Stocks"&gt;stocks&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Bonds" title="Bonds"&gt;bonds&lt;/a&gt; &lt;a href="http://en.wikipedia.org/wiki/Financial_market" title="Financial market"&gt;markets&lt;/a&gt;. It is used as a tool to predict changes in market value.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-6174438248247291789?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/6174438248247291789/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=6174438248247291789' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/6174438248247291789'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/6174438248247291789'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/09/correlation-pearson-correlation.html' title='Correlation, Pearson Correlation'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-218809664027129494</id><published>2007-08-30T16:07:00.000+01:00</published><updated>2007-08-30T16:08:49.169+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Overview of Statistical Test [***recommended***]</title><content type='html'>&lt;table border="0" cellpadding="6" cellspacing="0" width="600"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="left" valign="top"&gt;&lt;p class="subctr"&gt;&lt;span style="font-size:+1;"&gt; Choosing a statistical test&lt;/span&gt;&lt;/p&gt;      &lt;p&gt;This is chapter 37 of &lt;a set="yes" linkindex="2" href="http://www.graphpad.com/www/book/book.htm"&gt;Intuitive Biostatistics&lt;/a&gt; (ISBN 0-19-508607-4) by Harvey Motulsky. Copyright © 1995 by Oxford University Press Inc. All rights reserved. You may order the book from GraphPad Software with a software purchase, from any academic bookstore, or from &lt;a linkindex="3" href="http://www.amazon.com/exec/obidos/ISBN%3D0195086074/GraphpadSoftwareA/102-7198890-9734560"&gt;amazon.com&lt;/a&gt;.&lt;/p&gt;      &lt;p&gt;Learn how to &lt;a linkindex="4" href="http://www.graphpad.com/articles/interpret/tableaofacontents.htm"&gt;interpret the results of statistical tests&lt;/a&gt; and about our programs &lt;a set="yes" linkindex="5" href="http://www.graphpad.com/instat3/instat.htm"&gt;GraphPad InStat&lt;/a&gt; and &lt;a linkindex="6" href="http://www.graphpad.com/prism/Prism.htm"&gt;GraphPad Prism&lt;/a&gt;.&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr&gt;     &lt;td align="left" valign="top"&gt;&lt;span class="sublft"&gt;REVIEW OF AVAILABLE STATISTICAL TESTS&lt;/span&gt;      &lt;p&gt;This book has discussed many different statistical tests. To select the right test, ask yourself two questions: What kind of data have you collected? What is your goal? Then refer to Table 37.1.&lt;/p&gt;      &lt;p&gt;All tests are described in this book and are performed by InStat, except for tests marked with asterisks. Tests labeled with a single asterisk are briefly mentioned in this book, and tests labeled with two asterisks are not mentioned at all.&lt;/p&gt;      &lt;p&gt;Table 37.1. Selecting a statistical test&lt;/p&gt;            &lt;table border="2" cellpadding="3" cellspacing="0" width="580"&gt;       &lt;tbody&gt;&lt;tr&gt;        &lt;td align="left" valign="top"&gt;&lt;br /&gt;&lt;/td&gt;        &lt;td colspan="4" align="left" valign="top"&gt;&lt;a name="UQHTML3"&gt;&lt;/a&gt;&lt;span class="sublft"&gt;Type of Data&lt;/span&gt;&lt;/td&gt;       &lt;/tr&gt;       &lt;tr&gt;        &lt;td align="left" valign="top"&gt;&lt;span class="sublft"&gt;Goal&lt;/span&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Measurement (from Gaussian Population)&lt;/b&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Rank, Score, or Measurement (from Non- Gaussian Population)&lt;/b&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Binomial&lt;br /&gt;         (Two Possible Outcomes)&lt;/b&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Survival Time&lt;/b&gt;&lt;/td&gt;       &lt;/tr&gt;       &lt;tr&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Describe one group&lt;/b&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Mean, SD&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Median, interquartile range&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Proportion&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Kaplan Meier survival curve&lt;/td&gt;       &lt;/tr&gt;       &lt;tr&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Compare one group to a hypothetical value&lt;/b&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;One-sample &lt;i&gt;t&lt;/i&gt; test&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Wilcoxon test&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Chi-square&lt;br /&gt;        or&lt;br /&gt;        Binomial test **&lt;/td&gt;        &lt;td align="left" valign="top"&gt;&lt;br /&gt;&lt;/td&gt;       &lt;/tr&gt;       &lt;tr&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Compare two unpaired&lt;i&gt; &lt;/i&gt;groups&lt;/b&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Unpaired &lt;i&gt;t&lt;/i&gt; test&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Mann-Whitney test&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Fisher's test&lt;br /&gt;        (chi-square for large samples)&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Log-rank test or Mantel-Haenszel&lt;wbr&gt;*&lt;/td&gt;       &lt;/tr&gt;       &lt;tr&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Compare two paired groups&lt;/b&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Paired &lt;i&gt;t&lt;/i&gt; test&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Wilcoxon test&lt;/td&gt;        &lt;td align="left" valign="top"&gt;McNemar's test&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Conditional proportional hazards regression*&lt;/td&gt;       &lt;/tr&gt;       &lt;tr&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Compare three or more unmatched groups&lt;/b&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;One-way ANOVA&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Kruskal-Wallis test&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Chi-square test&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Cox proportional hazard regression**&lt;/td&gt;       &lt;/tr&gt;       &lt;tr&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Compare three or more matched groups&lt;/b&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Repeated-measur&lt;wbr&gt;es ANOVA&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Friedman test&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Cochrane Q**&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Conditional proportional hazards regression**&lt;/td&gt;       &lt;/tr&gt;       &lt;tr&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Quantify association between two variables&lt;/b&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Pearson correlation&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Spearman correlation&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Contingency coefficients**&lt;/td&gt;        &lt;td align="left" valign="top"&gt;&lt;br /&gt;&lt;/td&gt;       &lt;/tr&gt;       &lt;tr&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Predict value from another measured variable&lt;/b&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Simple linear regression&lt;br /&gt;        or&lt;br /&gt;        Nonlinear regression&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Nonparametric regression**&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Simple logistic regression*&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Cox proportional hazard regression*&lt;/td&gt;       &lt;/tr&gt;       &lt;tr&gt;        &lt;td align="left" valign="top"&gt;&lt;b&gt;Predict value from several measured or binomial variables&lt;/b&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Multiple linear regression*&lt;br /&gt;        or&lt;br /&gt;        Multiple nonlinear regression**&lt;/td&gt;        &lt;td align="left" valign="top"&gt;&lt;br /&gt;&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Multiple logistic regression*&lt;/td&gt;        &lt;td align="left" valign="top"&gt;Cox proportional hazard regression*&lt;/td&gt;       &lt;/tr&gt;      &lt;/tbody&gt;&lt;/table&gt;            &lt;p&gt;&lt;span class="sublft"&gt;REVIEW OF NONPARAMETRIC TESTS &lt;/span&gt;&lt;/p&gt;      &lt;p&gt;Choosing the right test to compare measurements is a bit tricky, as you must choose between two families of tests: parametric and nonparametric. Many -statistical test are based upon the assumption that the data are sampled from a Gaussian distribution. These tests are referred to as parametric tests. Commonly used parametric tests are listed in the first column of the table and include the t test and analysis of variance.&lt;/p&gt;      &lt;p&gt;Tests that do not make assumptions about the population distribution are referred to as nonparametric- tests. You've already learned a bit about nonparametric tests in previous chapters. All commonly used nonparametric tests rank the outcome variable from low to high and then analyze the ranks. These tests are listed in the second column of the table and include the Wilcoxon, Mann-Whitney test, and Kruskal-Wallis tests. These tests are also called distribution-fr&lt;wbr&gt;ee tests.&lt;/p&gt;      &lt;p&gt;&lt;span class="sublft"&gt;CHOOSING BETWEEN PARAMETRIC AND NONPARAMETRIC TESTS: THE EASY CASES &lt;/span&gt;&lt;/p&gt;      &lt;p&gt;Choosing between parametric and nonparametric tests is sometimes easy. You should definitely choose a parametric test if you are sure that your data are sampled from a population that follows a Gaussian distribution (at least approximately).&lt;wbr&gt; You should definitely select a nonparametric test in three situations:&lt;/p&gt;      &lt;ul&gt;&lt;p&gt;• The outcome is a rank or a score and the population is clearly not Gaussian. Examples include class ranking of students, the Apgar score for the health of newborn babies (measured on a scale of 0 to IO and where all scores are integers), the visual analogue score for pain (measured on a continuous scale where 0 is no pain and 10 is unbearable pain), and the star scale commonly used by movie and restaurant critics (* is OK, ***** is fantastic).&lt;br /&gt;• Some values are "off the scale," that is, too high or too low to measure. Even if the population is Gaussian, it is impossible to analyze such data with a parametric test since you don't know all of the values. Using a nonparametric test with these data is simple. Assign values too low to measure an arbitrary very low value and assign values too high to measure an arbitrary very high value. Then perform a nonparametric test. Since the nonparametric test only knows about the relative ranks of the values, it won't matter that you didn't know all the values exactly.&lt;br /&gt;• The data ire measurements, and you are sure that the population is not distributed in a Gaussian manner. If the data are not sampled from a Gaussian distribution, consider whether you can transformed the values to make the distribution become Gaussian. For example, you might take the logarithm or reciprocal of all values. There are often biological or chemical reasons (as well as statistical ones) for performing a particular transform.&lt;/p&gt;&lt;/ul&gt;      &lt;p&gt;&lt;span class="sublft"&gt;CHOOSING BETWEEN PARAMETRIC AND NONPARAMETRIC TESTS: THE HARD CASES &lt;/span&gt;&lt;/p&gt;      &lt;p&gt;It is not always easy to decide whether a sample comes from a Gaussian population. Consider these points:&lt;/p&gt;      &lt;ul&gt;&lt;p&gt;• If you collect many data points (over a hundred or so), you can look at the distribution of data and it will be fairly obvious whether the distribution is approximately bell shaped. A formal statistical test (Kolmogorov-Smi&lt;wbr&gt;rnoff test, not explained in this book) can be used to test whether the distribution of the data differs significantly from a Gaussian distribution. With few data points, it is difficult to tell whether the data are Gaussian by inspection, and the formal test has little power to discriminate between Gaussian and non-Gaussian distributions.&lt;br /&gt;• You should look at previous data as well. Remember, what matters is the distribution of the overall population, not the distribution of your sample. In deciding whether a population is Gaussian, look at all available data, not just data in the current experiment.&lt;br /&gt;• Consider the source of scatter. When the scatter comes from the sum of numerous sources (with no one source contributing most of the scatter), you expect to find a roughly Gaussian distribution.&lt;br /&gt;When in doubt, some people choose a parametric test (because they aren't sure the Gaussian assumption is violated), and others choose a nonparametric test (because they aren't sure the Gaussian assumption is met).&lt;/p&gt;&lt;/ul&gt;      &lt;p&gt;&lt;span class="sublft"&gt;CHOOSING BETWEEN PARAMETRIC AND NONPARAMETRIC TESTS: DOES IT MATTER? &lt;/span&gt;&lt;/p&gt;      &lt;p&gt;Does it matter whether you choose a parametric or nonparametric test? The answer depends on sample size. There are four cases to think about:&lt;/p&gt;      &lt;ul&gt;&lt;p&gt;• Large sample. What happens when you use a parametric test with data from a nongaussian population? The central limit theorem (discussed in Chapter 5) ensures that parametric tests work well with large samples even if the population is non-Gaussian. In other words, parametric tests are robust to deviations from Gaussian distributions, so long as the samples are large. The snag is that it is impossible to say how large is large enough, as it depends on the nature of the particular non-Gaussian distribution. Unless the population distribution is really weird, you are probably safe choosing a parametric test when there are at least two dozen data points in each group.&lt;br /&gt;• Large sample. What happens when you use a nonparametric test with data from a Gaussian population? Nonparametric tests work well with large samples from Gaussian populations. The P values tend to be a bit too large, but the discrepancy is small. In other words, nonparametric tests are only slightly less powerful than parametric tests with large samples.&lt;br /&gt;• Small samples. What happens when you use a parametric test with data from nongaussian populations? You can't rely on the central limit theorem, so the P value may be inaccurate.&lt;br /&gt;• Small samples. When you use a nonparametric test with data from a Gaussian population, the P values tend to be too high. The nonparametric tests lack statistical power with small samples.&lt;/p&gt;&lt;/ul&gt;      &lt;p&gt;Thus, large data sets present no problems. It is usually easy to tell if the data come from a Gaussian population, but it doesn't really matter because the nonparametric tests are so powerful and the parametric tests are so robust. Small data sets present a dilemma. It is difficult to tell if the data come from a Gaussian population, but it matters a lot. The nonparametric tests are not powerful and the parametric tests are not robust.&lt;/p&gt;      &lt;p&gt;&lt;span class="sublft"&gt;ONE- OR TWO-SIDED P VALUE? &lt;/span&gt;&lt;/p&gt;      &lt;p&gt;With many tests, you must choose whether you wish to calculate a one- or two-sided P value (same as one- or two-tailed P value). The difference between one- and two-sided P values was discussed in Chapter 10. Let's review the difference in the context of a t test. The P value is calculated for the null hypothesis that the two population means are equal, and any discrepancy between the two sample means is due to chance. If this null hypothesis is true, the one-sided P value is the probability that two sample means would differ as much as was observed (or further) in the direction specified by the hypothesis just by chance, even though the means of the overall populations are actually equal. The two-sided P value also includes the probability that the sample means would differ that much in the opposite direction (i.e., the other group has the larger mean). The two-sided P value is twice the one-sided P value.&lt;/p&gt;      &lt;p&gt;A one-sided P value is appropriate when you can state with certainty (and before collecting any data) that there either will be no difference between the means or that the difference will go in a direction you can specify in advance (i.e., you have specified which group will have the larger mean). If you cannot specify the direction of any difference before collecting data, then a two-sided P value is more appropriate. If in doubt, select a two-sided P value.&lt;/p&gt;      &lt;p&gt;If you select a one-sided test, you should do so before collecting any data and you need to state the direction of your experimental hypothesis. If the data go the other way, you must be willing to attribute that difference (or association or correlation) to chance, no matter how striking the data. If you would be intrigued, even a little, by data that goes in the "wrong" direction, then you should use a two-sided P value. For reasons discussed in Chapter 10, I recommend that you always calculate a two-sided P value.&lt;/p&gt;      &lt;p&gt;&lt;span class="sublft"&gt;PAIRED OR UNPAIRED TEST? &lt;/span&gt;&lt;/p&gt;      &lt;p&gt;When comparing two groups, you need to decide whether to use a paired test. When comparing three or more groups, the term paired is not apt and the term repeated measures is used instead.&lt;/p&gt;      &lt;p&gt;Use an unpaired test to compare groups when the individual values are not paired or matched with one another. Select a paired or repeated-measur&lt;wbr&gt;es test when values represent repeated measurements on one subject (before and after an intervention) or measurements on matched subjects. The paired or repeated-measur&lt;wbr&gt;es tests are also appropriate for repeated laboratory experiments run at different times, each with its own control.&lt;/p&gt;      &lt;p&gt;You should select a paired test when values in one group are more closely correlated with a specific value in the other group than with random values in the other group. It is only appropriate to select a paired test when the subjects were matched or paired before the data were collected. You cannot base the pairing on the data you are analyzing.&lt;/p&gt;      &lt;p&gt;&lt;span class="sublft"&gt;FISHER'S TEST OR THE CHI-SQUARE TEST? &lt;/span&gt;&lt;/p&gt;      &lt;p&gt;When analyzing contingency tables with two rows and two columns, you can use either Fisher's exact test or the chi-square test. The Fisher's test is the best choice as it always gives the exact P value. The chi-square test is simpler to calculate but yields only an approximate P value. If a computer is doing the calculations, you should choose Fisher's test unless you prefer the familiarity of the chi-square test. You should definitely avoid the chi-square test when the numbers in the contingency table are very small (any number less than about six). When the numbers are larger, the P values reported by the chi-square and Fisher's test will he very similar.&lt;/p&gt;      &lt;p&gt;The chi-square test calculates approximate P values, and the Yates' continuity correction is designed to make the approximation better. Without the Yates' correction, the P values are too low. However, the correction goes too far, and the resulting P value is too high. Statisticians give different recommendations&lt;wbr&gt; regarding Yates' correction. With large sample sizes, the Yates' correction makes little difference. If you select Fisher's test, the P value is exact and Yates' correction is not needed and is not available.&lt;/p&gt;      &lt;p&gt;&lt;span class="sublft"&gt;REGRESSION OR CORRELATION? &lt;/span&gt;&lt;/p&gt;      &lt;p&gt;Linear regression and correlation are similar and easily confused. In some situations it makes sense to perform both calculations. Calculate linear correlation if you measured both X and Y in each subject and wish to quantity how well they are associated. Select the Pearson (parametric) correlation coefficient if you can assume that both X and Y are sampled from Gaussian populations. Otherwise choose the Spearman nonparametric correlation coefficient. Don't calculate the correlation coefficient (or its confidence interval) if you manipulated the X variable.&lt;/p&gt;      &lt;p&gt;Calculate linear regressions only if one of the variables (X) is likely to precede or cause the other variable (Y). Definitely choose linear regression if you manipulated the X variable. It makes a big difference which variable is called X and which is called Y, as linear regression calculations are not symmetrical with respect to X and Y. If you swap the two variables, you will obtain a different regression line. In contrast, linear correlation calculations are symmetrical with respect to X and Y. If you swap the labels X and Y, you will still get the same correlation coefficient.&lt;/p&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-218809664027129494?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/218809664027129494/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=218809664027129494' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/218809664027129494'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/218809664027129494'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/08/overview-of-statistical-test.html' title='Overview of Statistical Test [***recommended***]'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-5722527630295894727</id><published>2007-08-30T15:17:00.000+01:00</published><updated>2007-08-30T15:18:41.063+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Statistical Data Analysis: Elementary Concepts</title><content type='html'>&lt;span style="font-weight: bold;font-size:100%;" &gt;Understanding Statistical Inference&lt;/span&gt;  &lt;p&gt;Statistical inference is based upon mathematical laws of probability. The  following example will give you the basic ideas.&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-weight: bold;"&gt;Statistical Inference &amp; The Coin Toss&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;Suppose we want to do a few coin tosses (sample) so that we can decide if a particular coin is equally likely to land head or tail over an infinite number of tosses (population). &lt;/p&gt;&lt;p&gt;If we toss the coin ten times and get 6 heads and 4 tails, we might suspect the coin is biased towards heads, but we wouldn't be very confident about this, because it's not that unusual (not that improbable) to get 6 heads out of 10. &lt;/p&gt;&lt;p&gt;On the other hand, if we toss the coin ten times and get 10 heads - we would be more confident that the coin is biased towards heads, because it is very unusual (not very probable at all) that we would get this result from an unbiased coin. &lt;/p&gt;&lt;h3&gt;Statistical Data Analysis: Hypothesis Testing&lt;/h3&gt; &lt;p&gt;The most common kind of statistical inference is hypothesis testing. Statistical data analysis allows us to use mathematical principles to decide how likely it is that our sample results match our hypothesis about a population. For example, if our research hypothesis is that the coin is not fair, but is actually biased towards heads - we can use principles of statistics to tell us how likely it is that we could get our sample results even if the coin were fair after all (null hypothesis). &lt;/p&gt;&lt;p&gt;If the probability of getting our sample results from a fair coin is very low, we feel confident in rejecting the null hypothesis (that the coin is fair). Even though we can't say for sure (because even a fair coin could produce our sample results), we can say that the results of our study &lt;b&gt;support the hypothesis&lt;/b&gt; that the coin is indeed biased.  &lt;/p&gt;&lt;p&gt;When we make this decision based on statistical data analysis, this is statistical inference.   &lt;/p&gt;&lt;h3&gt;Statistical Data Analysis:  p-value&lt;/h3&gt; &lt;p&gt;In statistical hypothesis testing we use a p-value (probability value) to decide whether we have enough evidence to reject the null hypothesis and say our research hypothesis is supported by the data. &lt;/p&gt;&lt;p&gt;The p-value is a numerical statement of how likely it is that we could have  gotten our sample data (e.g., 10 heads) even if the null hypothesis is true  (e.g., fair coin). By convention, if the p-value is less than 0.05 (p &lt;&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-5722527630295894727?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/5722527630295894727/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=5722527630295894727' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/5722527630295894727'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/5722527630295894727'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/08/statistical-data-analysis-elementary.html' title='Statistical Data Analysis: Elementary Concepts'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-4290431849435578609</id><published>2007-08-30T14:10:00.001+01:00</published><updated>2007-08-30T14:24:54.047+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>KOLMOGOROV SMIRNOV Test (TWO SAMPLE) II</title><content type='html'>&lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;    Purpose:     &lt;/span&gt;&lt;ul&gt;&lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;      Perform a Kolmogorov-Smir&lt;wbr&gt;nov two sample test that two data samples     come from the same distribution.  Note that we are not specifying     what that common distribution is.     &lt;/span&gt;&lt;/ul&gt;&lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;    Description:     &lt;/span&gt;&lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;      The one sample Kolmogorov-Smir&lt;wbr&gt;nov (K-S) test is based on the     empirical distribution function (ECDF).  Given N data points     Y&lt;sub&gt;1&lt;/sub&gt; Y&lt;sub&gt;2&lt;/sub&gt; ..., Y&lt;sub&gt;N&lt;/sub&gt; the ECDF is defined     as     &lt;/span&gt;&lt;p&gt; &lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;    &lt;/span&gt;&lt;/p&gt;&lt;ul&gt;&lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;         &lt;img src="http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/eqns/en.gif" alt="E(i) = n(i)/N" /&gt;     &lt;/span&gt;&lt;/ul&gt;&lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;      &lt;/span&gt;&lt;p&gt; &lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;    where n(i) is the number of points less than Y&lt;sub&gt;i&lt;/sub&gt;  This     is a step function that increases by 1/N at the value of each data     point.  We can graph a plot of the empirical distribution     function with a cumulative distribution function for a given     distribution.  The one sample K-S test is based on the maximum     distance between these two curves.  That is,     &lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;    &lt;/span&gt;&lt;/p&gt;&lt;ul&gt;&lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;         &lt;img src="http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/eqns/ks.gif" alt="D = max |F(Y(i)) - E(i)|" /&gt;     &lt;/span&gt;&lt;/ul&gt;&lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;      &lt;/span&gt;&lt;p&gt; &lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;    where F is the theoretical cumulative distribution function.     &lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;    The two sample K-S test is a variation of this.  However, instead     of comparing an empirical distribution function to a theoretical     distribution function, we compare the two empirical distribution     functions.  That is,     &lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;    &lt;/span&gt;&lt;/p&gt;&lt;ul&gt;&lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;         &lt;img src="http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/eqns/ks2samp.gif" alt="D = max |E1(i) - E2(i)|" /&gt;     &lt;/span&gt;&lt;/ul&gt;&lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;      &lt;/span&gt;&lt;p&gt; &lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;    where E&lt;sub&gt;1&lt;/sub&gt; and E&lt;sub&gt;2&lt;/sub&gt; are the empirical distribution     functions for the two samples.  Note that we compute E&lt;sub&gt;1&lt;/sub&gt;     and E&lt;sub&gt;2&lt;/sub&gt; at each point in both samples (that is both     E&lt;sub&gt;1&lt;/sub&gt; and E&lt;sub&gt;2&lt;/sub&gt; are computed at each point     in each sample).     &lt;/span&gt;&lt;/p&gt;&lt;p&gt; &lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;    More formally, the Kolmogorov-Smir&lt;wbr&gt;nov two sample test statistic     can be defined as follows.     &lt;/span&gt;&lt;/p&gt; &lt;span style="font-family:Arial,Helvetica,sans-serif;"&gt;    &lt;table noborder=""&gt;        &lt;tbody&gt;&lt;tr&gt;            &lt;td valign="top"&gt;               H&lt;sub&gt;0&lt;/sub&gt;:           &lt;/td&gt;            &lt;td valign="top"&gt;               The two samples come from a common distribution.           &lt;/td&gt;         &lt;/tr&gt;         &lt;tr&gt;            &lt;td valign="top"&gt;               H&lt;sub&gt;a&lt;/sub&gt;:           &lt;/td&gt;            &lt;td valign="top"&gt;               The two samples do not come from a common distribution.           &lt;/td&gt;         &lt;/tr&gt;         &lt;tr&gt;            &lt;td valign="top"&gt;               Test Statistic:           &lt;/td&gt;            &lt;td valign="top"&gt;               The Kolmogorov-Smir&lt;wbr&gt;nov two sample test statistic is              defined as              &lt;p&gt;              &lt;/p&gt;&lt;ul&gt;&lt;img src="http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/eqns/ks2samp.gif" alt="D = max |E1(i) - E2(i)|" /&gt;&lt;/ul&gt;               &lt;p&gt;              where E&lt;sub&gt;1&lt;/sub&gt; and E&lt;sub&gt;2&lt;/sub&gt; are the empirical              distribution functions for the two samples.           &lt;/p&gt;&lt;/td&gt;         &lt;/tr&gt;         &lt;tr&gt;            &lt;td valign="top"&gt;               Significance Level:           &lt;/td&gt;            &lt;td valign="top"&gt;               &lt;img src="http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/eqns/alpha.gif" alt="alpha" /&gt;           &lt;/td&gt;         &lt;/tr&gt;         &lt;tr&gt;            &lt;td valign="top"&gt;               Critical Region:           &lt;/td&gt;            &lt;td valign="top"&gt;               The hypothesis regarding the distributional form is              rejected if the test statistic, D, is greater than              the critical value obtained from a table.  There              are several variations of these tables in the              literature that use somewhat different scalings              for the K-S test statistic and critical regions.              These alternative formulations should be equivalent,              but it is necessary to ensure that the test statistic              is calculated in a way that is consistent with how              the critical values were tabulated.              &lt;p&gt;             &lt;/p&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-4290431849435578609?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/4290431849435578609/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=4290431849435578609' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/4290431849435578609'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/4290431849435578609'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/08/kolmogorov-smirnov-two-sample.html' title='KOLMOGOROV SMIRNOV Test (TWO SAMPLE) II'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-7434259472037368109</id><published>2007-08-30T12:42:00.000+01:00</published><updated>2007-08-30T12:52:44.158+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Math'/><category scheme='http://www.blogger.com/atom/ns#' term='nonparametric Test'/><title type='text'>Nonparametric Methods II (Brief Overview)</title><content type='html'>Basically, there is at least one nonparametric equivalent for each parametric general type of test. In general, these tests fall into the following categories: &lt;ul&gt;&lt;li&gt;Tests of differences between groups (independent samples); &lt;/li&gt;&lt;li&gt;Tests of differences between variables (dependent samples); &lt;/li&gt;&lt;li&gt;Tests of relationships between variables. &lt;/li&gt;&lt;/ul&gt;  &lt;b&gt;Differences between independent groups. &lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Usually, when we have two samples that we want to compare concerning their mean value for some variable of interest, we would use the &lt;i&gt;t&lt;/i&gt;-test for independent samples; nonparametric alternatives for this test are the &lt;i&gt;&lt;span style="font-weight: bold;"&gt;Wald-Wolfowitz &lt;/span&gt;runs test&lt;/i&gt;, the &lt;i style="font-weight: bold;"&gt;Mann-Whitney U test&lt;/i&gt;, and the&lt;span style="font-weight: bold;"&gt; &lt;/span&gt;&lt;i style="font-weight: bold;"&gt;Kolmogorov-Smir&lt;wbr&gt;nov two-sample test&lt;/i&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;If we have multiple groups, we would use analysis of variance see &lt;a linkindex="10" href="http://www.statsoft.com/textbook/stanman.html"&gt;&lt;i&gt;ANOVA/MANOVA&lt;/i&gt;&lt;/a&gt;; the nonparametric equivalents to this method are the &lt;i style="font-weight: bold;"&gt;Kruskal-Wallis analysis of ranks&lt;/i&gt; and the &lt;i style="font-weight: bold;"&gt;Median test&lt;/i&gt;.  &lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;b&gt;Differences between dependent groups.&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt; If we want to compare two variables measured in the same sample we would customarily use the &lt;a set="yes" linkindex="11" href="http://www.statsoft.com/textbook/stbasic.html#t-test%20for%20dependent%20samples"&gt;&lt;i&gt;t-test for dependent samples&lt;/i&gt;&lt;/a&gt; ( for example, if we wanted to compare students' math skills at the beginning of the semester with their skills at the end of the semester). Nonparametric alternatives to this test are the &lt;i style="font-weight: bold;"&gt;Sign&lt;/i&gt;&lt;span style="font-weight: bold;"&gt; test &lt;/span&gt;and&lt;span style="font-weight: bold;"&gt; &lt;/span&gt;&lt;i style="font-weight: bold;"&gt;Wilcoxon's matched pairs&lt;/i&gt; test.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;If the variables of interest are dichotomous in nature (i.e., "pass" vs. "no pass") then &lt;i style="font-weight: bold;"&gt;McNemar's &lt;a linkindex="13" href="http://www.statsoft.com/textbook/glosc.html#Chi-square%20Distribution"&gt;Chi-square&lt;/a&gt;&lt;/i&gt;&lt;span style="font-weight: bold;"&gt; test&lt;/span&gt; is appropriate.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;If there are more than two variables that were measured in the same sample, then we would customarily use repeated measures ANOVA. Nonparametric alternatives to this method are &lt;i style="font-weight: bold;"&gt;Friedman's two-way analysis of variance&lt;/i&gt;&lt;span style="font-weight: bold;"&gt; &lt;/span&gt;and &lt;i style="font-weight: bold;"&gt;Cochran Q&lt;/i&gt;&lt;span style="font-weight: bold;"&gt; test&lt;/span&gt; (if the variable was measured in terms of categories, e.g., "passed" vs. "failed"). &lt;span style="font-weight: bold;"&gt;Cochran Q &lt;/span&gt;is particularly useful for measuring changes in frequencies (proportions) across time. &lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;b&gt;Relationships between variables. &lt;/b&gt; To express a relationship between two variables one usually computes the correlation coefficient. Nonparametric equivalents to the standard correlation coefficient are &lt;a style="font-weight: bold;" linkindex="14" href="http://www.statsoft.com/textbook/gloss.html#Spearman%20R"&gt;&lt;i&gt;Spearman R&lt;/i&gt;&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;, &lt;/span&gt;&lt;a style="font-weight: bold;" set="yes" linkindex="15" href="http://www.statsoft.com/textbook/glosi.html#Kendall%20Tau"&gt;&lt;i&gt;Kendall Tau&lt;/i&gt;&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;, and &lt;/span&gt;&lt;a style="font-weight: bold;" linkindex="16" href="http://www.statsoft.com/textbook/glosf.html#Gamma%20coefficient"&gt;coefficient Gamma&lt;/a&gt;&lt;span style="font-weight: bold;"&gt; &lt;/span&gt;(see &lt;a set="yes" linkindex="17" href="http://www.statsoft.com/textbook/stnonpar.html#correlations"&gt;&lt;i&gt;Nonparametric correlations&lt;/i&gt;&lt;/a&gt;).&lt;br /&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;If the two variables of interest are categorical in nature (e.g., "passed" vs. "failed" by "male" vs. "female") appropriate nonparametric statistics for testing the relationship between the two variables are &lt;span style="font-weight: bold;"&gt;the &lt;/span&gt;&lt;a style="font-weight: bold;" linkindex="18" href="http://www.statsoft.com/textbook/glosc.html#Chi-square%20Distribution"&gt;&lt;i&gt;Chi-square&lt;/i&gt;&lt;/a&gt;&lt;span style="font-weight: bold;"&gt; test&lt;/span&gt;, the &lt;i style="font-weight: bold;"&gt;Phi&lt;/i&gt;&lt;span style="font-weight: bold;"&gt; coefficient&lt;/span&gt;, and &lt;span style="font-weight: bold;"&gt;the &lt;/span&gt;&lt;i style="font-weight: bold;"&gt;Fisher exact&lt;/i&gt;&lt;span style="font-weight: bold;"&gt; test&lt;/span&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;In addition, a simultaneous test for relationships between multiple cases is available:  &lt;i style="font-weight: bold;"&gt;Kendall coefficient of concordance&lt;/i&gt;&lt;span style="font-weight: bold;"&gt;.&lt;/span&gt; This test is often used for expressing inter-rater agreement among independent judges who are rating (ranking) the same stimuli. &lt;/li&gt;&lt;/ul&gt;&lt;b&gt;Descriptive statistics. &lt;/b&gt; When one's data are not normally distributed, and the measurements at best contain rank order information, then computing the standard descriptive statistics (e.g., &lt;span style="font-weight: bold;"&gt;mean, standard deviation&lt;/span&gt;) is sometimes not the most informative way to summarize the data. For example, in the area of psychometrics it is well known that the &lt;i&gt;rated&lt;/i&gt; intensity of a stimulus (e.g., perceived brightness of a light) is often a logarithmic function of the actual intensity of the stimulus (brightness as measured in objective units of &lt;i&gt;Lux&lt;/i&gt;). In this example, the simple mean rating (sum of ratings divided by the number of stimuli) is not an adequate summary of the average actual intensity of the stimuli. (In this example, one would probably rather compute the &lt;a linkindex="19" href="http://www.statsoft.com/textbook/glosf.html#Geometric%20Mean"&gt;geometric mean&lt;/a&gt;.) &lt;br /&gt;&lt;ul&gt;&lt;li&gt;Nonparametrics and Distributions will compute a wide variety of measures of &lt;span style="font-weight: bold;"&gt;location&lt;/span&gt; (&lt;a style="font-weight: bold;" set="yes" linkindex="20" href="http://www.statsoft.com/textbook/glosm.html#Mean"&gt;mean&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;, &lt;/span&gt;&lt;a style="font-weight: bold;" linkindex="21" href="http://www.statsoft.com/textbook/glosm.html#Median"&gt;median&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;, &lt;/span&gt;&lt;a style="font-weight: bold;" linkindex="22" href="http://www.statsoft.com/textbook/glosm.html#Mode"&gt;mode&lt;/a&gt;, etc.) and &lt;span style="font-weight: bold;"&gt;dispersion&lt;/span&gt; (&lt;a style="font-weight: bold;" linkindex="23" href="http://www.statsoft.com/textbook/glosu.html#Variance"&gt;variance&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;, average deviation, quartile range&lt;/span&gt;, etc.) to provide the "complete picture" of one's data.  &lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2556963392609813338-7434259472037368109?l=dbigbear.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dbigbear.blogspot.com/feeds/7434259472037368109/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=2556963392609813338&amp;postID=7434259472037368109' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/7434259472037368109'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2556963392609813338/posts/default/7434259472037368109'/><link rel='alternate' type='text/html' href='http://dbigbear.blogspot.com/2007/08/nonparametric-methods-ii-brief-overview.html' title='Nonparametric Methods II (Brief Overview)'/><author><name>Xiong  (Johnny) Deng</name><uri>http://www.blogger.com/profile/08140441552728271376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2556963392609813338.post-5763278678468016066</id><published>2007-08-29T18:49:00.000+01:00</published><updated>2007-08-30T10:41:53.021+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining: Syllabus'/><title type='text'>A Partial Syllabus of Data Analysis</title><content type='html'>&lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Probability&lt;/span&gt;&lt;/p&gt;  &lt;table class="MsoNormalTable" style="width: 543.75pt;" border="0" cellpadding="0" cellspacing="0" width="725"&gt;  &lt;tbody&gt;&lt;tr style="height: 27.75pt;"&gt;   &lt;td style="padding: 0.75pt; background: rgb(204, 255, 255) none repeat scroll 0% 50%; width: 63pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 27.75pt;" width="84"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; width: 423pt; height: 27.75pt;" width="564"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;PROBABILITY THEORY :&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; width: 53.25pt; height: 27.75pt;" width="71"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;p class="MsoNormal"&gt;&lt;a name="DISTRIBUTIONS"&gt;&lt;/a&gt;&lt;span lang="EN-US"&gt;Distributions&lt;/span&gt;&lt;/p&gt;  &lt;table class="MsoNormalTable" style="width: 543.75pt;" border="0" cellpadding="0" cellspacing="0" width="725"&gt;  &lt;tbody&gt;&lt;tr style="height: 27.75pt;"&gt;   &lt;td style="padding: 0.75pt; background: rgb(204, 255, 255) none repeat scroll 0% 50%; width: 63pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 27.75pt;" width="84"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; width: 423pt; height: 27.75pt;" width="564"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;CONTINUOUS DISTRIBUTIONS :&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; width: 53.25pt; height: 27.75pt;" width="71"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 27.75pt;"&gt;   &lt;td style="padding: 0.75pt; background: rgb(204, 255, 255) none repeat scroll 0% 50%; width: 63pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 27.75pt;" width="84"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; width: 423pt; height: 27.75pt;" width="564"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;DISCRETE DISTRIBUTIONS :&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; width: 53.25pt; height: 27.75pt;" width="71"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;p class="MsoNormal"&gt;&lt;a name="Linear_Regression"&gt;&lt;/a&gt;&lt;span lang="EN-US"&gt;Linear Regression&lt;/span&gt;&lt;/p&gt;  &lt;table class="MsoNormalTable" style="width: 543.75pt;" border="0" cellpadding="0" cellspacing="0" width="725"&gt;  &lt;tbody&gt;&lt;tr style="height: 27.75pt;"&gt;   &lt;td style="padding: 0.75pt; background: rgb(204, 255, 255) none repeat scroll 0% 50%; width: 63pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 27.75pt;" width="84"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; width: 423pt; height: 27.75pt;" width="564"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;SIMPLE LINEAR REGRESSION&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; background: rgb(204, 255, 255) none repeat scroll 0% 50%; width: 53.25pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 27.75pt;" width="71"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 27.75pt;"&gt;   &lt;td style="padding: 0.75pt; background: rgb(204, 255, 255) none repeat scroll 0% 50%; width: 63pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 27.75pt;" width="84"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; width: 423pt; height: 27.75pt;" width="564"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;MULTIPLE LINEAR REGRESSION &lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; background: rgb(204, 255, 255) none repeat scroll 0% 50%; width: 53.25pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 27.75pt;" width="71"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Estimation&lt;/span&gt;&lt;/p&gt;  &lt;table class="MsoNormalTable" style="width: 657pt;" border="0" cellpadding="0" cellspacing="0" width="876"&gt;  &lt;tbody&gt;&lt;tr style=""&gt;   &lt;td style="padding: 0.75pt; width: 22.5pt;" width="30"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; width: 631.5pt;" width="842"&gt;   &lt;div align="center"&gt;   &lt;table class="MsoNormalTable" style="width: 591.75pt;" border="1" cellpadding="0" cellspacing="0" width="789"&gt;    &lt;tbody&gt;&lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span class="MsoHyperlink"&gt;&lt;span lang="EN-US"&gt;Confidence     intervals&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; width: 258pt;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span class="MsoHyperlink"&gt;&lt;span lang="EN-US"&gt;Confidence     intervals for means of normal distributions&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;One sample confidence intervals.&lt;br /&gt;    Two samples confidence intervals : paired samples, independent samples     (variances known, unknown but equal, unknown and not equal).&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span class="MsoHyperlink"&gt;&lt;span lang="EN-US"&gt;Approximate confidence     intervals on means&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Asymptotic interval (no demonstration).&lt;br /&gt;    Welch's approximation.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;&lt;a name="MSE"&gt;&lt;/a&gt;     &lt;p class="MsoNormal"&gt;&lt;span class="MsoHyperlink"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt;&lt;span style="text-decoration: none;"&gt; &lt;/span&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span class="MsoHyperlink"&gt;&lt;span lang="EN-US"&gt;Mean Square     Error (MSE)&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span class="MsoHyperlink"&gt;&lt;span lang="EN-US"&gt;Mean Square     Error (MSE)&lt;br /&gt;&lt;st1:street st="on"&gt;&lt;st1:address st="on"&gt;Minimum Mean Square&lt;/st1:address&gt;&lt;/st1:Street&gt;     Error (MMSE) estimators&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;MSE of a parameter estimator.&lt;br /&gt;    Best estimate of a random variable X.&lt;br /&gt;    Best estimate of X when a second r.v. Y is available.&lt;br /&gt;    Properties of Minimum Mean Square Error estimators.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;&lt;a name="Sufficient_statistic"&gt;&lt;/a&gt;     &lt;p class="MsoNormal"&gt;&lt;span class="MsoHyperlink"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt;&lt;span style="text-decoration: none;"&gt; &lt;/span&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span class="MsoHyperlink"&gt;&lt;span lang="EN-US"&gt;Sufficient     statistic&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span class="MsoHyperlink"&gt;&lt;span lang="EN-US"&gt;First examples     of sufficient statistics&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Sufficient statistics for :&lt;br /&gt;    * The Bernoulli distribution b(p),&lt;br /&gt;    * The uniform distribution U[0, q],&lt;br /&gt;    * The Poisson distribution P(l),&lt;br /&gt;    from the definition only.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span class="MsoHyperlink"&gt;&lt;span lang="EN-US"&gt;The     factorization theorem and applications&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;A necessary and sufficient condition     for a statistic to be sufficient.&lt;br /&gt;    Examples : Bernoulli, uniform, Poisson, normal (two methods), Gamma,     exponential.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;   &lt;/div&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Tests&lt;/span&gt;&lt;/p&gt;  &lt;table class="MsoNormalTable" style="width: 657pt;" border="0" cellpadding="0" cellspacing="0" width="876"&gt;  &lt;tbody&gt;&lt;tr style=""&gt;   &lt;td style="padding: 0.75pt; width: 22.5pt;" width="30"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; width: 631.5pt;" width="842"&gt;   &lt;div align="center"&gt;   &lt;table class="MsoNormalTable" style="width: 591.75pt;" border="1" cellpadding="0" cellspacing="0" width="789"&gt;    &lt;tbody&gt;&lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="ANOVA_OneWay"&gt;&lt;/a&gt;&lt;span lang="EN-US"&gt;ANOVA (One     way)&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Overview of ANOVA &lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;General principle of ANOVA&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Variance decomposition&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Total Sum of Squares, Factorial and     Residual sums of Squares.&lt;br /&gt;    A purely geometrical step.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Distributions of the Sums of Squares&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Sums of Squares as random variables.     Distributions, independence. Properties as estimators of the common     variance.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;ANOVA's F test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;ANOVA is a F test.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;&lt;a name="dunnett_eng"&gt;&lt;/a&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Dunnett's test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Dunnett's test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Comparing group means to the mean of a     reference group.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;&lt;a name="t-test"&gt;&lt;/a&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;t-test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;What are t-tests ? &lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Is a sample average trustworthy ?&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;One-sample t-test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Is the sample mean significantly     different from expected ?&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Student's t&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Distribution of the mean when the     variance is unknown&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;"Two dependent samples" t-test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Are the means of 2 dependent samples     equal ?&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;"Two Independent samples" t-test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Are the means of 2 independent samples     equal ?&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;t-test results&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;How do I read software results of t-tests     ?&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;&lt;a name="Chi-square_tests"&gt;&lt;/a&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Chi-square tests&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="The_basic_Chi-square_test"&gt;&lt;span lang="EN-US"&gt;The     basic Chi-square test&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Does a sample match a multinomial     distribution ?&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style="height: 19.5pt;"&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 19.5pt;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Continuous reference distribution&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt; height: 19.5pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Adapting the test to a continuous     variable&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style="height: 19.5pt;"&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 19.5pt;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="Estimated_parameters"&gt;&lt;span lang="EN-US"&gt;Estimated     parameters&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt; height: 19.5pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;If some parameters of the reference     distribution are unknown&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style="height: 19.5pt;"&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 19.5pt;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="Chi-square_equality"&gt;&lt;span lang="EN-US"&gt;The     Chi-square test of equality&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt; height: 19.5pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Do several samples originate from the same     distribution ?&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="Chi-square_independence"&gt;&lt;span lang="EN-US"&gt;The     Chi-square test of independence&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Are two categorical variables     independent ?&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Complements on the Chi-square of     independence&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Largest value, contributions, alternate     coefficients.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;&lt;a name="Fisher-Irwin"&gt;&lt;/a&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The Fisher-Irwin test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The Fisher-Irwin test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Are these two coins identically biased     ?&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;&lt;a name="Kolmogorov-Smirnov"&gt;&lt;/a&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The Kolmogorov-Smirnov test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The Kolmogorov statistic&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Its definition, distribution function,     and the ensuing test.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Complements on the Kolmogorov test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Very short on : K-test or Chi-2 test ?     Estimated parameters. Normality test.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;&lt;a name="wmw"&gt;&lt;/a&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The Mann-Whitney test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The Mann-Whitney statistic&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Its definition, distribution function,     and the ensuing test.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Complements on the Mann-Whitney test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Very short on : Why ranks ?     Location-shift test.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;&lt;a name="Newman-Keuls"&gt;&lt;/a&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Newman-Keuls test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The Newman-Keuls test&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Pairwise comparisons of group means     that avoid "paradoxical" conclusions.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;   &lt;/div&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;p class="MsoNormal"&gt;&lt;a name="CLASSIFICATION"&gt;&lt;/a&gt;&lt;span lang="EN-US"&gt;Classification&lt;/span&gt;&lt;/p&gt;  &lt;table class="MsoNormalTable" style="width: 657pt;" border="0" cellpadding="0" cellspacing="0" width="876"&gt;  &lt;tbody&gt;&lt;tr style=""&gt;   &lt;td style="padding: 0.75pt; width: 22.5pt;" width="30"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; width: 631.5pt;" width="842"&gt;   &lt;div align="center"&gt;   &lt;table class="MsoNormalTable" style="width: 591.75pt;" border="1" cellpadding="0" cellspacing="0" width="789"&gt;    &lt;tbody&gt;&lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="Fisher_linear_discriminant"&gt;&lt;span lang="EN-US"&gt;Fisher's     linear discriminant&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 315.75pt;" width="421"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Fisher's criterion and Fisher's vector&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 315.75pt;" width="421"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Definition and justification of     Fisher's criterion.&lt;br /&gt;    Maximizing Fisher's criterion (2 classes).&lt;br /&gt;    Fisher's discriminant.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Maximizing the generalized Fisher's     criterion&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 315.75pt;" width="421"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Maximizing the ratio of two quadratic     forms.&lt;br /&gt;    Maximizing the generalized Fisher's criterion.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 315.75pt;" width="421"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="Discriminant_Analysis"&gt;&lt;span lang="EN-US"&gt;Discriminant     Analysis&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;What is Discriminant Analysis ? &lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The most basic classification     technique.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="DA_2"&gt;&lt;span lang="EN-US"&gt;Discriminant Function     Analysis&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Finding new variables that are good at     separating classes.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Building a classifier&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Creating linear or quadratic     Classification Functions.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Complements on DA&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Just a little bit of maths.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="Logistic_Regression"&gt;&lt;span lang="EN-US"&gt;Logistic     Regression&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;What is Logistic Regression ?&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;LR is a powerful generalization of     Discriminant Analysis.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;What is the "logit" ?&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The information needed to build a     score.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Linear logit beyond DA&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Getting rid of the normality     assumption.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Estimating the coefficients of the     model&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Likelihood, and how it is maximized.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="Decision_Trees"&gt;&lt;span lang="EN-US"&gt;Decision Trees&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;What are Decision Trees ? &lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Heuristic, yet powerful classifiers.     Can do Regression too.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Growing a Tree&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Node splitting, Tree growth and Tree     use.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Three types of predictors&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Handling categorical, ordinal and     numerical predictors&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Splitting a node&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Misclassification, Gini index, Entropy,     Chi-square, Twoing&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Priors and costs&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Weighting the observations to favorably     bias the Tree.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Stopping rules and Pruning&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Getting the right size Tree to avoid     overfitting&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;   &lt;/tbody&gt;&lt;/table&gt;   &lt;/div&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;p class="MsoNormal"&gt;&lt;a name="EXPLORATORY"&gt;&lt;/a&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Exploratory Data Analysis&lt;/span&gt;&lt;/p&gt;  &lt;table class="MsoNormalTable" style="width: 657pt;" border="0" cellpadding="0" cellspacing="0" width="876"&gt;  &lt;tbody&gt;&lt;tr style=""&gt;   &lt;td style="padding: 0.75pt; width: 22.5pt;" width="30"&gt;   &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;   &lt;/td&gt;   &lt;td style="padding: 0.75pt; width: 631.5pt;" width="842"&gt;   &lt;div align="center"&gt;   &lt;table class="MsoNormalTable" style="width: 591.75pt;" border="1" cellpadding="0" cellspacing="0" width="789"&gt;    &lt;tbody&gt;&lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="PCA"&gt;&lt;span lang="EN-US"&gt;Principal Components     Analysis (PCA)&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;What is PCA ? &lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;An optimal way to display data on a plane,     and more.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="What_are_Principal_Components"&gt;&lt;span lang="EN-US"&gt;What are Principal Components&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The most efficient synthetic variables     for representing data.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Finding the Principal Components&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Maximizing the inertia of projected     observations.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Projection of the observations&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The best projection of data on a plane.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;a name="PCA_2"&gt;&lt;span lang="EN-US"&gt;Projection of the     variables&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Visualizing correlation between     variables.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Interpreting PCA results&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Interpreting the Principal Components     and data distribution.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Other applications of PCA&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Data Compression and Dimensionality     reduction.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;&lt;a name="Correspondence_Analysis"&gt;&lt;/a&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: white none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Correspondence Analysis&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Overview of Correspondence Analysis&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Visualizing the interaction of two     categorical variables.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Reformating data&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Contengency tables, frequencies,     profiles.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The Chi-square distance&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;...is more appropriate than euclidian     distance.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;The two PCAs&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;How many dimensions, barycenters, total     inertia.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) none repeat scroll 0% 50%; width: 258pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;" width="344"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;General principles of interpretation of     CA&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;     &lt;td style="padding: 0.75pt; width: 309.75pt;" width="413"&gt;     &lt;p class="MsoNormal"&gt;&lt;span lang="EN-US"&gt;Factors, weights, inertias, plots,     quality of representation.&lt;/span&gt;&lt;/p&gt;     &lt;/td&gt;    &lt;/tr&gt;    &lt;tr style=""&gt;     &lt;td style="padding: 0.75pt; background: rgb(255, 255, 204) non
