Hi all, This Blog is an English archive of my PhD experience in Imperial College London, mainly logging my research and working process, as well as some visual records.

Friday 29 June 2007

Principle of Indifference

The principle of indifference (also called principle of insufficient reason) is a rule for assigning epistemic probabilities. Suppose that there are n > 1 mutually exclusive and collectively exhaustive possibilities. The principle of indifference states that if the n possibilities are indistinguishable except for their names, then each possibility should be assigned a probability equal to 1/n.

The principle of indifference is meaningless under the frequency interpretation of probability, in which probabilities are relative frequencies rather than degrees of belief in uncertain propositions, conditional upon a state of information.

Bayesian Theorem

http://en.advantacell.com/wiki/Bayesian_probability

Tuesday 19 June 2007

[Data Mining]CRoss Industry Standard Process (CRISP)

Process Model

The current process model for data mining provides an overview of the life cycle of a data mining project. It contains the corresponding phases of a project, their respective tasks, and relationships between these tasks. At this description level, it is not possible to identify all relationships. There possibly exists relationships between all data mining tasks depending on goals, background and interest of the user, and most importantly depending on the data. An electronic copy of the CRISP-DM Version 1.0 Process Guide and User Manual is available free of charge. This contains step-by-step directions, tasks and objectives for each phase of the Data Mining Process. Download CRISP 1.0 Process and User Guide.


Figure: Phases of the CRISP-DM Process Model

The life cycle of a data mining project consists of six phases. The sequence of the phases is not strict. Moving back and forth between different phases is always required. It depends on the outcome of each phase which phase, or which particular task of a phase, that has to be performed next. The arrows indicate the most important and frequent dependencies between phases.

The outer circle in the figure symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions. Subsequent data mining processes will benefit from the experiences of previous ones.

Below follows a brief outline of the phases:

Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.

Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

Data Preparation
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.

Modeling
In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.

Evaluation
At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. However, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions will need to be carried out in order to actually make use of the created models.

Friday 1 June 2007

Perfect Collection of Data Mininig Related Classic Monographs

http://www.wekacn.org/ebook/

Its home site http://www.wekacn.org/, actually considers Data Mining as its Main topic.

ACM Doctoral Dissertation Award

Presented annually to the author(s) of the best doctoral dissertation(s) in computer science and engineering. The amount of the award is $5,000. The winning dissertation is published by Springer.

2006
Ng, Yi-Ren
Honorable Mention Agarwala, Aseem

2005
Liblit, Ben
Honorable Mention Dousse, Olivier

2004
Barak, Boaz
Honorable Mention Johari, Ramesh
Honorable Mention Witchel, Emmett

2003
Doan, AnHai
Honorable Mention Katabi, Dina
Honorable Mention Khot, Subhash

2002
Guruswami, Venkatesan
Honorable Mention Miller, Robert C.
Honorable Mention Roughgarden, Tim

2001
Stoica, Ion
Honorable Mention O'Callahan, Robert
Honorable Mention Wagner, David

2000
Vadhan, Salil
Honorable Mention Chan, William
Honorable Mention Ernst, Michael D.

1999
van Melkebeek, Dieter

1998
Balakrishnan, Hari
1997
McCanne, Steven R.

1996
Tu, Xiaoyuan
Waldspurger, Carl

1995
Arora, Sanjeev
Spielman, Dan

1994
Karger, David
Raman, T.V.

1993
Sudan, Madhu
Honorable Mention Kistler, James J.
Honorable Mention Nayak, Pandu

1992
McMillan, Kenneth
Rosenblum, Mendel

1991
Schapire, Robert
Series Winner Gibson, Garth
Series Winner Lund, Carsten
Honorable Mention Dan, Asit

1990
Geffner, Hector
Heckerman, David
Series Winner Nissan, Noam

1989
Saraswat, Vijay
Series Winner Killian, Joe
Series Winner Kearns, Michael J.
1988
Karchmer, Mauricio
Series Winner Condon, Anne
Series Winner Dill, David

1987
Canny, John
Series Winner Brown, Marc H.
Series Winner Greengard, Leslie

1986
Mulmuley, Ketan D.
Torkel Hastad, Johan
Series Winner Ebeling, Carl
Series Winner Ungar, David

1985
Ellis, John R.
Series Winner Chor, Ben-Zion
Series Winner Hillis, Daniel

1984
Katevenis, Manolis G.H.
Series Winner Bach, Carl E.
Series Winner Baird, Henry
Series Winner Korein, James

1983
Reps, Thomas W.
Series Winner Hildreth, Ellen
Series Winner Johnson, Steven

1982
Leiserson, Charles E.

1980
Cook, Douglas
Davis, Ruth E.
Larson, Lawrence Edwin
Slomin, Jacob

1978
Cattell, Roderic G.
Urban, Joseph



Detail See here: http://bbs.taisha.org/thread-532846-1-20.html

Information Retrieval Research Source

All below is a reference from http://net.pku.edu.cn/~webg/

Contents
Books
+ Finding Out About: Search Engine Technology from a cognitive
Perspective (Belew, R.K., 2000)
http://www-cse.ucsd.edu/~rik/foa/
+ Foundations of Statistical Natural (C. Manning and H. Schutze, 1999)
+ Information Retrieval, 2nd edition (C.J. van Rijsbergen, 1979)
(full text)
http://www.dcs.gla.ac.uk/Keith/Preface.html
+ Information Retrieval: A Survey (Ed Greengrass, 2000)
http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
+ Information Retrieval: Data Structures & Algorithms
(Frakes, W. and Baeza-Yates, R., 1992)
http://www.dcc.uchile.cl/~rbaeza/iradsbook/irbook.html
+ Information Retrieval Interaction (Ingwersen, P., Taylor Graham, 1992)
http://www.db.dk/pi/iri/
+ Managing Gigabytes:compressing and indexing documents and images,
2nd edition, (Ian H. Witten, Alistair Moffat,and Timothy Bell,1999)
+ Mining the Web: Discovering Knowledge from Hypertext Data
(Soumen Chakrabarti, 2003)
+ Modeling the Internet and the Web:
probabilistic Methods and Algorithms
(Pierre Baldi, Paolo Frasconi and Padhraic Smyth, 2003)
+ Modern Information Retrieval
(Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 2000)
+ Readings in Information Retrieval.
(Sparck-Jones, K. and Willett, P., 1997)
+ Search Engine: Principle,Technology and Systems
ËÑË÷ÒýÇæ-Ô­Àí¡¢¼¼ÊõÓëϵͳ
(Xiaoming Li,et al., 2005 ), (full text)
http://sewm.pku.edu.cn/book/dlbook.html
+ The Geometry of Information Retrieval
(C.J. van Rijsbergen, 2004)
http://ir.dcs.gla.ac.uk/GeometryOfIR/
+ The Turn: Integration of Information Seeking and Retrieval in Context
(Ingwersen, P., and Jarvelin, K., 2005)
+ TREC: Experiment and Evaluation in Information Retrieval
(Voorhees, E.M., and Harman, D.K., 2005)
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=10667

Conferences and Workshops
+ CIKM: Conference on Information and Knowledge Management
http://www.csee.umbc.edu/cikm/
+ SIGIR: Special Interest Group on Information Retrieval
http://www.sigir.org/
+ World Wide Web
http://www.iw3c2.org/
+ SEWM: Symposium of Search Engine and WebMining
È«¹úËÑË÷ÒýÇæºÍÍøÉÏÐÅÏ¢ÍÚ¾òѧÊõÑÐÌÖ»á
http://net.pku.edu.cn/~sewm/

Courses
+ CMU Information Retrieval
http://nyc.lti.cs.cmu.edu/classes/11-741/ (Spring 2006)
Instructors: Jamie Callan and Yiming Yang
+ Cornell University The Structure of Information Networks (Spring 2006)
http://www.cs.cornell.edu/courses/cs685/2006sp/
Instructor: Jon Kleinberg
+ Peking University Web Based Information Architectures (Fall 2005)
http://net.pku.edu.cn/~wbia/
Instructor: Xiaoming Li, Jimin Wang and Bo Peng
+ Stanford Univ. Text Information Retrieval and Web Mining (Autumn 2005)
http://www.stanford.edu/class/cs276/
Instructor: Christopher Manning and Prabhakar Raghavan
+ UIUC Introduction to Text Information Systems (Spring 2006)
http://sifaka.cs.uiuc.edu/course/498cxz06s/
Instructor: ChengXiang Zhai
+ UMass Univ. Information retrieval course (Spring 2005)
http://ciir.cs.umass.edu/cmpsci646/
Instructors: James Allan
+ Washington Univ. Search Engines course
http://courses.washington.edu/lis544/

Evaluation Resources
+ CLEF: Cross-Language Evaluation Forum
http://clef.iei.pi.cnr.it/
+ CWIRF: Chinese Web Information Retrieval Forum
http://www.cwirf.org/
+ DUC: Document Understanding Conferences
http://duc.nist.gov/
+ INEX: INitiative for the Evaluation of XML Retrieval
http://inex.is.informatik.uni-duisburg.de/
+ NTCIR: NII-NACSIS Test Collection for IR Systems
http://research.nii.ac.jp/ntcir/
+ TREC: Text REtrieval Conference
http://trec.nist.gov/

Journals
+ Briefings in Bioinformatics (full text)
http://bib.oxfordjournals.org/archive/
+ Computational Linguistics, The MIT Press
http://mitpress.mit.edu/catalog/item/default.asp?ttype=4&tid=10
+ Data & Knowledge Engineering (DKE), Elsevier
http://www.elsevier.com/wps/find/journaldescription.cws_home/505608/description?navopenmenu=-2
+ D-Lib Magazine
http://www.dlib.org/
+ Information Processing Letters, Elsevier
http://www.elsevier.com/locate/issn/00200190
+ Information Processing and Management (IP&M), Elsevier
http://www.elsevier.com/locate/infoproman
+ Information Retrieval, Springer
http://www.springer.com/sgw/cda/frontpage/0,11855,3-0-70-35744790-detailsPage%253Djournal%257Cdescription%257Cdescription,00.html
+ Information Research
http://informationr.net/ir
+ International Journal on Digital Libraries, Springer
http://link.springer.de/link/service/journals/00799/index.htm
+ International Journal of Cooperative Information Systems (IJCIS),
World Scientific
http://ejournals.wspc.com.sg/ijcis/ijcis.shtml
+ International Journal on Document Analysis and Recognition, Springer
http://link.springer.de/link/service/journals/10032/index.htm
+ International Journal of Intelligent Systems, Wiley
http://www3.interscience.wiley.com/cgi-bin/jhome/36062
+ International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems (IJUFKS), World Scientific
http://ejournals.wspc.com.sg/ijufks/ijufks.shtml
+ Journal of the American Society for Information Science and Technology (JASIST), Wiley
http://www3.interscience.wiley.com/cgi-bin/jhome/76501873
+ Journal of Documentation (JDoc). Emerald
http://www.emeraldinsight.com/0022-0418.htm
+ Journal of Intelligent Information Systems (JIIS), Springer
http://www.wkap.nl/journalhome.htm/0925-9902
+ Knowledge and Information Systems (KAIS), Springer
http://link.springer.de/link/service/journals/10115/index.htm
+ Natural Language Engineering, Cambridge University Press
http://www.cambridge.org/journals/journal_catalogue.asp?mnemonic=NLE
+ Transactions On Information Systems (TOIS), ACM
http://www.acm.org/tois/
+ Transactions on Knowledge and Data Engineering (TKDE), IEEE
http://www.computer.org/tkde/

List Archives
+ SIG-IRList, http://www.sigir.org/sigirlist/index.html

Organizations and Special Interest Groups
+ Cambridge NLIP, http://www.cl.cam.ac.uk/Research/NL/
+ CMU LTI, http://www.lti.cs.cmu.edu/
+ DEC laboratories in Palo Alto, Calif.
+ Glasgow Information Retrieval Group, http://www.dcs.gla.ac.uk/ir/
+ Google Labs, http://labs.google.com/
+ LTI, http://www.lti.cs.cmu.edu/
+ Massachusetts CIIR, http://ciir.cs.umass.edu/
+ MSR Asia, Web Search & Data Mining Group
http://research.microsoft.com/wsm/
+ Standford InfoLab, http://infolab.stanford.edu/
+ UIUC Information Retrieval Group, http://sifaka.cs.uiuc.edu/ir/
+ ±±´óÌìÍø×é, http://sewm.pku.edu.cn/
+ ±±¾©´óѧ¼ÆËãÓïÑÔѧÑо¿Ëù, http://icl.pku.edu.cn/
+ ¸´µ©´óѧÐÅÏ¢¼ìË÷ºÍ×ÔÈ»ÓïÑÔ´¦Àí×é,
http://www.cs.fudan.edu.cn/mcwil/irnlp/
+ ¹þ¹¤´óÐÅÏ¢¼ìË÷×é, http://ir.hit.edu.cn/
#+ Ç廪´óѧÖÇÄܼ¼ÊõÓëϵͳ¹ú¼ÒÖصãʵÑéÊÒ, (fail to visit the URL)
# http://www.csai.tsinghua.edu.cn/
+ ÖпÆÔº´ó¹æÄ£ÄÚÈݼÆËã×é, http://159.226.40.18/

Researchers
+ ChengXiang Zhai, developing Lemur
http://www-faculty.cs.uiuc.edu/~czhai/
+ Gerard Salton
http://www.cs.cornell.edu/Info/Department/Annual95/Faculty/Salton.html
+ Karen Sparck, developing IDF
http://www.cl.cam.ac.uk/users/ksj/
+ Keith van Rijsbergen
http://www.dcs.gla.ac.uk/~keith/
+ Jamie Callan,
http://www.cs.cmu.edu/~callan/
+ Jon Kleinberg, developing HIT
http://www.cs.cornell.edu/home/kleinber/
+ Li Xiaoming, developing Tianwang & Infomall
+ Nick Craswell, developing Terabyte Track
http://research.microsoft.com/~nickcr
+ Susan Dumais, developing LSI
http://research.microsoft.com/~sdumais/
+ Yiming Yang, developing text categorization
http://www.cs.cmu.edu/~yiming/
+ Stephen Robertson,
http://research.microsoft.com/users/robertson/
+ Tefko Saracevic
http://www.scils.rutgers.edu/~tefko/
+ W. Bruce Croft
http://ciir.cs.umass.edu/personnel/croft.html

Research-related Resources
+ http://www-faculty.cs.uiuc.edu/~czhai/research.html

Software
+ Apache Lucene: a full-featured text search engine library
http://lucene.apache.org/java/docs/index.html
+ Gate: a general architecture for text engineering
http://gate.ac.uk/
+ Lemur: A full-text search engine
http://www.lemurproject.org/
+ MG: A full-text search engine
http://www.math.utah.edu/pub/mg/
+ Porter Stemmer: English stemming algorithm
http://www.tartarus.org/martin/PorterStemmer/
+ Nutch: an open source web search engine
http://sourceforge.net/projects/nutch/
+ TSE: A Tiny Search Engine
http://sewm.pku.edu.cn/src/TSE/

---------------------
References:
[1] Information Retrieval Resources, http://www.sigir.org/resources.html
[2] http://ir.dcs.gla.ac.uk/resources.html
[3] http://www.cs.cmu.edu/~callan/Teaching/Resources.html
[4] Diekemar, Information Retrieval Links, Jan. 28, 1999.
http://web.syr.edu/~diekemar/ir.html
[5] ³Âºè±ê£¬ÍøÉÏÑÐÏ°ÐÅÏ¢¼ìË÷£¬1999Äê11ÔÂ.
http://159.226.40.18/freshman/resources/ÍøÉÏÑÐÏ°ÐÅÏ¢¼ìË÷.doc
[6] Êý¾ÝÍÚ¾òÑо¿Ôº, http://www.dmresearch.net/
[7] ÓïÒô×ÔÈ»ÓïÑÔÔÚÏß, http://www.snlpinfo.com/index.php
[8] PKU SEWM Group, http://sewm.pku.edu.cn/
[9] http://www.cs.cmu.edu/~callan/Teaching/Resources.html
[10] http://icl.pku.edu.cn/member/lisujian/maincontent.htm
[11] http://www.cs.fudan.edu.cn/mcwil/irnlp/link.htm
[12] Robert Krovetz, A Guide to the Literature of Information Retrieval,
http://159.226.40.18/freshman/resources/guide-to-ir-lit.ps
[13] ACM Digital Library,
http://portal.acm.org/portal.cfm
http://acm.lib.tsinghua.edu.cn/acm/
[14] http://www.sigir.org/proceedings/Proc-Browse.html
[15] SIGIR,
http://portal.acm.org/browse_dl.cfm?linked=1&part=series&idx=SERIES278&coll=portal&dl=ACM&CFID=72474811&CFTOKEN=69288563
[16] WWW, International World Wide Web Conference
http://portal.acm.org/browse_dl.cfm?linked=1&part=series&idx=SERIES968&coll=portal&dl=ACM&CFID=72474811&CFTOKEN=69288563
[17] China Digital Journal Community, http://wanfang.calis.edu.cn/wf/szhqk/index.html



---------------------

More details are listed as follows
====================
CIIR
(The Center for Intelligent Information Retrieval,
ÃÀ¹úMassachusetts´óѧµÄÖÇÄÜÐÅÏ¢¼ìË÷ÖÐÐÄ)
http://ciir.cs.umass.edu/

The Center for Intelligent Information Retrieval, a National Science
Foundation-created S/IUCRC Center, is one of the leading information retrieval
research labs in the world. The CIIR develops tools that provide effective
and efficient access to large, heterogeneous, distributed, text and
multimedia databases.

CIIR accomplishments include significant research advances in the areas of
distributed information retrieval, information filtering, topic detection,
multimedia indexing and retrieval, document image processing, terabyte
collections, data mining, summarization, resource discovery, interfaces
and visualization, and cross-lingual information retrieval.

The Center for Intelligent Information Retrieval continues to support the
emerging information infrastructure, both through research and technology
transfer. The goal of the CIIR is to develop tools that provide effective
and efficient access to large, heterogeneous, distributed, text and
multimedia databases.

====================
Glasgow Information Retrieval Group
http://www.dcs.gla.ac.uk/ir/
ÓÉKeith van RijsbergenÂÊÁìµÄÓ¢¹úGlasgow´óѧÐÅÏ¢¼ìË÷Ñо¿Ð¡×é¡£
Õâ¸öС×éÀíÂÛºÍʵ¼ù²¢ÖØ£¬Ö¼ÔÚ½¨ÔìÒ»¸ö¸ßЧ¡¢ÐÂÓ±¡¢³É¹¦µÄ¶àýÌåÐÅÏ¢¼ìË÷ϵͳ£¬
ΪÖÕ¼«Óû§·þÎñ¡£

The Information Retrieval Group led by Professor Keith van Rijsbergen has a
vigorous programme of research, based on both theory and experiment, aimed at
giving end-users novel, effective, and efficient access to the world of
multi-media information. The group, part of the Department of Computing Science,
University of Glasgow, has a strong research history in a wide area of
information retrieval research from theoretical modelling of the retrieval
process to advanced system building and to the user-oriented evaluation of
information retrieval systems. The group's interests also include many areas
of Web information retrieval such as link analysis, summarisation and the
development of novel interaction techniques (e.g., ostension, implicit feedback
and graphical visualisation). Our research preserves a strong emphasis on
the evaluation of interactive IR systems, and the group maintains strong links
with researchers in Human-Computer Interaction and Psychology.

------
Keith van Rijsbergen, http://www.dcs.gla.ac.uk/~keith/
Ó¢¹ú¸ñÀ­Ë¹¸ç´óѧ¡£¸ÅÂÊIRµÄÂß¼­ÍÆÀíѧÅÉ´ú±íÈË£¬³ö°æÁËÖøÃûµÄIR¾­µä½Ì²Ä
INFORMATION RETRIEVAL£¬ Öصã½éÉÜÓøÅÂÊÑо¿ÐÅÏ¢¼ìµÄ·½·¨¡£

=====================
Cambridge NLIP Group
(Natural Language and Information Processing Group)
http://www.cl.cam.ac.uk/Research/NL/

Research in NLIP has been done in the Computer Laboratory for nearly fifty years.
The earliest work, by Roger Needham and Karen Sparck Jones, was on automatic
thesaurus construction, in the context of document retrieval and machine translation.
Subsequent research by Karen Sparck Jones during the 1960s and 70s focused on
statistical approaches to retrieval and included innovative work on term
weighting. From the later 1970s research in language processing developed,
with work on syntax, semantics and discourse processing,

------
Karen Sparck Jones, http://www.cl.cam.ac.uk/users/ksj/
Karen Sparck Jones has been one of the most influential figures in Computing
since the 1950¡¯s. Her work on Information Retrieval and Natural Language Processing
has never been so central as it is are today, with its implications for
search engine technology, the semantic web and even bioinformatics.

In 1972, Karen Sparck Jones published in the Journal of Documentation the paper
which defined the term weighting scheme now known as inverse document frequency (IDF).

Karen Sparck Jones is emeritus Professor of Computers and Information at the
Computer Laboratory, University of Cambridge. She has worked in automatic
language and information processing research since the late fifties,
and has many publications including several books, most recently `Evaluating
Natural Language Processing Systems' with Julia Galliers, and `Readings in
Information Retrieval', edited with Peter Willett.

1988Äê¶ÈSalton½±µÃÖ÷¡£ÏÖ´ú¸ÅÂÊIRÄ£Ð͵ÄÁíÒ»´´Ê¼ÈË¡£ÔÚNLP¡¢IRµÈÁìÓò¶¼ÆÄÓн¨Ê÷£¬
¶øÇÒ×öÁË´óÁ¿µÄ×éÖ¯ÐÔ¹¤×÷¡£ÏÖÔÚ¹©Ö°ÓÚÓ¢¹ú½£ÇÅ´óѧ¼ÆËã»úѧԺ¡£

====================
LTI
CMU (Carnegie Mellon Universit) Language Technologies Institute,
http://www.lti.cs.cmu.edu/

The Language Technologies Institute (LTI) of the School of Computer Science at
Carnegie Mellon University conducts research and provides graduate education
in all aspects of language technology and information management. The LTI was
established in 1996, as an expansion of the Center for Machine Translation
(CMT).

The Center for Machine Translation (CMT) was a research branch of the School
of Computer Science devoted to basic and applied research in all aspects of
natural language processing, with a primary focus on machine translation,
speech processing, and information retrieval. Containing a unique mix of
academic and industrial researchers specializing in various aspects of
computer science, artificial intelligence, computational linguistics and
theoretical linguistics, the CMT provided a rich and diverse environment for
collaboration among faculty, staff, visiting scholars, and qualified students.

------
Lemur Toolkit
Lemur is a collection of search engine algorithms and information retrieval
applications used for IR research, development and education. Lemur provides a
rich query language that supports search against simple texts, structured
(XML) texts, and texts annotated with part-of-speech, named-entity, and other
annotations used in NLP and text-mining applications. Lemur's search engines
comfortably support collections ranging from a few gigabytes to a few
terabytes of text. The software is distributed under open-source license, and
is used widely in the IR research community.

====================
Standford InfoLab
http://infolab.stanford.edu/

The Stanford WebBase Project
http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/

The Stanford WebBase project is investigating various issues in crawling,
storage, indexing, and querying of large collections of Web pages. The project
builds on the previous Google activity that was part of the DLI1 initiative.
The DLI2 WebBase project aims to build the necessary infrastructure to
facilitate the development and testing of new algorithms for clustering,
searching, mining, and classification of Web content.
====================
±±´óÌìÍø×é, http://sewm.pku.edu.cn/

±±¾©´óѧÍøÂçʵÑéÊÒ×Ô1997Ä꿪ʼ´ÓÊÂËÑË÷ÒýÇæ·½ÃæµÄÑо¿Óëϵͳ¿ª·¢£¬
¼¼Êõ»ýÀÛÉîºñ£¬×ÛºÏʵÁ¦ºÍѧÊõÓ°ÏìÔÚ¹úÄÚÒ»Ö±´¦ÓÚÁìÏȵØλ¡£ÎÒÃÇÑз¢µÄ
¡°ÌìÍø¡±ËÑË÷ÒýÇæϵͳÊÇÈ«¹ú×îÓÐÓ°ÏìµÄ³ö×ÔУ԰µÄËÑË÷ÒýÇ棬´Ó1997Äê10ÔÂ
¿ªÊ¼Ò»Ö±ÔËÐÐÖÁ½ñ¡£¡°ÌìÍø¡±ÔÚÔöÁ¿ËÑË÷¼¼Êõ¡¢¿ìËÙ¼ìË÷¼¼Êõ£¬º£Á¿ÐÅÏ¢´æ´¢
¼¼ÊõµÈ·½Ã涼¾ßÓнÏÇ¿µÄÓÅÊÆ£¬ËýµÄ²»¶Ï·¢Õ¹ÅàÓýÁËÒ»ÅúÅúÔÚº£Á¿ÍøÂçÎı¾
ÐÅÏ¢´¦Àí·½ÃæÓÐʵս¾­ÑéµÄѧÉú£¬Êܵ½ÖÐÍâITÆóÒµµÄÆձ黶ӭ¡£
´Ó2001Ä꿪ʼ£¬±¾Ñо¿×éÔÚËÑË÷ÒýÇæ¼¼ÊõµÄ»ù´¡ÉÏ£¬Õ¹¿ªÁËÖйú»¥ÁªÍø
ÐÅÏ¢ÀúÊ·µÄÊÕ¼¯Óë´æµµ¹¤×÷£¬ÐγÉÁË¡°Öйú»¥ÁªÍøÐÅÏ¢²©Îï¹Ý¡±£¬ÖÁ½ñÒÑ
ÊÕ²Ø20ÒÚÔÚ²»Í¬Ê±ÆÚ³öÏÖ¹ýµÄÖÐÎÄÍøÒ³£¬ÊÇÄ¿Ç°È«¹ú¹æÄ£×î´óµÄÀúÊ·ÍøÒ³ÊÕ²Ø
Óë»Ø·Åϵͳ¡£Í¬Ê±£¬ÎÒÃÇ»¹³¢ÊÔÁËÔÚÆä»ù´¡ÉϽøÐжàѧ¿Æ½»²æµÄÑо¿¡£

====================
ÖпÆÔº´ó¹æÄ£ÄÚÈݼÆËã×é
http://159.226.40.18/

ÐÅÏ¢¼ìË÷С×éÖ÷ÒªÕë¶ÔÎı¾ÐÅÏ¢µÄ¼ìË÷¿ªÕ¹Ñо¿£¬¶à´Î²Î¼ÓTREC»áÒ飬
È¡µÃÁ˺ܺõÄÑо¿³É¹û¡£Ð¡×鿪·¢µÄÌìÂÞ¼ìË÷ϵͳÔںܶà¹ú¼ÒÖØÒªµÄÐÅÏ¢²¿ÃÅ
µÃµ½Á˹㷺µÄÓ¦Óã¬Ä¿Ç°Ö÷ÒªµÄÑо¿·½Ïò°üÀ¨WEBÐÅÏ¢µÄ»ñÈ¡£¬WEBÐÅÏ¢¼ìË÷µÈ¡£
ÐÅÏ¢·ÖÎöС×éµÄÑо¿Ö÷Òª¼¯ÖÐÔÚ´ó¹æÄ£¶àÔ´Òì¹¹ÐÅÏ¢µÄ·ÖÎöÓëÍÚ¾ò·½Ã棬
Ö÷Òª°üÀ¨Îı¾·ÖÀàÓë¾ÛÀà¡¢ÐÅÏ¢¹ýÂË¡¢¸öÐÔ»¯·þÎñ¡¢×ÔÈ»ÓïÑÔÎÊ´ðºÍdz²ã
×ÔÈ»ÓïÑÔ´¦ÀíµÈ¡£Ð¡×éÑÐÖÆÁËһϵÁÐÎı¾ÐÅÏ¢¼Ó¹¤´¦ÀíµÄʵÑéƽ̨£¬Ä¿Ç°ÊµÑé
ƽ̨¿ÉÒÔͨ¹ýÖ÷Ò³ÖС°³É¹ûÑÝʾ¡±½øÐÐÑÝʾ¡£ÖµµÃÒ»ÌáµÄÊÇС×鿪չµÄ¹«¿ªÔ´Âë
¼Æ»®£¬ÆäÖеĸßÐÔÄÜ·Ö´ÊϵͳICTCLASµÃµ½ÁËÑо¿ÈËÔ±µÄ¹ã·ºÈÏͬÓëʹÓá£

====================
¸´µ©´óѧÐÅÏ¢¼ìË÷ºÍ×ÔÈ»ÓïÑÔ´¦Àí×é,
http://www.cs.fudan.edu.cn/mcwil/irnlp/

´ó¹æÄ£Îı¾´¦ÀíÖ÷ÒªÑо¿×ÔÈ»ÓïÑÔ£¨ÌرðÊÇÖÐÎÄÐÅÏ¢£©µÄ´¦Àí¼¼ÊõºÍ·½·¨£¬
°üÀ¨¶þ¸ö·½ÃæÄÚÈÝ£ºÊ×ÏÈÊÇ»ù´¡ÐÔ¹¤×÷£¬Ö÷ÒªÊÇ»ù´¡ÐÔµÄÀíÂÛºÍËã·¨, °üÀ¨
×Ô¶¯·Ö´Ê¡¢Î´µÇ¼´Êʶ±ð¡¢´ÊÐԺ͸ÅÄî±ê×¢¡¢¾ä·¨·ÖÎöºÍÓïÒå·ÖÎöµÈ,Ò²°üÀ¨
ÓïÁÏ¿âµÄËѼ¯ÕûÀíµÈ£»Æä´ÎÊÇÖÐÎÄÐÅÏ¢´¦ÀíµÄÓ¦Óü¼Êõ£¬°üÀ¨×Ô¶¯Ë÷Òý¡¢
Îı¾¼ìË÷¡¢Îı¾ÕªÒª¡¢Îı¾·ÖÀàºÍÎı¾¹ýÂË£¬ÌرðÊÇÉÏÊö¼¼ÊõÔÚÍøÂç»·¾³ÏÂ
µÄÓ¦Óá£Õⲿ·Ö¹¤×÷ÊÇÎı¾·½ÏòµÄÑо¿Öص㡣

====================
HIT-IRLab, http://ir.hit.edu.cn/

¹þ¹¤´óÐÅÏ¢¼ìË÷Ñо¿ÊÒ (HIT-IRLab) ³ÉÁ¢ÓÚ 2001 Äê 3Ô¡£Ñо¿·½Ïò
°üÀ¨Îı¾¼ìË÷¡¢ÎÊ´ðϵͳ¡¢×Ô¶¯ÎÄÕª¡¢Îı¾ÍÚ¾òºÍÓïÑÔ·ÖÎöµÈ£¬ Ñо¿ÊÒÒÔ
ÓïÑÔ·ÖÎöΪ»ù´¡Ñо¿£¬ÒÔÎı¾¹ýÂËΪӦÓÃÑо¿£¬ÒÔÐÅÏ¢³éȡΪÓïÑÔ·ÖÎö´Ó
¾ä×ÓÀí½âÏò ƪÕÂÀí½âµÄÑÓÉ죬ÒÔ¾ä×Ó¼ìË÷ΪÔÚÓïÑÔ·ÖÎöºÍƪÕÂÀí½âµÄÖ§³Ö
ϵÄÖÇÄÜ»¯¾«×¼¼ìË÷¼¼Êõ¡£

====================
SIGIR£¨ÃÀ¹ú¼ÆËã»úѧ»áÐÅÏ¢¼ìË÷ÌرðÐËȤС×飩¡¢
TREC£¨Îı¾¼ìË÷ѧÊõÄê»á£©
MUC£¨ÏûÏ¢Àí½âѧÊõÄê»á£©
TIPSTER£¨ÃÀ¹ú¹ú·À²¿¸ß¼¶Ñо¿¼Æ»®ÊðµÄIRʵ¼ù»ùµØ£©

====================
±±¾©´óѧ¼ÆËãÓïÑÔѧÑо¿Ëù
http://icl.pku.edu.cn/

±±¾©´óѧ¼ÆËãÓïÑÔѧÑо¿Ëù³ÉÁ¢ÓÚ1986Äê¡£ÖÂÁ¦ÓÚ¼ÆËãÓïÑÔѧÀíÂÛ¡¢ÓïÑÔ
ÐÅÏ¢´¦ÀíµÄ»ù´¡×ÊÔ´ºÍÓ¦Óü¼ÊõÈý·½ÃæµÄÑо¿¡£
ΧÈƼÆËãÓïÑÔѧºÍ×ÔÈ»ÓïÑÔ´¦Àí£¬°üÀ¨ÈçÏÂÈý¸öÖ÷ÒªµÄ·½Ïò£ºÊ×ÏÈ»ù´¡×ÊÔ´
µÄÑо¿Ó뽨É裺¼ÆËã´ÊµäѧÓë»úÆ÷´Êµä£¬×ÛºÏÐÍÓïÑÔ֪ʶ¿â£¬ÓïÁÏ¿âÓïÑÔѧÓë
ÓïÁÏ¿â¼Ó¹¤¼¼Êõ£¬ÊõÓïѧ¡¢ÊõÓï×Ô¶¯ÌáÈ¡¡¢ÊõÓï±ê×¼»¯Ñо¿µÈ¡£Æä´ÎÊÇ»ù´¡ÀíÂÛ¡¢
NLPµÄÄ£Ðͺͷ½·¨£º¼ÆËãÓïÑÔѧ»ù´¡£¬×ÔÈ»ÓïÑÔ´¦ÀíºËÐļ¼Êõ£¬ÏÖ´úººÓïÓï·¨£¬
ººÓïµÄ´Ê/¾ä·¨/ÓïÒå·ÖÎö£¬NLPͳ¼ÆÄ£ÐÍ£¬ÓïÑÔ´¦ÀíµÄÐÅÏ¢ÂÛ·½·¨µÈ¡£ÁíÍâÊÇ
Ó¦Óü¼Êõ£º»úÆ÷·­ÒëµÄ·½·¨¡¢¼¼ÊõÓëϵͳʵÏÖ£¬ÐÅÏ¢¼ìË÷ÓëÌáÈ¡£¬×ÔÈ»ÓïÑÔ
ÐÅÏ¢´¦ÀíϵͳµÄÆÀ¼Û·½·¨ºÍ¼¼Êõ£¬ÊÜÏÞººÓï¼°Æ丨Öúд×÷ϵͳ£¬Öйú¹ÅÊ«´Ê¼ÆËã»ú
¸¨ÖúÑо¿µÈ¡£

====================
#Ç廪´óѧÖÇÄܼ¼ÊõÓëϵͳ¹ú¼ÒÖصãʵÑéÊÒ (fail to visit the URL)
#http://www.csai.tsinghua.edu.cn/

ÖÇÄܼ¼ÊõÓëϵͳ¹ú¼ÒÖصãʵÑéÊÒÒÀÍÐÓÚÇ廪´óѧ¡£ÊµÑéÊÒÓÚ1990Äê2ÔÂ
¶ÔÍ⿪·ÅÔËÐС£Ö÷Òª´ÓÊÂÈ˹¤ÖÇÄÜ»ù±¾Ô­Àí¡¢»ù±¾·½·¨µÄ»ù´¡ÓëÓ¦Óûù´¡Ñо¿£¬
°üÀ¨ÖÇÄÜÐÅÏ¢´¦Àí¡¢»úÆ÷ѧϰ¡¢ÖÇÄÜ¿ØÖÆ£¬ÒÔ¼°Éñ¾­ÍøÂçÀíÂ۵ȣ¬»¹´ÓÊÂÓë
È˹¤ÖÇÄÜÓйصÄÓ¦Óü¼ÊõÓëϵͳ¼¯³É¼¼ÊõµÄÑо¿£¬Ö÷ÒªÓÐÖÇÄÜ»úÆ÷ÈË¡¢ÉùÒô¡¢
ͼÐΡ¢Í¼Ïñ¡¢ÎÄ×Ö¼°ÓïÑÔ´¦ÀíµÈ¡£

================
Susan Dumais,
http://research.microsoft.com/~sdumais/

I am interested in algorithms and interfaces for improved information
retrieval, as well as general issues in and human-computer interaction. I
joined Microsoft Research in July 1997. I work on a wide variety of
information access and management issues, including: personal information
management, web search, question answering, information retrieval, text
categorization, collaborative filtering, interfaces for improved search and
navigation, and user/task modeling.

Prior to coming to Microsoft, I worked on a statistical method for
concept-based retrieval known as Latent Semantic Indexing. You can find
pointers to this work on the Bellcore (now Telcordia) LSI page.

===============
UIUC Information Retrieval Group
http://sifaka.cs.uiuc.edu/ir/

The Information Retrieval (IR) group is part of the Database and Information
Systems (DAIS) Lab of the Computer Science Department at University of
Illinois at Urbana-Champaign. We work on a wide spectrum of problems in the
general area of text information management, including retrieval,
organization, filtering , and mining of textual information, aiming at
developing advanced text information management techniques and systems that
help people make better use of text information.

------
ChengXiang Zhai,
http://www-faculty.cs.uiuc.edu/~czhai/

Research Interests: Information Retrieval, Text Mining, Natural Language
Processing, Bioinformatics

University of Illinois at Urbana-Champaign, is recognized for
his work on user-centered, adaptive intelligent information access. His
techniques expect to improve search-engine performance, support better
information organization and enable understanding of large volumes of
information. Zhai's work in information retrieval is expected to enhance
curricula and provide new educational tools for the growing information
technology workforce.

===============
Stephen Robertson,
http://research.microsoft.com/users/robertson/

Stephen Robertson joined Microsoft Research Cambridge in April 1998.

In 1998, he was awarded the Tony Kent STRIX award by the Institute of
Information Scientists. In 2000, he was awarded the Salton Award by ACM SIGIR.
He is a Fellow of Girton College, Cambridge.

At Microsoft, he runs a group called Information Retrieval and Analysis, which
is concerned with core search processes such as term weighting, document
scoring and ranking algorithms, and combination of evidence from different
sources. These are studied theoretically through the use of formal models,
mainly statistical, and statistical methods including machine learning
methods, and experimentally, through activities such as the Text Retrieval
Conference (TREC) and with internally generated evaluation sets. The group
(with its Keenbow evaluation environment) has had some excellent results at
TREC. The group works closely with product groups to transfer ideas and
techniques.

His main research interests are in the design and evaluation of retrieval
systems. He is the author, jointly with Karen Sparck Jones, of a probabilistic
theory of information retrieval, which has been moderately influential. A
further development of that model, with Stephen Walker, led to the term
weighting and document ranking function known as Okapi BM25, which is used in
many experimental text retrieval systems.

Prior to joining Microsoft, he was at City University London, where he retains
a part-time position as Professor of Information Systems in the Department of
Information Science (homepage). He was Head of Department for eight years,
during which time it achieved the highest possible rating in two successive
research assessment exercises. He also started the Centre for Interactive
Systems Research, the main research vehicle of which is the Okapi text
retrieval system, which has also done well at TREC.

Before joining City, he was a research fellow at University College London,
where he took his PhD in the School of Library Archive and Information
Studies. Before that he was in the research department at Aslib. He has an MSc
in Information Science from City and a first degree in mathematics from
Cambridge.

===================
Nick Craswell
http://research.microsoft.com/~nickcr

I am an associate researcher at Microsoft Research Cambridge, in the
Information Retrieval and Analysis Group.

Research Overview

I am interested in Web search evaluation, mostly on enterprise-scale webs but
also the World Wide Web. I built the VLC, VLC2, WT2g and .GOV test
collections, which have been made available to research groups around the
world. David Hawking and I coordinated the TREC Web Track experiments. I am
currently involved in the TREC Terabyte Track and Enterprise Track. Some
publications: Book chapter preprint (pdf), IR'01 (citeseer) and CSIRO'01
(pdf).

I also work on effective Web search, which means making use of information in
pages, link structure and URL structure to generate more useful Web search
results. Some papers: SIGIR'05 (pdf), SIGIR'01 (pdf), TOIS'03 (pdf) (copying
is by permission of ACM, Inc.) and ADCS'03 (pdf).

My PhD was in distributed information retrieval (thesis pdf) which means
building a system on top of multiple engines/databases that already exist. My
recent work in the area has considered whether (or when) DIR is really
practical. Some papers: ADC'99 (ps), DL'00 (pdf), ADC'03 (pdf) and ADC'04
(pdf).

===============
Web Search & Data Mining Group of MSR Asia
http://research.microsoft.com/wsm/

The goal of the Web Search & Data Mining Group of MSR Asia is to drive the
next generation of Web search by leveraging data mining, machine learning, and
knowledge discovery techniques for information analysis, organization,
retrieval, and visualization. In addition, in contrast with current Web search
methods, which essentially do document-level ranking and retrieval, the Web
Search & Data Mining Group has created search at the object level to bring
increased knowledge and intelligence to users.

A Glimpse at Several Core Innovations:

Large-scale Experimental Web Search Platform

The Web Search & Data Mining Group is creating a large scale search platform
to efficiently store, parse, index and search billions of Web pages and other
types of documents. The search platform is flexible enough to allow for
testing of various state-of-the-art search techniques that have been created
at the lab using new technologies.

Structuralizing the Web

The biggest challenge facing both users and search engines over the next
several decades is the continued unstructured growth of the Internet. As such,
search functions that can effectively and efficiently dig out
machine-understandable information and knowledge layers from unorganized and
unstructured Web data will be the key to supporting relevant search results.
To meet this challenge, the group is exploring technologies, namely Web
information extraction, deep Web mining, and Web structure mining that can
automatically classify structures and extract objects from the Web. The
information and knowledge gathered using these new techniques greatly improves
the performance of current Web search and even facilitates the creation of
more sophisticated next generation search technologies.

Vertical Search

Today's conventional search engines can be described as page-level search
engines whose main function is to rank web pages according to their relevance
to a given query. Driving the future of the search industry are functions that
delve deeper into vertical domains to provide knowledge and intelligence to
query results. At MSR Asia, the Web Search & Data Mining Group is addressing
the greatest challenges faced by vertical search including large scale web
classification, object-level information extraction, object identification and
integration, and object relationship mining and ranking. The results of these
efforts are leading to more advanced search engines that deliver intelligence
and insight to search results.

Mobile Search

The explosive growth of new computing devices such as handheld computers,
Windows Mobile-based PocketPCs, and SmartPhones is driving demand for greater
and more efficient information access. These devices, which leverage the power
of the Web and allow greater access to information than ever before, are still
not capable of performing at the level of a desktop PC. At MSR Asia, the Web
Search & Data Mining Group is inventing new technologies to improve the mobile
search and browsing experience and deliver the capabilities of a PC to users
of these new devices. Project initiatives include developing innovative
presentation schemes and user interfaces to facilitate search and browsing
tasks on mobile devices and developing context aware search technologies to
address the special information needs of mobile users.

Multimedia Search

The Web Search & Data Mining Group is conducting research into new
technologies that index multimedia content such as images, videos, and audio.
Through content analysis and advanced visualization techniques, the group is
transforming today's conventional text based search engines to include
multimedia content thus delivering more intelligent search results to users.
For example, the group recently developed a new multimedia news reader which
mines large archival news databases presenting text, map information, images,
and background music within a unique user interface providing readers with a
more efficient news search engine and a more enjoyable reading experience.

------
Wei-Ying Ma
http://research.microsoft.com/users/wyma/

Senior Researcher, Research Manager, Microsoft Research Asia

Dr. Wei-Ying Ma received the B.S. degree in electrical engineering from the
National Tsing Hua University in Taiwan in 1990, and the M.S. and Ph.D.
degrees in electrical and computer engineering from the University of
California at Santa Barbara in 1994 and 1997, respectively. From 1994 to 1997
he was engaged in the Alexandria Digital Library (ADL) project in UCSB while
completing his Ph.D. He developed a web-based image retrieval system called
Netra which has been frequently cited by other researchers and is regarded as
one of the most representative image retrieval systems. From 1997 to 2001, he
was with HP Labs where he worked in the field of multimedia adaptation and
distributed media services infrastructure. He joined Microsoft Research Asia
in 2001. Since then, he has been leading a research group to conduct research
in the areas of information retrieval, web search, data mining, mobile
browsing, and multimedia management. He currently serves as an Editor for the
ACM/Springer Multimedia Systems Journal and Associate Editor for ACM
Transactions on Information System (TOIS). He has served on the organizing and
program committees of many international conferences including ACM Multimedia,
ACM SIGIR, ACM CIKM, WWW, ICME, CVPR, SPIE Multimedia Storage and Archiving
Systems, SPIE Multimedia Communication and Networking, etc. He is also the
general co-chair of International Multimedia Modeling (MMM) Conference 2005
and International Conference on Image and Video Retrieval (CIVR) 2005. He has
published 5 book chapters and over 100 international journal and conference
papers.

====================
Google Labs
http://labs.google.com/

Google Labs is a playground for Google engineers and adventurous Google users.
Google staffers with wild and crazy ideas post their prototypes on Google Labs
and solicit feedback on how the technology could be used or improved. None of
these experiments are guaranteed to make it onto Google.com, as this is really
the first phase in the development process. Google users with a desire to jump
over the cutting edge are invited to check out any or all of the posted
prototypes and send their comments directly to the Googlers who developed
them. Please, remember to wear your safety goggles while using this site.

Labs.google.com, Google's technology playground.
Google labs showcases a few of our favorite ideas that aren't quite ready for
prime time. Your feedback can help us improve them. Please play with these
prototypes and send your comments directly to the Googlers who developed them.

Want to learn more about Google technology? Here are some papers.
http://labs.google.com/papers/index.html

Passionate about these topics? You should work at Google.
algorithms, artificial intelligence, compiler optimization,
computer architecture, computer graphics,
data compression, data mining, file system design,
genetic algorithms, information retrieval,
machine learning, natural language processing, operating systems,
profiling, robotics,
text processing, user interface design,
web information retrieval, and more!

http://www.google.com/press/podium.html
Google Press Center: The Google Podium
Here you'll find a selection of public presentations made by Google
executives. From time to time, we will continue to add transcripts, audio or
video clips and links to presentations hosted elsewhere.

====================
Jon Kleinberg
http://www.cs.cornell.edu/home/kleinber/

Professor of Computer Science, Cornell University

My research is concerned with algorithms that exploit the combinatorial
structure of networks and information. My recent work has included
* link analysis and modeling of the World Wide Web and related information networks;
* discrete optimization and network algorithms; and
* algorithmic approaches to clustering, indexing, and data mining.
====================

Major Conference Proceedings on Data Mining

ACM SIGKDD (Knowledge Discovery and Data Mining)
ACM SIGIR Conference on Information Retrieval
ACM SIGMOD (Management of Data)
VLDB (
VERY LARGE DATE BASE)
IEEE ICDM
(IEEE International Conference on Data Mining)
SIAM SDM (SIAM Data Mining conference)
IEEE ICDE
International Conference on Data Engineering
IEEE ICML International Conference on Machine Learning
WWW (
International World Wide Web Conference

and other related conferences.

Recommended Data Mining related Monographs

The following texts are recommended, for reference. There are numerous other books or online resources on data mining available.

1. Jiawei Han and Micheline KamberData Mining: Concepts and Techniques 2nd ed., Morgan Kaufmann, 2006. See the book's home page for errata, course slides, and other reference materials.

2. Soumen Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and Semi-Structured Data”, Morgan Kaufmann, 2002.

3. R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Inter-science, 2001.

4. M. H. Dunham, Data Mining: Introductory and Advanced Topics, Prentice Hall, 2002.

5. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Advances in Knowledge Discovery and Data Mining, The MIT Press, 1996

6. U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001

7. D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001.

8. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, 2001

9. T. M. Mitchell, Machine Learning, McGraw Hill, 1997.

10. Pan-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, Addison-Wesley, 2006. ISBN: 0-321-32136-7

11. S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

12. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed., 2005, ISBN 0-12-088407-0