CSCI 5833 -- Data Mining Tools and Techniques STAT
5931 -- Research Topics in Statistics
Office and Addresses Delta
171 Phone 281.283.3805 Class Hours (Face-to-Face or Online) This class is a hybrid class. The only face-to-face sessions are the orientation, midterm and final.
Orientation: Monday, June 3
at noon in Delta 241 Office Hours Email me to set up an appointment or to arrange a Zoom session. My Zoom details can be found in the Google folder. Teaching Assistant
Ms. Srutha Dasjou June: Monday 5 to 9 PM; Tuesday 5 to 9 PM; Wednesday 9 to Noon; Thursday 9 to Noon July: Monday 5 to 9 PM; Tuesday 5 to 9 PM; Wednesday 9 to Noon; Thursday 9 to Noon
Course Description Data Mining has emerged as one of the most exciting and dynamic fields in computer science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion by the end of this year. The
theoretical underpinnings of the data mining have existed for awhile
(e.g., pattern recognition, statistics, data analysis and machine
learning), the practice and use of these techniques have been largely
ad-hoc. With the availability of large databases to store, manage and
assimilate data, the new thrust of data mining lies at the
intersection of database systems, artificial intelligence and
algorithms that efficiently analyze data. Data mining seeks to detect
`interesting' and significant nuggets of relationships/knowledge
buried within data. It seeks to discover association rules, episode
rules, sequential rules, etc., and it is concerned with efficient data
structures and algorithms for data examination which possess good
scaling properties. There
have been several success stories in this relatively young area: the
SKICAT system for automatic cataloguing of sky surveys (JPL), the
Advanced Scout system for mining NBA data (IBM), the QuakeFinder
system for geoscientific data mining (UCLA/JPL) and the PYTHIA system
for mining information from performance evaluation of scientific
software (Purdue). Case studies from various domains (financial,
bioinformatics, etc.) will be presented.
In a 15-week
semester (Fall, Spring) you are expected to commit
15 to 20 hours per week to this course!
Course Goals
By
the end of the course, you will
Prerequisites A course in artificial
intelligence, machine learning, pattern recognition, algorithms, or
statistics would be helpful, but is not required. Programming
experience (or at least one course) in either C, C++, C#, Delphi,
Java, PASCAL, or VB (using Visual Studio).
If you do not meet the prerequisites, then you need to drop
this course! Methodology Lecture, seminar, case studies, and interactive problem solving. Appraisal:
Grades will be based solely on criteria listed above. No other factors will be considered. Grading Scale
93+
= A; 90 = A-; 87+ = B+; 83+ = B; 80+ = B-; 77+ = C+; 73+ = C; 70 = C-; 67+ = D+; 63+ = D; 60+ = D-; 0+= F My motto: Foster disciplined, altruistic passion. Required Textbook None. Reference
Books 1. Aggarwal, Charu C. Data mining: the textbook. Springer, 2015. 2. Berry and Linoff, Data Mining Techniques, Wiley, 2000. 3. Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011.
4. Mitchell, Machine Learning, McGraw-Hill, Boston, 1997.
5. Witten, Ian H., and Eibe Frank. "Data mining: practical machine learning tools and techniques with Java implementations." ACM Sigmod Record 31.1 (2002): 76-77. Other
Reference Materials
Conferences, Journals, and Organizations
Data
Resources
Bioinformatic
and biological databases: Santa
Fe dataset
Data Mining Software
Schedule (Tentative)
Jun 03
************************************************************************ *** All course materials are located in the Google Drive folder. *** *** Send me a gmail and I will add it to the Google Drive folder. *** Assign Homework 1 Point value: 100 points Due
date: Thursday, June 13th,
5 PM
FOR THIS WEEK (IF NOT SOONER) Blue Color = Available in the Google Folder It is the student's responsibility to download the notes, print the notes. ·
Read:
Syllabus · Read documents in: WK00 Notes - Orientation Data Mining - 20240507.pdf · Read documents in: WK01 Notes - What is Data Mining and the DM Process.zip FOR NEXT WEEK (IF NOT SOONER) · Read documents in: WK02 Notes - The Data in Data Mining.zip
Jun 05
Assign Homework 2 Point value: 100 points
FOR NEXT WEEK (IF NOT SOONER)
Jun
10
FOR NEXT WEEK (IF NOT SOONER)
Jun 12
FOR NEXT WEEK (IF NOT SOONER) · Read documents in:
J
FOR NEXT WEEK (IF NOT SOONER)
FOR NEXT WEEK (IF NOT SOONER)
FOR NEXT WEEK (IF NOT SOONER) ·
Submit: Midterm
questions by Monday, July 1st, 2 PM or sooner Use the template found on the Google Drive Leave out any identifying information (Your name, Student ID number) Specify whether you want me to share your questions with your fellow students. · Study!
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html · Read: No assigned readings for next week · Review: The following SVM applet: http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml
Background reading (not required) Aha, D., Kibler, D, Marc Albert, Instance-Based Learning Algorithms, Machine Learning, Kluwer Publishers, 6, 1991, Pp. 37-66.
Jul
Assign Homework 3 Point value: 100 points Due
date: Thursday, July 11th, 5
PM
· Review: The following tutorial on Genetic Algorithms: http://www.obitko.com/tutorials/genetic-algorithms/ Background reading (not required) Koza, John, Genetic Programming, Dept. of CS, Stanford University, 1997, Pp. 1 26. Koza, John,
Future Work and Practical Applications of Genetic
Programming, Handbook
of Evolutionary Computation, June, 1996, Pp. 1 7. Koza, John, Riccardo Poli, A Genetic Programming Tutorial, Stanford University Mitchell, Tom M., Machine Learning and Data Mining, Communications of the ACM, 42 11, November 1999. Whitley, Darrell, A Genetic Algorithm Tutorial, Dept. of CS, TR CS-93-103, Dept. of CS, Colorado State University, Pp. 1 38.
Point value: 100 points Due
date: Tuesday, July 23rd, 5
PM
FOR NEXT WEEK (IF NOT SOONER · Review: One (or more) of the following online neural network tutorials: · Run: WK12 R Example - Self Organizing Maps ·
Review: Download,
install, and try GDB_Net (A
neural network software package) · Review: Try out this Kohonan Self-Organizing Map applet Click here for the zipped code of this applet. Background reading (not required) Gerstner, Wulfram, Supervised Learning for Neural Networks: A Tutorial with Java exercises. The corresponding Java applets for this tutorial are available at: http://diwww.epfl.ch/mantra/tutorial/english/
Jul 22 -
Background reading (not required) Anand, Sarbjot, et al., The Role of Domain Knowledge in Data Mining, CIKM, Baltimore, Maryland, 1995. Clark, Glymour, et al., Statistical Inference and Data Mining, Communications of the ACM, 39 11, November 1996. Elder, John, et al., A Statistical Perspective on Knowledge Discovery in Databases Friedman, Jerome H., Data Mining and Statistics: What's the Connection?, Department of Statistics, Stanford
· Submit: Final questions by Monday, Jul 22nd, 2 PM or earlier. This is optional. Use the template found on the Google Drive Strip out any identifying information (Your name, Student ID number) Specify whether you want me to share your questions with your fellow students.
· Study!
Jul 23
Jul 24 Final Exam - Noon Central Time
Other Policies This class has 6 simple rules: 1) Be respectful of others. 2) Be very passionate about your learning and do your best. 3) Be fearless - ask lots of questions in class. 4) Don't be late on anything. 5) Don't ever cheat. 6) Have fun!
Miscellaneous
© 2002-2024 Boetticher: Data Mining Course, All Rights Reserved. |
|