CSCI 5833 -- Data Mining Tools and Techniques STAT
5931 -- Research Topics in Statistics
Office and Addresses Delta
171 Phone 281.283.3805 Class Hours (Face-to-Face or Online)
Thursday
10:00 -
12:50, Room: Delta 237, or via Zoom (If necessary) Office Hours Wed 12 - 4 PM, Thurs 9 - 10 AM, or by appointment. Students with appointments have priority. If the suite door is locked, then call my extension (x3805) using the phone in the hallway. Students who have an appointment will have priority over those students who don't. A Zoom session is also possible. Teaching Assistant
Mr. Naga Sai
Venkatesh Perumalla TA Hours: Monday 7 - 10 PM; Tuesday 10AM to 3 PM; 7 - 10 PM; Wednesday 7 - 10 PM
If you are not willing to learn, no one can help you. If you are determined to learn, no one can stop you. -- Zig Ziglar
Course Description Data Mining has emerged as one of the most exciting and dynamic fields in computer science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion by the end of this year. The
theoretical underpinnings of the data mining have existed for awhile
(e.g., pattern recognition, statistics, data analysis and machine
learning), the practice and use of these techniques have been largely
ad-hoc. With the availability of large databases to store, manage and
assimilate data, the new thrust of data mining lies at the
intersection of database systems, artificial intelligence and
algorithms that efficiently analyze data. Data mining seeks to detect
`interesting' and significant nuggets of relationships/knowledge
buried within data. It seeks to discover association rules, episode
rules, sequential rules, etc., and it is concerned with efficient data
structures and algorithms for data examination which possess good
scaling properties. There
have been several success stories in this relatively young area: the
SKICAT system for automatic cataloguing of sky surveys (JPL), the
Advanced Scout system for mining NBA data (IBM), the QuakeFinder
system for geoscientific data mining (UCLA/JPL) and the PYTHIA system
for mining information from performance evaluation of scientific
software (Purdue). Case studies from various domains (financial,
bioinformatics, etc.) will be presented. The traditional graduate student load is 3 courses.
Be prepared to commit 15 to 20 hours per week to this course!
Course Goals
By
the end of the course, you will
Prerequisites A course in artificial
intelligence, machine learning, pattern recognition, algorithms, or
statistics would be helpful, but is not required. Programming
experience (or at least one course) in either C, C++, C#, Delphi,
Java, PASCAL, or VB (using Visual Studio).
If you do not meet the prerequisites, then you need to drop
this course! Methodology Lecture, seminar, case studies, and interactive problem solving. Appraisal:
Grades will be based solely on criteria listed above. No other factors will be considered. Grading Scale
93+
= A; 90 = A-; 87+ = B+; 83+ = B; 80+ = B-; 77+ = C+; 73+ = C; 70 = C-; 67+ = D+; 63+ = D; 60+ = D-; 0+= F My motto: Foster disciplined, altruistic passion. Required Textbook None. Reference
Books 1. Aggarwal, Charu C. Data mining: the textbook. Springer, 2015. 2. Berry and Linoff, Data Mining Techniques, Wiley, 2000. 3. Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011.
4. Mitchell, Machine Learning, McGraw-Hill, Boston, 1997.
5. Witten, Ian H., and Eibe Frank. "Data mining: practical machine learning tools and techniques with Java implementations." ACM Sigmod Record 31.1 (2002): 76-77. Other
Reference Materials
Conferences, Journals, and Organizations
Data
Resources
Bioinformatic
and biological databases: Santa
Fe dataset
Data Mining Software
Schedule (Tentative)
Jan 18
************************************************************************ *** All course materials are located in the Google Drive folder. *** *** I strongly recommend you place the notes in a 3-ring binder. *** Assign Homework 1 Point value: 100 points Due date: Thursday, February 8th, 10:00 AM
FOR THIS WEEK (IF NOT SOONER) Blue Color = Available on the Google Drive It is the student's responsibility to download the notes, print the notes, and bring them to class. ·
Read:
Syllabus · Read documents in: WK01 Notes - What is Data Mining and the DM Process.zip FOR NEXT WEEK (IF NOT SOONER) · Read documents in: WK02 Notes - The Data in Data Mining.zip
Jan
Assign Homework 2 Point value: 100 points
FOR NEXT WEEK (IF NOT SOONER)
Feb
FOR NEXT WEEK (IF NOT SOONER)
Feb
Assignment 1 is due. FOR NEXT WEEK (IF NOT SOONER) ·
Read documents in: WK05 Notes - Data
Preprocessing - Attribute and Dimension Reduction.zip
Feb 15
Assign Homework 3 Point value: 100 points Due
date:
FOR NEXT WEEK (IF NOT SOONER)
Feb 22 Decision Trees
FOR NEXT WEEK (IF NOT SOONER)
Feb 29
FOR NEXT WEEK (IF NOT SOONER) · Submit: Midterm questions by Wednesday, March 6th, 7 PM. Use the template found on the Google Drive Strip out any identifying information (Your name, Student ID number) Specify whether you want me to post your questions on the Google Drive. · Study!
Mar 07 Midterm
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html · Read: No assigned readings for next week · Review: The following SVM applet: http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml ·
Mar 14 ************ Spring Break *************
Mar 21 Ensemble Learning, Random Forests, Clustering, Instance-Based Learning, SVM. Part 1
Mar 28 Ensemble Learning, Random Forests, Clustering, Instance-Based Learning, SVM. Part 2
· Review: The following tutorial on Genetic Algorithms: http://www.obitko.com/tutorials/genetic-algorithms/ Apr 04 Genetic Algorithms, Genetic Programs Point value: 100 points Due
date:
FOR NEXT WEEK (IF NOT SOONER · Review: One (or more) of the following online neural network tutorials: · Run: WK11 R Example - Self Organizing Map.R
·
Review: Download,
install, and try GDB_Net (A
neural network software package) · Review: Try out this Kohonan Self-Organizing Map applet Click here for the zipped code of this applet.
Apr 11
Point value: 100 points Due date: Sunday, April 28th, 4:00 PM
·
Read:
·
Read:
·
Read:
·
Read:
W
Apr 18
· Read: WK12 Notes - Evaluating Results.pdf
Apr 25
· Submit: Final questions by Tuesday, May 1st, 7 PM. This is optional. Use the template found on the Google Drive Strip out any identifying information (Your name, Student ID number) Specify whether you want me to post your questions on the Google Drive.
· Study!
Apr 28
May 02 - Final Exam
Other Policies This class has 6 simple rules: 1) Be respectful of others. (3% Penalty/Cellphone or text infraction) 2) Be very passionate about your learning and do your best. 3) Be fearless - ask lots of questions in class. 4) Don't be late on anything. (10% Penalty/Day) 5) Don't ever cheat. 6) Have fun!
Miscellaneous
© 2002-2024 Boetticher: Data Mining Course, All Rights Reserved. |
|