neural networks Dr. Gary D. Boetticher Software Metrics
software economics
Return to the home page of Dr. Boetticher
University of Houston Clear Lake - About the University
School of Science and Computer Engineering - Info about SCE
Research Areas - Info about Dr. Boetticher's research
Dr. Boetticher's publications
Courses taught by Dr. Boetticher
Dr. Boetticher's professional experiences

 

CSCI 5833 -- Data Mining Tools and Techniques

STAT 5931 -- Research Topics in Statistics
Updated January 16, 2024

 

Office and Addresses

Delta 171 Phone 281.283.3805
email: boetticher@uhcl.edu
Secretary: Ms. Caroline Johnson, Delta 161 281.283.3860

Class Hours (Face-to-Face or Online)

Thursday 10:00 - 12:50, Room: Delta 237, or via Zoom (If necessary)
Zoom information may be found in the Google folder for this course.

Office Hours

Wed 12 - 4 PM, Thurs 9 - 10 AM, or by appointment. Students with appointments have priority. If the suite door is locked, then call my extension (x3805) using the phone in the hallway. Students who have an appointment will have priority over those students who don't. A Zoom session is also possible.

Teaching Assistant

Mr. Naga Sai Venkatesh Perumalla
Email: perumallan1637@uhcl.edu

TA Hours: Monday 7 - 10 PM; Tuesday 10AM to 3 PM; 7 - 10 PM; Wednesday 7 - 10 PM

 

    

If you are not willing to learn, no one can help you.

If you are determined to learn, no one can stop you. -- Zig Ziglar

 

Course Description

Data Mining has emerged as one of the most exciting and dynamic fields in computer science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion by the end of this year.

The theoretical underpinnings of the data mining have existed for awhile (e.g., pattern recognition, statistics, data analysis and machine learning), the practice and use of these techniques have been largely ad-hoc. With the availability of large databases to store, manage and assimilate data, the new thrust of data mining lies at the intersection of database systems, artificial intelligence and algorithms that efficiently analyze data. Data mining seeks to detect `interesting' and significant nuggets of relationships/knowledge buried within data. It seeks to discover association rules, episode rules, sequential rules, etc., and it is concerned with efficient data structures and algorithms for data examination which possess good scaling properties.

There have been several success stories in this relatively young area: the SKICAT system for automatic cataloguing of sky surveys (JPL), the Advanced Scout system for mining NBA data (IBM), the QuakeFinder system for geoscientific data mining (UCLA/JPL) and the PYTHIA system for mining information from performance evaluation of scientific software (Purdue). Case studies from various domains (financial, bioinformatics, etc.) will be presented.

The traditional graduate student load is 3 courses. Be prepared to commit 15 to 20 hours per week to this course!

Course Goals

 

By the end of the course, you will

  • Understand the data mining process.

  • Have a working knowledge of different data mining tools and techniques.  

  • Have an understanding of various Machine Learners (ML).

  • Have a working knowledge of some of the more significant current research in the area of data mining and ML.

  • Be aware of various data mining data repositories for the study of data mining.

  • Be able to effectively apply a number of data mining algorithms (e.g., neural networks, genetic algorithms) to solve data mining problems from various problem domains including Financial and Bioinformatics.

  • Be familiar with several successful applications of data mining.

Prerequisites

A course in artificial intelligence, machine learning, pattern recognition, algorithms, or statistics would be helpful, but is not required. Programming experience (or at least one course) in either C, C++, C#, Delphi, Java, PASCAL, or VB (using Visual Studio). If you do not meet the prerequisites, then you need to drop this course!  

Methodology

Lecture, seminar, case studies, and interactive problem solving.

Appraisal:

 Assignments:  15% of the total
 Quizzes and Participation

  5% of the total

 Midterm:  40% of the total
 Final: 40% of the total

Grades will be based solely on criteria listed above. No other factors will be considered.

Grading Scale

    93+ = A; 90 = A-; 87+ = B+; 83+ = B; 80+ = B-;

      77+ = C+; 73+ = C; 70 = C-; 67+ = D+; 63+ = D; 60+ = D-; 0+= F

My motto:

Foster disciplined, altruistic passion.

Required Textbook

     None.

Reference Books

1. Aggarwal, Charu C. Data mining: the textbook. Springer, 2015.

2. Berry and Linoff, Data Mining Techniques, Wiley, 2000.

3. Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011.

4. Mitchell, Machine Learning, McGraw-Hill, Boston, 1997.

5. Witten, Ian H., and Eibe Frank. "Data mining: practical machine learning tools and techniques with Java implementations." ACM Sigmod Record 31.1 (2002): 76-77.

Other Reference Materials

Conferences, Journals, and Organizations

Data Resources

 

Bioinformatic and biological databases:  

Santa Fe dataset  

 

Data Mining Software

  • R version 4.0.3(Windows): A programming language and software environment for statistical computing and graphics. Third-party packages support machine learners. It is available on the Google Drive.
  • RStudio (Desktop Version version 1.3.1093): GUI interface for R. It is available on the Google Drive.
  • Python 3.9.1: Available for various platforms. The link is for a Windows environment.
  • WEKA 3.9: Waikato Environment for Knowledge Analysis, contains many different learners. Versions 3.6.8 through 3.9 are available in the Google Folder. It is also installed in the NT lab in the Delta building.
  • HeuristicLab 3.3.16: HeuristicLab is a framework for heuristic and evolutionary algorithms that is developed by members of the Heuristic and Evolutionary Algorithms Laboratory (HEAL) since 2002.
  • GDB Net: A Backpropagation Neural Network Program. It is available on the Google Drive.
  • GDB GP: A Genetic Program Software Package. It is available on the Google Drive.
  • RapidMiner Studio: With RapidMiner Studio, you can access, load and analyze any type of data – both traditional structured data and unstructured data like text, images, and media. It can also extract information from these types of data and transform unstructured data into structured.
  • Orange: You will fall in love with this tool’s visual programming and Python scripting. It also has components for machine learning, add-ons for bioinformatics and text mining. It’s packed with features for data analytics.
  • KNIME: Data preprocessing has three main components:  extraction, transformation and loading. KNIME does all three. It gives you a graphical user interface to allow for the assembly of nodes for data processing. It is an open source data analytics, reporting and integration platform. KNIME also integrates various components for machine learning and data mining through its modular data pipelining concept and has caught the eye of business intelligence and financial data analysis.

    Written in Java and based on Eclipse, KNIME is easy to extend and to add plugins. Additional functionalities can be added on the go. Plenty of data integration modules are already included in the core version.

  • NLTK: When it comes to language processing tasks, nothing can beat NLTK. NLTK provides a pool of language processing tools including data mining, machine learning, data scraping, sentiment analysis and other various language processing tasks. All you need to do is install NLTK, pull a package for your favorite task and you are ready to go. Because it’s written in Python, you can build applications on top if it, customizing it for small tasks.
  • Software for Data Mining: List of software and tools maintained by KDnuggets.
  • IBM Data Warehouse Edition: A leading-edge DM software.
  • Teradata Warehouse Miner:
  • Oracle Data Mining: FAQ on Oracle 12c data mining tools.
  • Microsoft Data Analytics
  • SPSS Data Mining
  • Data Mining Products: A list of companies and their products in the area of data mining.
  • 43 Top Data Mining Softare Programs
  • Six of the Best Open Source Data Mining Tools
  • Free Data Mining Tools
  • List of data mining and learning analytics tools

 

Schedule (Tentative)

Jan 18 – Course overview, What is Data Mining? The Data Mining Process

************************************************************************

***   All course materials are located in the Google Drive folder.   ***

***   You are expected to bring a copy of the notes to all lectures. ***

***   I strongly recommend you place the notes in a 3-ring binder.   ***

************************************************************************

    

Assign Homework 1

Point value: 100 points

Due date:  Thursday, February 8th, 10:00 AM

 

FOR THIS WEEK (IF NOT SOONER)       

            Blue Color = Available on the Google Drive

                It is the student's responsibility to download the notes, print the notes, and bring them to class.

·   Read:  Syllabus

·    Read documents in:  WK01 Notes - What is Data Mining and the DM Process.zip

FOR NEXT WEEK (IF NOT SOONER)  

·   Read documents in:  WK02 Notes - The Data in Data Mining.zip

 

Jan 25 – The Data in Data Mining, General Types of Data Mining Problems

  

Assign Homework 2

Point value: 100 points

Due date:  Thursday, February 22nd, 10:00 AM

 

FOR NEXT WEEK (IF NOT SOONER)

·   Read documents in:  WK03 Notes - The Experimental Process.zip

 

Feb 01 – The Experimental Process

FOR NEXT WEEK (IF NOT SOONER)

·   Read documents in:  WK03 Notes - Data Preprocessing - Introduction.zip

 

Feb 08 –  Data Preprocessing - Introduction

 

Assignment 1 is due.

FOR NEXT WEEK (IF NOT SOONER)

·   Read documents in:  WK05 Notes - Data Preprocessing - Attribute and Dimension Reduction.zip 

 

Feb 15 – Data Preprocessing - Attribute and Dimension Reduction

 

Assign Homework 3

Point value: 100 points

Due date:  Thursday, March 7th, 10:00 AM

 

FOR NEXT WEEK (IF NOT SOONER)

·   Read documents in: WK06 and WK07 Notes - Decision Trees.zip

·   Bring a laptop with R, Python, and WEKA installed

 

Feb 22 –  Decision Trees

 

Assignment 2 is due.

FOR NEXT WEEK (IF NOT SOONER)

·   Read: WK07 Notes - Data Mining for Very Busy People.PDF

·   Bring a laptop with R, Python, and WEKA installed

 

Feb 29 – Decision trees continued, Midterm Review

FOR NEXT WEEK (IF NOT SOONER)  

·  Submit:  Midterm questions by Wednesday, March 6th, 7 PM.

                Use the template found on the Google Drive

                Strip out any identifying information (Your name, Student ID number)

                Specify whether you want me to post your questions on the Google Drive.

·   Study!

 

Mar 07 – Midterm

        

Assignment 3 is due.

 

FOR NEXT WEEK (IF NOT SOONER)

·   Read documents in: WK08 Notes - Clustering, Instance Based Learning, SVM, Ensemble Learning.zip

·   Review The following K-Means applet:

      http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

 ·   Read:  No assigned readings for next week

 ·   Review The following SVM applet:

      http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml

 ·   Bring a laptop with R, Python, and WEKA installed

 

Mar 14 ************ Spring Break *************

 

 

Mar 21 – Ensemble Learning, Random Forests, Clustering, Instance-Based Learning, SVM. Part 1

 

 

Mar 28 – Ensemble Learning, Random Forests, Clustering, Instance-Based Learning, SVM. Part 2

 

FOR NEXT WEEK (IF NOT SOONER)

·  Review The following tutorial on Genetic Algorithms:

            http://www.obitko.com/tutorials/genetic-algorithms/

·   Read documents in: WK10 Notes - Genetic Algorithms.zip

Apr 04 – Genetic Algorithms, Genetic Programs

 

Assign Homework 4

Point value: 100 points

Due date:  Thursday, April 18th, 10:00 AM

 

FOR NEXT WEEK (IF NOT SOONER 

·   Review One (or more) of the following online neural network tutorials:

NN Tutorial 1

NN Tutorial 2

NN Tutorial 3

·   Read documents in: WK11 Notes and Papers - Neural Networks.zip

·   Run:    WK11 R Example - Self Organizing Map.R

 

·   Review Download, install, and try GDB_Net (A neural network software package).

·   Review Try out this Kohonan Self-Organizing Map applet

                 Click here for the zipped code of this applet.

 

******** April 9 – Last day to withdraw ********

 

Apr 11 – Neural Networks including Perceptron, Backprop., Self-Organizing Map (SOM)

 

Assign Homework 5

Point value: 100 points

Due date:  Sunday, April 28th, 4:00 PM

 

FOR NEXT WEEK (IF NOT SOONER)  

·   Read:  WK12 Notes - Neural Networks - CNNs and GANs

·   Read:  WK12 NotesB - The Rectified Linear Unit (ReLU) Activation Function

·   Read:  WK12 NotesC - A Gentle Introduction to the Progressive Growing GAN

·   Read:  WK12 NotesD - A Gentle Introduction to StyleGAN the Style Generative Adversarial Network 

 

Apr 18 – Neural Networks - Convolutional Neural Networks,  Generative Adversarial Networks (GANs)

       

Assignment 4 is due.

FOR NEXT WEEK (IF NOT SOONER)  

·   Read:  WK12 Notes - Evaluating Results.pdf

Apr 25 – Evaluating Data Mining results, Final Exam Review

FOR NEXT WEEK (IF NOT SOONER)

·   Submit:   Final questions by Tuesday, May 1st, 7 PM. This is optional.

                Use the template found on the Google Drive

                Strip out any identifying information (Your name, Student ID number)

                Specify whether you want me to post your questions on the Google Drive.

 

·   Study!

 

Apr 28 – Assignment 5 is due (Sunday 4 PM).

 

May 02 - Final Exam

 

Other Policies

This class has 6 simple rules:

 1) Be respectful of others. (3% Penalty/Cellphone or text infraction)

 2) Be very passionate about your learning and do your best.

 3) Be fearless - ask lots of questions in class.

 4) Don't be late on anything. (10% Penalty/Day)

 5) Don't ever cheat.

 6) Have fun!

 

Miscellaneous

  • Any person with a disability who requires a special accommodation should inform me and contact the Disability services office or call 281 283 2627 as soon as possible.

  • You are expected to come fully prepared to every class!

  • If there is any religious observance that may interfere with any scheduled exam, homework due date, or attending class, please notify me of the situation during the first 2 weeks of class so that adjustments can be made at that time.

  • Please turn off all cell phones, and pagers prior to the start of class.

  • I am willing to provide letters of recommendation/references only if you have attained an 'A' in one of my classes, or two 'A-' in two of my classes.

  • I highly recommend that you seek out your advisor and complete your Candidate Plan of Study (CPS) as soon as possible. I am normally not available for advising during the summer months.

Return to Top


HomeUHCL SCE



2700 Bay Area Boulevard
Delta Building. Office 171
Houston, Texas 77058
Voice: 281-283-3805
Fax: 281-283-3869
boetticher@uhcl.edu


© 2002-2024 Boetticher: Data Mining Course, All Rights Reserved.

Undergrad courses taught by Dr. Boetticher
Graduate courses taught by Dr. Boetticher