Data Mining

Rakesh Agrawal
IBM Almaden Research Center
San Jose, CA 95120, U.S.A.
ragrawal@almaden.ibm.com


Abstract

Data mining is the efficient discovery of previously unknown patterns in large databases, and is emerging as a major application area for databases. Many of the data mining problems have been motivated by the practical decision support problems faced by most large retail organizations.

The Quest project on data mining at the IBM Almaden Research Center has developed innovative technology to discover useful patterns in gigabytes of data in a short amount of time. This software can be used to solve the following customer problems:

The Quest software has been tested on several customer datasets. Several customers are actively developing various applications of Quest in retail, finance, and other industries. The software currently runs on RS/6000 workstations under AIX on flat files, DB2/CS, and DB2/MVS data bases. Parallel algorithms for finding association rules also run on IBM Power Parallel System. The software has been designed to be easily portable to other platforms and run on multiple data-repositories.

In this tutorial, I will draw upon my Quest experience to present my perspective of data mining, describe current work, and present some open problems.

 Clik here for further informations on this topic.
 
 

Bibliography

Rakesh Agrawal, Tomasz Imielinski, and Arun Swami, ``Database Mining: A Performance Perspective'', IEEE Transactions on Knowledge and Data Engineering, Special Issue on Learning and Discovery in Knowledge-Based Databases, December 1993.

Rakesh Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, and Arun Swami, ``An Interval Classifier for Database Mining Applications'', VLDB-92, Vancouver, British Columbia, Canada, 1992, 560--573.

Rakesh Agrawal, Tomasz Imielinski and Arun Swami, ``Mining Association Rules between Sets of Items in Large Databases'', SIGMOD-93, Washington D.C., May 1993.

R. Agrawal, C. Faloutsos, and A. Swami, ``Efficient Similarity Search in Sequence Databases'', 4th Int'l Conf. on Foundations of Data Organization and Algorithms (FODO), Chicago, Oct. 1993.

Rakesh Agrawal and Ramakrishnan Srikant, ``Fast Algorithms for Mining Association Rules in Large Databases'', VLDB-94, Santiago, Chile, Sept. 1994. Expanded version available as IBM Research Report RJ9839, June 1994.

Rakesh Agrawal and Ramakrishnan Srikant, ``Mining Sequential Patterns'', 11th Int'l Conf. on Data Engineering, Taipei, Taiwan, March 1995.

R. Agrawal, G. Psaila, E.L. Wimmers, and M. Zait: ``Querying Shapes of Histories'', VLDB-95, Zurich, Switzerland, Sept. 1995.

R. Agrawal, K.I. Lin, H.S. Sawhney, and K. Shim: ``Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases'', VLDB-95, Zurich, Switzerland, Sept. 1995.

R. Srikant and R. Agrawal: ``Mining Generalized Association Rules'', VLDB-95, Zurich, Switzerland, Sept. 1995.

R. Agrawal and G. Psaila: ``Active Data Mining'', 1st Int'l Conf. on Knowledge Discovery and Data Mining (KDD-95), Montreal, August 1995.

M. Mehta, J. Rissanen, and R. Agrawal: ``MDL-based Decision Tree Pruning'', 1st Int'l Conf. on Knowledge Discovery and Data Mining (KDD-95), Montreal, August 1995.
 
 


 



Integrating Information Retrieval
with Database Systems

W. Bruce Croft
NSF Center for Intelligent Information Retrieval
Computer Science Department
University of Massachusetts, Amherst


Abstract

The number of possible applications that involve both structured and unstructured data have increased enormously in recent years. The importance of text databases, in particular, has meant that there is significant interest in integrated platforms that combine the modeling and management capabilities of a database system with the effective retrieval algorithms and efficient storage techniques used in information retrieval systems.

In this course, I will give an overview of the important functionality provided by an information retrieval system, and then discuss the issues and techniques involved in producing an integrated system. The course will emphasize support for the full range of functionality required in a text-based system, including retrieval, routing, filtering, distribution, feedback, interfaces, and browsing. Areas such as probabilistic retrieval models, persistent object management, indexing, query optimization, and query languages will be covered in detail.

Clik here for further informations on this topic.


 



Spatial Database Systems

Ralf Hartmut Gueting
University of Hagen
Germany


Abstract

The purpose of a spatial database system is the management of data about some space in two or more dimensions. Spatial data management imposes a number of requirements on database systems. Concepts are needed to describe the shape and position of objects in space as well as relationships between them. This can be fulfilled by introducing spatial data types (algebras), providing types such as point, line, or region in 2D. To access efficiently objects in some particular region of the space, specialized spatial access methods are needed. To follow efficiently spatial relationships between objects, one needs spatial join methods. The user interface must support graphical representation as well as overlay of different query results. Since extensions at all levels of the system architecture are needed, extensible database systems are the prime architectural framework for the implementation of spatial database systems.

The tutorial aims at giving a coherent picture of the main research results obtained so far in the areas of modeling, querying, data structures and algorithms for system implementation, and system architecture.
 
 


 



Databases for telecommunications

H.V. Jagadish
AT&T Bell Laboratories


Abstract

The communications industry is on the brink of a revolution, as evidenced by almost daily newspaper articles on cable TV, wireless phones, and ``information highways''. Just as CAD/CASE applications have had a profound effect on the database community in the 80s, so are network applications likely to in the 90s. This tutorial explores the crucial role that databases play in telecommunications networks.

Wide area communications networks have been around for a long time. Ad hoc application-specific solutions have been adopted for data management. With rapid changes the communications industry is currently undergoing, and exponential growth in traffic volume, such ``hard-wired'' solutions are no longer acceptable, and generic database software is desired. However, traditional databases do not provide all the features required in networks, as we will discuss.

There are three major layers at which it is relevant to consider database needs in a large network: network operation, network management, and network services. Database requirements differ between the three, and we will consider each in turn.
 
 


 



Querying and visualizing the Internet

Alberto O. Mendelzon
University of Toronto 


Abstract

The most significant recent developments enabling the creation of global information resources are the incredible growth of the Internet and the success of the World Wide Web.

Everyone who has used a WWW browser such as Mosaic or Netscape knows the frustrations of trying to find information that is definitely out there somewhere, if we only knew where. The ``lost in hyperspace'' syndrome, well known from the early implementations of hypertext, has become more severe with the enormously larger scale and lack of any coherent structure in distributed environments like the Web.

We will survey two approaches to this problem:

  1. tools for visualizing and keeping track of sub-networks of interest; and
  2. query interfaces that support declarative requests for information, and evaluation mechanisms to fill these requests efficiently.
Clik here for loading the slides of this talk.


 



Data Models in Spatial Databases

Jan Paredaens
Universiteit Antwerpen


Abstract

During the last decades the number of computer applications that use two- or three-dimensional spatial information has steadily increased. Various spatial database models have been proposed and developed to support these applications. The main problem these models are faced with is that they have to combine finite thematic information with infinite geometric information in a way that is not only elegant and sound but also allows an efficient implementation.

We will discuss a number of such data models and we will show that the development of a solid theory for spatial databases depends on a variety of disciplines: database system theory, geography, computational geometry and topology.

 We will then focus on two particular data models: the linear model in which exact spatial information is available and the topological data model in which only relative positions of spatial objects are considered.

This course is prepared and will be presented in cooperation with two Ph. D. students Bart Kuijpers and Luc Vandeurzen.

Clik here for further informations on this topic.
 
 


 



Persistent polymorphic systems

Joachim W. Schmidt and Florian Matthes
Universitaet Hamburg


Abstract

Recent progress in database technology has improved significantly the process of constructing and evolving large-scale, long-lived applications.

In this tutorial, Persistent Polymorphic Systems are presented as a generalization of DBMSs which offer substantially improved generic services like orthogonal persistence, bulk data storage, iteration abstraction, and multi-user access. The generic services of Persistent Polymorphic Systems can be instantiated and customized to specific applications by high-level declarations and statements.

Moreover, database applications make heavy use of additional generic services like data visualization, behavior modeling, network communication or workflow management which are offered by external servers such as GUI toolkits, fourth-generation languages, distributed object managers or workflow tools. Therefore, the future success of database environments depends crucially on their ability to be integrated smoothly and on a high level of abstraction into a larger heterogeneous information infrastructure.

We outline recent progress towards Persistent Polymorphic Systems along the lines of

The tutorial concludes by summarizing recent advances towards the vision of scalable Persistent Polymorphic Systems based on the orthogonal persistence of data, code and threads, on polymorphic type systems for generalized bulk data types and on models for code, data and thread migration in distributed environments.

Clik here for loading the slides of this talk.