Efficient decision tree construction on streaming data

Ruoming Jin; Gagan Agrawal

doi:10.1145/956750.956821

Efficient decision tree construction on streaming data

Ruoming Jin, Gagan Agrawal

Research output: Contribution to conference › Paper › peer-review

126 Scopus citations

Abstract

Decision tree construction is a well studied problem in data mining. Recently, there has been much interest in mining streaming data. Domingos and Hulten have presented a one-pass algorithm for decision tree construction. Their work uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed.In this paper, we revisit this problem. We make the following two contributions: 1) We present a numerical interval pruning (NIP) approach for efficiently processing numerical attributes. Our results show an average of 39% reduction in execution times. 2) We exploit the properties of the gain function entropy (and gini) to reduce the sample size required for obtaining a given bound on the accuracy. Our experimental results show a 37% reduction in the number of data instances required.

Original language	English (US)
Pages	571-576
Number of pages	6
DOIs	https://doi.org/10.1145/956750.956821
State	Published - 2003
Externally published	Yes
Event	9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 - Washington, DC, United States Duration: Aug 24 2003 → Aug 27 2003

Conference

Conference	9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03
Country/Territory	United States
City	Washington, DC
Period	8/24/03 → 8/27/03

Keywords

Decision tree
Sampling
Streaming data

ASJC Scopus subject areas

Software
Information Systems

Access to Document

10.1145/956750.956821

Cite this

@conference{1be55c284e33403da2e5a793471615da,

title = "Efficient decision tree construction on streaming data",

abstract = "Decision tree construction is a well studied problem in data mining. Recently, there has been much interest in mining streaming data. Domingos and Hulten have presented a one-pass algorithm for decision tree construction. Their work uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed.In this paper, we revisit this problem. We make the following two contributions: 1) We present a numerical interval pruning (NIP) approach for efficiently processing numerical attributes. Our results show an average of 39% reduction in execution times. 2) We exploit the properties of the gain function entropy (and gini) to reduce the sample size required for obtaining a given bound on the accuracy. Our experimental results show a 37% reduction in the number of data instances required.",

keywords = "Decision tree, Sampling, Streaming data",

author = "Ruoming Jin and Gagan Agrawal",

note = "Copyright: Copyright 2010 Elsevier B.V., All rights reserved.; 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 ; Conference date: 24-08-2003 Through 27-08-2003",

year = "2003",

doi = "10.1145/956750.956821",

language = "English (US)",

pages = "571--576",

}

TY - CONF

T1 - Efficient decision tree construction on streaming data

AU - Jin, Ruoming

AU - Agrawal, Gagan

PY - 2003

Y1 - 2003

N2 - Decision tree construction is a well studied problem in data mining. Recently, there has been much interest in mining streaming data. Domingos and Hulten have presented a one-pass algorithm for decision tree construction. Their work uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed.In this paper, we revisit this problem. We make the following two contributions: 1) We present a numerical interval pruning (NIP) approach for efficiently processing numerical attributes. Our results show an average of 39% reduction in execution times. 2) We exploit the properties of the gain function entropy (and gini) to reduce the sample size required for obtaining a given bound on the accuracy. Our experimental results show a 37% reduction in the number of data instances required.

AB - Decision tree construction is a well studied problem in data mining. Recently, there has been much interest in mining streaming data. Domingos and Hulten have presented a one-pass algorithm for decision tree construction. Their work uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed.In this paper, we revisit this problem. We make the following two contributions: 1) We present a numerical interval pruning (NIP) approach for efficiently processing numerical attributes. Our results show an average of 39% reduction in execution times. 2) We exploit the properties of the gain function entropy (and gini) to reduce the sample size required for obtaining a given bound on the accuracy. Our experimental results show a 37% reduction in the number of data instances required.

KW - Decision tree

KW - Sampling

KW - Streaming data

UR - http://www.scopus.com/inward/record.url?scp=77952325551&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77952325551&partnerID=8YFLogxK

U2 - 10.1145/956750.956821

DO - 10.1145/956750.956821

M3 - Paper

AN - SCOPUS:77952325551

SP - 571

EP - 576

T2 - 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03

Y2 - 24 August 2003 through 27 August 2003

ER -

Efficient decision tree construction on streaming data

Abstract

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this