- Home
- Products
- Integration
- Tutorial
- Barcode FAQ
- Purchase
- Company
Silberschatz Korth Sudarshan: Database System Concepts, Fourth Edition in Software
Silberschatz Korth Sudarshan: Database System Concepts, Fourth Edition Code 128B Creation In None Using Barcode maker for Software Control to generate, create Code-128 image in Software applications. Code 128A Reader In None Using Barcode scanner for Software Control to read, scan read, scan image in Software applications. VII Other Topics
Code 128 Code Set A Generator In Visual C#.NET Using Barcode creation for .NET Control to generate, create Code 128 Code Set A image in Visual Studio .NET applications. Paint Code 128B In .NET Using Barcode generation for ASP.NET Control to generate, create Code 128B image in ASP.NET applications. 22 Advanced Querying and Information Retrieval
Painting ANSI/AIM Code 128 In .NET Using Barcode maker for .NET framework Control to generate, create Code 128 Code Set B image in .NET applications. Code 128B Creation In Visual Basic .NET Using Barcode creation for .NET framework Control to generate, create Code 128A image in .NET applications. The McGraw Hill Companies, 2001 GTIN - 13 Creation In None Using Barcode printer for Software Control to generate, create EAN13 image in Software applications. UPC-A Creator In None Using Barcode creator for Software Control to generate, create UCC - 12 image in Software applications. 22
Making Data Matrix In None Using Barcode encoder for Software Control to generate, create Data Matrix 2d barcode image in Software applications. Making Bar Code In None Using Barcode maker for Software Control to generate, create bar code image in Software applications. Advanced Querying and Information Retrieval
Paint USS-128 In None Using Barcode drawer for Software Control to generate, create GS1 128 image in Software applications. Make ANSI/AIM Code 39 In None Using Barcode maker for Software Control to generate, create USS Code 39 image in Software applications. respectively As an optimization, since the class for the range 25,000 to 50,000 and the range 50,000 to 75,000 is the same under the node degree = masters, the two ranges have been merged into a single range 25,000 to 75,000 Best Splits Intuitively, by choosing a sequence of partitioning attributes, we start with the set of all training instances, which is impure in the sense that it contains instances from many classes, and end up with leaves which are pure in the sense that at each leaf all training instances belong to only one class We shall see shortly how to measure purity quantitatively To judge the bene t of picking a particular attribute and condition for partitioning of the data at a node, we measure the purity of the data at the children resulting from partitioning by that attribute The attribute and condition that result in the maximum purity are chosen The purity of a set S of training instances can be measured quantitatively in several ways Suppose there are k classes, and of the instances in S the fraction of instances in class i is pi One measure of purity, the Gini measure is de ned as Painting EAN / UCC - 14 In None Using Barcode creation for Software Control to generate, create EAN - 14 image in Software applications. Printing UCC - 12 In Java Using Barcode generation for Java Control to generate, create GS1 128 image in Java applications. Gini(S) = 1 Scan Data Matrix ECC200 In Visual C#.NET Using Barcode scanner for .NET Control to read, scan read, scan image in VS .NET applications. Read Bar Code In Java Using Barcode decoder for Java Control to read, scan read, scan image in Java applications. p2 i
Painting Linear In Java Using Barcode encoder for Java Control to generate, create Linear image in Java applications. Data Matrix Encoder In None Using Barcode drawer for Online Control to generate, create DataMatrix image in Online applications. When all instances are in a single class, the Gini value is 0, while it reaches its maximum (of 1 1/k) if each class has the same number of instances Another measure of purity is the entropy measure, which is de ned as Code-39 Creator In None Using Barcode generator for Online Control to generate, create Code 39 Extended image in Online applications. Code 39 Printer In Objective-C Using Barcode creation for iPhone Control to generate, create Code 39 image in iPhone applications. Entropy(S) = pi log2 pi
The entropy value is 0 if all instances are in a single class, and reaches its maximum when each class has the same number of instances The entropy measure derives from information theory When a set S is split into multiple sets Si , i = 1, 2, , r, we can measure the purity of the resultant set of sets as: Purity(S1 , S2 , , Sr ) = |Si | purity(Si ) |S|
That is, the purity is the weighted average of the purity of the sets Si The above formula can be used with both the Gini measure and the entropy measure of purity The information gain due to a particular split of S into Si , i = 1, 2, , r is then Information-gain(S, {S1 , S2 , , Sr }) = purity(S) purity(S1 , S2 , , Sr ) Splits into fewer sets are preferable to splits into many sets, since they lead to simpler and more meaningful decision trees The number of elements in each of the sets Si may also be taken into account; otherwise, whether a set Si has 0 elements or 1 element would make a big difference in the number of sets, although the split is the same for almost all the elements The information content of a particular split can be Silberschatz Korth Sudarshan: Database System Concepts, Fourth Edition
VII Other Topics
22 Advanced Querying and Information Retrieval
The McGraw Hill Companies, 2001 Data Mining
de ned in terms of entropy as
Information-content(S, {S1 , S2 , , Sr })) = |Si | |Si | log2 |S| |S|
All of this leads to a de nition: The best split for an attribute is the one that gives the maximum information gain ratio, de ned as Information-gain(S, {S1 , S2 , , Sr }) Information-content(S, {S1 , S2 , , Sr }) Finding Best Splits How do we nd the best split for an attribute How to split an attribute depends on the type of the attribute Attributes can be either continuous valued, that is, the values can be ordered in a fashion meaningful to classi cation, such as age or income, or can be categorical, that is, they have no meaningful order, such as department names or country names We do not expect the sort order of department names or country names to have any signi cance to classi cation Usually attributes that are numbers (integers/reals) are treated as continuous valued while character string attributes are treated as categorical, but this may be controlled by the user of the system In our example, we have treated the attribute degree as categorical, and the attribute income as continuous valued We rst consider how to nd best splits for continuous-valued attributes For simplicity, we shall only consider binary splits of continuous-valued attributes, that is, splits that result in two children The case of multiway splits is more complicated; see the bibliographical notes for references on the subject To nd the best binary split of a continuous-valued attribute, we rst sort the attribute values in the training instances We then compute the information gain obtained by splitting at each value For example, if the training instances have values 1, 10, 15, and 25 for an attribute, the split points considered are 1, 10, and 15; in each case values less than or equal to the split point form one partition and the rest of the values form the other partition The best binary split for the attribute is the split that gives the maximum information gain For a categorical attribute, we can have a multiway split, with a child for each value of the attribute This works ne for categorical attributes with only a few distinct values, such as degree or gender However, if the attribute has many distinct values, such as department names in a large company, creating a child for each value is not a good idea In such cases, we would try to combine multiple values into each child, to create a smaller number of children See the bibliographical notes for references on how to do so Decision-Tree Construction Algorithm The main idea of decision tree construction is to evaluate different attributes and different partitioning conditions, and pick the attribute and partitioning condition that results in the maximum information gain ratio The same procedure works recur-
|
|