Errata for Mining of Massive Datasets - First Edition
    This is the errata sheet for the first edition. It is no longer being maintained. For errata in the second edition, please see The Errata Sheet for the Second Edition. Page numbers refer to the pages in the book's hardcopy edition, not the downloads. We shall endeavor to keep the downloads up-to-date.

    SectionLocationProblemReported ByDate Reported
    1.1.5 p. 4. l. 13 "orignal" should be "original". Ed Knorr 3/5/12
    1.4 p. 16, 3 lines above Sect. 1.5 Delete "what". Rok Sosic 3/18/13
    1.5 p. 16, l. -10 "many" should be "many instances". Rok Sosic 3/18/13
    2.2.2 p. 23, l. 1. 2 "Grouping and aggregation" should better be referred to as "grouping by key". Rok Sosic 3/18/13
    2.3.2 p. 27, Fig. 2.4 caption "fives" should be "five". Ed Knorr 3/5/12
    2.3.10 p. 35, l. 13 right parenthesis missing at the end Anastasios Gounaris 7/10/13
    2.3.11 p. 36, l. 12 "However, it" should be "However, in". Waleed Hameid 5/1/12
    2.4.1 p. 37, l. -7 "ts" should be "its". Aris Anagnostopoulos and Rok Sosic 3/2/13
    2.4.2 p. 39, l. 20 R should be P twice in the displayed expression. Aris Anagnostopoulos 3/2/13
    2.5.1 p. 43, l. 14 "gigabit" should be "one gigabit per second". Ed Knorr 3/5/12
    2.5.1 p. 44, l. 15 "of" should be "on". Anastasios Gounaris 7/10/13
    2.5.2 p. 44, l. -12 "not use" should be "not use it". Rok Sosic 3/18/13
    2.5.3 p. 46, Fig. 2.8 "h(T.C)=1" should be "g(T.C)=1". Waleed Hameid 4/29/12
    2.5.3 p. 48, l. -2 of the box Delete "inversely". Rok Sosic 3/18/13
    3.4.2 p. 69, l. 9 The "threshold" should be defined as the value of s for which the probability of being a candidate reaches 1/2. Rok Sosic 3/18/13
    3.4.2 p. 69, l. -4 0.328 should be 0.672. Robert West 5/2/12
    3.6.3 p. 81, l. -11 "0.2 and 0.6" should be "0.8 and 0.4". Amitabh Chaudhary 4/6/14
    3.6.3 p. 82, l. 6 below figure The figures given there are actually for a use of Example 3.19 followed by Example 3.18. If we use Example 3.18 first, and then Example 3.19, we get (0.2, 0.8, .9991285, 0.0000004). Zhou Jingbo 7/11/13
    3.7.2 p. 85, l. 1 d2/180 should be (180-d2)/180. Nicholas Zhao 1/21/13
    3.7.4 p. 86, l. -3 Remove d from d cos θ Wang Bin 6/7/12
    3.9.6 p. 101, l. 16, 17 "prefix" should be "suffix" in both lines. Weng Zhen-Bin 11/7/13
    3.9.6 p. 102, l. 13 Right parenthesis needed after 9+j. Weng Zhen-Bin 11/7/13
    4.2.1 p. 113, l. -15 capitalize "for". Rok Sosic 3/18/13
    4.2.2 p. 113, l. -3, .2 "URL's" should be "IP addresses". Rok Sosic 3/18/13
    4.4.1 p. 119, l. 7, 8, and 10 "URL" should be "IP address". Aris Anagnostopoulos and Rok Sosic 3/2/13
    4.5.1 p. 122, l. 21 Delete one "the". Rok Sosic 3/18/13
    4.5.3 p. 124 All the occurrences of X.value in this section should be 2*X.value - 1. Ge Qi, Greg Lee 11/4/14
    4.5.6 p. 126, l. -2 "Exercise 4.7" should be "Exercise 4.5.3". Wang Bin 6/7/12
    4.5.6 p. 127, l. 3 "induction on n" should be "induction on m". Wang Bin 6/7/12
    4.6.2 p. 128, bottom There are actually six rules needed. The sixth is that every position with a 1 is in some bucket. Aris Anagnostopoulos 3/2/13
    4.6.3 p. 129, l. -14 "log n" should be "log N". Wang Bin 6/7/12
    4.6.4 p. 129, l. -4 "highest" should be "earliest" or "lowest". Robert West 5/2/12
    4.6.6 p. 131, l. -3 The condition must also be relaxed for the buckets of size 1. There may be any number between 1 and r of these too. Aris Anagnostopoulos 3/2/13
    5.4.2 p. 165 l. 11-12 These lines are better expressed: "of the fraction of the Web, m/n, that is in the spam farm." Aris Anagnostopoulos 3/2/13
    5.4.5 p. 166, l. -15 "measure" should be "measure for each page". Rok Sosic 3/18/13
    5.5.2 p. 171, l. -1 A should be L. Wang Bin 6/7/12
    SECTION PAGE/LINE WHAT WHO WHEN
    6.1.1 p. 178, l. 5-6 In Fig. 6.2, the entry for {cat, a} should be 2,3,7 and the entry for {and, a} should be {2,7}. As a result, 4-5 lines below the figure, there should be 5 frequent pairs, including {cat, a}. Further, in the paragraph below that, we should discover that {dog, cat, a} is a frequent triple. Robin Bennett 3/19/13
    6.1.2 p. 179, l. 3 of box "of" between "pairs" and "items". Ed Knorr 3/5/12
    6.1.2 p. 179, l. -2 of box One too many left quotes. Ed Knorr 3/5/12
    6.1.3 p. 180, l. -14, -13 The confidence is actually 3/5, since "and" also appears in basket (7). Ed Knorr 3/5/12
    6.1.3 p. 181, l. 1 "That its" should be "That is". Robert West 5/2/12
    6.1.3 p. 181, l. 8 "bear" should be "beer". Ed Knorr 3/5/12
    6.2.5 p. 187, l. -4 "exceed" should be "are equal to or greater than". Anastasios Gounaris 7/10/13
    6.2.5 p. 188, l. -14 The units are bytes for both space counts. Aris Anagnostopoulos 3/2/13
    6.2.5 p. 188, l. -3, -2 "frequent pairs" should be "pairs of frequent items". Anastasios Gounaris 7/10/13
    6.3 p. 193, 195, 197 In each figure, "strucrure" should be "structure". Ed Knorr 3/5/12
    6.3.1 p. 194, l. -7, -16 "frequent pairs" should be "pairs of frequent items". Robert West 5/2/12
    6.3.2 p. 194, l. -9 "multistage" should be capitalized. Ed Knorr 3/5/12
    6.3.2 p. 196, l. 3 of box "hash" should be "hashes". Rok Sosic 3/18/13
    6.4.2 p. 201, l. -20 "though" should be "through". Anastasios Gounaris 7/10/13
    6.4.5 p. 203, l. 22 "finite" should be "nonzero". Aris Anagnostopoulos 3/2/13
    6.5.2 p. 207, l. 21 "for that element" should be "for that item". Wang Bin 6/7/12
    6.5.3 p. 208, l. -15 Replace "this standard" by "this algorithm". Anastasios Gounaris 7/10/13
    7.1 p. 213, l. -13 "agglomerative" should be "point-assignment". Aris Anagnostopoulos 3/2/13
    7.1.3 p. 216, formula at bottom of page The sum in the numerator should start at i = 1, not 0. Anastasios Gounaris 7/10/13
    7.1.4 p. 217, Exercise 7.1.2, l. 2 "Eclidean" should be "Euclidean". Angad Singh 3/5/12
    7.3.2 p. 228, l. 5 Period should be a comma in (12,3). Anastasios Gounaris 7/10/13
    7.3.4 p. 231, l. 7-8 of box "we can add the squares of the components of the vector to SUMSQ to get the new SUMSQ." Anastasios Gounaris 7/10/13
    7.5.2 p. 238, l. -2 "close as" should be "as close as". Aris Anagnostopoulos 3/2/13
    7.5.4 p. 241, l. 9 Add "the clustroid of" after "and then to". Anastasios Gounaris 7/10/13
    7.6.4 p. 245, l. -10 "centroids" should be "clustroids". Anastasios Gounaris 7/10/13
    7.7 p. 250, l. 18 "DBMO" should be "BDMO". Wang Bin 6/7/12
    8.4.7 p. 268, l. 21 The exponent on e should be -fi (minus sign is missing). Zack Taylor 8/14/12
    9.1.1 p. 279, l. 1 "those" should be "all". Rok Sosic 3/18/13
    9.1.2 p. 280, l. 7 below the box "all the many" should be "each of the". Rok Sosic 3/18/13
    9.2.2 p. 282, l. -12 "item" should be "items". Rok Sosic 3/18/13
    9.2.2 p. 282, l. -10 "yields" should be "yield". Rok Sosic 3/18/13
    9.2.4 p. 285, l. 12 "feature" should be "features". Rok Sosic 3/18/13
    9.3.1 p. 292, l. 11 .386 should be .380. Oscar Wu 8/2/12
    9.3.1 p. 292, l. 14-15 The conclusion is backward. In fact, a higher (positive) cosine means a smaller angle and therefore greater similarity. In this case, cosine distance suggests A is more similar to B than to C. Oscar Wu 8/2/12
    9.4.2 p. 300, l. 3 Right parenthesis missing before = sign. Dennis Sidharta 3/5/12