MIT 6.5830 simpleDB Lab3

Exercise 1

需要完成的是:

  • src/java/simpledb/optimizer/IntHistogram.java

实验手册介绍了基于桶的方法来计算直方图,以便进行选择估计。直方图是一种统计报告,表示数据分布的频率。这里要完成的就是记录表的信息,实现基于桶的方法。

IntHistogram.java代码:

package simpledb.optimizer;

import simpledb.execution.Predicate;

/**
 * A class to represent a fixed-width histogram over a single integer-based field.
 */
public class IntHistogram {

    private int[] buckets;
    private int min;
    private int max;
    private double width;
    private int ntups;

    /**
     * Create a new IntHistogram.
     * <p>
     * This IntHistogram should maintain a histogram of integer values that it receives.
     * It should split the histogram into "buckets" buckets.
     * <p>
     * The values that are being histogrammed will be provided one-at-a-time through the "addValue()" function.
     * <p>
     * Your implementation should use space and have execution time that are both
     * constant with respect to the number of values being histogrammed.  For example, you shouldn't
     * simply store every value that you see in a sorted list.
     *
     * @param buckets The number of buckets to split the input value into.
     * @param min     The minimum integer value that will ever be passed to this class for histogramming
     * @param max     The maximum integer value that will ever be passed to this class for histogramming
     */
    public IntHistogram(int buckets, int min, int max) {
        // TODO: some code goes here
        this.buckets = new int[buckets];
        this.min = min;
        this.max = max;
        this.width = Math.max(1, (max - min + 1.0) / buckets);
        this.ntups = 0;
    }

    /**
     * Add a value to the set of values that you are keeping a histogram of.
     *
     * @param v Value to add to the histogram
     */
    public void addValue(int v) {
        // TODO: some code goes here
        if (v < min || v > max) {
            return;
        }
        buckets[getBucketId(v)]++;
        ntups++;
    }

    public int getBucketId(int v) {
        return (int) ((v - min) / width);
    }

    /**
     * Estimate the selectivity of a particular predicate and operand on this table.
     * <p>
     * For example, if "op" is "GREATER_THAN" and "v" is 5,
     * return your estimate of the fraction of elements that are greater than 5.
     *
     * @param op Operator
     * @param v  Value
     * @return Predicted selectivity of this particular operator and value
     */
    public double estimateSelectivity(Predicate.Op op, int v) {

        // TODO: some code goes here
        switch (op) {
            case EQUALS:
                return estimateSelectivity(Predicate.Op.GREATER_THAN_OR_EQ, v) - estimateSelectivity(Predicate.Op.GREATER_THAN, v);
            case GREATER_THAN:
                if (v > max) {
                    return 0.0;
                } else if (v <= min) {
                    return 1.0;
                } else {
                    int bucketId = getBucketId(v);
                    double tuples = 0.0;
                    for (int i = bucketId + 1; i < buckets.length; i++) {
                        tuples += buckets[i];
                    }
                    tuples += (min + (bucketId + 1) * width - v - 1) * (1.0 * buckets[bucketId] / width);
                    return tuples / ntups;
                }
            case LESS_THAN:
                return estimateSelectivity(Predicate.Op.LESS_THAN_OR_EQ, v - 1);
            case LESS_THAN_OR_EQ:
                return 1.0 - estimateSelectivity(Predicate.Op.GREATER_THAN, v);
            case GREATER_THAN_OR_EQ:
                return estimateSelectivity(Predicate.Op.GREATER_THAN, v - 1);
            case NOT_EQUALS:
                return 1.0 - estimateSelectivity(Predicate.Op.EQUALS, v);
            default:
                throw new IllegalArgumentException("Unsupported operator " + op);
        }
    }

    /**
     * @return the average selectivity of this histogram.
     *         <p>
     *         This is not an indispensable method to implement the basic
     *         join optimization. It may be needed if you want to
     *         implement a more efficient optimization
     */
    public double avgSelectivity() {
        // TODO: some code goes here
        double sum = 0.0;
        for (int bucket : buckets) {
            sum += 1.0 * bucket / ntups;
        }
        return sum / buckets.length;
    }

    /**
     * @return A string describing this histogram, for debugging purposes
     */
    public String toString() {
        // TODO: some code goes here
        return "IntHistogram{" +
                "buckets=" + java.util.Arrays.toString(buckets) +
                ", min=" + min +
                ", max=" + max +
                ", width=" + width +
                ", ntups=" + ntups +
                '}';
    }
}

这里的width取的是1(max - min + 1.0) / buckets的最大值,是因为width如果小于1,会导致桶比数据多,这里所做的事情就没有意义了。

此外,estimateSelectivity方法只实现了GREATER_THAN,剩下的Op都用GREATER_THAN推出来,GREATER_THAN中把比v所在的桶更大的桶中的数据数都加起来,再把v所在的桶假设数据均匀,min + (bucketId + 1) * width得到这个桶的最大值,再减去v,再减去1(因为下标从0开始),再除以buckets[bucketId] / width,估算这个桶内大于v的数据数,得到估计结果。

执行单元测试:

ant runtest -Dtest=IntHistogramTest

应该是successful的。

Exercise 2

需要完成的是:

  • src/java/simpledb/optimizer/TableStats.java

查询解析器会为每个表创建一个TableStats实例,传入tableId和每页IO开销,构建基于桶的直方图并求选择性估算。

TableStats.java代码:

package simpledb.optimizer;

import simpledb.common.Database;
import simpledb.common.Type;
import simpledb.execution.Predicate;
import simpledb.execution.SeqScan;
import simpledb.storage.*;
import simpledb.transaction.Transaction;

import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;

/**
 * TableStats represents statistics (e.g., histograms) about base tables in a
 * query.
 * <p>
 * This class is not needed in implementing lab1 and lab2.
 */
public class TableStats {

    private static final ConcurrentMap<String, TableStats> statsMap = new ConcurrentHashMap<>();

    static final int IOCOSTPERPAGE = 1000;

    private int icCostPerPage;
    private ConcurrentHashMap<Integer, IntHistogram> intHistograms;
    private ConcurrentHashMap<Integer, StringHistogram> stringHistograms;
    private HeapFile dbFile;
    private TupleDesc td;
    private int totalTuples;

    public static TableStats getTableStats(String tablename) {
        return statsMap.get(tablename);
    }

    public static void setTableStats(String tablename, TableStats stats) {
        statsMap.put(tablename, stats);
    }

    public static void setStatsMap(Map<String, TableStats> s) {
        try {
            java.lang.reflect.Field statsMapF = TableStats.class.getDeclaredField("statsMap");
            statsMapF.setAccessible(true);
            statsMapF.set(null, s);
        } catch (NoSuchFieldException | IllegalAccessException | IllegalArgumentException | SecurityException e) {
            e.printStackTrace();
        }

    }

    public static Map<String, TableStats> getStatsMap() {
        return statsMap;
    }

    public static void computeStatistics() {
        Iterator<Integer> tableIt = Database.getCatalog().tableIdIterator();

        System.out.println("Computing table stats.");
        while (tableIt.hasNext()) {
            int tableid = tableIt.next();
            TableStats s = new TableStats(tableid, IOCOSTPERPAGE);
            setTableStats(Database.getCatalog().getTableName(tableid), s);
        }
        System.out.println("Done.");
    }

    /**
     * Number of bins for the histogram. Feel free to increase this value over
     * 100, though our tests assume that you have at least 100 bins in your
     * histograms.
     */
    static final int NUM_HIST_BINS = 100;

    /**
     * Create a new TableStats object, that keeps track of statistics on each
     * column of a table
     *
     * @param tableid       The table over which to compute statistics
     * @param ioCostPerPage The cost per page of IO. This doesn't differentiate between
     *                      sequential-scan IO and disk seeks.
     */
    public TableStats(int tableid, int ioCostPerPage) {
        // For this function, you'll have to get the
        // DbFile for the table in question,
        // then scan through its tuples and calculate
        // the values that you need.
        // You should try to do this reasonably efficiently, but you don't
        // necessarily have to (for example) do everything
        // in a single scan of the table.
        // TODO: some code goes here
        Map<Integer, Integer> minMap = new HashMap<>();
        Map<Integer, Integer> maxMap = new HashMap<>();
        this.intHistograms = new ConcurrentHashMap<>();
        this.stringHistograms = new ConcurrentHashMap<>();
        this.dbFile = (HeapFile)Database.getCatalog().getDatabaseFile(tableid);
        this.icCostPerPage = ioCostPerPage;
        this.td = this.dbFile.getTupleDesc();

        Transaction tx = new Transaction();
        tx.start();
        DbFileIterator it = dbFile.iterator(tx.getId());
        try {
            it.open();
            while (it.hasNext()) {
                this.totalTuples++;
                Tuple tuple = it.next();
                for (int i = 0; i < td.numFields(); i++) {
                    if (td.getFieldType(i).equals(Type.INT_TYPE)) {
                        IntField field = (IntField)tuple.getField(i);
                        minMap.put(i, Math.min(minMap.getOrDefault(i, Integer.MAX_VALUE), field.getValue()));
                        maxMap.put(i, Math.max(maxMap.getOrDefault(i, Integer.MIN_VALUE), field.getValue()));
                    } else if (td.getFieldName(i).equals(Type.STRING_TYPE)) {
                        StringHistogram histogram = this.stringHistograms.getOrDefault(i, new StringHistogram(NUM_HIST_BINS));
                        StringField field = (StringField)tuple.getField(i);
                        histogram.addValue(field.getValue());
                        this.stringHistograms.put(i, histogram);
                    }
                }
            }

            for (int i = 0; i < td.numFields(); ++i) {
                if (minMap.get(i) != null) {
                    this.intHistograms.put(i, new IntHistogram(NUM_HIST_BINS, minMap.get(i), maxMap.get(i)));
                }
            }

            it.rewind();
            while (it.hasNext()) {
                Tuple tuple = it.next();
                for (int i = 0; i < td.numFields(); ++i) {
                    if (td.getFieldType(i).equals(Type.INT_TYPE)) {
                        IntField f = (IntField)tuple.getField(i);
                        IntHistogram intHistogram = this.intHistograms.get(i);
                        if (intHistogram == null) throw new IllegalArgumentException();
                        intHistogram.addValue(f.getValue());
                        this.intHistograms.put(i, intHistogram);
                    }
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            it.close();
            try {
                tx.commit();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    /**
     * Estimates the cost of sequentially scanning the file, given that the cost
     * to read a page is costPerPageIO. You can assume that there are no seeks
     * and that no pages are in the buffer pool.
     * <p>
     * Also, assume that your hard drive can only read entire pages at once, so
     * if the last page of the table only has one tuple on it, it's just as
     * expensive to read as a full page. (Most real hard drives can't
     * efficiently address regions smaller than a page at a time.)
     *
     * @return The estimated cost of scanning the table.
     */
    public double estimateScanCost() {
        // TODO: some code goes here
        return dbFile.numPages() * this.icCostPerPage * 2;
    }

    /**
     * This method returns the number of tuples in the relation, given that a
     * predicate with selectivity selectivityFactor is applied.
     *
     * @param selectivityFactor The selectivity of any predicates over the table
     * @return The estimated cardinality of the scan with the specified
     *         selectivityFactor
     */
    public int estimateTableCardinality(double selectivityFactor) {
        // TODO: some code goes here
        return (int) (this.totalTuples * selectivityFactor);
    }

    /**
     * The average selectivity of the field under op.
     *
     * @param field the index of the field
     * @param op    the operator in the predicate
     *              The semantic of the method is that, given the table, and then given a
     *              tuple, of which we do not know the value of the field, return the
     *              expected selectivity. You may estimate this value from the histograms.
     */
    public double avgSelectivity(int field, Predicate.Op op) {
        // TODO: some code goes here
        if (td.getFieldType(field).equals(Type.INT_TYPE)) {
            return this.intHistograms.get(field).avgSelectivity();
        } else if (td.getFieldType(field).equals(Type.STRING_TYPE)) {
            return this.stringHistograms.get(field).avgSelectivity();
        }
        return -1.0;
    }

    /**
     * Estimate the selectivity of predicate <tt>field op constant</tt> on the
     * table.
     *
     * @param field    The field over which the predicate ranges
     * @param op       The logical operation in the predicate
     * @param constant The value against which the field is compared
     * @return The estimated selectivity (fraction of tuples that satisfy) the
     *         predicate
     */
    public double estimateSelectivity(int field, Predicate.Op op, Field constant) {
        // TODO: some code goes here
        if (td.getFieldType(field).equals(Type.INT_TYPE)) {
            return this.intHistograms.get(field).estimateSelectivity(op, ((IntField)constant).getValue());
        } else if (td.getFieldType(field).equals(Type.STRING_TYPE)) {
            return this.stringHistograms.get(field).estimateSelectivity(op, ((StringField)constant).getValue());
        }
        return -1.0;
    }

    /**
     * return the total number of tuples in this table
     */
    public int totalTuples() {
        // TODO: some code goes here
        return this.totalTuples;
    }

}

构造方法主要进行3个部分,第一部分是第一次扫描,统计各个字段的最大值和最小值;第二部分是给每个字段构建直方图,第三部分是给每个字段的直方图添加数据。

执行单元测试:

ant runtest -Dtest=TableStatsTest

应该是successful的。

Exercise 3

JoinOptimizer.java中包含了排序和计算成本的方法,需要实现的是estimateJoinCostestimateJoinCardinality方法。

estimateJoinCost代码:

    public double estimateJoinCost(LogicalJoinNode j, int card1, int card2,
                                   double cost1, double cost2) {
        if (j instanceof LogicalSubplanJoinNode) {
            // A LogicalSubplanJoinNode represents a subquery.
            // You do not need to implement proper support for these for Lab 3.
            return card1 + cost1 + cost2;
        } else {
            // Insert your code here.
            // HINT: You may need to use the variable "j" if you implemented
            // a join algorithm that's more complicated than a basic
            // nested-loops join.
            return cost1 + card1 * cost2 + card1 * card2;
        }
    }

这个方法的目标是估计一个连接操作的代价。如果是子查询,就用实验代码给定的简单估计(左表基数+左表扫描成本+右表扫描成本);如果不是子查询,就用左表的扫描成本+左表基数x右表扫描成本+左表基数x右表基数,因为左表需要全部扫描,而左表的每一条记录都需要扫描右表,再加上CPU开销。

estimateJoinCardinality代码:

    /**
     * Estimate the join cardinality of two tables.
     */
    public static int estimateTableJoinCardinality(Predicate.Op joinOp,
                                                   String table1Alias, String table2Alias, String field1PureName,
                                                   String field2PureName, int card1, int card2, boolean t1pkey,
                                                   boolean t2pkey, Map<String, TableStats> stats,
                                                   Map<String, Integer> tableAliasToId) {
        int card = 1;
        // TODO: some code goes here
        if (joinOp == Predicate.Op.EQUALS) {
            if (t1pkey && !t2pkey) {
                card = card2;
            } else if (!t1pkey && t2pkey) {
                card = card1;
            } else if (t1pkey && t2pkey) {
                card = Math.min(card1, card2);
            } else {
                card = Math.max(card1, card2);
            }
        } else {
            card = (int) (0.3 * card1 * card2);
        }
        return card <= 0 ? 1 : card;
    }

这个方法的目标是估计一个连接操作的基数。对于相等连接,其中一个属性是主键时连接产生的元组数量取非主键的基数;对于两个都是主键则取最小值;对于两个都是非主键时取最大值;对于范围扫描,选取两个基数交叉积的固定比例(30%)。

Exercise 4

需要实现JoinOptimizer.java中的orderJoins方法。

实验代码已经提供了enumerateSubsets方法枚举一个列表中所有指定大小的子集;computeCostAndCardOfSubplan方法在查询优化中,通过比较不同的连接顺序和方法,找出成本最低的查询计划;printJoins方法当优化器有explain选项时图形显示连接计划。

orderJoins代码:

    public List<LogicalJoinNode> orderJoins(
            Map<String, TableStats> stats,
            Map<String, Double> filterSelectivities, boolean explain)
            throws ParsingException {
        // Not necessary for labs 1 and 2.

        // TODO: some code goes here
        CostCard bestCostCard = new CostCard();
        PlanCache planCache = new PlanCache();
        for (int i = 1; i <= joins.size(); ++i) {
            Set<Set<LogicalJoinNode>> subsets = enumerateSubsets(joins, i);
            for (Set<LogicalJoinNode> subset : subsets) {
                double bestCostSoFar = Double.MAX_VALUE;
                bestCostCard = new CostCard();
                for (LogicalJoinNode join : subset) {
                    CostCard costCard = computeCostAndCardOfSubplan(stats, filterSelectivities, join, subset, bestCostSoFar, planCache);
                    if (costCard != null) {
                        bestCostSoFar = costCard.cost;
                        bestCostCard = costCard;
                    }
                }
                if (bestCostSoFar != Double.MAX_VALUE) {
                    planCache.addPlan(subset, bestCostCard.cost, bestCostCard.card, bestCostCard.plan);
                }
            }
        }
        if (explain) {
            printJoins(bestCostCard.plan, planCache, stats, filterSelectivities);;
        }
        if (bestCostCard.plan != null) {
            return bestCostCard.plan;
        }
        return joins;
    }

这个方法的目标是计算一个逻辑上合理且效率较高的表连接顺序,使用了动态规划的思想来寻找最优的连接顺序。

执行单元测试和系统测试:

ant runtest -Dtest=JoinOptimizerTest
ant runsystest -Dtest=QueryTest

应该是successful的。

Extra Credit

实验代码实现的enumerateSubsets方法通过暴力枚举求子集,但是实际上并不需要所有大小的子集,只要指定size大小的子集,所以可以改为深度优先搜索(DFS)会更好。

enumerateSubsets代码:

    public <T> Set<Set<T>> enumerateSubsets(List<T> v, int size) {
        Set<Set<T>> els = new HashSet<>();
        dfs(els, v, size, 0, new LinkedList<>());
        return els;
    }

    private <T> void dfs(Set<Set<T>> els, List<T> v, int size, int curIdx, Deque<T> path) {
        if (path.size() == size) {
            els.add(new HashSet<>(path));
            return;
        }
        if (curIdx == v.size()) {
            return;
        }
        path.addFirst(v.get(curIdx));
        dfs(els, v, size, curIdx + 1, path);
        path.removeFirst();
        dfs(els, v, size, curIdx + 1, path);
    }
posted on 2024-04-25 15:49  未连接到互联网  阅读(16)  评论(0编辑  收藏  举报