MIT 6.5830 simpleDB Lab3
Exercise 1
需要完成的是:
- src/java/simpledb/optimizer/IntHistogram.java
实验手册介绍了基于桶的方法来计算直方图,以便进行选择估计。直方图是一种统计报告,表示数据分布的频率。这里要完成的就是记录表的信息,实现基于桶的方法。
IntHistogram.java代码:
package simpledb.optimizer;
import simpledb.execution.Predicate;
/**
* A class to represent a fixed-width histogram over a single integer-based field.
*/
public class IntHistogram {
private int[] buckets;
private int min;
private int max;
private double width;
private int ntups;
/**
* Create a new IntHistogram.
* <p>
* This IntHistogram should maintain a histogram of integer values that it receives.
* It should split the histogram into "buckets" buckets.
* <p>
* The values that are being histogrammed will be provided one-at-a-time through the "addValue()" function.
* <p>
* Your implementation should use space and have execution time that are both
* constant with respect to the number of values being histogrammed. For example, you shouldn't
* simply store every value that you see in a sorted list.
*
* @param buckets The number of buckets to split the input value into.
* @param min The minimum integer value that will ever be passed to this class for histogramming
* @param max The maximum integer value that will ever be passed to this class for histogramming
*/
public IntHistogram(int buckets, int min, int max) {
// TODO: some code goes here
this.buckets = new int[buckets];
this.min = min;
this.max = max;
this.width = Math.max(1, (max - min + 1.0) / buckets);
this.ntups = 0;
}
/**
* Add a value to the set of values that you are keeping a histogram of.
*
* @param v Value to add to the histogram
*/
public void addValue(int v) {
// TODO: some code goes here
if (v < min || v > max) {
return;
}
buckets[getBucketId(v)]++;
ntups++;
}
public int getBucketId(int v) {
return (int) ((v - min) / width);
}
/**
* Estimate the selectivity of a particular predicate and operand on this table.
* <p>
* For example, if "op" is "GREATER_THAN" and "v" is 5,
* return your estimate of the fraction of elements that are greater than 5.
*
* @param op Operator
* @param v Value
* @return Predicted selectivity of this particular operator and value
*/
public double estimateSelectivity(Predicate.Op op, int v) {
// TODO: some code goes here
switch (op) {
case EQUALS:
return estimateSelectivity(Predicate.Op.GREATER_THAN_OR_EQ, v) - estimateSelectivity(Predicate.Op.GREATER_THAN, v);
case GREATER_THAN:
if (v > max) {
return 0.0;
} else if (v <= min) {
return 1.0;
} else {
int bucketId = getBucketId(v);
double tuples = 0.0;
for (int i = bucketId + 1; i < buckets.length; i++) {
tuples += buckets[i];
}
tuples += (min + (bucketId + 1) * width - v - 1) * (1.0 * buckets[bucketId] / width);
return tuples / ntups;
}
case LESS_THAN:
return estimateSelectivity(Predicate.Op.LESS_THAN_OR_EQ, v - 1);
case LESS_THAN_OR_EQ:
return 1.0 - estimateSelectivity(Predicate.Op.GREATER_THAN, v);
case GREATER_THAN_OR_EQ:
return estimateSelectivity(Predicate.Op.GREATER_THAN, v - 1);
case NOT_EQUALS:
return 1.0 - estimateSelectivity(Predicate.Op.EQUALS, v);
default:
throw new IllegalArgumentException("Unsupported operator " + op);
}
}
/**
* @return the average selectivity of this histogram.
* <p>
* This is not an indispensable method to implement the basic
* join optimization. It may be needed if you want to
* implement a more efficient optimization
*/
public double avgSelectivity() {
// TODO: some code goes here
double sum = 0.0;
for (int bucket : buckets) {
sum += 1.0 * bucket / ntups;
}
return sum / buckets.length;
}
/**
* @return A string describing this histogram, for debugging purposes
*/
public String toString() {
// TODO: some code goes here
return "IntHistogram{" +
"buckets=" + java.util.Arrays.toString(buckets) +
", min=" + min +
", max=" + max +
", width=" + width +
", ntups=" + ntups +
'}';
}
}
这里的width
取的是1
和(max - min + 1.0) / buckets
的最大值,是因为width如果小于1,会导致桶比数据多,这里所做的事情就没有意义了。
此外,estimateSelectivity
方法只实现了GREATER_THAN,剩下的Op都用GREATER_THAN推出来,GREATER_THAN中把比v所在的桶更大的桶中的数据数都加起来,再把v所在的桶假设数据均匀,min + (bucketId + 1) * width
得到这个桶的最大值,再减去v,再减去1(因为下标从0开始),再除以buckets[bucketId] / width
,估算这个桶内大于v的数据数,得到估计结果。
执行单元测试:
ant runtest -Dtest=IntHistogramTest
应该是successful的。
Exercise 2
需要完成的是:
- src/java/simpledb/optimizer/TableStats.java
查询解析器会为每个表创建一个TableStats实例,传入tableId和每页IO开销,构建基于桶的直方图并求选择性估算。
TableStats.java代码:
package simpledb.optimizer;
import simpledb.common.Database;
import simpledb.common.Type;
import simpledb.execution.Predicate;
import simpledb.execution.SeqScan;
import simpledb.storage.*;
import simpledb.transaction.Transaction;
import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
/**
* TableStats represents statistics (e.g., histograms) about base tables in a
* query.
* <p>
* This class is not needed in implementing lab1 and lab2.
*/
public class TableStats {
private static final ConcurrentMap<String, TableStats> statsMap = new ConcurrentHashMap<>();
static final int IOCOSTPERPAGE = 1000;
private int icCostPerPage;
private ConcurrentHashMap<Integer, IntHistogram> intHistograms;
private ConcurrentHashMap<Integer, StringHistogram> stringHistograms;
private HeapFile dbFile;
private TupleDesc td;
private int totalTuples;
public static TableStats getTableStats(String tablename) {
return statsMap.get(tablename);
}
public static void setTableStats(String tablename, TableStats stats) {
statsMap.put(tablename, stats);
}
public static void setStatsMap(Map<String, TableStats> s) {
try {
java.lang.reflect.Field statsMapF = TableStats.class.getDeclaredField("statsMap");
statsMapF.setAccessible(true);
statsMapF.set(null, s);
} catch (NoSuchFieldException | IllegalAccessException | IllegalArgumentException | SecurityException e) {
e.printStackTrace();
}
}
public static Map<String, TableStats> getStatsMap() {
return statsMap;
}
public static void computeStatistics() {
Iterator<Integer> tableIt = Database.getCatalog().tableIdIterator();
System.out.println("Computing table stats.");
while (tableIt.hasNext()) {
int tableid = tableIt.next();
TableStats s = new TableStats(tableid, IOCOSTPERPAGE);
setTableStats(Database.getCatalog().getTableName(tableid), s);
}
System.out.println("Done.");
}
/**
* Number of bins for the histogram. Feel free to increase this value over
* 100, though our tests assume that you have at least 100 bins in your
* histograms.
*/
static final int NUM_HIST_BINS = 100;
/**
* Create a new TableStats object, that keeps track of statistics on each
* column of a table
*
* @param tableid The table over which to compute statistics
* @param ioCostPerPage The cost per page of IO. This doesn't differentiate between
* sequential-scan IO and disk seeks.
*/
public TableStats(int tableid, int ioCostPerPage) {
// For this function, you'll have to get the
// DbFile for the table in question,
// then scan through its tuples and calculate
// the values that you need.
// You should try to do this reasonably efficiently, but you don't
// necessarily have to (for example) do everything
// in a single scan of the table.
// TODO: some code goes here
Map<Integer, Integer> minMap = new HashMap<>();
Map<Integer, Integer> maxMap = new HashMap<>();
this.intHistograms = new ConcurrentHashMap<>();
this.stringHistograms = new ConcurrentHashMap<>();
this.dbFile = (HeapFile)Database.getCatalog().getDatabaseFile(tableid);
this.icCostPerPage = ioCostPerPage;
this.td = this.dbFile.getTupleDesc();
Transaction tx = new Transaction();
tx.start();
DbFileIterator it = dbFile.iterator(tx.getId());
try {
it.open();
while (it.hasNext()) {
this.totalTuples++;
Tuple tuple = it.next();
for (int i = 0; i < td.numFields(); i++) {
if (td.getFieldType(i).equals(Type.INT_TYPE)) {
IntField field = (IntField)tuple.getField(i);
minMap.put(i, Math.min(minMap.getOrDefault(i, Integer.MAX_VALUE), field.getValue()));
maxMap.put(i, Math.max(maxMap.getOrDefault(i, Integer.MIN_VALUE), field.getValue()));
} else if (td.getFieldName(i).equals(Type.STRING_TYPE)) {
StringHistogram histogram = this.stringHistograms.getOrDefault(i, new StringHistogram(NUM_HIST_BINS));
StringField field = (StringField)tuple.getField(i);
histogram.addValue(field.getValue());
this.stringHistograms.put(i, histogram);
}
}
}
for (int i = 0; i < td.numFields(); ++i) {
if (minMap.get(i) != null) {
this.intHistograms.put(i, new IntHistogram(NUM_HIST_BINS, minMap.get(i), maxMap.get(i)));
}
}
it.rewind();
while (it.hasNext()) {
Tuple tuple = it.next();
for (int i = 0; i < td.numFields(); ++i) {
if (td.getFieldType(i).equals(Type.INT_TYPE)) {
IntField f = (IntField)tuple.getField(i);
IntHistogram intHistogram = this.intHistograms.get(i);
if (intHistogram == null) throw new IllegalArgumentException();
intHistogram.addValue(f.getValue());
this.intHistograms.put(i, intHistogram);
}
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
it.close();
try {
tx.commit();
} catch (IOException e) {
e.printStackTrace();
}
}
}
/**
* Estimates the cost of sequentially scanning the file, given that the cost
* to read a page is costPerPageIO. You can assume that there are no seeks
* and that no pages are in the buffer pool.
* <p>
* Also, assume that your hard drive can only read entire pages at once, so
* if the last page of the table only has one tuple on it, it's just as
* expensive to read as a full page. (Most real hard drives can't
* efficiently address regions smaller than a page at a time.)
*
* @return The estimated cost of scanning the table.
*/
public double estimateScanCost() {
// TODO: some code goes here
return dbFile.numPages() * this.icCostPerPage * 2;
}
/**
* This method returns the number of tuples in the relation, given that a
* predicate with selectivity selectivityFactor is applied.
*
* @param selectivityFactor The selectivity of any predicates over the table
* @return The estimated cardinality of the scan with the specified
* selectivityFactor
*/
public int estimateTableCardinality(double selectivityFactor) {
// TODO: some code goes here
return (int) (this.totalTuples * selectivityFactor);
}
/**
* The average selectivity of the field under op.
*
* @param field the index of the field
* @param op the operator in the predicate
* The semantic of the method is that, given the table, and then given a
* tuple, of which we do not know the value of the field, return the
* expected selectivity. You may estimate this value from the histograms.
*/
public double avgSelectivity(int field, Predicate.Op op) {
// TODO: some code goes here
if (td.getFieldType(field).equals(Type.INT_TYPE)) {
return this.intHistograms.get(field).avgSelectivity();
} else if (td.getFieldType(field).equals(Type.STRING_TYPE)) {
return this.stringHistograms.get(field).avgSelectivity();
}
return -1.0;
}
/**
* Estimate the selectivity of predicate <tt>field op constant</tt> on the
* table.
*
* @param field The field over which the predicate ranges
* @param op The logical operation in the predicate
* @param constant The value against which the field is compared
* @return The estimated selectivity (fraction of tuples that satisfy) the
* predicate
*/
public double estimateSelectivity(int field, Predicate.Op op, Field constant) {
// TODO: some code goes here
if (td.getFieldType(field).equals(Type.INT_TYPE)) {
return this.intHistograms.get(field).estimateSelectivity(op, ((IntField)constant).getValue());
} else if (td.getFieldType(field).equals(Type.STRING_TYPE)) {
return this.stringHistograms.get(field).estimateSelectivity(op, ((StringField)constant).getValue());
}
return -1.0;
}
/**
* return the total number of tuples in this table
*/
public int totalTuples() {
// TODO: some code goes here
return this.totalTuples;
}
}
构造方法主要进行3个部分,第一部分是第一次扫描,统计各个字段的最大值和最小值;第二部分是给每个字段构建直方图,第三部分是给每个字段的直方图添加数据。
执行单元测试:
ant runtest -Dtest=TableStatsTest
应该是successful的。
Exercise 3
JoinOptimizer.java
中包含了排序和计算成本的方法,需要实现的是estimateJoinCost
和estimateJoinCardinality
方法。
estimateJoinCost代码:
public double estimateJoinCost(LogicalJoinNode j, int card1, int card2,
double cost1, double cost2) {
if (j instanceof LogicalSubplanJoinNode) {
// A LogicalSubplanJoinNode represents a subquery.
// You do not need to implement proper support for these for Lab 3.
return card1 + cost1 + cost2;
} else {
// Insert your code here.
// HINT: You may need to use the variable "j" if you implemented
// a join algorithm that's more complicated than a basic
// nested-loops join.
return cost1 + card1 * cost2 + card1 * card2;
}
}
这个方法的目标是估计一个连接操作的代价。如果是子查询,就用实验代码给定的简单估计(左表基数+左表扫描成本+右表扫描成本);如果不是子查询,就用左表的扫描成本+左表基数x右表扫描成本+左表基数x右表基数
,因为左表需要全部扫描,而左表的每一条记录都需要扫描右表,再加上CPU开销。
estimateJoinCardinality代码:
/**
* Estimate the join cardinality of two tables.
*/
public static int estimateTableJoinCardinality(Predicate.Op joinOp,
String table1Alias, String table2Alias, String field1PureName,
String field2PureName, int card1, int card2, boolean t1pkey,
boolean t2pkey, Map<String, TableStats> stats,
Map<String, Integer> tableAliasToId) {
int card = 1;
// TODO: some code goes here
if (joinOp == Predicate.Op.EQUALS) {
if (t1pkey && !t2pkey) {
card = card2;
} else if (!t1pkey && t2pkey) {
card = card1;
} else if (t1pkey && t2pkey) {
card = Math.min(card1, card2);
} else {
card = Math.max(card1, card2);
}
} else {
card = (int) (0.3 * card1 * card2);
}
return card <= 0 ? 1 : card;
}
这个方法的目标是估计一个连接操作的基数。对于相等连接,其中一个属性是主键时连接产生的元组数量取非主键的基数;对于两个都是主键则取最小值;对于两个都是非主键时取最大值;对于范围扫描,选取两个基数交叉积的固定比例(30%)。
Exercise 4
需要实现JoinOptimizer.java
中的orderJoins
方法。
实验代码已经提供了enumerateSubsets
方法枚举一个列表中所有指定大小的子集;computeCostAndCardOfSubplan
方法在查询优化中,通过比较不同的连接顺序和方法,找出成本最低的查询计划;printJoins
方法当优化器有explain选项时图形显示连接计划。
orderJoins代码:
public List<LogicalJoinNode> orderJoins(
Map<String, TableStats> stats,
Map<String, Double> filterSelectivities, boolean explain)
throws ParsingException {
// Not necessary for labs 1 and 2.
// TODO: some code goes here
CostCard bestCostCard = new CostCard();
PlanCache planCache = new PlanCache();
for (int i = 1; i <= joins.size(); ++i) {
Set<Set<LogicalJoinNode>> subsets = enumerateSubsets(joins, i);
for (Set<LogicalJoinNode> subset : subsets) {
double bestCostSoFar = Double.MAX_VALUE;
bestCostCard = new CostCard();
for (LogicalJoinNode join : subset) {
CostCard costCard = computeCostAndCardOfSubplan(stats, filterSelectivities, join, subset, bestCostSoFar, planCache);
if (costCard != null) {
bestCostSoFar = costCard.cost;
bestCostCard = costCard;
}
}
if (bestCostSoFar != Double.MAX_VALUE) {
planCache.addPlan(subset, bestCostCard.cost, bestCostCard.card, bestCostCard.plan);
}
}
}
if (explain) {
printJoins(bestCostCard.plan, planCache, stats, filterSelectivities);;
}
if (bestCostCard.plan != null) {
return bestCostCard.plan;
}
return joins;
}
这个方法的目标是计算一个逻辑上合理且效率较高的表连接顺序,使用了动态规划的思想来寻找最优的连接顺序。
执行单元测试和系统测试:
ant runtest -Dtest=JoinOptimizerTest
ant runsystest -Dtest=QueryTest
应该是successful的。
Extra Credit
实验代码实现的enumerateSubsets
方法通过暴力枚举求子集,但是实际上并不需要所有大小的子集,只要指定size大小的子集,所以可以改为深度优先搜索(DFS)会更好。
enumerateSubsets代码:
public <T> Set<Set<T>> enumerateSubsets(List<T> v, int size) {
Set<Set<T>> els = new HashSet<>();
dfs(els, v, size, 0, new LinkedList<>());
return els;
}
private <T> void dfs(Set<Set<T>> els, List<T> v, int size, int curIdx, Deque<T> path) {
if (path.size() == size) {
els.add(new HashSet<>(path));
return;
}
if (curIdx == v.size()) {
return;
}
path.addFirst(v.get(curIdx));
dfs(els, v, size, curIdx + 1, path);
path.removeFirst();
dfs(els, v, size, curIdx + 1, path);
}