git diff如何确定差异所在函数context

问题

在使用git diff 展示c/c++文件修改内容时,除了显示修改上下文外,输出还贴心的展示了修改所在的函数。尽管这个展示并不总是准确,但是能够做到大部分情况下准确也已经相当不错:是不是git内置了c语言这种高级语言的语法分析器?另外,git的这种分析在什么情况下会不准确?

例如,在下面的例子中

tsecer@harry: cat -n tsecer.cpp 
     1  int harry()
     2  {
     3
     4
     5
     6
     7
     8
     9
    10
    11      return -1;
    12  }
tsecer@harry: git diff tsecer.cpp 
diff --git a/tsecer.cpp b/tsecer.cpp
index 7f143fd..037a629 100644
--- a/tsecer.cpp
+++ b/tsecer.cpp
@@ -8,5 +8,5 @@ int harry()
 
 
 
-    return 0;
+    return -1;
 }

把版本库中源文件11行的

return 0

修改为

return -1;

在git diff的输出中可以看到有函数上下文信息,也就是在@@引导的行号信息之后,有额外的疑似上下文的字符串,而这个"int harry()"也正是修改所在的函数

@@ -8,5 +8,5 @@ int harry()

git对于diff的配置

function-context

这个选项只是影响输出的上下文信息,但是不影响函数识别规则

   -W, --function-context   
     Show whole surrounding functions of changes.  

更多的配置

git对于diff更多个配置

Defining a custom hunk-header
Each group of changes (called a "hunk") in the textual diff output is prefixed with a line of the form:

@@ -k,l +n,m @@ TEXT
This is called a hunk header. The "TEXT" portion is by default a line that begins with an alphabet, an underscore or a dollar sign; this matches what GNU diff -p output uses. This default selection however is not suited for some contents, and you can use a customized pattern to make a selection.

First, in .gitattributes, you would assign the diff attribute for paths.

*.tex	diff=tex
Then, you would define a "diff.tex.xfuncname" configuration to specify a regular expression that matches a line that you would want to appear as the hunk header "TEXT". Add a section to your $GIT_DIR/config file (or $HOME/.gitconfig file) like this:

[diff "tex"]
	xfuncname = "^(\\\\(sub)*section\\{.*)$"
Note. A single level of backslashes are eaten by the configuration file parser, so you would need to double the backslashes; the pattern above picks a line that begins with a backslash, and zero or more occurrences of sub followed by section followed by open brace, to the end of line.

There are a few built-in patterns to make this easier, and tex is one of them, so you do not have to write the above in your configuration file (you still need to enable this with the attribute mechanism, via .gitattributes). The following built in patterns are available:

ada suitable for source code in the Ada language.

bash suitable for source code in the Bourne-Again SHell language. Covers a superset of POSIX shell function definitions.

bibtex suitable for files with BibTeX coded references.

cpp suitable for source code in the C and C++ languages.

csharp suitable for source code in the C# language.

css suitable for cascading style sheets.

dts suitable for devicetree (DTS) files.

elixir suitable for source code in the Elixir language.

fortran suitable for source code in the Fortran language.

fountain suitable for Fountain documents.

golang suitable for source code in the Go language.

html suitable for HTML/XHTML documents.

java suitable for source code in the Java language.

kotlin suitable for source code in the Kotlin language.

markdown suitable for Markdown documents.

matlab suitable for source code in the MATLAB and Octave languages.

objc suitable for source code in the Objective-C language.

pascal suitable for source code in the Pascal/Delphi language.

perl suitable for source code in the Perl language.

php suitable for source code in the PHP language.

python suitable for source code in the Python language.

ruby suitable for source code in the Ruby language.

rust suitable for source code in the Rust language.

scheme suitable for source code in the Scheme language.

tex suitable for source code for LaTeX documents.

从代码上看,从文件名找到使用的driver

static void diff_filespec_load_driver(struct diff_filespec *one,
				      struct index_state *istate)
{
	/* Use already-loaded driver */
	if (one->driver)
		return;

	if (S_ISREG(one->mode))
		one->driver = userdiff_find_by_path(istate, one->path);

	/* Fallback to default settings */
	if (!one->driver)
		one->driver = userdiff_find_by_name("default");
}

根据driver找到diff配置项的代码

/// @file: git-master\compat\regex\regex.h
/* If this bit is set, then use extended regular expression syntax.
   If not set, then use basic regular expression syntax.  */
#define REG_EXTENDED 1

int userdiff_config(const char *k, const char *v)
{
	struct userdiff_driver *drv;
	const char *name, *type;
	size_t namelen;

	if (parse_config_key(k, "diff", &name, &namelen, &type) || !name)
		return 0;

	drv = userdiff_find_by_namelen(name, namelen);
	if (!drv) {
		ALLOC_GROW(drivers, ndrivers+1, drivers_alloc);
		drv = &drivers[ndrivers++];
		memset(drv, 0, sizeof(*drv));
		drv->name = xmemdupz(name, namelen);
		drv->binary = -1;
	}

	if (!strcmp(type, "funcname"))
		return parse_funcname(&drv->funcname, k, v, 0);
	if (!strcmp(type, "xfuncname"))
		return parse_funcname(&drv->funcname, k, v, REG_EXTENDED);
	if (!strcmp(type, "binary"))
		return parse_tristate(&drv->binary, k, v);
	if (!strcmp(type, "command"))
		return git_config_string(&drv->external, k, v);
	if (!strcmp(type, "textconv"))
		return git_config_string(&drv->textconv, k, v);
	if (!strcmp(type, "cachetextconv"))
		return parse_bool(&drv->textconv_want_cache, k, v);
	if (!strcmp(type, "wordregex"))
		return git_config_string(&drv->word_regex, k, v);

	return 0;
}

context len

通过git diff的帮助手册可以看到,在默认情况下,diff输出的是修改内容上下文的长度为3(行)。在这个例子中也可以看到,前面3行是空行,完整无误的输出在了结果中。

从帮助文档也可以看到,还有其它的方法可以修改diff上下文信息,但是这个并不是关键,我们现在主要关心diff如何确定函数名。

-U<n>
--unified=<n>
Generate diffs with <n> lines of context instead of the usual three. Implies --patch.
diff.context
Generate diffs with <n> lines of context instead of the default of 3. This value is overridden by the -U option.

如何确定函数名

其实查看git的代码可以发现,git内部判断函数名的方法异常简单,根本没有什么语法分析。这个函数的主要逻辑就是判断一行的第一个字符串是字母(isalpha)字符即可,注意:空白,左大括号都不是字母字符。

回过头来看代码内容可以发现,虽然修改在第9行,但是由于前面都是空行,所以不满足函数名。

/// @file: git-master\xdiff\xemit.c
static long def_ff(const char *rec, long len, char *buf, long sz, void *priv)
{
	if (len > 0 &&
			(isalpha((unsigned char)*rec) || /* identifier? */
			 *rec == '_' || /* also identifier? */
			 *rec == '$')) { /* identifiers from VMS and other esoterico */
		if (len > sz)
			len = sz;
		while (0 < len && isspace((unsigned char)rec[len - 1]))
			len--;
		memcpy(buf, rec, len);
		return len;
	}
	return -1;
}

如何突破默认context长度为3的限制

正如前面所说,context len只有3行,如何找到3行之外的函数呢?

下面是对于每个差异输出的主体控制代码,其中有一个初始值为-1的变量funclineprev,这个其实表示了从差异向前搜索“函数名”的边界。也就是说,对于某个差异,git是从差异上下文开始位置向前搜索前面定义的"函数名"。由于初始值是-1,所以第一个差异会一直搜索到文件的开始。

int xdl_emit_diff(xdfenv_t *xe, xdchange_t *xscr, xdemitcb_t *ecb,
		  xdemitconf_t const *xecfg) {
	long s1, s2, e1, e2, lctx;
	xdchange_t *xch, *xche;
	long funclineprev = -1;
	struct func_line func_line = { 0 };

	for (xch = xscr; xch; xch = xche->next) {
		xdchange_t *xchp = xch;
		xche = xdl_get_hunk(&xch, xecfg);
		if (!xch)
			break;

pre_context_calculation:
		s1 = XDL_MAX(xch->i1 - xecfg->ctxlen, 0);
		s2 = XDL_MAX(xch->i2 - xecfg->ctxlen, 0);
/// ...
		/*
		 * Emit current hunk header.
		 */

		if (xecfg->flags & XDL_EMIT_FUNCNAMES) {
			get_func_line(xe, xecfg, &func_line,
				      s1 - 1, funclineprev);
			funclineprev = s1 - 1;
		}
/// ...
}

static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
			  struct func_line *func_line, long start, long limit)
{
	long l, size, step = (start > limit) ? -1 : 1;
	char *buf, dummy[1];

	buf = func_line ? func_line->buf : dummy;
	size = func_line ? sizeof(func_line->buf) : sizeof(dummy);

	for (l = start; l != limit && 0 <= l && l < xe->xdf1.nrec; l += step) {
		long len = match_func_rec(&xe->xdf1, xecfg, l, buf, size);
		if (len >= 0) {
			if (func_line)
				func_line->len = len;
			return l;
		}
	}
	return -1;
}

识别错误的情况

例如,下面的例子git输出的“疑似”函数就是错误的

tsecer@harry: cat -n tsecer.cpp 
     1  int x;
     2  int y;
     3  int 1;
     4  int 2;
     5  int 3;
     6  int Z;
tsecer@harry: git diff tsecer.cpp 
diff --git a/tsecer.cpp b/tsecer.cpp
index b2540c2..df187dc 100644
--- a/tsecer.cpp
+++ b/tsecer.cpp
@@ -3,4 +3,4 @@ int y;
 int 1;
 int 2;
 int 3;
-int z;
+int Z;
tsecer@harry: 

git使用的diff算法

从注释可以看到,git内部的diff主要使用这个文档描述的算法。

/// @file: git-master\xdiff\xdiffi.c
/*
 * See "An O(ND) Difference Algorithm and its Variations", by Eugene Myers.
 * Basically considers a "box" (off1, off2, lim1, lim2) and scan from both
 * the forward diagonal starting from (off1, off2) and the backward diagonal
 * starting from (lim1, lim2). If the K values on the same diagonal crosses
 * returns the furthest point of reach. We might encounter expensive edge cases
 * using this algorithm, so a little bit of heuristic is needed to cut the
 * search and to return a suboptimal point.
 */
static long xdl_split(unsigned long const *ha1, long off1, long lim1,
		      unsigned long const *ha2, long off2, long lim2,
		      long *kvdf, long *kvdb, int need_min, xdpsplit_t *spl,
		      xdalgoenv_t *xenv)

其实,svn的比较使用的也是类似的算法

/// @file: subversion-1.14.2\subversion\libsvn_diff\lcs.c
/*
 * Calculate the Longest Common Subsequence (LCS) between two datasources.
 * This function is what makes the diff code tick.
 *
 * The LCS algorithm implemented here is based on the approach described
 * by Sun Wu, Udi Manber and Gene Meyers in "An O(NP) Sequence Comparison
 * Algorithm", but has been modified for better performance.
 *
 * Let M and N be the lengths (number of tokens) of the two sources
 * ('files'). The goal is to reach the end of both sources (files) with the
 * minimum number of insertions + deletions. Since there is a known length
 * difference N-M between the files, that is equivalent to just the minimum
 * number of deletions, or equivalently the minimum number of insertions.
 * For symmetry, we use the lesser number - deletions if M<N, insertions if
 * M>N.
 *
 * Let 'k' be the difference in remaining length between the files, i.e.
 * if we're at the beginning of both files, k=N-M, whereas k=0 for the
 * 'end state', at the end of both files. An insertion will increase k by
 * one, while a deletion decreases k by one. If k<0, then insertions are
 * 'free' - we need those to reach the end state k=0 anyway - but deletions
 * are costly: Adding a deletion means that we will have to add an additional
 * insertion later to reach the end state, so it doesn't matter if we count
 * deletions or insertions. Similarly, deletions are free for k>0.
 *
 * Let a 'state' be a given position in each file {pos1, pos2}. An array
 * 'fp' keeps track of the best possible state (largest values of
 * {pos1, pos2}) that can be achieved for a given cost 'p' (# moves away
 * from k=0), as well as a linked list of what matches were used to reach
 * that state. For each new value of p, we find for each value of k the
 * best achievable state for that k - either by doing a costly operation
 * (deletion if k<0) from a state achieved at a lower p, or doing a free
 * operation (insertion if k<0) from a state achieved at the same p -
 * and in both cases advancing past any matching regions found. This is
 * handled by running loops over k in order of descending absolute value.
 *
 * A recent improvement of the algorithm is to ignore tokens that are unique
 * to one file or the other, as those are known from the start to be
 * impossible to match.
 */

算法的一些个人理解

引理

文章有两个简单的(lemma),引理1主要论证了如何使用之前已经计算过的D值也迭代计算新的D值(也就是对于D的计算可以通过D,D-2, D-4…… -D的值来计算,而这些值在在计算D时都已经计算过了,可以在这些已知值的基础上继续计算)
引理2主要是论证这种方法获得路径是最优的。

V数组的意义

文章中已明确说明

Consequently, V is an array of integers where V[k] contains the row index of the endpoint of a furthest reaching path in diagonal k

数组V[i] = x,表示y - x = i这条对角线上,差值为D-1(其中D为当前外层循环迭代变量的值)的最大值。由于i和x都是已知的,所以根据y - x = i可以计算出y的坐标值,这样就获得了本次迭代的一个起始坐标点。

另外,由于D和D+1次迭代,它们互相错开,所以在存储的时候可以复用一个数组。

figure2算法中取舍的意义

这个由lemma2前定义的“furthest reaching”决定

A D-path is furthest reaching in diagonal k if and only if it is one of the D-paths ending on diagonal k whose end point has the greatest possible row (column) number of all such paths. Informally, of all D-paths ending in diagonal k, it ends furthest from the origin, (0,0)

直观上说,这个最远的意思是“距离原点”最远,因为在同一条对角线y-x=k上,x值最大也意味y值最大,进而意味着x+y(大致看做距离原点的曼哈顿距离)最大,而x越大也意味着已经匹配的字符串最长(相同差异量的前提下)。

对于对角线k来说,它有两个相邻的对角线k-1和k+1都可以到达,此时选择哪一个呢?算法给出了取v值较大的那个,由于v数组中存储的是x轴坐标,所以这个选择也等价于取x值较大的那个。

  1. If k = -D or k ≠ D and V[k - 1] < V[k + 1] Then
  2. x ← V[k + 1]
  3. Else
  4. x ← V[k - 1]+1

如何生成完成路径

文章说明,原始的算法只是计算了D的值,而没有办法生成完整路径(solution graph),如果希望还原路径需要D^2量级的额外存储空间,存储每轮迭代中的完整路径。

The search of the greedy algorithm traces the optimal D-paths among others. But only the current set of furthest reaching endpoints are retained in V. Consequently, only the length of an SES/LCS can be reported in Line 12. To explicitly generate a solution path, O(D^2) space is used to store a copy of V after each iteration of the outer loop. Let Vd be the copy of V kept after the dth iteration.

posted on 2022-12-08 20:33  tsecer  阅读(585)  评论(0编辑  收藏  举报

导航