问题

在使用git diff 展示c/c++文件修改内容时，除了显示修改上下文外，输出还贴心的展示了修改所在的函数。尽管这个展示并不总是准确，但是能够做到大部分情况下准确也已经相当不错：是不是git内置了c语言这种高级语言的语法分析器？另外，git的这种分析在什么情况下会不准确？

例如，在下面的例子中

tsecer@harry: cat -n tsecer.cpp

     1  int harry()

     2  {

     3

     4

     5

     6

     7

     8

     9

    10

    11      return -1;

    12  }

tsecer@harry: git diff tsecer.cpp

diff --git a/tsecer.cpp b/tsecer.cpp

index 7f143fd..037a629 100644

--- a/tsecer.cpp

+++ b/tsecer.cpp

@@ -8,5 +8,5 @@ int harry()

-    return 0;

+    return -1;

 }

把版本库中源文件11行的

return 0

修改为

return -1;

在git diff的输出中可以看到有函数上下文信息，也就是在@@引导的行号信息之后，有额外的疑似上下文的字符串，而这个"int harry()"也正是修改所在的函数

@@ -8,5 +8,5 @@ int harry()

git对于diff的配置

function-context

这个选项只是影响输出的上下文信息，但是不影响函数识别规则

   -W, --function-context

     Show whole surrounding functions of changes.

更多的配置

git对于diff更多个配置

Defining a custom hunk-header

Each group of changes (called a "hunk") in the textual diff output is prefixed with a line of the form:

@@ -k,l +n,m @@ TEXT

This is called a hunk header. The "TEXT" portion is by default a line that begins with an alphabet, an underscore or a dollar sign; this matches what GNU diff -p output uses. This default selection however is not suited for some contents, and you can use a customized pattern to make a selection.

First, in .gitattributes, you would assign the diff attribute for paths.

*.tex	diff=tex

Then, you would define a "diff.tex.xfuncname" configuration to specify a regular expression that matches a line that you would want to appear as the hunk header "TEXT". Add a section to your $GIT_DIR/config file (or $HOME/.gitconfig file) like this:

[diff "tex"]

	xfuncname = "^(\\\\(sub)*section\\{.*)$"

Note. A single level of backslashes are eaten by the configuration file parser, so you would need to double the backslashes; the pattern above picks a line that begins with a backslash, and zero or more occurrences of sub followed by section followed by open brace, to the end of line.

There are a few built-in patterns to make this easier, and tex is one of them, so you do not have to write the above in your configuration file (you still need to enable this with the attribute mechanism, via .gitattributes). The following built in patterns are available:

ada suitable for source code in the Ada language.

bash suitable for source code in the Bourne-Again SHell language. Covers a superset of POSIX shell function definitions.

bibtex suitable for files with BibTeX coded references.

cpp suitable for source code in the C and C++ languages.

csharp suitable for source code in the C# language.

css suitable for cascading style sheets.

dts suitable for devicetree (DTS) files.

elixir suitable for source code in the Elixir language.

fortran suitable for source code in the Fortran language.

fountain suitable for Fountain documents.

golang suitable for source code in the Go language.

html suitable for HTML/XHTML documents.

java suitable for source code in the Java language.

kotlin suitable for source code in the Kotlin language.

markdown suitable for Markdown documents.

matlab suitable for source code in the MATLAB and Octave languages.

objc suitable for source code in the Objective-C language.

pascal suitable for source code in the Pascal/Delphi language.

perl suitable for source code in the Perl language.

php suitable for source code in the PHP language.

python suitable for source code in the Python language.

ruby suitable for source code in the Ruby language.

rust suitable for source code in the Rust language.

scheme suitable for source code in the Scheme language.

tex suitable for source code for LaTeX documents.

从代码上看，从文件名找到使用的driver

static void diff_filespec_load_driver(struct diff_filespec *one,

				      struct index_state *istate)

{

	/* Use already-loaded driver */

	if (one->driver)

		return;

	if (S_ISREG(one->mode))

		one->driver = userdiff_find_by_path(istate, one->path);

	/* Fallback to default settings */

	if (!one->driver)

		one->driver = userdiff_find_by_name("default");

}

根据driver找到diff配置项的代码

/// @file: git-master\compat\regex\regex.h

/* If this bit is set, then use extended regular expression syntax.

   If not set, then use basic regular expression syntax.  */

#define REG_EXTENDED 1

int userdiff_config(const char *k, const char *v)

{

	struct userdiff_driver *drv;

	const char *name, *type;

	size_t namelen;

	if (parse_config_key(k, "diff", &name, &namelen, &type) || !name)

		return 0;

	drv = userdiff_find_by_namelen(name, namelen);

	if (!drv) {

		ALLOC_GROW(drivers, ndrivers+1, drivers_alloc);

		drv = &drivers[ndrivers++];

		memset(drv, 0, sizeof(*drv));

		drv->name = xmemdupz(name, namelen);

		drv->binary = -1;

	}

	if (!strcmp(type, "funcname"))

		return parse_funcname(&drv->funcname, k, v, 0);

	if (!strcmp(type, "xfuncname"))

		return parse_funcname(&drv->funcname, k, v, REG_EXTENDED);

	if (!strcmp(type, "binary"))

		return parse_tristate(&drv->binary, k, v);

	if (!strcmp(type, "command"))

		return git_config_string(&drv->external, k, v);

	if (!strcmp(type, "textconv"))

		return git_config_string(&drv->textconv, k, v);

	if (!strcmp(type, "cachetextconv"))

		return parse_bool(&drv->textconv_want_cache, k, v);

	if (!strcmp(type, "wordregex"))

		return git_config_string(&drv->word_regex, k, v);

	return 0;

}

context len

通过git diff的帮助手册可以看到，在默认情况下，diff输出的是修改内容上下文的长度为3(行)。在这个例子中也可以看到，前面3行是空行，完整无误的输出在了结果中。

从帮助文档也可以看到，还有其它的方法可以修改diff上下文信息，但是这个并不是关键，我们现在主要关心diff如何确定函数名。

-U<n>

--unified=<n>

Generate diffs with <n> lines of context instead of the usual three. Implies --patch.

diff.context

Generate diffs with <n> lines of context instead of the default of 3. This value is overridden by the -U option.

如何确定函数名

其实查看git的代码可以发现，git内部判断函数名的方法异常简单，根本没有什么语法分析。这个函数的主要逻辑就是判断一行的第一个字符串是字母(isalpha)字符即可，注意：空白，左大括号都不是字母字符。

回过头来看代码内容可以发现，虽然修改在第9行，但是由于前面都是空行，所以不满足函数名。

/// @file: git-master\xdiff\xemit.c

static long def_ff(const char *rec, long len, char *buf, long sz, void *priv)

{

	if (len > 0 &&

			(isalpha((unsigned char)*rec) || /* identifier? */

			 *rec == '_' || /* also identifier? */

			 *rec == '$')) { /* identifiers from VMS and other esoterico */

		if (len > sz)

			len = sz;

		while (0 < len && isspace((unsigned char)rec[len - 1]))

			len--;

		memcpy(buf, rec, len);

		return len;

	}

	return -1;

}

如何突破默认context长度为3的限制

正如前面所说，context len只有3行，如何找到3行之外的函数呢？

下面是对于每个差异输出的主体控制代码，其中有一个初始值为-1的变量funclineprev，这个其实表示了从差异向前搜索“函数名”的边界。也就是说，对于某个差异，git是从差异上下文开始位置向前搜索前面定义的"函数名"。由于初始值是-1，所以第一个差异会一直搜索到文件的开始。

int xdl_emit_diff(xdfenv_t *xe, xdchange_t *xscr, xdemitcb_t *ecb,

		  xdemitconf_t const *xecfg) {

	long s1, s2, e1, e2, lctx;

	xdchange_t *xch, *xche;

	long funclineprev = -1;

	struct func_line func_line = { 0 };

	for (xch = xscr; xch; xch = xche->next) {

		xdchange_t *xchp = xch;

		xche = xdl_get_hunk(&xch, xecfg);

		if (!xch)

			break;

pre_context_calculation:

		s1 = XDL_MAX(xch->i1 - xecfg->ctxlen, 0);

		s2 = XDL_MAX(xch->i2 - xecfg->ctxlen, 0);

/// ...

		/*

		 * Emit current hunk header.

		 */

		if (xecfg->flags & XDL_EMIT_FUNCNAMES) {

			get_func_line(xe, xecfg, &func_line,

				      s1 - 1, funclineprev);

			funclineprev = s1 - 1;

		}

/// ...

}

static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,

			  struct func_line *func_line, long start, long limit)

{

	long l, size, step = (start > limit) ? -1 : 1;

	char *buf, dummy[1];

	buf = func_line ? func_line->buf : dummy;

	size = func_line ? sizeof(func_line->buf) : sizeof(dummy);

	for (l = start; l != limit && 0 <= l && l < xe->xdf1.nrec; l += step) {

		long len = match_func_rec(&xe->xdf1, xecfg, l, buf, size);

		if (len >= 0) {

			if (func_line)

				func_line->len = len;

			return l;

		}

	}

	return -1;

}

识别错误的情况

例如，下面的例子git输出的“疑似”函数就是错误的

tsecer@harry: cat -n tsecer.cpp

     1  int x;

     2  int y;

     3  int 1;

     4  int 2;

     5  int 3;

     6  int Z;

tsecer@harry: git diff tsecer.cpp

diff --git a/tsecer.cpp b/tsecer.cpp

index b2540c2..df187dc 100644

--- a/tsecer.cpp

+++ b/tsecer.cpp

@@ -3,4 +3,4 @@ int y;

 int 1;

 int 2;

 int 3;

-int z;

+int Z;

tsecer@harry:

git使用的diff算法

从注释可以看到，git内部的diff主要使用这个文档描述的算法。

/// @file: git-master\xdiff\xdiffi.c

/*

 * See "An O(ND) Difference Algorithm and its Variations", by Eugene Myers.

 * Basically considers a "box" (off1, off2, lim1, lim2) and scan from both

 * the forward diagonal starting from (off1, off2) and the backward diagonal

 * starting from (lim1, lim2). If the K values on the same diagonal crosses

 * returns the furthest point of reach. We might encounter expensive edge cases

 * using this algorithm, so a little bit of heuristic is needed to cut the

 * search and to return a suboptimal point.

 */

static long xdl_split(unsigned long const *ha1, long off1, long lim1,

		      unsigned long const *ha2, long off2, long lim2,

		      long *kvdf, long *kvdb, int need_min, xdpsplit_t *spl,

		      xdalgoenv_t *xenv)

其实，svn的比较使用的也是类似的算法

/// @file: subversion-1.14.2\subversion\libsvn_diff\lcs.c

/*

 * Calculate the Longest Common Subsequence (LCS) between two datasources.

 * This function is what makes the diff code tick.

 *

 * The LCS algorithm implemented here is based on the approach described

 * by Sun Wu, Udi Manber and Gene Meyers in "An O(NP) Sequence Comparison

 * Algorithm", but has been modified for better performance.

 *

 * Let M and N be the lengths (number of tokens) of the two sources

 * ('files'). The goal is to reach the end of both sources (files) with the

 * minimum number of insertions + deletions. Since there is a known length

 * difference N-M between the files, that is equivalent to just the minimum

 * number of deletions, or equivalently the minimum number of insertions.

 * For symmetry, we use the lesser number - deletions if M<N, insertions if

 * M>N.

 *

 * Let 'k' be the difference in remaining length between the files, i.e.

 * if we're at the beginning of both files, k=N-M, whereas k=0 for the

 * 'end state', at the end of both files. An insertion will increase k by

 * one, while a deletion decreases k by one. If k<0, then insertions are

 * 'free' - we need those to reach the end state k=0 anyway - but deletions

 * are costly: Adding a deletion means that we will have to add an additional

 * insertion later to reach the end state, so it doesn't matter if we count

 * deletions or insertions. Similarly, deletions are free for k>0.

 *

 * Let a 'state' be a given position in each file {pos1, pos2}. An array

 * 'fp' keeps track of the best possible state (largest values of

 * {pos1, pos2}) that can be achieved for a given cost 'p' (# moves away

 * from k=0), as well as a linked list of what matches were used to reach

 * that state. For each new value of p, we find for each value of k the

 * best achievable state for that k - either by doing a costly operation

 * (deletion if k<0) from a state achieved at a lower p, or doing a free

 * operation (insertion if k<0) from a state achieved at the same p -

 * and in both cases advancing past any matching regions found. This is

 * handled by running loops over k in order of descending absolute value.

 *

 * A recent improvement of the algorithm is to ignore tokens that are unique

 * to one file or the other, as those are known from the start to be

 * impossible to match.

 */

算法的一些个人理解

引理

文章有两个简单的(lemma)，引理1主要论证了如何使用之前已经计算过的D值也迭代计算新的D值(也就是对于D的计算可以通过D，D-2， D-4…… -D的值来计算，而这些值在在计算D时都已经计算过了，可以在这些已知值的基础上继续计算)

引理2主要是论证这种方法获得路径是最优的。

V数组的意义

文章中已明确说明

Consequently, V is an array of integers where V[k] contains the row index of the endpoint of a furthest reaching path in diagonal k

数组V[i] = x，表示y - x = i这条对角线上，差值为D-1(其中D为当前外层循环迭代变量的值)的最大值。由于i和x都是已知的，所以根据y - x = i可以计算出y的坐标值，这样就获得了本次迭代的一个起始坐标点。

另外，由于D和D+1次迭代，它们互相错开，所以在存储的时候可以复用一个数组。

figure2算法中取舍的意义

这个由lemma2前定义的“furthest reaching”决定

A D-path is furthest reaching in diagonal k if and only if it is one of the D-paths ending on diagonal k whose end point has the greatest possible row (column) number of all such paths. Informally, of all D-paths ending in diagonal k, it ends furthest from the origin, (0,0)

直观上说，这个最远的意思是“距离原点”最远，因为在同一条对角线y-x=k上，x值最大也意味y值最大，进而意味着x+y(大致看做距离原点的曼哈顿距离)最大，而x越大也意味着已经匹配的字符串最长(相同差异量的前提下)。

对于对角线k来说，它有两个相邻的对角线k-1和k+1都可以到达，此时选择哪一个呢？算法给出了取v值较大的那个，由于v数组中存储的是x轴坐标，所以这个选择也等价于取x值较大的那个。

If k = -D or k ≠ D and V[k - 1] < V[k + 1] Then

x ← V[k + 1]

Else

x ← V[k - 1]+1

如何生成完成路径

文章说明，原始的算法只是计算了D的值，而没有办法生成完整路径(solution graph)，如果希望还原路径需要D^2量级的额外存储空间，存储每轮迭代中的完整路径。

The search of the greedy algorithm traces the optimal D-paths among others. But only the current set of furthest reaching endpoints are retained in V. Consequently, only the length of an SES/LCS can be reported in Line 12. To explicitly generate a solution path, O(D^2) space is used to store a copy of V after each iteration of the outer loop. Let Vd be the copy of V kept after the dth iteration.

git diff如何确定差异所在函数context的更多相关文章

git diff比较版本差异(生成补丁)
1.git diff [<options>] <commit> <commit> options 使用--name-only(git diff HEAD cd504 ...
消除Git diff中^M的差异
消除Git diff中^M的差异在Windows上把一个刚commit的文件夹上传到了Ubuntu.在Ubuntu上使用git status查看,发现很多文件都被红色标注,表示刚刚修改未add.在W ...
git学习(五) git diff操作
git diff操作 git diff用于比较差异: git diff 不加任何参数用于比较当前工作区跟暂存区的差异 git diff --cached 或者--staged 对比暂存区(git a ...
git diff 比较差异
说明以下命令可以不指定 <filename>,表示对全部文件操作. 命令涉及和 Git本地仓库对比的,均可指定 commit 的版本. HEAD 最近一次 commit HEAD^ 上次 ...
git diff获取差异文件中文乱码的解决办法
通过git的diff命令对两个commit id的版本进行差异化的对比.中文文件时出现乱码. git diff 6bded8d0c1fe1746c122121217dc0c88667091089 a9 ...
git diff 差异对比
转载原文: http://fsjoy.blog.51cto.com/318484/245465/ 1. 查看当前所有的更改情况.git status 结果有3部分,changes to be comm ...
git diff与linux diff的输出格式之unified format
前言前面有一篇文章<一个有些意思的项目--文件夹对比工具(一)>,里面简单讲了下diff算法之--Myers算法. 既然是算法,就会有实现,比如git diff中有Myers的实现,gi ...
git diff ^M的消除
这是由于换行符在不同的操作系统上定义的区别造成的. Windows用CR LF来定义换行,Linux用LF. CR全称是Carriage Return ,或者表示为\r, 意思是回车. LF全称是Li ...
git diff的用法
在git提交环节,存在三大部分:working tree(工作区), index file(暂存区:stage), commit(分支:master) working tree:就是你所工作在的目录, ...
Git diff 常见用法
Git diff 用于比较两次修改的差异 1.1 比较工作区与暂存区 git diff 比较的是单个仓库的工作区与暂存区的差别,repo diff是对git diff的封装,用来分别显示各个项目 ...

随机推荐

(pymssql._pymssql.OperationalError) (8152, b'String or binary data would be truncated.DB-Lib error message 20 018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n')
(pymssql._pymssql.OperationalError) (8152, b'String or binary data would be truncated.DB-Lib error m ...
Windows Service调试方法小结
方法1:log记录这是一个通用的调试方法,效率比较低,但比较实用,通过查看日志,总能达到调试的目的方法2:附加到进程这是Windows Service程序调试的常用方法,缺点是对Windows环 ...
[CSS]背景图片很大，根据屏幕缩小适配后，div之间有空隙的问题
RT.美术给的素材宽度是1080px的. 在不缩放的情况下,1080px宽度的屏幕显示div之间正常,没有空隙,但使用transform属性之后,div缩小,div之间有空隙(白线) 百度有人说给这些 ...
富文本编辑器第一次正常显示，第二次渲染失败 -----在使用laravel-admin 时
第二次显示解决方法: 在每次获取富文本编辑器实例的时候,先删除一下,避免之前已经实例化造成的渲染失败
ElasticSearch、ElasticSearch-head的安装和问题解决
前言:elasticsearch作为一个基于Lucene的分布式搜索引擎,其搜索功能的强大之处不用多说,而elasticsearch-head作为一个node项目,能够轻松管理elasticsearc ...
MariaDB 搭建主备及主主
一.主备可参考:MariaDB之GTID主从复制二.主主
【Excel】IF条件函数公式怎么用？
版本 Excel 2019 步骤点击插入函数打开文档,点击公式菜单下的插入函数. 双击选择IF函数在函数列表双击选择IF函数. 输入条件测试值在第一个输入框输入条件测试值. 设置输出结果值
react -hook 项目搭建
1.脚手架搭建 2.清除多余文件 3.搭建项目文件列表 4.引入公共样式 5.页面构建LOGIN 6.安装路由 v5 v6有区别 7.搭建路由文件router.js 懒加载配合supence使用 8. ...
Lua文件夹及文件操作
转载于:CSDN-刘长栋 -[[ @引用:require("FileLib") @调用:fileLib.createFolder(path) @功能: 1.创建文件夹 2.连续创建 ...
org.aspectj.lang不存在,引入失败。
问题:添加了依赖,或引入了jar包但是写aspect类时无法引入解决办法:

git diff如何确定差异所在函数context

问题