问题

在使用git diff 展示c/c++文件修改内容时,除了显示修改上下文外,输出还贴心的展示了修改所在的函数。尽管这个展示并不总是准确,但是能够做到大部分情况下准确也已经相当不错:是不是git内置了c语言这种高级语言的语法分析器?另外,git的这种分析在什么情况下会不准确?

例如,在下面的例子中

tsecer@harry: cat -n tsecer.cpp
1 int harry()
2 {
3
4
5
6
7
8
9
10
11 return -1;
12 }
tsecer@harry: git diff tsecer.cpp
diff --git a/tsecer.cpp b/tsecer.cpp
index 7f143fd..037a629 100644
--- a/tsecer.cpp
+++ b/tsecer.cpp
@@ -8,5 +8,5 @@ int harry() - return 0;
+ return -1;
}

把版本库中源文件11行的

return 0

修改为

return -1;

在git diff的输出中可以看到有函数上下文信息,也就是在@@引导的行号信息之后,有额外的疑似上下文的字符串,而这个"int harry()"也正是修改所在的函数

@@ -8,5 +8,5 @@ int harry()

git对于diff的配置

function-context

这个选项只是影响输出的上下文信息,但是不影响函数识别规则

   -W, --function-context
Show whole surrounding functions of changes.

更多的配置

git对于diff更多个配置

Defining a custom hunk-header
Each group of changes (called a "hunk") in the textual diff output is prefixed with a line of the form: @@ -k,l +n,m @@ TEXT
This is called a hunk header. The "TEXT" portion is by default a line that begins with an alphabet, an underscore or a dollar sign; this matches what GNU diff -p output uses. This default selection however is not suited for some contents, and you can use a customized pattern to make a selection. First, in .gitattributes, you would assign the diff attribute for paths. *.tex diff=tex
Then, you would define a "diff.tex.xfuncname" configuration to specify a regular expression that matches a line that you would want to appear as the hunk header "TEXT". Add a section to your $GIT_DIR/config file (or $HOME/.gitconfig file) like this: [diff "tex"]
xfuncname = "^(\\\\(sub)*section\\{.*)$"
Note. A single level of backslashes are eaten by the configuration file parser, so you would need to double the backslashes; the pattern above picks a line that begins with a backslash, and zero or more occurrences of sub followed by section followed by open brace, to the end of line. There are a few built-in patterns to make this easier, and tex is one of them, so you do not have to write the above in your configuration file (you still need to enable this with the attribute mechanism, via .gitattributes). The following built in patterns are available: ada suitable for source code in the Ada language. bash suitable for source code in the Bourne-Again SHell language. Covers a superset of POSIX shell function definitions. bibtex suitable for files with BibTeX coded references. cpp suitable for source code in the C and C++ languages. csharp suitable for source code in the C# language. css suitable for cascading style sheets. dts suitable for devicetree (DTS) files. elixir suitable for source code in the Elixir language. fortran suitable for source code in the Fortran language. fountain suitable for Fountain documents. golang suitable for source code in the Go language. html suitable for HTML/XHTML documents. java suitable for source code in the Java language. kotlin suitable for source code in the Kotlin language. markdown suitable for Markdown documents. matlab suitable for source code in the MATLAB and Octave languages. objc suitable for source code in the Objective-C language. pascal suitable for source code in the Pascal/Delphi language. perl suitable for source code in the Perl language. php suitable for source code in the PHP language. python suitable for source code in the Python language. ruby suitable for source code in the Ruby language. rust suitable for source code in the Rust language. scheme suitable for source code in the Scheme language. tex suitable for source code for LaTeX documents.

从代码上看,从文件名找到使用的driver

static void diff_filespec_load_driver(struct diff_filespec *one,
struct index_state *istate)
{
/* Use already-loaded driver */
if (one->driver)
return; if (S_ISREG(one->mode))
one->driver = userdiff_find_by_path(istate, one->path); /* Fallback to default settings */
if (!one->driver)
one->driver = userdiff_find_by_name("default");
}

根据driver找到diff配置项的代码

/// @file: git-master\compat\regex\regex.h
/* If this bit is set, then use extended regular expression syntax.
If not set, then use basic regular expression syntax. */
#define REG_EXTENDED 1 int userdiff_config(const char *k, const char *v)
{
struct userdiff_driver *drv;
const char *name, *type;
size_t namelen; if (parse_config_key(k, "diff", &name, &namelen, &type) || !name)
return 0; drv = userdiff_find_by_namelen(name, namelen);
if (!drv) {
ALLOC_GROW(drivers, ndrivers+1, drivers_alloc);
drv = &drivers[ndrivers++];
memset(drv, 0, sizeof(*drv));
drv->name = xmemdupz(name, namelen);
drv->binary = -1;
} if (!strcmp(type, "funcname"))
return parse_funcname(&drv->funcname, k, v, 0);
if (!strcmp(type, "xfuncname"))
return parse_funcname(&drv->funcname, k, v, REG_EXTENDED);
if (!strcmp(type, "binary"))
return parse_tristate(&drv->binary, k, v);
if (!strcmp(type, "command"))
return git_config_string(&drv->external, k, v);
if (!strcmp(type, "textconv"))
return git_config_string(&drv->textconv, k, v);
if (!strcmp(type, "cachetextconv"))
return parse_bool(&drv->textconv_want_cache, k, v);
if (!strcmp(type, "wordregex"))
return git_config_string(&drv->word_regex, k, v); return 0;
}

context len

通过git diff的帮助手册可以看到,在默认情况下,diff输出的是修改内容上下文的长度为3(行)。在这个例子中也可以看到,前面3行是空行,完整无误的输出在了结果中。

从帮助文档也可以看到,还有其它的方法可以修改diff上下文信息,但是这个并不是关键,我们现在主要关心diff如何确定函数名。

-U<n>
--unified=<n>
Generate diffs with <n> lines of context instead of the usual three. Implies --patch.
diff.context
Generate diffs with <n> lines of context instead of the default of 3. This value is overridden by the -U option.

如何确定函数名

其实查看git的代码可以发现,git内部判断函数名的方法异常简单,根本没有什么语法分析。这个函数的主要逻辑就是判断一行的第一个字符串是字母(isalpha)字符即可,注意:空白,左大括号都不是字母字符。

回过头来看代码内容可以发现,虽然修改在第9行,但是由于前面都是空行,所以不满足函数名。

/// @file: git-master\xdiff\xemit.c
static long def_ff(const char *rec, long len, char *buf, long sz, void *priv)
{
if (len > 0 &&
(isalpha((unsigned char)*rec) || /* identifier? */
*rec == '_' || /* also identifier? */
*rec == '$')) { /* identifiers from VMS and other esoterico */
if (len > sz)
len = sz;
while (0 < len && isspace((unsigned char)rec[len - 1]))
len--;
memcpy(buf, rec, len);
return len;
}
return -1;
}

如何突破默认context长度为3的限制

正如前面所说,context len只有3行,如何找到3行之外的函数呢?

下面是对于每个差异输出的主体控制代码,其中有一个初始值为-1的变量funclineprev,这个其实表示了从差异向前搜索“函数名”的边界。也就是说,对于某个差异,git是从差异上下文开始位置向前搜索前面定义的"函数名"。由于初始值是-1,所以第一个差异会一直搜索到文件的开始。

int xdl_emit_diff(xdfenv_t *xe, xdchange_t *xscr, xdemitcb_t *ecb,
xdemitconf_t const *xecfg) {
long s1, s2, e1, e2, lctx;
xdchange_t *xch, *xche;
long funclineprev = -1;
struct func_line func_line = { 0 }; for (xch = xscr; xch; xch = xche->next) {
xdchange_t *xchp = xch;
xche = xdl_get_hunk(&xch, xecfg);
if (!xch)
break; pre_context_calculation:
s1 = XDL_MAX(xch->i1 - xecfg->ctxlen, 0);
s2 = XDL_MAX(xch->i2 - xecfg->ctxlen, 0);
/// ...
/*
* Emit current hunk header.
*/ if (xecfg->flags & XDL_EMIT_FUNCNAMES) {
get_func_line(xe, xecfg, &func_line,
s1 - 1, funclineprev);
funclineprev = s1 - 1;
}
/// ...
} static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
struct func_line *func_line, long start, long limit)
{
long l, size, step = (start > limit) ? -1 : 1;
char *buf, dummy[1]; buf = func_line ? func_line->buf : dummy;
size = func_line ? sizeof(func_line->buf) : sizeof(dummy); for (l = start; l != limit && 0 <= l && l < xe->xdf1.nrec; l += step) {
long len = match_func_rec(&xe->xdf1, xecfg, l, buf, size);
if (len >= 0) {
if (func_line)
func_line->len = len;
return l;
}
}
return -1;
}

识别错误的情况

例如,下面的例子git输出的“疑似”函数就是错误的

tsecer@harry: cat -n tsecer.cpp
1 int x;
2 int y;
3 int 1;
4 int 2;
5 int 3;
6 int Z;
tsecer@harry: git diff tsecer.cpp
diff --git a/tsecer.cpp b/tsecer.cpp
index b2540c2..df187dc 100644
--- a/tsecer.cpp
+++ b/tsecer.cpp
@@ -3,4 +3,4 @@ int y;
int 1;
int 2;
int 3;
-int z;
+int Z;
tsecer@harry:

git使用的diff算法

从注释可以看到,git内部的diff主要使用这个文档描述的算法。

/// @file: git-master\xdiff\xdiffi.c
/*
* See "An O(ND) Difference Algorithm and its Variations", by Eugene Myers.
* Basically considers a "box" (off1, off2, lim1, lim2) and scan from both
* the forward diagonal starting from (off1, off2) and the backward diagonal
* starting from (lim1, lim2). If the K values on the same diagonal crosses
* returns the furthest point of reach. We might encounter expensive edge cases
* using this algorithm, so a little bit of heuristic is needed to cut the
* search and to return a suboptimal point.
*/
static long xdl_split(unsigned long const *ha1, long off1, long lim1,
unsigned long const *ha2, long off2, long lim2,
long *kvdf, long *kvdb, int need_min, xdpsplit_t *spl,
xdalgoenv_t *xenv)

其实,svn的比较使用的也是类似的算法

/// @file: subversion-1.14.2\subversion\libsvn_diff\lcs.c
/*
* Calculate the Longest Common Subsequence (LCS) between two datasources.
* This function is what makes the diff code tick.
*
* The LCS algorithm implemented here is based on the approach described
* by Sun Wu, Udi Manber and Gene Meyers in "An O(NP) Sequence Comparison
* Algorithm", but has been modified for better performance.
*
* Let M and N be the lengths (number of tokens) of the two sources
* ('files'). The goal is to reach the end of both sources (files) with the
* minimum number of insertions + deletions. Since there is a known length
* difference N-M between the files, that is equivalent to just the minimum
* number of deletions, or equivalently the minimum number of insertions.
* For symmetry, we use the lesser number - deletions if M<N, insertions if
* M>N.
*
* Let 'k' be the difference in remaining length between the files, i.e.
* if we're at the beginning of both files, k=N-M, whereas k=0 for the
* 'end state', at the end of both files. An insertion will increase k by
* one, while a deletion decreases k by one. If k<0, then insertions are
* 'free' - we need those to reach the end state k=0 anyway - but deletions
* are costly: Adding a deletion means that we will have to add an additional
* insertion later to reach the end state, so it doesn't matter if we count
* deletions or insertions. Similarly, deletions are free for k>0.
*
* Let a 'state' be a given position in each file {pos1, pos2}. An array
* 'fp' keeps track of the best possible state (largest values of
* {pos1, pos2}) that can be achieved for a given cost 'p' (# moves away
* from k=0), as well as a linked list of what matches were used to reach
* that state. For each new value of p, we find for each value of k the
* best achievable state for that k - either by doing a costly operation
* (deletion if k<0) from a state achieved at a lower p, or doing a free
* operation (insertion if k<0) from a state achieved at the same p -
* and in both cases advancing past any matching regions found. This is
* handled by running loops over k in order of descending absolute value.
*
* A recent improvement of the algorithm is to ignore tokens that are unique
* to one file or the other, as those are known from the start to be
* impossible to match.
*/

算法的一些个人理解

引理

文章有两个简单的(lemma),引理1主要论证了如何使用之前已经计算过的D值也迭代计算新的D值(也就是对于D的计算可以通过D,D-2, D-4…… -D的值来计算,而这些值在在计算D时都已经计算过了,可以在这些已知值的基础上继续计算)

引理2主要是论证这种方法获得路径是最优的。

V数组的意义

文章中已明确说明

Consequently, V is an array of integers where V[k] contains the row index of the endpoint of a furthest reaching path in diagonal k

数组V[i] = x,表示y - x = i这条对角线上,差值为D-1(其中D为当前外层循环迭代变量的值)的最大值。由于i和x都是已知的,所以根据y - x = i可以计算出y的坐标值,这样就获得了本次迭代的一个起始坐标点。

另外,由于D和D+1次迭代,它们互相错开,所以在存储的时候可以复用一个数组。

figure2算法中取舍的意义

这个由lemma2前定义的“furthest reaching”决定

A D-path is furthest reaching in diagonal k if and only if it is one of the D-paths ending on diagonal k whose end point has the greatest possible row (column) number of all such paths. Informally, of all D-paths ending in diagonal k, it ends furthest from the origin, (0,0)

直观上说,这个最远的意思是“距离原点”最远,因为在同一条对角线y-x=k上,x值最大也意味y值最大,进而意味着x+y(大致看做距离原点的曼哈顿距离)最大,而x越大也意味着已经匹配的字符串最长(相同差异量的前提下)。

对于对角线k来说,它有两个相邻的对角线k-1和k+1都可以到达,此时选择哪一个呢?算法给出了取v值较大的那个,由于v数组中存储的是x轴坐标,所以这个选择也等价于取x值较大的那个。

  1. If k = -D or k ≠ D and V[k - 1] < V[k + 1] Then
  2. x ← V[k + 1]
  3. Else
  4. x ← V[k - 1]+1

如何生成完成路径

文章说明,原始的算法只是计算了D的值,而没有办法生成完整路径(solution graph),如果希望还原路径需要D^2量级的额外存储空间,存储每轮迭代中的完整路径。

The search of the greedy algorithm traces the optimal D-paths among others. But only the current set of furthest reaching endpoints are retained in V. Consequently, only the length of an SES/LCS can be reported in Line 12. To explicitly generate a solution path, O(D^2) space is used to store a copy of V after each iteration of the outer loop. Let Vd be the copy of V kept after the dth iteration.

git diff如何确定差异所在函数context的更多相关文章

  1. git diff比较版本差异(生成补丁)

    1.git diff [<options>] <commit> <commit> options 使用--name-only(git diff HEAD cd504 ...

  2. 消除Git diff中^M的差异

    消除Git diff中^M的差异 在Windows上把一个刚commit的文件夹上传到了Ubuntu.在Ubuntu上使用git status查看,发现很多文件都被红色标注,表示刚刚修改未add.在W ...

  3. git学习(五) git diff操作

    git diff操作 git diff用于比较差异: git diff 不加任何参数 用于比较当前工作区跟暂存区的差异 git diff --cached 或者--staged 对比暂存区(git a ...

  4. git diff 比较差异

    说明 以下命令可以不指定 <filename>,表示对全部文件操作. 命令涉及和 Git本地仓库对比的,均可指定 commit 的版本. HEAD 最近一次 commit HEAD^ 上次 ...

  5. git diff获取差异文件中文乱码的解决办法

    通过git的diff命令对两个commit id的版本进行差异化的对比.中文文件时出现乱码. git diff 6bded8d0c1fe1746c122121217dc0c88667091089 a9 ...

  6. git diff 差异对比

    转载原文: http://fsjoy.blog.51cto.com/318484/245465/ 1. 查看当前所有的更改情况.git status 结果有3部分,changes to be comm ...

  7. git diff与linux diff的输出格式之unified format

    前言 前面有一篇文章<一个有些意思的项目--文件夹对比工具(一)>,里面简单讲了下diff算法之--Myers算法. 既然是算法,就会有实现,比如git diff中有Myers的实现,gi ...

  8. git diff ^M的消除

    这是由于换行符在不同的操作系统上定义的区别造成的. Windows用CR LF来定义换行,Linux用LF. CR全称是Carriage Return ,或者表示为\r, 意思是回车. LF全称是Li ...

  9. git diff的用法

    在git提交环节,存在三大部分:working tree(工作区), index file(暂存区:stage), commit(分支:master) working tree:就是你所工作在的目录, ...

  10. Git diff 常见用法

      Git diff 用于比较两次修改的差异 1.1 比较工作区与暂存区 git diff  比较的是单个仓库的工作区与暂存区的差别,repo diff是对git diff的封装,用来分别显示各个项目 ...

随机推荐

  1. vue中的Swiper使用slideTo提示no function

    参考官网资料解决:

  2. Mybatis-plus中sql语句各查询条件含义

    lt:less than 小于le:less than or equal to 小于等于eq:equal to 等于ne:not equal to 不等于ge:greater than or equa ...

  3. MSSql 跨服务器查询解决方案

    先确定从服务器是否允许有外部链接的服务器: select * from sys.servers 没有的话,需要添加服务器链接: EXEC sp_addlinkedserver @server='10. ...

  4. react项目--redux封装

    index.ts 1 const store = { 2 state: { 3 num: 20, 4 }, 5 actions: { 6 // 放同步的代码 7 add1(newState: { nu ...

  5. mac使用expect登录跳板机后的机器

    两个文档 #!/usr/bin/expect -f #连接文件名字记录 set ip [lindex $argv 0] catch {spawn ssh 1.1.1.1}## ip地址换成自己的 ex ...

  6. Java基础-类型转换、变量、变量命名规范

    类型转换 强制转换 (类型)变量名 高-->低 自动转换 低-->高 注意点 不能对布尔值进行转换 不能把对象类型转换为不相干的类型 在把高容量转换到低容量的时候,强制转换 转换的时候可能 ...

  7. js获取当天时间,凌晨0点

    凌晨0点 fields['startTime']=new Date(new Date(fields.searchTime2[0]).toLocaleDateString()).getTime() 当天 ...

  8. 一套.NET Core +WebAPI+Vue前后端分离权限框架

    今天给大家推荐一个基于.Net Core开发的企业级的前后端分离权限框架. 项目简介 这是基于.NetCore开发的.构建的简单.跨平台.前后端分离的框架.此项目代码清晰.层级分明.有着完善的权限功能 ...

  9. webpack的加载器兼容配置一览

    "devDependencies": { "css-loader": "^3.2.1", "file-loader": ...

  10. C++和C中的输入输出总结、标准输入/标准输出/标准错误与重定向,>>、>、<、<<

    标准输入/标准输出/标准错误与重定向 0表示标准输入.1表示标准输出.2标准错误.1和2都是默认是输出到屏幕. linux中的>>.>.<.<<:这些符号是Linu ...