sqlserver 调试WINDBG ---troubleshootingsql.com
https://troubleshootingsql.com/tag/stack-dump/
Debugging that latch timeout
My last post of debugging an assertion didn’t have any cool debugging tips since there is not much that you can do with an assertion dump unless you have access to private symbols and sometimes even access to the source code. In this post, I am going to not disappoint and show you some more cool things that the windows debugger can do for you with public symbols for a latch timeout issue.
When you encounter a latch timeout (buffer or non-buffer latch), the first occurrence of it’s type generates a mini-dump. If there are further occurrences of the same latch timeout, then that is reported as an error message in the SQL Errorlog.
Buffer latch timeouts are typically reported using Error: 844 and 845. The common reasons for such errors are documented in a KB Article. For a non-buffer latch timeout, you will get the an 847 error.
Error # | Error message template (from sys.messages) |
844 | Time out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait. |
845 | Time-out occurred while waiting for buffer latch type %d for page %S_PGID, database ID %d. |
846 | A time-out occurred while waiting for buffer latch — type %d, bp %p, page %d:%d, stat %#x, database id: %d, allocation unit Id: %I64d%ls, task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Not continuing to wait. |
847 |
Timeout occurred while waiting for latch: class ‘%ls’, id %p, type %d, Task 0x%p : %d, waittime %d, flags 0x%I64x, owning task 0x%p. Continuing to wait. |
This is what you will see in the SQL Errorlog when a latch timeout occurs.
spid148 Time out occurred while waiting for buffer latch —type 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009,database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 300, flags 0x1a, owning task 0x0000000003C129B8. Continuing to wait.
spid148 **Dump thread – spid = 148, PSS = 0x000000044DC17BD0, EC = 0x000000044DC17BE0
spid148 ***Stack Dump being sent to D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.txtspid148 * Latch timeout
spid148 * Input Buffer 84 bytes –
spid148 * DBCC CHECKDB WITH ALL_ERRORMSGS
External dump process returned no errors.
I have only pasted the relevant portion from the Errorlog for brevity. As I have outlined in my previous blog posts on similar topics, that there is a large opportunity for due diligence that can be done with the help of the Windows Event Logs and the SQL Server Errorlogs before you start spawning off windows debugger to analyze the memory dump on your system. The first few obvious things that you will notice is that SPID 148 encountered the issue while performing a CHECKDB on database ID 120. The timeout occurred while waiting for a buffer latch on a page (Page ID is available in the message above). I don’t see a “Timeout waiting for external dump process” message in the SQL Errorlog which means that I have a good chance of extracting useful information from the mini-dump that was generated by SQLDumper.
Latch timeouts are typically victims of either a system related issue (hardware or drivers or operating system or a previous error encountered by SQL Server). So the next obvious action item would be to look into the SQL Errorlogs and find out if there were any additional errors prior to the latch timeout issue. I see a number of OS Error 1450 reported by the same SPID 148 for the same file handle but different offsets.
spid148 The operating system returned error1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x0000156bf36000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.
Additionally, I see prior and post (within 5-15 minutes) the latch timeout issue, multiple other SPIDs reporting the same 1450 error message for different offsets but again on the same file.
spid137 The operating system returned error1450(Insufficient system resources exist to complete the requested service.) to SQL Server during a write at offset 0x000007461f8000 in file with handle 0x0000000000001358. This is usually a temporary condition and the SQL Server will keep retrying the operation. If the condition persists then immediate action must be taken to correct it.
I also see the latch timeout message being reported after every 300 ms for the same page and the database.
spid148 Time out occurred while waiting for buffer latch — type 4, bp 0000000832FE1200, page 3:11234374, stat 0x7c20009,database id: 120, allocation unit id: 72057599731367936, task 0x0000000003C4F2E8 : 0, waittime 82800, flags 0x1a, owning task 0x0000000003C129B8. Continuing to wait.
Notice the waittime above, it has increased from 300 to 82800!! So the next thing I do is look up issues related to CHECKDB and 1450 error messages on the web using Bing (Yes, I do use BING!!). These are relevant posts related to the above issue.
http://blogs.msdn.com/b/psssql/archive/2008/07/10/sql-server-reports-operating-system-error-1450-or-1452-or-665-retries.aspx
http://blogs.msdn.com/b/psssql/archive/2009/03/04/sparse-file-errors-1450-or-665-due-to-file-fragmentation-fixes-and-workarounds.aspx
As of now, it is quite clear that the issue is related to a possible sparse file issue related to file fragmentation. Now it is time for me to check if there are other threads in the dump waiting on SyncWritePreemptivecalls.
Use the location provided in the Errorlog snippet reporting the Latch Timeout message to locate the mini-dump for the issue (in this case SQLDump0001.mdmp).
Now when you load the dump using WinDBG, you will see the following information:
Loading Dump File [D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0001.mdmp]
User Mini Dump File: Only registers, stack and portions of memory are availableComment: ‘Stack Trace’
Comment: ‘Latch timeout’This dump file has an exception of interest stored in it.
The above tells you that this is a mini-dump for a Latch Timeoutcondition and the location from where you loaded the dump. Then I use the command to set my symbol path and direct the symbols downloaded from the Microsoft symbol server to a local symbol file cache on my machine.
.sympath srv*D:\PublicSymbols*http://msdl.microsoft.com/download/symbols
Then I issue a reload command to load the symbols for sqlservr.exe. This can also be done using CTRL+L and providing the complete string above (without .sympath), checking the Reload checkbox and clicking on OK. The only difference here is that the all the public symbols for all loaded modules in the dump will be downloaded from the Microsoft Symbol Server which are available.
.reload /f sqlservr.exe
Next thing is to verify that the symbols were correctly loaded using thelmvm sqlservr command. If the symbols were loaded correctly, you should see the following output. Note the text in green.
0:005> lmvm sqlservr
start end module name
00000000`01000000 00000000`03668000 sqlservr T (pdb symbols) d:\publicsymbols\sqlservr.pdb\2A3969D78EE24FD494837AF090F5EDBC2\sqlservr.pdb
If symbols were not loaded, then you will see an output as shown below.
0:005> lmvm sqlservr
start end module name
00000000`01000000 00000000`03668000 sqlservr (export symbols) sqlservr.exe
I will use the !findstack command to locate all threads which have the function call SyncWritePreemptive on their callstack.
0:070> !findstack sqlservr!FCB::SyncWritePreemptive 0
Thread 069, 1 frame(s) match
Thread 074, 1 frame(s) match
Thread 076, 1 frame(s) match
Thread 079, 1 frame(s) match
Thread 081, 1 frame(s) match
Thread 082, 1 frame(s) match
Thread 086, 1 frame(s) match
Thread 089, 1 frame(s) match
Thread 091, 1 frame(s) match
Thread 095, 1 frame(s) match
Thread 098, 1 frame(s) match
Thread 099, 1 frame(s) match
Thread 104, 1 frame(s) match
Thread 106, 1 frame(s) match
Thread 107, 1 frame(s) match
Thread 131, 1 frame(s) match
Thread 136, 1 frame(s) match0:070> ~81s
ntdll!ZwWaitForSingleObject+0xa:
00000000`77ef0a2a c3 ret
0:081> kL100ntdll!ZwDelayExecution
kernel32!SleepEx
sqlservr!FCB::SyncWritePreemptive
sqlservr!FCB::PullPageToReplica
sqlservr!alloca_probe
sqlservr!BUF::CopyOnWrite
sqlservr!PageRef::PrepareToDirty
sqlservr!RecoveryMgr::DoCOWPreWrites
You could get all the callstacks with the function that you are searching for using the command: !findstack sqlservr!FCB::SyncWritePreemptive 3
If you look at the thread that raised ended up raising the Latch Timeout warning was also performing a CHECKDB.
0:074> .ecxr
0:074> kL100
kernel32!RaiseException
sqlservr!CDmpDump::Dump
sqlservr!CImageHelper::DoMiniDump
sqlservr!stackTrace
sqlservr!LatchBase::DumpOnTimeoutIfNeeded
sqlservr!LatchBase::PrintWarning
sqlservr!alloca_probe
sqlservr!BUF::AcquireLatch
…
…
sqlservr!UtilDbccCreateReplica
sqlservr!UtilDbccCheckDatabase
sqlservr!DbccCheckDB
sqlservr!DbccCommand::Execute
If you cannot find the thread which raised the Latch Timeout warning, you could print out all the callstacks in the dump using ~*kL100 and the searching for the function call in blue above. It is quite clear from the callstack above that the thread was also involved in performing a CHECKDB operation as reported in the SQL Errorlog in the input buffer for the Latch Timeout dump.
If case you were not able to identify the issue right off the bat, you need to check the build that you are on and look for issues that were addressed related to Latch Timeouts for the SQL Server release that you are using. The symptoms section would have sufficient amount of information for you to compare with your current symptoms and scenario and determine if the KB Article that you are looking at is applicable in your case.
Now is the time, when you need to have some context about the operations that were occurring on the server to actually determine what the actual issue is. Based on what I heard from the system administrators that there was a CHECKDB being executed on the database while the application was executing DML operations on the database. Additionally, the volume on which the disk resides on has fragmentation and the database in question is large (>750GB).
Based on the two MSDN blog posts that I mentioned above, it is quite clear that it is possible to run into sparse file limitations when there is high amount of fragmentation on the disks or if there are a large number of concurrent changes occurring on the database when a CHECKDB is executing on it. A number of Windows and SQL Server updates along with workarounds to run CHECKDB on such databases is mentioned in the second blog post mentioned above. On a separate note, this is not an issue with CHECKDB! It is limitation that you are hitting with sparse files on the OS layer. Remember CHECKDB, starting from SQL Server 2005, creates an internal snapshot (makes sparse file) to execute the consistency check. Paul Randal’s tweet made me add this line to call this out explicitly!
As always… if you are still stuck, contact Microsoft CSS with the mini-dump file, SQL Errorlog and the Windows Event Logs. It might be quite possible that CSS might ask you to collect additional data as most Latch Timeout issues are generally an after-effect of a previous issue. In this case, it was the OS Error 1450.
Well… That’s it for today! Hope this is helpful for the next time you encounter a Latch Timeout issue.
sqlserver 调试WINDBG ---troubleshootingsql.com的更多相关文章
- .NET 5 程序高级调试-WinDbg
上周和大家分享了.NET 5开源工作流框架elsa,程序跑起来后,想看一下后台线程的执行情况.抓了个进程Dump后,使用WinDbg调试,加载SOS调试器扩展,结果无法正常使用了: 0:000> ...
- SQLServer调试
1.普通调试 直接点击SSMS客户端上的调试按钮即可 2.存储过程调试 2.1 定义存储过程(以Northwind数据库为例) USE [Northwind] GO /****** Object: S ...
- SQLSERVER LATCH WINDBG
https://mssqlwiki.com/2012/09/07/latch-timeout-and-sql-server-latch/
- 使用Windbg和SoS扩展调试分析.NET程序
在博客堂的不是我舍不得 - High CPU in GC(都是+=惹的祸,为啥不用StringBuilder呢?). 不是我舍不得 - .NET里面的Out Of Memory 看到很多人在问如何分析 ...
- Windbg是windows平台上强大的调试器
基础调试命令 - .dump/.dumpcap/.writemem/!runaway Windbg是windows平台上强大的调试器,它相对于其他常见的IDE集成的调试器有几个重要的优势, Windb ...
- windbg预览版,windbg preview配置win7x64双机调试
目录 一丶简介 二丶步骤 1.下载Windbg Preview (windbg预览版本) 2.配置虚拟机端口 3.虚拟机设置调试湍口 4.windbg preview开始调试. 一丶简介 Windbg ...
- 使用WinDbg内核调试[转]
Technorati 标签: windbg,内核调试 WINDOWS调试工具很强大,但是学习使用它们并不容易.特别对于驱动开发者使用的WinDbg和KD这两个内核调试器(CDB和NTSD是用户态调试器 ...
- WinDBG 调试命令大全
转载收藏于:http://www.cnblogs.com/kekec/archive/2012/12/02/2798020.html #调试命令窗口 ++++++++++++++++++++++++ ...
- 基础调试命令 - .dump/.dumpcap/.writemem/!runaway
Windbg是windows平台上强大的调试器,它相对于其他常见的IDE集成的调试器有几个重要的优势, Windbg可以做内核态调试 Windbg可以脱离源代码进行调试 Windbg可以用来分析dum ...
随机推荐
- MySQL中大于等于小于等于的写法
由于在mybatis框架的xml中<= , >=解析会出现问题,编译报错,所以需要转译 第一种写法: 原符号 < <= > >= & ' " 替换 ...
- 「6月雅礼集训 2017 Day1」说无可说
[题目大意] 给出n个字符串,求有多少组字符串之间编辑距离为1~8. n<=200,∑|S| <= 10^6 [题解] 首先找编辑距离有一个n^2的dp,由于发现只找小于等于8的,所以搜旁 ...
- 以root启动pycharm
在使用scapy模块的时候提示permitted就猜想可能是权限问题.然后换成root启动啥事情都没了. 由于本机是deepin先找到pycharm.sh脚本 然后再执行sudo ./pycharm. ...
- HDU2444(判断是否为二分图,求最大匹配)
The Accomodation of Students Time Limit: 5000/1000 MS (Java/Others) Memory Limit: 32768/32768 K ( ...
- 怎么重启shell ubuntu
sunosfind . -type f | xargs grep count 怎么重启shell ubuntu方法一:退出,重新登录方法二:source /etc/profile
- 使用libssh2库实现支持密码参数的ssh2客户端
使用libssh2库实现支持密码参数的ssh2客户端 http://blog.chinaunix.net/uid-24382173-id-229823.html libssh2的简单应用 http:/ ...
- java 定时器的三种方式
原地址:http://blog.csdn.net/haorengoodman/article/details/23281343/ /** * 普通thread * 这是最常见的,创建一个thread, ...
- django日志的设置
关于django的日志设置详细可以看下官方文档:https://yiyibooks.cn/xx/Django_1.11.6/topics/logging.html 示例: # 日志文件配置 LOGGI ...
- MATLAB二维插值和三维插值
插值问题描述:已知一个函数上的若干点,但函数具体表达式未知,现在要利用已知的若干点求在其他点处的函数值,这个过程就是插值的过程. 1.一维插值 一维插值就是给出y=f(x)上的点(x1,y1),(x2 ...
- 将datatable导出为excel的三种方式(转)
一.使用Microsoft.Office.Interop.Excel.DLL 需要安装Office 代码如下: 2 public static bool ExportExcel(Sy ...