Maximum number of WAL files in the pg_xlog directory (2)
Hi,As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0.
We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula:
greatest(
(2 + checkpoint_completion_target) * checkpoint_segments + 1,
checkpoint_segments + wal_keep_segments + 1
)
Hi,As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0.
We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula:
greatest(
(2 + checkpoint_completion_target) * checkpoint_segments + 1,
checkpoint_segments + wal_keep_segments + 1
)I think the first bug is even having this formula in the documentation to start with, and in trying to use it.
"and will normally not be more than..."This may be "normal" for a toy system. I think that the normal state for any system worth monitoring is that it has had load spikes at some point in the past.
So it is the next part of the doc, which describes how many segments it climbs back down to upon recovering from a spike, which is the important one. And that doesn't mention wal_keep_segments at all, which surely cannot be correct.
I will try to independently derive the correct formula from the code, as you did, without looking too much at your derivation first, and see if we get the same answer.
> Monitoring is another matter, and I don't really think a monitoring
> solution should count the WAL files. What actually really matters is the
> database availability, and that is covered with having enough disk space in
> the WALs partition.
If we don't count the WAL files, though, that eliminates the best way to
detecting when archiving is failing.
> Monitoring is another matter, and I don't really think a monitoring
> solution should count the WAL files. What actually really matters is the
> database availability, and that is covered with having enough disk space in
> the WALs partition.If we don't count the WAL files, though, that eliminates the best way to
detecting when archiving is failing.
>> > If we don't count the WAL files, though, that eliminates the best way to
>> > detecting when archiving is failing.
>> >
>> >
> WAL files don't give you this directly. You may think it's an issue to get
> a lot of WAL files, but it can just be a spike of changes. Counting .ready
> files makes more sense when you're trying to see if wal archiving is
> failing. And now, using pg_stat_archiver is the way to go (thanks Gabriele
> :) ).
Yeah, a situation where we can't give our users any kind of reasonable
monitoring threshold at all sucks though. Also, it makes it kind of
hard to allocate a wal partition if it could be 10X the minimum size,
you know?
What happened to the work Heikki was doing on making transaction log
disk usage sane?
On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge <[hidden email]> wrote:Hi,As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0.
We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula:
greatest(
(2 + checkpoint_completion_target) * checkpoint_segments + 1,
checkpoint_segments + wal_keep_segments + 1
)I think the first bug is even having this formula in the documentation to start with, and in trying to use it."and will normally not be more than..."This may be "normal" for a toy system. I think that the normal state for any system worth monitoring is that it has had load spikes at some point in the past.So it is the next part of the doc, which describes how many segments it climbs back down to upon recovering from a spike, which is the important one. And that doesn't mention wal_keep_segments at all, which surely cannot be correct.I will try to independently derive the correct formula from the code, as you did, without looking too much at your derivation first, and see if we get the same answer.
> state, would be:
>
> greatest(1 + checkpoint_completion_target) * checkpoint_segments,
> wal_keep_segments) + 1 +
> 2 * checkpoint_segments + 1
I don't think we can assume checkpoint_completion_target is at all
reliable enough to base a maximum calculation on, assuming anything
above the maximum is cause of concern and something to inform the admins
about.
Assuming checkpoint_completion_target is 1 for maximum purposes, how
about:
max(2 * checkpoint_segments, wal_keep_segments) + 2 * checkpoint_segments + 2
On Mon, Nov 3, 2014 at 12:39:26PM -0800, Jeff Janes wrote:
> It looked to me that the formula, when descending from a previously stressed
> state, would be:
>
> greatest(1 + checkpoint_completion_target) * checkpoint_segments,
> wal_keep_segments) + 1 +
> 2 * checkpoint_segments + 1I don't think we can assume checkpoint_completion_target is at all
reliable enough to base a maximum calculation on, assuming anything
above the maximum is cause of concern and something to inform the admins
about.Assuming checkpoint_completion_target is 1 for maximum purposes, how
about:max(2 * checkpoint_segments, wal_keep_segments) + 2 * checkpoint_segments + 2
Sorry for my very late answer. It's been a tough month.2014-11-27 0:00 GMT+01:00 Bruce Momjian <[hidden email]>:On Mon, Nov 3, 2014 at 12:39:26PM -0800, Jeff Janes wrote:
> It looked to me that the formula, when descending from a previously stressed
> state, would be:
>
> greatest(1 + checkpoint_completion_target) * checkpoint_segments,
> wal_keep_segments) + 1 +
> 2 * checkpoint_segments + 1I don't think we can assume checkpoint_completion_target is at all
reliable enough to base a maximum calculation on, assuming anything
above the maximum is cause of concern and something to inform the admins
about.Assuming checkpoint_completion_target is 1 for maximum purposes, how
about:max(2 * checkpoint_segments, wal_keep_segments) + 2 * checkpoint_segments + 2
Seems something I could agree on. At least, it makes sense, and it works for my customers. Although I'm wondering why "+ 2", and not "+ 1". It seems Jeff and you agree on this, so I may have misunderstood something.
On Tue, Dec 30, 2014 at 12:35 AM, Guillaume Lelarge <[hidden email]> wrote:Sorry for my very late answer. It's been a tough month.2014-11-27 0:00 GMT+01:00 Bruce Momjian <[hidden email]>:On Mon, Nov 3, 2014 at 12:39:26PM -0800, Jeff Janes wrote:
> It looked to me that the formula, when descending from a previously stressed
> state, would be:
>
> greatest(1 + checkpoint_completion_target) * checkpoint_segments,
> wal_keep_segments) + 1 +
> 2 * checkpoint_segments + 1I don't think we can assume checkpoint_completion_target is at all
reliable enough to base a maximum calculation on, assuming anything
above the maximum is cause of concern and something to inform the admins
about.Assuming checkpoint_completion_target is 1 for maximum purposes, how
about:max(2 * checkpoint_segments, wal_keep_segments) + 2 * checkpoint_segments + 2
Seems something I could agree on. At least, it makes sense, and it works for my customers. Although I'm wondering why "+ 2", and not "+ 1". It seems Jeff and you agree on this, so I may have misunderstood something.From hazy memory, one +1 comes from the currently active WAL file, which exists but is not counted towards either wal_keep_segments nor towards recycled files. And the other +1 comes from the formula for how many recycled files to retain, which explicitly has a +1 in it.
>
> (1 + checkpoint_completion_target) * checkpoint_segments + 1 +
> max(wal_keep_segments, checkpoint_segments)
Now that we have min_wal_size and max_wal_size in 9.5, I don't see any
value to figuring out the proper formula for backpatching.
注:
- 在pg版本9.1 -> 9.4的官方文档中,计算pg_xlog中日志存放数量的方法均为: ( 2 + checkpoint_completion_target ) * checkpoint_segments + 1,但经过上面各位pg大神的讨论是有问题的,更准确的公式应该是:max(2 * checkpoint_segments, wal_keep_segments) + 2 * checkpoint_segments + 2
- 另外在pg9.5版本中,新添加了min_wal_size和max_wal_size两个参数,通过max_wal_size和checkpoint_completion_target 参数来控制产生多少个XLOG后触发检查点, 通过min_wal_size和max_wal_size参数来控制哪些XLOG可以循环使用。具体内容参见德哥博客文章。
- 看到今年淘宝6月的数据库内核月报中也提到了这个问题,他们是由于wal日志过大发现的问题,最终得出的计算公式和上面可以说就是一样的,只是checkpoint_completion_target 没有取值为1而已,公式为:max(wal_keep_segments, checkpoint_segments + checkpoint_segments*checkpoint_completion_target) + 2 * checkpoint_segments + 1 + 1,有兴趣同学的可以看一下。但远没有上面大神争论来的有意思。
Maximum number of WAL files in the pg_xlog directory (2)的更多相关文章
- Maximum number of WAL files in the pg_xlog directory (1)
Guillaume Lelarge: Hi, As part of our monitoring work for our customers, we stumbled upon an issue ...
- Linux Increase The Maximum Number Of Open Files / File Descriptors (FD)
How do I increase the maximum number of open files under CentOS Linux? How do I open more file descr ...
- the max number of open files 最大打开文件数 ulimit -n RabbitMQ调优
Installing on RPM-based Linux (RHEL, CentOS, Fedora, openSUSE) — RabbitMQ https://www.rabbitmq.com/i ...
- tomcat 大并发报错 Maximum number of threads (200) created for connector with address null and port 8080
1.INFO: Maximum number of threads (200) created for connector with address null and port 8091 说明:最大线 ...
- tomcat 大并发报错 Maximum number of threads (200) created for connector with address null and port 80
1.INFO: Maximum number of threads (200) created for connector with address null and port 80 说明:最大线程数 ...
- The maximum number of processes for the user account running is currently , which can cause performance issues. We recommend increasing this to at least 4096.
[root@localhost ~]# vi /etc/security/limits.conf # /etc/security/limits.conf # #Each line describes ...
- ORA-00020: maximum number of processes (40) exceeded模拟会话连接数满
问题描述:在正式生产环境中,有的库建的process和session连接数目设置的较小,导致后期满了无法连接.因为正式库无法进行停库修改,只能释放连接,做个测试模拟 1. 修改现有最大会话与进程连接数 ...
- iOS---The maximum number of apps for free development profiles has been reached.
真机调试免费App ID出现的问题The maximum number of apps for free development profiles has been reached.免费应用程序调试最 ...
- [LeetCode] Third Maximum Number 第三大的数
Given a non-empty array of integers, return the third maximum number in this array. If it does not e ...
随机推荐
- SyntaxError: missing ; before statement 错误的解决
今天jsp页面中报错:SyntaxError: missing ; before statement 简单的理解是语法错误,F12调试之后发现原来是我定义的一个js中的全局变量的问题. <scr ...
- 用Ogre实现《天龙八部》场景中水面(TerrainLiquid)详解
本文主要讲的是<天龙八部>游戏中水面(TerrainLiquid)的具体实现,使用C++,Ogre1.6. 天龙的水面做的比较简单,虽然没有倒影,但动态纹理+深度图做出的效果还行,看着不是 ...
- S50非接触式IC卡性能简介(M1)
一.主要指标 分为16个扇区,每个扇区为4块,每块16个字节,以块为存取单位: 每个扇区有独立的一组密码及访问控制: 每张卡有唯一序列号,为32位: 具有防冲突机制,支持多卡操作: 无电源,自带天线, ...
- Emacs和Ultra Edit列编辑模式
在emacs中可以使用C-r系列组合键进行区域选择编辑,或者使用emacs自带的cua-mode,然后键入C-ret进行可视化列编辑. 使用Ultra Edit同样可以方便的进入列编辑模式,只需要按下 ...
- Python 的property的实现 .
描述符.就是 将某种特殊类型①的类的实例指派给另一个类的属性 ①只要实现一下三种方法的其中一个就是特殊类型. __get__(self,instance,owner) -用于访问属性,他返回属性的值. ...
- IOS源码封装成.bundle和.a文件,以及加入xib的具体方法,翻遍网络,仅此一家完美翻译!! IOS7!!(3) 完美结局
以上翻译有误解之处,现在简单做法如下: 经过深入研究,才感觉明白了内部机制,现在简单介绍于下,主要步骤:xcode5 创建库项目,删掉测试文件和默认创建的类,添加viewController类带xib ...
- (Your)((Term)((Project)))
Description You have typed the report of your term project in your personal computer. There are seve ...
- 用Qt图形视图框架开发拼图游戏
用Qt的图形视图框架(Graphics View Framework)做了一个拼图游戏DEMO,演示了: QGraphicsView.QGraphicsScene.QGraphicsItem的基本用法 ...
- 【转】Web应用的组件化开发(一)
原文转自:http://blog.jobbole.com/56161/ 基本思路 1. 为什么要做组件化? 无论前端也好,后端也好,都是整个软件体系的一部分.软件产品也是产品,它的研发过程也必然是有其 ...
- 第二个Sprint冲刺第二天
讨论地点:宿舍 讨论成员:邵家文.李新.朱浩龙.陈俊金 任务:解决了第二个Sprint冲刺第一天遇到的错误. 燃尽图: 遇到的问题: 解决之后: 开发感悟:最近一直在写代码,都很少外出活动了,不知不觉 ...