测试环境 PostgreSQL 库连接不上—案例分享
文章目录
- 业务有问题, 业务开发反馈数据库链接不上
- 查看问题
- 查看系统启动日志
业务有问题, 业务开发反馈数据库链接不上
查看问题
命令行当前可以链接, 查看日志
发现日志文件没有生成, 查看最新可看的日志
2025-07-28 14:49:46.065 UTC 124306:PANIC: could not write to file "pg_logical/replorigin_checkpoint.tmp": No space left on device
2025-07-28 14:49:46.069 UTC 124303:LOG: checkpointer process (PID 124306) was terminated by signal 6: Aborted
2025-07-28 14:49:46.069 UTC 124303:LOG: terminating any other active server processes
2025-07-28 14:49:46.105 UTC 124303:LOG: all server processes terminated; reinitializing
2025-07-28 14:49:46.529 UTC 810665:LOG: database system was interrupted; last known up at 2025-07-28 14:34:56 UTC
2025-07-28 14:49:46.533 UTC [unknown] postgres 192.117.15.140(42692) imos 810669:FATAL: the database system is in recovery mode
发现磁盘空间满了
查看 log_directory , 发现值是 log, 与正确的 /pg_log 不一致
postgres=# show log_directory ;log_directory
---------------log
(1 row)
查看 postgresql.conf 文件
[root@vm110 database]# ll /pgdata/postgresql.conf
-rw------- 1 postgres postgres 0 Jul 29 10:23 /pgdata/postgresql.conf
发现文件被修改过了
[root@vm110 pg_log]# stat /pgdata/postgresql.confFile: /pgdata/postgresql.confSize: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: 8,83 Inode: 7077966 Links: 1
Access: (0600/-rw-------) Uid: ( 1001/postgres) Gid: ( 1001/postgres)
Access: 2025-07-29 10:23:12.336922653 +0800
Modify: 2025-07-29 10:23:10.000917984 +0800
Change: 2025-07-29 14:12:53.973472792 +0800Birth: 2025-07-17 11:50:26.404484263 +0800
查看系统启动日志
# 查看最近的关机和启动记录
journalctl --list-boots
IDX BOOT ID FIRST ENTRY LAST ENTRY 0 66930e8981b74fa7a7118e05983b950c Tue 2025-07-29 10:22:40 CST Tue 2025-07-29 15:18:03 CST# 查看当前启动的日志
journalctl -b 0
发现 PG 没有启动成功Jul 29 10:23:12 vm110 rc.local[2059]: stopped waiting
Jul 29 10:23:12 vm110 rc.local[2059]: pg_ctl: could not start server
Jul 29 10:23:12 vm110 rc.local[2059]: Examine the log output.
Jul 29 10:23:12 vm110 su[2043]: pam_unix(su-l:session): session closed for user postgres
Jul 29 10:23:12 vm110 smartd[1186]: Device: /dev/sdn [SAT], ST500DM009-2F110A, S/N:ZA47TTGM, WWN:5-000c50-0a46bd034, FW:CC43, 500 GB
Jul 29 10:23:12 vm110 smartd[1186]: Device: /dev/sdn [SAT], found in smartd database 7.3/5319: Seagate BarraCuda 3.5 (CMR)
Jul 29 10:23:12 vm110 rc.local[1639]: [10:23:12 initme] Check DB...
Jul 29 10:23:12 vm110 su[2234]: (to postgres) root on none
Jul 29 10:23:12 vm110 su[2234]: pam_unix(su-l:session): session opened for user postgres(uid=1001) by (uid=0)
Jul 29 10:23:12 vm110 dockerd[2047]: time="2025-07-29T10:23:12.624298340+08:00" level=info msg="Loading containers: start."
Jul 29 10:23:12 vm110 smartd[1186]: Device: /dev/sdn [SAT], is SMART capable. Adding to "monitor" list.# 查看系统日志中是否有 postgresql.conf 配置文件相关信息,
果然发现了,开机启动时候, 会从 /postgresql/script/pgsql/postgresql.conf 拷贝并复制到 /pgdata/postgresql.conf
此时磁盘空间满了, 所以清空正常, 但是写入失败了# journalctl -b 0 |grep postgresql
Jul 29 10:23:10 vm110 rc.local[1896]: cp: error copying '/postgresql/script/pgsql/postgresql.conf' to '/pgdata/postgresql.conf': No space left on device
Jul 29 10:23:11 vm110 rc.local[1639]: [10:23:11 initme] Start postgresql...
Jul 29 10:24:35 vm110 rc.local[1639]: [10:24:35 initme] postgresql started ERR!
紧接着数据库起来了, 但是日志文件没有正常生成.
查看重启信息
[root@vm110 pg_log]# last reboot |head -3
reboot system boot 6.4.0 Tue Jul 29 10:23 still running
reboot system boot 6.4.0 Mon Jul 28 15:24 still running
reboot system boot 6.4.0 Thu Jul 17 11:56 still running
[root@vm110 pg_log]#
可以看到 07.29 10:23 有过 reboot 系统, 同一分钟内, $PGDATA/postgresql.conf 文件被改过, 数据库启动了,但是有点异常
查看 PG 何时启动的 , 可以看到是 11:40 才启动的
[root@vm110 database]# ps -ef |grep 499848 |grep database
postgres 499848 1 0 11:40 ? 00:00:05 /program/bin/postgres -D /pgdata
[root@vm110 database]#
[root@vm110 database]#
[root@vm110 database]# ps -eo 'lstart,cmd,pid' | grep 499848 |grep -v grep
Tue Jul 29 11:40:22 2025 /program/bin/postgres -D /v 499848
回过来看, 为啥业务无法访问数据库, 因为 $PGDATA/postgresql.conf 文件内容被清空了, 所以 listen_address相当于是默认值 127.0.0.1
所以无法正常接受 tcp/ip 访问了
但是为啥 $PGDATA/postgresql.conf文件被清空了, 这个是由于 initme.sh代码中导致的
[root@vm110 pg_log]# grep postgresql.conf /program/bin/initme.sh -in
205: su - postgres -c "export LD_LIBRARY_PATH=/lib:$ROOTPATH/program/lib:/usr/lib/; export PGDATA=$DATABASEDIR; cp $DB_SQLPATH/postgresql.conf $DATABASEDIR/ -rf"
# 大概是如下这样一个过程
昨晚磁盘空间满 --> PG 挂了 --> 今天系统 reboot --> PG 配置文件被清空(initme.sh脚本在空间不够情况下导致) --> PG 以空配置(默认配置)启动(失败)[硬盘空间满了] --> 人为清理过文件(使得空间有一定空闲) ?
–> 重启 server.sh 服务 --> PG 起来了但是不接受连接 --> 业务服务异常 --> 产生大量 core dump 文件 --> DBA 发现 PG 配置异常,然后手动恢复 PG 配置 --> 重启 server.sh restart --> 服务恢复 --> 测试清理冗余文件