##SQL Version :Microsoft SQL Server 2019 (RTM-CU17) (KB5016394) - 15.0.4249.2 (X64)
확인사항]
먼저 DTC실패(Error: 8509)기록된후 Rollback하는 과정에서 Dump가 생성으며, 해당 이슈는 이미 알려진 이슈로 CU21에서 FIX가 되었습니다.
해당 Dump는 in-memory Table사용과정에서 발생할수있으며, 고객사같은 경우는 DTC Transaction이 Rollback 되는 과정에서 해당 known issue로 인한 Dump가 생성었고 이로인한 가용성그룹이 Resolving상태로 빠지게됩니다.
이후 Leasetime out이 발생하나 FailureConditionLevel(2)로 Failover가 자동적으로 Triggered되지 않았습니다.
#ErrorLog.2 2024-04-02 14:01:03.93 spid62 Error: 8509, Severity: 16, State: 1. 2024-04-02 14:01:03.93 spid62 Import of Microsoft Distributed Transaction Coordinator (MS DTC) transaction failed: 0x8004d00e(XACT_E_NOTRANSACTION). 2024-04-02 14:01:35.11 spid156 Error: 8509, Severity: 16, State: 1. 2024-04-02 14:01:35.11 spid156 Import of Microsoft Distributed Transaction Coordinator (MS DTC) transaction failed: 0x8004d00e(XACT_E_NOTRANSACTION). 2024-04-02 14:02:01.15 spid853 Error: 8509, Severity: 16, State: 1. 2024-04-02 14:02:01.15 spid853 Import of Microsoft Distributed Transaction Coordinator (MS DTC) transaction failed: 0x8004d00e(XACT_E_NOTRANSACTION). 2024-04-02 14:02:35.74 spid35s CImageHelper::Init () Version-specific dbghelp.dll is not used 2024-04-02 14:02:35.74 spid35s Using 'dbghelp.dll' version '4.0.5' 2024-04-02 14:02:35.75 spid35s **Dump thread - spid = 0, EC = 0x0000020BD57C1E10 ..중략.. 2024-04-02 14:02:35.75 spid35s ***Stack Dump being sent to \SQLDump0017.txt 2024-04-02 14:02:35.75 spid35s * ******************************************************************************* 2024-04-02 14:02:35.75 spid35s * 2024-04-02 14:02:35.75 spid35s * BEGIN STACK DUMP: 2024-04-02 14:02:35.75 spid35s * 04/02/24 14:02:35 spid 35 2024-04-02 14:02:35.75 spid35s * 2024-04-02 14:02:35.75 spid35s * Location: "sql\\ntdbms\\hekaton\\engine\\core\\tx.cpp":6337 2024-04-02 14:02:35.75 spid35s * Expression: Dependencies.CommitDepCountOut >= 1 2024-04-02 14:02:35.75 spid35s * SPID: 35 2024-04-02 14:02:35.75 spid35s * Process ID: 5652 2024-04-02 14:03:23.88 spid35s Timeout waiting for external dump process 21796. 2024-04-02 14:03:23.88 spid35s Error: 17066, Severity: 16, State: 1. 2024-04-02 14:03:23.88 spid35s SQL Server Assertion: File: <"sql\\ntdbms\\hekaton\\engine\\core\\tx.cpp">, line=6337 Failed Assertion = 'Dependencies.CommitDepCountOut >= 1'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted. ..중략.. 2024-04-02 14:03:23.88 Server The state of the local availability replica in availability group 'L_AVG' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The state changed because the lease between the local availability replica and Windows Server Failover Clustering (WSFC) has expired. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console. 2024-04-02 14:03:23.88 spid189s The availability group database "DATABASE_P03" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required. 2024-04-02 14:03:23.88 spid325s The availability group database " DATABASE_03 _IF" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required. 2024-04-02 14:03:23.88 spid851s The availability group database " DATABASE_03_HMI_2" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required. 2024-04-02 14:03:23.88 spid849s The availability group database " DATABASE_03_P03" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required. #ClusterLog 00002c40.00003588::2024/04/02-14:02:40.450 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'system' health state has been changed from 'clean' to 'warning' at 2024-04-02 14:02:40.447 00002c40.00003588::2024/04/02-14:02:40.453 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'query_processing' health state has been changed from 'clean' to 'warning' at 2024-04-02 14:02:40.447 00002c40.00000844::2024/04/02-14:02:57.490 ERR [RES] SQL Server Availability Group <L_AVG>: [hadrag] Lease renewal failed with timeout error 00002c40.00000844::2024/04/02-14:02:57.490 ERR [RES] SQL Server Availability Group <L_AVG>: [hadrag] The lease is expired. The lease should have been renewed by 2024/04/02-12:02:47.490 00002c40.00003858::2024/04/02-14:02:58.046 ERR [RES] SQL Server Availability Group <L_AVG>: [hadrag] Availability Group lease is no longer valid 00002c40.00003858::2024/04/02-14:02:58.046 ERR [RES] SQL Server Availability Group <L_AVG>: [hadrag] Resource Alive result 0. 00002c40.00003858::2024/04/02-14:02:58.046 ERR [RES] SQL Server Availability Group <L_AVG>: [hadrag] Availability Group lease is no longer valid 00002c40.00003858::2024/04/02-14:02:58.046 ERR [RES] SQL Server Availability Group <L_AVG>: [hadrag] Resource Alive result 0. 00002c40.00003858::2024/04/02-14:02:58.046 WARN [RHS] Resource L_AVG IsAlive has indicated failure. 00001cec.00002838::2024/04/02-14:02:58.046 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'L_AVG', gen(0) result 1/0. 00002c40.00003858::2024/04/02-14:02:58.046 INFO [RHS-WER] Scheduling WER ERROR report in 10.000. ReportId 2f1efff4-5677-4f46-8c8c-9162e4c155db; 00001cec.00002838::2024/04/02-14:02:58.046 INFO [RCM] Res L_AVG: Online -> ProcessingFailure( StateUnknown ) 00001cec.00002838::2024/04/02-14:02:58.046 INFO [RCM] TransitionToState(L_AVG) Online-->ProcessingFailure. |
해당 이슈가 Failover를 Trigger하는 부차적인 이슈가 있으며, 이로인한 failover를 반복적으로 수행되나 해당 역할을 Online시키지 못함에 따라 반복적으로 dump로 인한 failover가 수행되었습니다.
이과정에서 large transaction rollback및 반복적인 failover시도로인한 DB연결이슈(983)가 발생한 것으로 확인하였습니다.
2024/04/02-14:03:23.916 INFO [RCM] rcm::RcmApi::MoveGroup: (Group:L_AVG Dest:1 Flags:0 MoveType:MoveType::Manual Cur.State:Pending, ContextSize:0) 2024/04/02-14:03:23.917 ERR [RCM] rcm::RcmApi::OnlineResource: (5023)' because of 'The API call is not valid while resource is in the [Terminating to DelayRestartingResource] state.' 2024-04-02 14:03:23.93 spid725s Error: 3624, Severity: 20, State: 1. 2024-04-02 14:03:23.93 spid725s A system assertion check has failed. Check the SQL Server error log for details. Typically, an assertion failure is caused by a software bug or data corruption. To check for database corruption, consider running DBCC CHECKDB. If you agreed to send dumps to Microsoft during setup, a mini dump will be sent to Microsoft. An update might be available from Microsoft in the latest Service Pack or in a Hotfix from Technical Support. 2024-04-02 14:03:23.93 spid725s Error: 3314, Severity: 21, State: 3. 2024-04-02 14:03:23.93 spid725s During undoing of a logged operation in database 'tempdb' (page (1:139) if any), an error occurred at log record ID (17626:408077:603). Typically, the specific failure is logged previously as an error in the operating system error log. Restore the database or file from a backup, or repair the database. ..중략.. 2024-04-02 14:04:22.71 Logon Error: 983, Severity: 14, State: 1. 2024-04-02 14:04:22.71 Logon Unable to access availability database 'DATABASE_P03' because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the operation again later. 2024-04-02 14:04:30.07 spid189s Nonqualified transactions are being rolled back in database DATABASE_P03 for an Always On Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required. 2024-04-02 14:04:32.07 spid189s Nonqualified transactions are being rolled back in database DATABASE_P03 for an Always On Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required. 2024-04-02 14:04:32.99 spid653s Remote harden of transaction 'GhostCleanupTask' (ID 0x00000072638fd8cd 0001:3a728b14) started at Apr 2 2024 2:04PM in database 'DATABASE_P03' at LSN (170966:295662:202) failed. 2024-04-02 14:04:33.67 spid325s Attempting to recover in-doubt distributed transactions involving Microsoft Distributed Transaction Coordinator (MS DTC). This is an informational message only. No user action is required. 2024-04-02 14:04:33.67 spid37s Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [5777243E-1C2F-4399-8B8F-8F006F9FD4EE]->[CFF305E1-8C94-4DB3-85D7-0EDD5502E88E], database [DATABASE_P03_IF], remote endpoint [TCP://LES.es.net:5022], source operation [8909B701-63D0-4C65-BE06-CE7BC04DD049]: Transitioning from [CATCHUP] to [COMPLETED].' 2024-04-02 14:04:33.67 spid46s Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [5777243E-1C2F-4399-8B8F-8F006F9FD4EE]->[7ED01BC6-97DF-41C4-84E9-737FE70CB40B], database [DATABASE_P03_IF], remote endpoint [TCP://LES.es.net:5022], source operation [79597AD0-EB94-4E28-ACC6-0E791C5BB0E0]: Transitioning from [CATCHUP] to [COMPLETED].' 2024-04-02 14:04:33.67 spid974s [DbMgrPartnerCommitPolicy::SetReplicaInfoInAg] DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint: CFF305E1-8C94-4DB3-85D7-0EDD5502E88E:10:4 2024-04-02 14:04:34.07 spid189s Nonqualified transactions are being rolled back in database DATABASE_P03 for an Always On Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required. |
이후 고객사에서 LES Server재시작(2024-04-02 14:49)이후 정상적으로 'DATABASE_P03' Online되며 AG가 정상적으로 Online되었습니다.
2024-04-02 14:49:51.76 Server SQL Server is terminating because of a system shutdown. This is an informational message only. No user action is required. 2024-04-02 15:00:39.69 spid46s Always On Availability Groups connection with secondary database established for primary database 'DATABASE_P03' on the availability replica 'LES2' with Replica ID: {cff305e1-8c94-4db3-85d7-0edd5502e88e}. This is an informational message only. No user action is required. 2024-04-02 15:00:39.81 spid139s Always On Availability Groups connection with secondary database established for primary database 'DATABASE_P03' on the availability replica 'LES2' with Replica ID: {cff305e1-8c94-4db3-85d7-0edd5502e88e}. This is an informational message only. No user action is required. 2024-04-02 15:00:52.81 spid79s Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [5777243E-1C2F-4399-8B8F-8F006F9FD4EE]->[CFF305E1-8C94-4DB3-85D7-0EDD5502E88E], database [DATABASE_P03], remote endpoint [TCP://LES2.es.net:5022], source operation [719D2611-5566-404F-A02F-B08684AE5A07]: Transitioning from [CATCHUP] to [COMPLETED].' 2024-04-02 15:00:57.02 spid46s [DbMgrPartnerCommitPolicy::SetReplicaInfoInAg] DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint: CFF305E1-8C94-4DB3-85D7-0EDD5502E88E:11:4 2024-04-02 15:00:57.02 spid46s DbMgrPartnerCommitPolicy::SetSyncState: CFF305E1-8C94-4DB3-85D7-0EDD5502E88E:11:4 |
[제안사항]
해당 이슈는 CU21에서 FIX되었으며, 최소 CU21로 patch하시는 것을 제안드립니다.
Cumulative update 21 for SQL Server 2019 (KB5025808) - SQL Server | Microsoft Learn
다만, 기존 known issue들에 의 한 잠재적인 위험요소가 있기에 가장 마지막 버젼인 CU25로 업그레이드 하시는 것을 권장드립니다.
'Database > SQL Server' 카테고리의 다른 글
파티션 테이블 압축 (0) | 2024.07.04 |
---|---|
AlwaysOn 구성환경에서 Server Role 체크 후 Job Agent 실행 중지하기 (0) | 2023.02.14 |
임의의 쿼리를 HTML 테이블로 바꾸는 동적 프로시저 (0) | 2022.10.13 |
FREE: Real-time SQL Server Performance Monitor (0) | 2022.09.14 |
database size (0) | 2022.05.25 |