##SQL Version :Microsoft SQL Server 2019 (RTM-CU17) (KB5016394) - 15.0.4249.2 (X64)

 

 

확인사항]

먼저 DTC실패(Error: 8509)기록된후 Rollback하는 과정에서 Dump가 생성으며, 해당 이슈는 이미 알려진 이슈로 CU21에서 FIX가 되었습니다.

해당 Dump in-memory Table사용과정에서 발생할수있으며, 고객사같은 경우는 DTC Transaction Rollback 되는 과정에서 해당 known issue로 인한 Dump가 생성었고 이로인한 가용성그룹이 Resolving상태로 빠지게됩니다.

이후 Leasetime out이 발생하나 FailureConditionLevel(2) Failover가 자동적으로 Triggered되지 않았습니다.

 

#ErrorLog.2
2024-04-02 14:01:03.93 spid62      Error: 8509, Severity: 16, State: 1.
2024-04-02 14:01:03.93 spid62      Import of Microsoft Distributed Transaction Coordinator (MS DTC) transaction failed: 0x8004d00e(XACT_E_NOTRANSACTION).
2024-04-02 14:01:35.11 spid156     Error: 8509, Severity: 16, State: 1.
2024-04-02 14:01:35.11 spid156     Import of Microsoft Distributed Transaction Coordinator (MS DTC) transaction failed: 0x8004d00e(XACT_E_NOTRANSACTION).
2024-04-02 14:02:01.15 spid853     Error: 8509, Severity: 16, State: 1.
2024-04-02 14:02:01.15 spid853     Import of Microsoft Distributed Transaction Coordinator (MS DTC) transaction failed: 0x8004d00e(XACT_E_NOTRANSACTION).
2024-04-02 14:02:35.74 spid35s     CImageHelper::Init () Version-specific dbghelp.dll is not used
2024-04-02 14:02:35.74 spid35s     Using 'dbghelp.dll' version '4.0.5'
2024-04-02 14:02:35.75 spid35s     **Dump thread - spid = 0, EC = 0x0000020BD57C1E10
..중략..
2024-04-02 14:02:35.75 spid35s     ***Stack Dump being sent to \SQLDump0017.txt
2024-04-02 14:02:35.75 spid35s     * *******************************************************************************
2024-04-02 14:02:35.75 spid35s     *
2024-04-02 14:02:35.75 spid35s     * BEGIN STACK DUMP:
2024-04-02 14:02:35.75 spid35s     *   04/02/24 14:02:35 spid 35
2024-04-02 14:02:35.75 spid35s     *
2024-04-02 14:02:35.75 spid35s     * Location:         "sql\\ntdbms\\hekaton\\engine\\core\\tx.cpp":6337
2024-04-02 14:02:35.75 spid35s     * Expression:         Dependencies.CommitDepCountOut >= 1
2024-04-02 14:02:35.75 spid35s     * SPID:                 35
2024-04-02 14:02:35.75 spid35s     * Process ID:         5652
2024-04-02 14:03:23.88 spid35s     Timeout waiting for external dump process 21796.
 
2024-04-02 14:03:23.88 spid35s     Error: 17066, Severity: 16, State: 1.
2024-04-02 14:03:23.88 spid35s     SQL Server Assertion: File: <"sql\\ntdbms\\hekaton\\engine\\core\\tx.cpp">, line=6337 Failed Assertion = 'Dependencies.CommitDepCountOut >= 1'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.
..중략..
2024-04-02 14:03:23.88 Server      The state of the local availability replica in availability group 'L_AVG' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'.  The state changed because the lease between the local availability replica and Windows Server Failover Clustering (WSFC) has expired.  For more information, see the SQL Server error log or cluster log.  If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
2024-04-02 14:03:23.88 spid189s    The availability group database "DATABASE_P03" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2024-04-02 14:03:23.88 spid325s    The availability group database " DATABASE_03 _IF" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2024-04-02 14:03:23.88 spid851s    The availability group database " DATABASE_03_HMI_2" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2024-04-02 14:03:23.88 spid849s    The availability group database " DATABASE_03_P03" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
 
#ClusterLog
00002c40.00003588::2024/04/02-14:02:40.450 INFO  [RES] SQL Server Availability Group: [hadrag] SQL Server component 'system' health state has been changed from 'clean' to 'warning' at 2024-04-02 14:02:40.447
00002c40.00003588::2024/04/02-14:02:40.453 INFO  [RES] SQL Server Availability Group: [hadrag] SQL Server component 'query_processing' health state has been changed from 'clean' to 'warning' at 2024-04-02 14:02:40.447
00002c40.00000844::2024/04/02-14:02:57.490 ERR   [RES] SQL Server Availability Group <L_AVG>: [hadrag] Lease renewal failed with timeout error
00002c40.00000844::2024/04/02-14:02:57.490 ERR   [RES] SQL Server Availability Group <L_AVG>: [hadrag] The lease is expired. The lease should have been renewed by 2024/04/02-12:02:47.490
00002c40.00003858::2024/04/02-14:02:58.046 ERR   [RES] SQL Server Availability Group <L_AVG>: [hadrag] Availability Group lease is no longer valid
00002c40.00003858::2024/04/02-14:02:58.046 ERR   [RES] SQL Server Availability Group <L_AVG>: [hadrag] Resource Alive result 0.
00002c40.00003858::2024/04/02-14:02:58.046 ERR   [RES] SQL Server Availability Group <L_AVG>: [hadrag] Availability Group lease is no longer valid
00002c40.00003858::2024/04/02-14:02:58.046 ERR   [RES] SQL Server Availability Group <L_AVG>: [hadrag] Resource Alive result 0.
00002c40.00003858::2024/04/02-14:02:58.046 WARN  [RHS] Resource L_AVG IsAlive has indicated failure.
00001cec.00002838::2024/04/02-14:02:58.046 INFO  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'L_AVG', gen(0) result 1/0.
00002c40.00003858::2024/04/02-14:02:58.046 INFO  [RHS-WER] Scheduling WER ERROR report in 10.000. ReportId 2f1efff4-5677-4f46-8c8c-9162e4c155db;
00001cec.00002838::2024/04/02-14:02:58.046 INFO  [RCM] Res L_AVG: Online -> ProcessingFailure( StateUnknown )
00001cec.00002838::2024/04/02-14:02:58.046 INFO  [RCM] TransitionToState(L_AVG) Online-->ProcessingFailure.

 

해당 이슈가 Failover Trigger하는 부차적인 이슈가 있으며, 이로인한 failover를 반복적으로 수행되나 해당 역할을 Online시키지 못함에 따라 반복적으로 dump로 인한 failover가 수행되었습니다.

이과정에서 large transaction rollback및 반복적인 failover시도로인한 DB연결이슈(983)가 발생한 것으로 확인하였습니다.

 

2024/04/02-14:03:23.916 INFO  [RCM] rcm::RcmApi::MoveGroup: (Group:L_AVG Dest:1 Flags:0 MoveType:MoveType::Manual Cur.State:Pending, ContextSize:0)
2024/04/02-14:03:23.917 ERR   [RCM] rcm::RcmApi::OnlineResource: (5023)' because of 'The API call is not valid while resource is in the [Terminating to DelayRestartingResource] state.'


2024-04-02 14:03:23.93 spid725s    Error: 3624, Severity: 20, State: 1.
2024-04-02 14:03:23.93 spid725s    A system assertion check has failed. Check the SQL Server error log for details. Typically, an assertion failure is caused by a software bug or data corruption. To check for database corruption, consider running DBCC CHECKDB. If you agreed to send dumps to Microsoft during setup, a mini dump will be sent to Microsoft. An update might be available from Microsoft in the latest Service Pack or in a Hotfix from Technical Support.
2024-04-02 14:03:23.93 spid725s    Error: 3314, Severity: 21, State: 3.
2024-04-02 14:03:23.93 spid725s    During undoing of a logged operation in database 'tempdb' (page (1:139) if any), an error occurred at log record ID (17626:408077:603). Typically, the specific failure is logged previously as an error in the operating system error log. Restore the database or file from a backup, or repair the database.
..중략..
2024-04-02 14:04:22.71 Logon       Error: 983, Severity: 14, State: 1.
2024-04-02 14:04:22.71 Logon       Unable to access availability database 'DATABASE_P03' because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the operation again later.
2024-04-02 14:04:30.07 spid189s    Nonqualified transactions are being rolled back in database DATABASE_P03 for an Always On Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required.
2024-04-02 14:04:32.07 spid189s    Nonqualified transactions are being rolled back in database DATABASE_P03 for an Always On Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required.
2024-04-02 14:04:32.99 spid653s    Remote harden of transaction 'GhostCleanupTask' (ID 0x00000072638fd8cd 0001:3a728b14) started at Apr  2 2024  2:04PM in database 'DATABASE_P03' at LSN (170966:295662:202) failed.
2024-04-02 14:04:33.67 spid325s    Attempting to recover in-doubt distributed transactions involving Microsoft Distributed Transaction Coordinator (MS DTC). This is an informational message only. No user action is required.
2024-04-02 14:04:33.67 spid37s     Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [5777243E-1C2F-4399-8B8F-8F006F9FD4EE]->[CFF305E1-8C94-4DB3-85D7-0EDD5502E88E], database [DATABASE_P03_IF], remote endpoint [TCP://LES.es.net:5022], source operation [8909B701-63D0-4C65-BE06-CE7BC04DD049]: Transitioning from [CATCHUP] to [COMPLETED].'
2024-04-02 14:04:33.67 spid46s     Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [5777243E-1C2F-4399-8B8F-8F006F9FD4EE]->[7ED01BC6-97DF-41C4-84E9-737FE70CB40B], database [DATABASE_P03_IF], remote endpoint [TCP://LES.es.net:5022], source operation [79597AD0-EB94-4E28-ACC6-0E791C5BB0E0]: Transitioning from [CATCHUP] to [COMPLETED].'
2024-04-02 14:04:33.67 spid974s    [DbMgrPartnerCommitPolicy::SetReplicaInfoInAg] DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint: CFF305E1-8C94-4DB3-85D7-0EDD5502E88E:10:4
2024-04-02 14:04:34.07 spid189s    Nonqualified transactions are being rolled back in database DATABASE_P03 for an Always On Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required.

 

이후 고객사에서 LES Server재시작(2024-04-02 14:49)이후 정상적으로 'DATABASE_P03' Online되며 AG가 정상적으로 Online되었습니다.

2024-04-02 14:49:51.76 Server      SQL Server is terminating because of a system shutdown. This is an informational message only. No user action is required.
 
2024-04-02 15:00:39.69 spid46s     Always On Availability Groups connection with secondary database established for primary database 'DATABASE_P03' on the availability replica 'LES2' with Replica ID: {cff305e1-8c94-4db3-85d7-0edd5502e88e}. This is an informational message only. No user action is required.
2024-04-02 15:00:39.81 spid139s    Always On Availability Groups connection with secondary database established for primary database 'DATABASE_P03' on the availability replica 'LES2' with Replica ID: {cff305e1-8c94-4db3-85d7-0edd5502e88e}. This is an informational message only. No user action is required.
2024-04-02 15:00:52.81 spid79s     Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [5777243E-1C2F-4399-8B8F-8F006F9FD4EE]->[CFF305E1-8C94-4DB3-85D7-0EDD5502E88E], database [DATABASE_P03], remote endpoint [TCP://LES2.es.net:5022], source operation [719D2611-5566-404F-A02F-B08684AE5A07]: Transitioning from [CATCHUP] to [COMPLETED].'
2024-04-02 15:00:57.02 spid46s     [DbMgrPartnerCommitPolicy::SetReplicaInfoInAg] DbMgrPartnerCommitPolicy::SetSyncAndRecoveryPoint: CFF305E1-8C94-4DB3-85D7-0EDD5502E88E:11:4
2024-04-02 15:00:57.02 spid46s     DbMgrPartnerCommitPolicy::SetSyncState: CFF305E1-8C94-4DB3-85D7-0EDD5502E88E:11:4

 

[제안사항]

해당 이슈는 CU21에서 FIX되었으며, 최소 CU21 patch하시는 것을 제안드립니다.

Cumulative update 21 for SQL Server 2019 (KB5025808) - SQL Server | Microsoft Learn

 

 

다만, 기존 known issue들에 의 한 잠재적인 위험요소가 있기에 가장 마지막 버젼인 CU25로 업그레이드 하시는 것을 권장드립니다.

Download SQL Server® 2019 for Microsoft® Windows Latest Cumulative Update from Official Microsoft Download Center

+ Recent posts