Anonymous | Login | 2024-09-17 19:36 CDT |
My View | View Issues |
View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||||
ID | Project | Category | View Status | Date Submitted | Last Update | ||||
0000135 | Rosetta | [All Projects] Crash | public | 2012-10-18 13:49 | 2013-02-26 13:41 | ||||
Reporter | delucasl | ||||||||
Assigned To | delucasl | ||||||||
Priority | normal | Severity | crash | Reproducibility | random | ||||
Status | resolved | Resolution | fixed | ||||||
Platform | All platforms | OS | Any | OS Version | Any | ||||
Product Version | Trunk | ||||||||
Fixed in Version | Trunk | ||||||||
Summary | 0000135: MySQL transaction commits can deadlock, they should be retried rather than aborting. | ||||||||
Description | If a lot of MPI processes are writing to one database simultaneously, the database engine will occasionally deadlock. Right now Rosetta handles this by utility exiting, it should handle by retrying statement.exec() several times rather than just utility exiting. The annoying thing is that the cppdb::mysql::Deadlock error shown below is an exception thrown as a cppdb_error exception, rather than a specific exception, either cppdb should be modified to throw a special exception for this, or we can detect it with string matching in safely_write_to_database() | ||||||||
Steps To Reproduce | Make a few hundred connections to one database and start writing a few hundred thousand structures, it'll happen eventually. | ||||||||
Additional Information | ERROR: cppdb::mysql::Deadlock found when trying to get lock; try restarting transaction ERROR:: Exit from: src/basic/database/sql_utils.cc line: 430 application called MPI_Abort(MPI_COMM_WORLD, 911) - process 0 Assertion failed in file socksm.c at line 1663: (it_plfd->revents & 0x008) == 0 Assertion failed in file socksm.c at line 1663: (it_plfd->revents & 0x008) == 0 Assertion failed in file socksm.c at line 1663: (it_plfd->revents & 0x008) == 0 internal ABORT - process 3 internal ABORT - process 4 internal ABORT - process 5 Assertion failed in file socksm.c at line 1663: (it_plfd->revents & 0x008) == 0 internal ABORT - process 1 Assertion failed in file socksm.c at line 1663: (it_plfd->revents & 0x008) == 0 internal ABORT - process 2 mpiexec: Warning: task 0 exited with status 143. mpiexec: Warning: tasks 1-5 exited with status 1. | ||||||||
Tags | No tags attached. | ||||||||
Application(s) Affected | any JD2 application | ||||||||
Command Line Used | score_jd2.linuxgccrelease -database $database @path_to_flags -nstruct $large_number | ||||||||
Developer Options | Confirmed As Bug | ||||||||
Fixed in SVN Version | 51751 | ||||||||
Attached Files | |||||||||
Notes | |
(0000119) momeara (Attentive Developer) 2012-10-18 15:10 |
I'm in favor of refining the cppdb error codes, though our licensing of cppdb requires we pass any modifications upstream. Sam- we have played with chunking transactions into larger blocks, have you seen this or tried this yourself? It seemed to help when we have many more client nodes than server nodes. When you create a transaction set the transaction_mode. The biggest trick is the sessions need to live long enough to do the chunking, which I'm not sure happens with the JD2DatabaseJobOutputter |
(0000120) delucasl (Administrator) 2012-10-18 15:17 |
I'm definitely leaning towards adding a deadlock_error exception. As far as I know each individual structure is a single transaction in the Job Outputter. Making bigger chunks than that would be problematic because the DatabaseFilter code assumes that the set of structures visible in the database represents all completed models. the documentation for InnoDB (http://dev.mysql.com/doc/refman/5.0/en/innodb-deadlocks.html [^]) says that you can mostly avoid deadlocks by structuring the database and queries such that transactions are serialized. I suspect that there are performance and scaling consequences to doing this though. These deadlocks are rare enough that simply retrying is probably the best outcome in our case. I'll try that method first before doing anything more complex. |
(0000146) delucasl (Administrator) 2013-02-26 13:41 |
I left this open to confirm that 51751 fixed the problem. it does. closing bug. |
Issue History | |||
Date Modified | Username | Field | Change |
2012-10-18 13:49 | delucasl | New Issue | |
2012-10-18 13:49 | delucasl | Status | new => assigned |
2012-10-18 13:49 | delucasl | Assigned To | => delucasl |
2012-10-18 15:10 | momeara | Note Added: 0000119 | |
2012-10-18 15:17 | delucasl | Note Added: 0000120 | |
2013-02-26 13:41 | delucasl | Fixed in SVN Version | => 51751 |
2013-02-26 13:41 | delucasl | Note Added: 0000146 | |
2013-02-26 13:41 | delucasl | Status | assigned => resolved |
2013-02-26 13:41 | delucasl | Fixed in Version | => Trunk |
2013-02-26 13:41 | delucasl | Resolution | open => fixed |
Copyright © 2000 - 2012 MantisBT Group |