MantisBT - Rosetta
View Issue Details
0000135Rosetta[All Projects] Crashpublic2012-10-18 13:492013-02-26 13:41
delucasl 
delucasl 
normalcrashrandom
resolvedfixed 
All platformsAnyAny
Trunk 
Trunk 
any JD2 application
score_jd2.linuxgccrelease -database $database @path_to_flags -nstruct $large_number
Confirmed As Bug
51751
0000135: MySQL transaction commits can deadlock, they should be retried rather than aborting.
If a lot of MPI processes are writing to one database simultaneously, the database engine will occasionally deadlock. Right now Rosetta handles this by utility exiting, it should handle by retrying statement.exec() several times rather than just utility exiting.

The annoying thing is that the cppdb::mysql::Deadlock error shown below is an exception thrown as a cppdb_error exception, rather than a specific exception, either cppdb should be modified to throw a special exception for this, or we can detect it with string matching in safely_write_to_database()
Make a few hundred connections to one database and start writing a few hundred thousand structures, it'll happen eventually.
ERROR: cppdb::mysql::Deadlock found when trying to get lock; try restarting transaction
ERROR:: Exit from: src/basic/database/sql_utils.cc line: 430
application called MPI_Abort(MPI_COMM_WORLD, 911) - process 0
Assertion failed in file socksm.c at line 1663: (it_plfd->revents & 0x008) == 0
Assertion failed in file socksm.c at line 1663: (it_plfd->revents & 0x008) == 0
Assertion failed in file socksm.c at line 1663: (it_plfd->revents & 0x008) == 0
internal ABORT - process 3
internal ABORT - process 4
internal ABORT - process 5
Assertion failed in file socksm.c at line 1663: (it_plfd->revents & 0x008) == 0
internal ABORT - process 1
Assertion failed in file socksm.c at line 1663: (it_plfd->revents & 0x008) == 0
internal ABORT - process 2
mpiexec: Warning: task 0 exited with status 143.
mpiexec: Warning: tasks 1-5 exited with status 1.
No tags attached.
Issue History
2012-10-18 13:49delucaslNew Issue
2012-10-18 13:49delucaslStatusnew => assigned
2012-10-18 13:49delucaslAssigned To => delucasl
2012-10-18 15:10momearaNote Added: 0000119
2012-10-18 15:17delucaslNote Added: 0000120
2013-02-26 13:41delucaslFixed in SVN Version => 51751
2013-02-26 13:41delucaslNote Added: 0000146
2013-02-26 13:41delucaslStatusassigned => resolved
2013-02-26 13:41delucaslFixed in Version => Trunk
2013-02-26 13:41delucaslResolutionopen => fixed

Notes
(0000119)
momeara   
2012-10-18 15:10   
I'm in favor of refining the cppdb error codes, though our licensing of cppdb requires we pass any modifications upstream.

Sam- we have played with chunking transactions into larger blocks, have you seen this or tried this yourself? It seemed to help when we have many more client nodes than server nodes. When you create a transaction set the transaction_mode. The biggest trick is the sessions need to live long enough to do the chunking, which I'm not sure happens with the JD2DatabaseJobOutputter
(0000120)
delucasl   
2012-10-18 15:17   
I'm definitely leaning towards adding a deadlock_error exception. As far as I know each individual structure is a single transaction in the Job Outputter. Making bigger chunks than that would be problematic because the DatabaseFilter code assumes that the set of structures visible in the database represents all completed models.

the documentation for InnoDB (http://dev.mysql.com/doc/refman/5.0/en/innodb-deadlocks.html [^]) says that you can mostly avoid deadlocks by structuring the database and queries such that transactions are serialized. I suspect that there are performance and scaling consequences to doing this though. These deadlocks are rare enough that simply retrying is probably the best outcome in our case. I'll try that method first before doing anything more complex.
(0000146)
delucasl   
2013-02-26 13:41   
I left this open to confirm that 51751 fixed the problem. it does. closing bug.