Bug #155

Database corruption after creating 1 job and quitting master

Added by Andreas Schröder over 7 years ago. Updated about 7 years ago.

Status:FixedStart date:10/10/2010
Priority:UrgentDue date:
Assignee:Alistair Leslie-Hughes% Done:

100%

Category:all
Target version:0.64.5

Description

1. Create 1 job with Drqman or DQOR
2. Quit master
3. You will get the following output:

$ /usr/local/drqueue/bin/master.Linux.x86_64 -o
Logging on screen.
Sun Oct 10 20:57:44 2010 (PID:16033) : |   Info | -> MSG: Starting master... (Revision: cf2cfd6)
Could not open config file: '/etc/drqueue/master.conf'
Parsing config at: /usr/local/drqueue/etc/master.conf
Could not load database: Could not open file or directory. Initializing.
Waiting for connections...
Sun Oct 10 20:57:44 2010 (PID:16034) : |   Info | -> MSG: master_consistency_checks(): Starting... (PID: 16034)
Sun Oct 10 21:00:20 2010 (PID:16038) : |   Info | -> MSG: Registered on position 0
^CSun Oct 10 21:00:29 2010 (PID:16033) : |   Info | -> MSG: Saving...
Sun Oct 10 21:00:29 2010 (PID:16033) : |   Info | -> MSG: Storing DB into: '/usr/local/drqueue/db/drqueue.db'
Sun Oct 10 21:00:29 2010 (PID:16033) : |  ERROR | Job: name:'abc2' owner:'kaazoo' id:'0' : -> MSG: database_job_save(): error saving job environment variables. (Success)
Sun Oct 10 21:00:29 2010 (PID:16033) : |  ERROR | Job: name:'abc2' owner:'kaazoo' id:'0' : -> MSG: database_save(): error saving job number 0. (Success)
Sun Oct 10 21:00:29 2010 (PID:16033) : |   Info | Job: name:'abc2' owner:'kaazoo' id:'0' : -> MSG: Cleaning jobs from memory...
PID, Signal that caused death: 16033, 2 (Interrupt)

History

#1 Updated by Andreas Schröder over 7 years ago

The bug seems to be in this commit:
https://ssl.drqueue.org/redmine/projects/drqueue/repository/revisions/3d085944b6313a721ef29addf7d5dc5a62ea1612/diff/libdrqueue/computer.c

$ /usr/local/drqueue/bin/master.Darwin.x86_64 -o -l3
Logging on screen.
Logging level set to: 259 (Debug)
Tue Oct 12 00:25:31 2010 (PID:63558) : |   Info | -> MSG: Starting master... (Revision: cf2cfd6)
Could not open config file: '/etc/drqueue/master.conf'
Parsing config at: /usr/local/drqueue/etc/master.conf
Tue Oct 12 00:25:31 2010 (PID:63558) : |  Debug | -> MSG: Using file '/usr/local/drqueue/bin/master' to obtain shared memory id.
Tue Oct 12 00:25:31 2010 (PID:63558) : |  Debug | -> MSG: Using file '/usr/local/drqueue/bin/master' to obtain semaphores id.
Waiting for connections...
Tue Oct 12 00:25:31 2010 (PID:63559) : |   Info | -> MSG: master_consistency_checks(): Starting... (PID: 63559)
Tue Oct 12 00:25:36 2010 (PID:63560) : |  Debug | -> MSG: Entering computer_index_addr
Tue Oct 12 00:25:36 2010 (PID:63560) : |  Debug | -> MSG: Comparing 127.0.0.1 with .
Tue Oct 12 00:25:36 2010 (PID:63560) : |  Debug | -> MSG: Exiting computer_index_addr. Index of computer 127.0.0.1 is 0.
Tue Oct 12 00:25:36 2010 (PID:63560) : |  Debug | -> MSG: Request list of jobs
Tue Oct 12 00:25:36 2010 (PID:63560) : |  Debug | -> MSG: Entering handle_r_r_listjobs
Tue Oct 12 00:25:36 2010 (PID:63560) : |  Debug | -> MSG: Exiting handle_r_r_listjobs
Tue Oct 12 00:25:36 2010 (PID:63562) : |  Debug | -> MSG: Entering computer_index_addr
Tue Oct 12 00:25:36 2010 (PID:63562) : |  Debug | -> MSG: Comparing 127.0.0.1 with .
Tue Oct 12 00:25:36 2010 (PID:63562) : |  Debug | -> MSG: Exiting computer_index_addr. Index of computer 127.0.0.1 is 0.
Tue Oct 12 00:25:36 2010 (PID:63562) : |  Debug | -> MSG: Request list of computers
Tue Oct 12 00:25:47 2010 (PID:63563) : |  Debug | -> MSG: Entering computer_index_addr
Tue Oct 12 00:25:47 2010 (PID:63563) : |  Debug | -> MSG: Comparing 127.0.0.1 with .
Tue Oct 12 00:25:47 2010 (PID:63563) : |  Debug | -> MSG: Exiting computer_index_addr. Index of computer 127.0.0.1 is 0.
Tue Oct 12 00:25:47 2010 (PID:63563) : |  Debug | -> MSG: Registration of new job request
Tue Oct 12 00:25:47 2010 (PID:63563) : |  Debug | -> MSG: Job index 0 free
Tue Oct 12 00:25:47 2010 (PID:63563) : |   Info | -> MSG: Registered on position 0
Tue Oct 12 00:25:47 2010 (PID:63563) : |  Debug | -> MSG: Entering request_all_slaves_job_available
Tue Oct 12 00:25:47 2010 (PID:63563) : |  Debug | -> MSG: Creating priority ordered list of jobs
Tue Oct 12 00:25:47 2010 (PID:63563) : |  Debug | -> MSG: Job available (0) for notification to slaves
Tue Oct 12 00:25:47 2010 (PID:63563) : |  Debug | -> MSG: Exiting request_all_slaves_job_available
Tue Oct 12 00:25:48 2010 (PID:63564) : |  Debug | -> MSG: Entering computer_index_addr
Tue Oct 12 00:25:48 2010 (PID:63564) : |  Debug | -> MSG: Comparing 127.0.0.1 with .
Tue Oct 12 00:25:48 2010 (PID:63564) : |  Debug | -> MSG: Exiting computer_index_addr. Index of computer 127.0.0.1 is 0.
Tue Oct 12 00:25:48 2010 (PID:63564) : |  Debug | -> MSG: Request list of jobs
Tue Oct 12 00:25:48 2010 (PID:63564) : |  Debug | -> MSG: Entering handle_r_r_listjobs
Tue Oct 12 00:25:48 2010 (PID:63564) : |  Debug | -> MSG: send_envvars(): we have 0 environment variables available for the request
Tue Oct 12 00:25:48 2010 (PID:63564) : |  Debug | -> MSG: send_envvars() and just informed the client about that
Tue Oct 12 00:25:48 2010 (PID:63564) : |  Debug | -> MSG: Exiting handle_r_r_listjobs

^CTue Oct 12 00:26:59 2010 (PID:63558) : |   Info | -> MSG: Saving...
Tue Oct 12 00:26:59 2010 (PID:63558) : |   Info | -> MSG: Storing DB into: '/usr/local/drqueue/db/drqueue.db'
Tue Oct 12 00:26:59 2010 (PID:63558) : |  Debug | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: send_envvars(): we have 0 environment variables available for the request
Tue Oct 12 00:26:59 2010 (PID:63558) : |  ERROR | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: dr_socket_write(): unable to send 2 bytes (Unknown error: 0)
Tue Oct 12 00:26:59 2010 (PID:63558) : |  ERROR | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: database_job_save(): error saving job environment variables. (Unknown error: 0)
Tue Oct 12 00:26:59 2010 (PID:63558) : |  ERROR | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: database_save(): error saving job number 0. (Unknown error: 0)
Tue Oct 12 00:26:59 2010 (PID:63558) : |  Debug | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: clean_out: Child arrived (63574) with rc = 2
Tue Oct 12 00:26:59 2010 (PID:63558) : |  Debug | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: clean_out: Child arrived (63573) with rc = 2
Tue Oct 12 00:26:59 2010 (PID:63558) : |  Debug | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: clean_out: Child arrived (63572) with rc = 2
Tue Oct 12 00:26:59 2010 (PID:63558) : |  Debug | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: clean_out: Child arrived (63571) with rc = 2
Tue Oct 12 00:26:59 2010 (PID:63558) : |  Debug | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: clean_out: Child arrived (63569) with rc = 2
Tue Oct 12 00:26:59 2010 (PID:63558) : |  Debug | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: clean_out: Child arrived (63568) with rc = 2
Tue Oct 12 00:26:59 2010 (PID:63558) : |  Debug | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: clean_out: Child arrived (63567) with rc = 2
Tue Oct 12 00:26:59 2010 (PID:63558) : |  Debug | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: clean_out: Child arrived (63566) with rc = 2
Tue Oct 12 00:26:59 2010 (PID:63558) : |  Debug | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: clean_out: Child arrived (63565) with rc = 2
Tue Oct 12 00:26:59 2010 (PID:63558) : |  Debug | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: clean_out: Child arrived (63561) with rc = 2
Tue Oct 12 00:26:59 2010 (PID:63558) : |  Debug | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: clean_out: Child arrived (63559) with rc = 2
Tue Oct 12 00:26:59 2010 (PID:63558) : |   Info | Job: name:'ls' owner:'kaazoo' id:'0' : -> MSG: Cleaning jobs from memory...
PID, Signal that caused death: 63558, 2 (Interrupt)

#2 Updated by Andreas Schröder over 7 years ago

  • Status changed from New to In progress
  • Assignee set to Andreas Schröder

Commit 5d383b63c00781f3b49b72787a98d6bc29f43118 seems to fix this problem. The downside is that I had to switch back to dr_read/dr_write. So the main problem could be that dr_socket_read/dr_socket_write was used where dr_file_read/dr_file_write should be used or vice versa.
When a solution is found, the better code could be used again.

#3 Updated by Andreas Schröder over 7 years ago

  • Status changed from In progress to Feedback
  • Assignee changed from Andreas Schröder to Alistair Leslie-Hughes

#4 Updated by Alistair Leslie-Hughes over 7 years ago

I've track the issue down, It will be a little while before I have a solid patch for review.

#5 Updated by Jorge Daza Garcia-Blanes about 7 years ago

  • Status changed from Feedback to Fixed
  • % Done changed from 0 to 100

Alistair Leslie-Hughes wrote:

I've track the issue down, It will be a little while before I have a solid patch for review.

Allisteair, may I know why do you think dr_read and dr_write need to be changed ?

Before I throw my fury on you and start thinking badly about you ?

If you want to make them static... or would you simply like to declare an unused "total" variable ? and then creating a new bug by wrongly placing the operators ?

Come on, clean that up or I'll do it for you... hear the beeping on your left ear ?

And you make this urgent ? Urgent is the removal of those buggy functions.

#6 Updated by Jorge Daza Garcia-Blanes about 7 years ago

Alistair Leslie-Hughes wrote:

I've track the issue down, It will be a little while before I have a solid patch for review.

:) Really, what I meant was, have you tried changing the location of your -= and += operators as in the original dr_read function ?

I guess it might work for you.

#7 Updated by Jorge Daza Garcia-Blanes about 7 years ago

Also, Andreas and Alistair, I considered this bug fixed, as the corruption stopped from happening.

Alistair, please do not modify drqueue core without understanding what you're doing. dr_read and dr_write should never be changed in their previous usage.

Revert any other modification (if any other) you might have committed in that direction, please.

Also available in: Atom PDF