Value of one of the thread-argument variable has changed at some place, which is only detected when the program crashes.
(gdb) p *a $7 = { host = 0x0, port = 0, h_ent = 0x7fff5fbf6f10, request = 0x100005620 "UH??H??? ", reqlen = 140734799803600, timeout = 0x100004e56, s = 0x100281000, cnt = 1606381568, sargs = 0x1001000e0 }
Look at the garbage.
So, watchpoint.
watch a
isn't helpful at all, since there is no write to the variable a. Have to use memory location watch, which is
(gdb) watch -location a Hardware watchpoint 7: *(struct args **) 4297592504
(GDB shipped with OS X doesn't recognize watch -l (as specified in the help message).
Then the debugger kindly pauses my program at
(gdb) c Continuing. Hardware watchpoint 6 deleted because the program has left the block in which its expression is valid. READ: 8192, TOTAL: 8192 Hardware watchpoint 7: *(struct args **) 4297592504 Old value = (struct args *) 0x7fff5fbf6f40 New value = (struct args *) 0x7fff5fbf6f00 0x0000000100002641 in parse_response (buf=0x10027eeb8 "HTTP/1.1 200 OK\r\nVary: Accept-Encoding\r\nContent-Type: image/x-icon\r\nLast-Modified: Tue, 14 Aug 2012 15:19:23 GMT\r\nDate: Wed, 17 Apr 2013 14:07:54 GMT\r\nExpires: Thu, 25 Apr 2013 14:07:54 GMT\r\nX-Content"..., len=8192, a=0x7fff5fbf7000) at md.c:512 512 nextproc[len] = 0; (gdb) bt #0 0x0000000100002641 in parse_response (buf=0x10027eeb8 "HTTP/1.1 200 OK\r\nVary: Accept-Encoding\r\nContent-Type: image/x-icon\r\nLast-Modified: Tue, 14 Aug 2012 15:19:23 GMT\r\nDate: Wed, 17 Apr 2013 14:07:54 GMT\r\nExpires: Thu, 25 Apr 2013 14:07:54 GMT\r\nX-Content"..., len=8192, a=0x7fff5fbf7000) at md.c:512 #1 0x0000000100005c10 in recv_engine (p=0x7fff5fbf6f40) at engine.c:394 #2 0x00007fff88c637a2 in _pthread_start () #3 0x00007fff88c501e1 in thread_start () (gdb) info locals readlen = 8192 sfd = 3 i = 0 n = 3 quit = 0 _i = 0 a = (struct args *) 0x7fff5fbf6f00 buf = "HTTP/1.1 200 OK\r\nVary: Accept-Encoding\r\nContent-Type: image/x-icon\r\nLast-Modified: Tue, 14 Aug 2012 15:19:23 GMT\r\nDate: Wed, 17 Apr 2013 14:07:54 GMT\r\nExpires: Thu, 25 Apr 2013 14:07:54 GMT\r\nX-Content"... go_loop = 1 total = 8192 error_exit = 0 out_rwx.86 = (int (*)[1]) 0x10027ee20
(gdb) l parse_response 502 503 return bytes_processed - bufsz_orig; 504 } 505 506 int // return 0 for normal, 1 for all done, -1 for error 507 parse_response (char *buf, long len, struct session_args *a) { 508 // responsible for parsing the status line 509 // and handle the rest to different parse sub routines 510 char *nextinput = buf; 511 char *nextproc = buf; (gdb) l 512 nextproc[len] = 0;
(gdb) l engine.c:366 361 || a->sargs[i].rtt_timer[0].tv_usec) { 362 // we only pause for the first time 363 timer_pause (a->sargs[i].rtt_timer); 364 } 365 //printf ("Read %d\n", sfd); 366 int readlen = recv (sfd, buf, sizeof(buf) - 1, 0); 367 if (readlen == 0) { 368 dprintf ("READ 0 <%d>\n", a->s[i]); 369 close (a->s[i]); 370 if (a->sargs[i].repeat_recv > 0) {
Everything is just crystal clear now. I did an recv using the whole buffer, at line 366, then passed to parse_response, which expected one more byte at end of buffer so I could append a '\0' for easier string operations (line 512), of which location overruns
into &a, the variable that shouldn't have been changed.
It's weird that valgrind didn't detect this. Maybe there is a stack checking tool in experimental stage I remember but this definitely is not in the memcheck tool.
Anyway, looks like it's working again now.
A few more insights:
1. this bug didn't reveal until I cleared out some recursion code, tests the code with multiple threads, on a line that has delays and drops.
2. GCC built binary on LInux didn't gave me trouble. LLVM on OS X shows up problems 8/10 times.
3. Read the code, make asserts base on assumptions. I did this in the parsing module, not in the boundary of networking and parsing modules.
4. watch -location is useful.
5. Still didn't figure out what is "-fstack-protector" for in gcc / clang
Update 0424:
Using valgrind exp-sgcheck
==25634== Invalid read of size 1 ==25634== at 0x4C2C008: index (h_intercepts.c:89) ==25634== by 0x402BB6: parse_response (md.c:552) ==25634== by 0x403A5F: test_parse (md.c:896) ==25634== by 0x403CAF: main (md.c:951) ==25634== Address 0x7ff000430 expected vs actual: ==25634== Expected: stack array "buf" of size 8,192 in frame 2 back from here ==25634== Actual: unknown ==25634== Actual: is 0 after Expected
Valgrind sgcheck found one occurrence, which is a few accesses after the one bug I found using GDB as shown above. But it could still be useful sometimes maybe.