#milkymist IRC log for Thursday, 2011-08-18

wpwrakand now, live from the arena, the eternal struggle of good versus evil ! today, the champion of the powers of good will be ... adam ! the forces of evil are represented by board 0x32. will the good prevail ? will one more M1rc3 perish in darkness ? watch it live, only on #milkymist !00:58
rohhow's the score atm?01:02
wpwrakroh: so far, adam has just entered the area but hasn't uttered a word yet. perhaps he's meditating, gathering his spiritual forces to defeat the foe :)01:09
rohi am sure he'll get through that pile of stuff to fix. will just take some time.01:17
wpwrakyeah. let's hope there aren't too many of the truly weird problems left in that pile01:26
wpwrakhmm, seems that tonight will be quiet01:48
kristianpauland still you thinking rework 90 pcbs again?02:00
wolfspraulalright, I saw some confusion in questions from Sebastien and Kristian Paul02:06
wolfspraulit's a little difficult to keep overview in the flood of details02:06
kristianpaulwell, i'm asking with no awareness of last day of backlog02:07
wolfspraul1. we cannot say there there is a problem with 'bad nor chips'02:07
wolfspraulall is fine with the nor chips02:07
wolfspraul2. urjtag may have bugs, but right now it seems Xilinx Impact is only as good as urjtag, so we will continue to use urjtag and the jtag-serial board02:07
wolfspraul3. of course we will apply fix2b, or any additional fix we may find to be needed - to all boards that are being sold02:10
wolfspraulall boards that are being sold are sold in the same condition, and that condition is 100% pass and 100% bug-free02:10
wolfspraulquite simple :-)02:10
wolfspraul4. yesterday was a big messy, or rather slow, because it was not really a production day, but more a fix2b design verification day02:10
wolfspraulnormally we would not spend so much time on one board that shows a NOR problem (_for whatever reason_), but we did because we had to verify fix2b and cannot risk jumping over too many unknowns02:10
wolfspraulafter yesterday, I feel pretty good about fix2b02:10
wolfspraulAdam will now approach the other 15/16 fix2b candidates in a more production style, that is - fast02:10
wolfspraulif something doesn't work - note that went wrong, then next board02:10
wolfspraul5. I am very happy that now it seems we can remove the long wire02:11
wolfspraulthe long wire is an invitation for trouble02:11
kristianpaulaka FM atenna :)02:11
wolfspraulthe one thing special about rc3 is that we mix design verification and production run02:11
wolfspraulalmost since day smt+1 if you remember02:11
wolfspraulwe started fiddling with the reset circuit right at the beginning because boards wouldn't boot02:12
kristianpaulsure, that was messy but still well managed wich is good !02:12
wolfspraulso that's causing a big problem02:12
kristianpaulyes i have good memory02:12
wolfspraulbecause design verification and production are so different02:12
wolfspraulproduction is about speed, economics02:12
wolfspraulevery board goes through a predefined and efficient process02:12
wolfsprauland it doesn't matter why it fails, if it fails the failure point is recorded and we go to the next board02:13
wolfspraulbut if we are uncertain about the design (!), then we cannot do that02:13
kristianpaulkeep going !:)02:13
kristianpaulyes, i know understand your point02:13
wolfspraulwhy did sales not start yet?02:13
wolfspraulit's easy02:14
wolfspraulit's _NOT_ because some boards have problems02:14
wolfspraulin a run of 90 there will always be problems02:14
wolfspraulit's because our test procedure, which ended in 10 thirty second render cycles, showed sudden failures (cannot reconfig) on boards that were perfectly rendering fine, but stopped doing so at the 2nd, 5th, 9th render cycle02:15
wolfspraulthat's bad!!!!02:15
wolfspraulthat's the bad thing02:15
wolfspraulnot the schmitt-triggers, usb transceivers, nor chips, urjtag bugs, long wires, etc.02:15
kristianpaulsure, thats uncertain02:15
kristianpaulat least was..02:15
wolfspraulno it still is02:15
wolfspraulso I have 40 boards (for example)02:15
wolfspraul30 pass02:15
wolfspraul10 fail, some at 2nd rendering, some at 5th, some at 9th02:15
wolfspraulcan we sell the 30 pass?02:15
kristianpaulnot working02:16
wolfspraulbecause if we had done 20 render cycles, or 30, then maybe only 20 boards would have 'passed'02:16
wolfspraulor 1502:16
wolfspraulmaybe if we do 100 render cycles, none would pass02:16
wolfspraulgot it?02:16
kristianpaultotally :)02:16
wolfspraulI cannot start selling like this, not _ANY_ board02:16
wolfspraulso what I want to see with fix2b now is this:02:17
wolfspraul1. Adam works on a lot of boards, one after another, fast02:17
wolfspraulsome pass, some fail02:17
wolfspraulhe will do 10 render cycles on each one02:17
wolfspraulI have two conditions to start sales:02:17
wolfspraul1) at least 50% of boards must pass (otherwise something so big may be covered up somewhere that we are better off pausing for a day or two to study it)02:18
wolfspraul2) from the boards that fail, _NONE_ must fail at any point after running the test software (all peripherals). If they fail before that - fine. But once they boot for the first time after the test software, there must be no failures.02:19
wpwrakyeah, no regressions02:19
wolfspraulso I think we need to give Adam one or two full days, where he can just focus on speed and no chatting and no time consuming analysis.02:20
wolfspraulunless something really worrying comes up with fix2b, but we can follow the wiki page as he updates it02:20
wpwrakthat's what we had with 0x3a. maybe one day we'll even know why ;-)02:20
wpwrak(no chatting) hehe ;-)02:20
wolfspraulwpwrak: 0x3A rendered fine, and then stopped?02:20
wolfspraulkristianpaul: is this all clear now?02:21
kristianpaulwolfspraul: yes02:21
kristianpauli lack patience thats all :)02:22
wolfspraulhe, ok. sorry. it's a flood of details I know.02:22
kristianpaulyou think in long term02:22
kristianpaulwich is GOOD02:22
wpwrak0x3a didn't get that far. butat one point in time correct NOR content could be read back. after that, either writing failed or readback has 100% reproducible corruption.02:22
kristianpauli havent even tought about tha 50% pass02:22
wolfspraulok but that's way before 100% pass of the test software02:22
wolfspraulI don't care about those cases02:22
wolfspraulkristianpaul: that's just to protect us from a potentially still remaining design mistake02:22
wolfspraulif Adam works on 20 boards, and 2 pass - something is wrong :-)02:23
kristianpauloh sure, as those poping with rc2 :)02:23
wolfspraulwpwrak: it's good that we digged into the nor of 0x3A so much yesterday, but from a failure analysis standpoint, it may be just one of the 5-10 boards with 'various' problems here or there in the end02:23
wolfspraulnow we are sure it's not related to fix2b02:23
wolfspraulkristianpaul: you can also see it this way:02:24
wolfspraulthe test software must catch 100% of failures02:24
wolfspraulif the test software passes, after that the board must work02:24
wpwrakwolfspraul: yeah, i feel good about fix2b. also in 0x3a, besides the actual corruption, the rest of the system behaviour makes sense02:25
wolfspraulif boards fail after the test software has determined they are 100% ok, that's a big problem02:25
wpwrakwolfspraul: this means that we can now do a little better than just "d2/d3 dimly lit" :)02:25
wolfspraulthat means our test software is bad (or a design mistake that the test software cannot detect), and it means we have to re-test the entire batch after that issue is cleared up02:25
wpwrakyeah. these things would suck02:25
wolfspraulwpwrak: yes, we learnt a lot with 0x3a02:26
wolfspraulinteresting board02:26
wolfspraulbut not the time to go deeper there now, and maybe never02:26
wolfspraulmaybe just replace the nor chip and it's all fine02:26
wolfspraulmanufacturing is about economics, not every weird case needs to be analyzed to the total root cause02:27
wolfspraulif we have a strong design, and strong test software, we have a basis to run manufacturing economics on02:27
wolfspraulthe most decisions become economic decisions02:27
wolfspraulbut if we are uncertain about the design, or test software - BAD! :-)02:27
wolfspraulthen it gets messy02:27
wolfspraulbecause then we cannot just make quick economic decisions02:28
wpwrakif we get a cluster of NOR troubles like in 0x3a, it may make sense to write a pattern of 0x0000 or 0xffff and read it back. then probe the bus lines. that would show whether the problem is on reading or on writing.02:28
wolfspraulI think it's crazy that we first solder this long wire and diode to 90 boards, and a few weeks later determine that it was not needed :-)02:28
wolfspraulthat shows how we are mixing design and production work02:28
wolfspraulbut it's ok, we go full power forward on everything now02:28
wpwrakyeah, that was the scenic route :)02:29
wolfspraulif more design problems show up, well, sorry, we have to go through the entire batch again...02:29
rohtrue. on the other hand.. better to find and fix that bugs before shipping.. not like some other vendors selling green bananas02:29
wolfspraulno worries.02:29
rohin the end there were not many 'design' errors right? mostly 'bad parts' as far as i understood02:30
wolfsprauloh no02:30
rohlike the schmitt-triggers and now this diodes?02:30
wolfsprauldesign 'error' maybe too much, but definitely design 'uncertainty'02:30
wolfsprauland that is enough to disrupt the normal testing of the run02:30
wolfspraulroh: like for example, it turned out that the way we produced the boards in SMT, none of them would have booted02:31
rohhuh? i thought all design changes were prototyped and 'tested02:31
wolfspraulwould you call that a 'design error'? :-)02:31
rohby reworking a rc202:31
wolfspraulso they went back to SMT for rework multiple times (!)02:31
wolfspraulplus reworks on Adam's side02:31
rohuh.. what was the root cause of that? bad smt params?02:31
wolfspraulroh: no, we made mistakes there02:31
wolfspraulno bad smt params02:31
wolfspraulthe smt shop did everything as told02:32
kristianpaulup to how many rewords are aceptable? (wich was my concern when asked first)02:32
wolfspraulI'm telling you - design 'weaknesses'02:32
wolfspraulso we find 'oops'. 'they all don't boot'02:32
wolfspraulthat's how it started02:32
wolfspraulbut it's ok, these things can happen in small runs02:32
wolfspraulkristianpaul: theoretically it's infinite, you can rework many times - why not02:33
wolfspraulbut practically maybe not02:33
rohwolfspraul: well.. design is schematic and pcb layout for me. when it comes to parts and the mechanical mounting.. thats 'craft' for me ;)02:33
wolfspraulbecause every rework means new errors get introduced02:33
wolfspraulroh: we definitely ran into design issues02:33
rohwolfspraul: can you be more precise what assumptions were 'wrong' there?02:33
wolfspraulthat's why you hear about fix1, fix2, fix3, fix4, fix2b, etc.02:33
kristianpaulyeah, thats increasi the mix of posible bugs02:33
wolfspraulroh: phew, hard. the reset circuit is really subtle.02:34
wolfspraullots of details, lots of pins02:34
wolfspraulI have no full electrical overview.02:34
wolfspraulsebastien and werner do02:34
wolfspraulit's actually supposed to be 'simple' (ahem)02:34
wolfspraulbut you know how it is02:34
rohmaybe we should colaboratively write a book after that... list all classes of 'errors' and 'stuff to go wrong' ... to help other people do it better or not do the same errors again ;)02:34
wolfspraula little thing wrong there and the board won't boot02:34
wolfspraulso before the boards even left the SMT shop for the first time, Adam was already on the phone trying to tell them "ahh, can you please do the following reworks before sending out: a, b, c"02:35
wpwraki don't think anyone really has a full overview of the current reset circuit ;-) the very non-ideal diode(s) make it rather complicated.02:35
rohwpwrak: what i dont get.. why does it need to be so complicated?02:36
rohdoesnt xilinx provide a proper, simple reset example?02:36
wolfspraulyeah, we already have more 'improvements' for the reset circuit lined up (gates, second reset ic, etc)02:36
wpwrakroh: now it's relatively simple again. but the problem is that the diodes have relatively large capacitance.02:37
wpwrakroh: the issue is that we (think we) need to hold the NOR in reset while power ramps up02:37
wolfspraulcollectively we must have spent 3 months on the reset circuit now02:37
wpwrakroh: (and we hope we don't really need to hold it in reset while power ramps down ... because rc3 doesn't do that :)02:38
wolfspraulAdam went to Xilinx FAE etc. etc. crazy.02:38
wpwrakheh ;-)02:38
kristianpaul(FAE) oh, what they said i dont remenber that..02:38
wolfsprauldon't even ask02:38
rohwolfspraul: yes. and it will be a complete waste of time if xilinx does a new revision of the spartan ;)02:38
wolfspraulI don't know02:38
wolfspraultoo many details02:38
wpwrakwolfspraul: (2nd reset IC) actually, that was a bad idea. we need a gate to act as our "diode"02:39
wolfspraulI think about the platform, Milkymist platform.02:39
wolfspraulso once more boards are out, I think our effectiveness in those things will go up.02:39
wpwrakroh: it may not be an FPGA-specific issue02:39
wolfspraulbecause in the end it is a simple circuit02:39
kristianpauloh yes, now adam should learn to use the /ignore nick  command :)02:39
wolfspraulbut we are struggling because we operate with so few people, so few boards02:39
wolfspraulI just need to stabilize the bloody rc3 run, get reliable 100% pass testing results, and start to ship those monsters out :-)02:40
wolfspraulroh: for sure the value of the m1 board is not in its reset circuit02:42
wolfspraulif anything it's in the Milkymist SoC, Flickernoise, case :-)02:42
wolfspraulso I'm not worried that this is a very spartan-6 specific little circuit, that's fully understood.02:42
wolfspraulthese are no 'investments', just cost and nastiness02:43
wolfspraulkristianpaul: in general you want to avoid reworks completely02:44
wolfspraulrework = heat02:44
wolfspraulheat = bad02:44
wolfspraulmake the production process as determined as possible02:44
wpwrakcrispy chips. yumm :)02:44
wolfspraulthe rework heat can cause a nearly infinite number of side-effects in chips, passive parts, the pcb, etc.02:45
wolfspraulwhy is there all this fuss about precise reflow temperature curves?02:45
wpwrak... diodes ... :)02:45
wolfspraulbecause it's so important, even 1 degree makes a difference02:45
kristianpaulah, thats why wpwrak like freeze boards :)02:45
wolfspraulwhether the top is at 246 celsius or 247 celsius...02:45
wolfspraulso think about that02:45
wolfspraulif that is important, how crude a rework is!02:46
wpwrakkristianpaul: naw, that was actually something else :)02:46
wolfspraulit's like a hammer on delicate china02:46
wpwrakkristianpaul: what mystifies me in 0x3a is that we went from read noise to either write noise or stable read failure02:46
kristianpaulgood, you hammer the monster before leave the cave :)02:47
wolfspraulso the best number of reworks is 0. but theoretically there is nothing wrong with reworks either.02:47
wolfspraulsorry, doesn't get more precise than that...02:47
rohwpwrak: well.. maybe its really a broken nor chip. or even only bad soldering for some reason. there are always flukes making sure you are puzzled about the rules02:47
wolfspraulkristianpaul: in terms of life expectancy of a particular board, I don't think you can say in general that a board with 0 reworks has a longer life expectancy than one with 5 reworks.02:49
wolfspraulthe key is the test software here02:49
wolfspraulif the board with 5 reworks passes the test software, and the test software (or process) is good, we can safely assume the life expectancy of that board to be the same as the one with 0 reworks (and also pass the entire test process)02:49
wolfspraulthat's because there may be a small lingering problem in the one with 0 reworks as well, I don't see how reworks in general increase the number of small lingering problems02:50
wolfspraulI have no such data.02:50
rohwell.. heat can reduce the life expectancy of caps, semiconductors and other parts as well.. but i dont think that reworks have more influence than regular production or weird designs.02:52
rohweird meaning e.g. mis-spec-ing a smps and killing the caps over time in the process02:53
rohhappens from time to time on mainboards (exploding caps are not always a 'bad caps' cause)02:54
wolfspraulin terms of life expectancy, there's some interesting stuff in the solder process.02:56
wolfspraulunfortunately more and more consumer electronics are designed for a 2 year or even less life span02:56
rohthat also. i bet that lead-free mania will bite us in the ass atleast once ;)02:56
wolfspraulso the process gets optimized towards that02:56
wolfspraulbut there are dramatic differences in life expectancy (say for example temperature impact over time), so if you want to you can solder in a way that will be dozens of times more robust towards temperature cycles02:57
wolfspraulbut one by one02:57
rohmaybe we shouldnt design like that and use that for marketing ;)02:57
wolfspraulwe cannot work on all these details now02:57
wolfspraulit's not the design, it's the soldering process02:57
wolfspraulevery time any part gets hot or cools down again, the different materials expand differently02:59
wolfspraulif you want to manufacture for 20 or 30 year life expectancy, there's a lot of good stuff you can do02:59
wolfspraulbut increasingly the consumer electronics industry moves away from that02:59
wolfspraulso that's only for aviation, cars, medicine, etc.02:59
wolfspraulanyway we are not at that level yet03:00
wolfsprauljust trying to get m1 rc3 out as a good product, that boots and works at all :-)03:00
wolfspraulbut I looked at some data once of a comparison of different solder processes and techniques, and failure rates of temperature cycles (say up to 60 and back down to 30).03:01
wolfsprauland the differences were huge03:01
wolfspraulso after 500 cycles, 1000 cycles, 2000 cycles. some my have failure rates of 30-40%, and other processes maybe only 1%03:02
wpwrakso, how's the battle going ? cluster got smaller ?10:35
wolfspraulI don't dare to ask :-)10:37
wolfspraulsometimes gotta give people time...10:37
wpwrakah, i thought you had a little window monitoring adam's vital functions, heart rate, blood pressure, transpiration, ... :)10:40
wolfspraulsome things I read on the wiki do not look that great10:41
wolfspraulsearch for fix2b10:42
wolfspraul0x32 is still not right somehow10:42
wolfspraul0x34 and 0x39 are ok10:43
wolfspraul0x3A has the nor problems we observed yesterday10:43
wolfspraul0x3C, hmm10:43
wolfspraul0x40 is good, but then: 0x48 (!) what's that?10:44
wolfspraulcannot configure after 2nd rendering!10:44
wpwrak0x34 and 0x39 were already good yesterday, no ?10:44
wolfspraulthere is more stuff we have to find out about10:45
wolfspraul0x3C - strange (like 0x32?)10:45
wolfspraul0x48 is really bad10:46
wolfspraulbecause that means we still have boards that fall back from rendering to unreconfigurable, even after fix2b10:46
wolfspraul0x54 is good, 0x55 could be something with the nor chip (=ignore)10:47
wolfspraul0x5C is good10:47
wolfspraulthat seems to be all fix2b results so far10:47
wolfspraulwe need to look into 3C and 4810:47
lekerneldelays delays delays delays delays delays10:48
wpwrak0x54 is good, no ?10:49
wolfspraulthe test results are very clear, that's good. I'm the eternal optimist.10:49
wolfspraulyes, perfect10:49
wpwrakah, 0x55 was the NOR .. checking ...10:49
wolfspraulI think we can ignore 0x55, not important in our quest for a stable design and reliable test process for 100% pass boards10:49
wolfspraulI would look at 0x48 first, really dig in there. because that board regressed!10:50
wolfsprauland then 0x3C maybe, if needed10:50
wpwrak0x55 seems bad, yes. maybe we have a NOR cluster now. but let'see then adam is through with fix2b10:50
wolfsprauldon't worry about problems with the nor chip per se10:50
wpwrakrc2 used the same NOR chips as rc3 ?10:50
wolfspraulthere are no problems with the nor chips10:51
wolfsprauleven if there are, they are easily replaced and done10:51
wolfspraulwe are not debugging nor chips10:51
wpwraki was thinking of the interaction with the FPGA. it's a fairly complex process. FPGA apparently needs to read the NOR's configuration data, etc.10:51
wolfspraulnah. we have way too many working boards to suspect a design issue there.10:52
wpwrakcan be borderline parameters10:52
wolfspraulif we knew our design and test process was 100% stable, we would replace the nor chip on 0x55 and most likely it would pass then.10:52
wolfspraulI would not look at 0x55, waste of time imho.10:53
wolfspraul0x48 and 0x3C are interesting10:53
wolfspraul(and maybe more later since Adam is not finished yet)10:53
wpwrak0x55 is scary, yes. i'd rather look at 0x3a :)10:53
wpwrak0x3a looks as if one could figure out what's going on. and it somehow almost worked in the past. so if the NOR problems have a common cause, that may provide some clues.10:54
wolfspraul0x48 is my favorite10:55
wolfspraulcrystal clear test path10:55
wolfsprauleverything picture perfect, but then10:55
wolfspraullet me check 0x3A...10:55
wpwrakbut .. the next tests would be harder to make: write synthetic patterns, check them on the bus, read them back, etc. not the things adam usually does. well, when you send me my M1(s) maybe include 0x3a :)10:55
wolfspraulahh. 0x3A never booted before.10:55
wolfspraulI'm not so interested in those (maybe a mistake).10:56
wpwrakyes, it never booted. that's the fly in the ointment :)10:56
wolfspraulI don't suspect a big problem with the design.10:56
wolfspraulwe made rc1, rc2, etc.10:56
wolfspraulthat's all fine10:56
wolfspraulit must be something small, like we already fiddled with the reset circuit 3 times now.10:56
wpwrakthe reset circuit looks good on 0x3a10:56
wolfspraullet me read 0x3A notes carefully10:56
wolfsprauloh wait10:57
wolfspraul0x3A is the one from yesterday!10:57
wolfspraulno - not go back to that :-)10:57
wolfsprauljust replace the nor chip (we have no spares right now so cannot try)10:57
wolfspraulI'm 80% sure after replacing nor chip it works10:58
wolfspraulnot so interesting10:58
wolfspraulhow about 0x3C ?10:58
wolfspraulthat's exactly like 0x32, with the 'pulses' etc.10:58
wpwrakmmh, i think replacing the NOR is too radical. you may just mask a real problem. i wouldn't replace the NOR of 0x3a before a) checking the data that goes in really gets corrupted before it comes out again and b) verifying the signal timing.10:59
wolfspraulthe difference is that 0x32 never booted before, but 0x3A did10:59
wolfspraulsorry I meant 0x3C did10:59
wolfspraulno really, no more time into 0x3A10:59
wolfspraulit's not worth it10:59
wolfspraullook at the difference between 0x3C and 0x3211:00
wpwrak0x3a didn't boot11:00
wolfspraulthey have pretty much the same state now11:00
wpwrakah, 0x3c :)11:00
wolfspraulthose crazy nor bit corruption searches take huge time and don't help us in the big picture with the run11:00
wpwrak0x32 has a long patient's history :)11:00
wolfspraulwe are not fixing every board here11:00
wolfspraulwe are only trying to come up with a stable design and reliable test process (!)11:00
wolfspraulso taht we can start sending boards out11:01
wpwrak(nor a waste of time) dunno. i wouldn't be so quick to assume that the chips just go bad randomly.11:01
wolfspraulof course I understand the _real_ bug may hide anywhere...11:01
wolfspraultrue, but we have lower hanging fruits11:01
wolfspraulcompare 0x3C and 0x3211:01
lekernelby the way, have you tried assembling one complete unit already?11:01
wpwrakyeah,looking at 0x3211:01
lekernelwith case, box, etc.11:01
wolfspraulsure everybody has 1 unit, I think Adam too (his own)11:02
lekernelwith the case and the box?11:02
wolfspraulbut not from 0x30 on and higher11:02
wolfspraulno probably not11:02
wolfspraulyou worried it won't fit? :-)11:02
lekernelyes. given that absolutely everything in this run has gone wrong in one way or another, there could be surprises there as well11:02
wolfspraulnah it will fit. I'm not getting distracted on that now.11:03
wolfspraulfix2b is a big step forward, looking at today's results11:03
wolfspraulbut not 100% yet, it seems11:03
wolfspraulI don't want to trample over test results and ignore them etc.11:04
wolfspraulnot good11:04
wpwrakwow. 0x32 is crazy.11:04
wolfspraulwell, read 0x3C now :-)11:04
wolfspraulfrom the boards that we have fix2b results for so far, I would look at 0x48 first11:06
wolfspraultp36/37 is good, but it won't reconfigure currently (after rendering before)11:07
lekernelwolfspraul, maybe you should ship problem boards around (including to a Xilinx FAE) so people can look at them in parallel?11:09
wpwrakfor 0x32 and 0x3c, the next thing to analyze would be to bring R60 back. if that doesn't help, try without D16 (without D16, the board is a likely NOR corruption candidate, though. so not for normal sale)11:09
wpwraklekernel: yes, wolfgang seems to have a few boards he's already given up on. i think he could let these out.11:10
wolfspraulno, I think that will be the ultimate delay producer11:11
wolfspraulthe quality and consistency of our test results would go down11:11
wpwrakworst case: something is found that fixes all these boards but is hard to apply in the field11:11
wolfspraulI made that mistake with rc2, so no way I'm going to make it again :-)11:11
wpwrakwolfspraul: rc2 went to people who didn't even turn it on :)11:11
wolfspraulif we do that we will not sell any rc3, so I won't do it11:11
wpwraklekernel: i think he just doesn't want to spend money on fedex :)11:12
wolfspraulwpwrak: for 32/3c, bring R60 back involves bringing the long wire back as well?11:12
wpwrakwolfspraul: no no. just solder one resistor to an existing footprint11:12
wolfspraulah ok11:12
wpwrakwolfspraul: r60 was removed as part of fix2, so lower the current on the reset chip a little11:13
lekernelby doing that we had mwalle fix the video chip (Adam and I failed), as well as me fixing the intermittent video-in failure and audio output noise11:13
wpwrakwolfspraul: but with fix2b, we're already nicer to the reset chip, so ...11:13
wolfspraullekernel: no it wouldn't work. it would be the end of rc3.11:14
wolfspraulI will not do it.11:14
wolfspraulwe need to be able to look across multiple boards.11:14
wolfsprauland if they are in different locations with different people the consistency will completely break down.11:14
wolfspraulof course all sorts of random results will pop up, and the general quality will go up11:14
wolfspraulthen we can try in rc4 what the results are :-)11:14
wolfspraulthat was what rc2 was for and we didn't do that well in this sytem11:15
wolfspraulnot again11:15
wolfspraulI am not writing off rc3, no need.11:15
wolfspraulfrom the fix2b results so far, the only one that pops out is 0x4811:16
wolfspraulthat one is not right11:16
lekernelthe time sinks in rc3 are: protection circuit, counterfeit buffers, and now flash/reset circuit11:16
wpwrakmaybe fly sebastien to taipei ? make R&D division do double shifts ;-)11:16
wolfspraulbut it's only one board, so I suggest to wait until Adam finished the entire fix2b plan11:16
lekernelvery little of that can be attributed to our shipping of rc2 boards around11:16
wolfspraulahh :-) you can ask Adam later, without politics for an honest answer. we have seen 'similar' problems like the one we are dealing with here on rc2.11:17
wpwrakwolfspraul: so your plan is, if there aren't a lot more 0x48, just consider them outliers and go ahead ?11:17
wolfspraulbut because of the way we sent rc2 out, we lost focus and consistency to get to the root causes and eliminate them for rc3.11:17
wolfspraulthat's my analysis11:17
wolfspraulwe can make that judgment11:17
lekernelwe did eliminate the video in instability and the audio noise11:17
wolfspraul0x48 is tough though11:18
wolfspraullekernel: yes! :-) so those we don't have to worry about now :-)11:18
wolfspraulwpwrak: would you be willing to ignore 0x48 and assume our design is stable and our test process is reliable?11:18
lekernelalso, a lot of successful improvements between rc1 and rc2 were done in a more 'distributed' way11:18
wpwrakwolfspraul: the thing is that, if something needs deep analysis, the current process is very inefficient. so if you can exclude needing deep analysis, then you're right not to spread the work11:18
wolfspraulsending boards anywhere now will delay rc3 sales by at least a month11:19
wolfsprauljust saying11:19
wpwrakwolfspraul: (0x48) i think i'd want to know if the board responds to environmental parameters11:19
wolfspraulI will simply refuse to sell CRAP.11:19
wpwrakwolfspraul: (1 month) i don't think so. pick problems you don't expect to be able to analyze with the current process. then you can only win (well, minus the shipping cost)11:20
wolfspraulso as long as it's crap, I keep improving on it, until it's not crap anymore :-)11:20
wolfspraulnah there are no such problems11:20
wolfspraulI am reading the test results, not speculating or ranting at whatever targets come to my mind.11:20
wolfspraulthe problem we have that stops rc3 sales is very isolated11:20
wolfsprauland _almost_ eliminated11:21
wpwrakagain, the current process is inefficient for deep analysis. it is efficient, though, for things that need broad rework with non-trivial parts.11:21
wolfspraulany result that would come in from anywhere non-taipei delays sales for over a month11:25
wolfspraulI'm still more optimistic.11:25
wpwrakanother problem with the current process is that, if adam makes any systematic mistakes, they may get undiscovered. debugging his workflow is very time-consuming.11:25
wolfspraulplus any board that goes anywhere quickly falls out of the logic with which we can right now still compare boards and group them11:25
wpwrakyes, that's true11:26
wolfspraulyes. so let's have 90 people produce 1 board each :-)11:26
wolfspraulanyway I can say clearly that I have 100% trust in rc3 and our current approach11:26
wolfspraulthat design is good11:26
wpwraknaw, you don't have so many people :) mwalle is on vacation, so lekernel, me, anyone else ?11:26
wolfspraulhere's the procedure: Adam first finished the fix2b test plan.11:27
wolfspraulI don't want to interrupt him now.11:27
wolfspraulif 0x48 is super isolated, maybe all is good already11:28
wpwrakfinishing fix2b is good, agreed11:28
wpwrak(trust in the process) there's a thing colloquially called "get-there-ism". it's the determination of following one's current path of action to achieve a certain objective, ignoring evidence that this may not be possible. it's a well-known phenomenon in the aviation industry. makes planes crash a few meters from the runway, with empty fuel tanks, because the pilot didn't want to divert.11:30
wpwrak(get-there-ism) this is something to watch out for. it's easy to get caught up in it :)11:31
wolfspraulthese are the ones missing: 0x61 0x63 0x6B 0x6C 0x77 0x7A 0x7D 0x7F 0x8511:31
wpwrakso .. fix2b completion ~saturday evening11:32
wpwrakthen re-clustering11:32
wpwraki dislike those where the TP36/TP37 voltage goes crazy after fix2b again11:33
wolfspraulwe had that yesterday already, no?11:33
wpwrakwith 0x3a ? no. that was rock solid.11:33
wpwrak(0x3a, the TP36/37 results)11:34
wpwrakah, we never got to having a good look at 0x3211:34
wolfspraulman I just think through sending a board to Werner. so painful. the number of dead-end scenarios is staggering. which one? 0x3A?11:38
wolfspraulon 0x3A, Werner would disappear into nor analysis land11:38
wolfspraulhe would never replace the chip because a) he doesn't have a spare b) he needs to study that board because that's the one he has11:39
wpwrakhe'd run a few more tests, that's for sure :)11:39
wolfspraul0x32 ? total guess. what if after 2 hours we find it's some plain and simple problem somewhere, without any relation to the fix2/fix2b issues?11:39
wolfspraulsee that's the problem11:39
wpwrakwhen do the new NOR reach adam anyway ? i think he has ordered some, no ?11:40
wolfspraulof course it would be great if more people would be where the boards are, in one place11:40
wolfspraulbecause then we could parallelice11:40
wolfspraulbut sending a board out? argh11:40
wpwraki'm still confused about 0x32 and 0x3c. TP36/TP37 is inconsistent with what we know so far.11:40
wolfspraulthat's the end, I really think it through11:40
wpwrakerr, TP36/Tp37 going wild11:40
wolfspraulthe impossible to answer question is already the first one: _WHICH_ board? :-)11:40
wolfspraulbest would be all 9011:41
wolfsprauland then overnight magically back to Taipei for assembly and packig11:41
wpwrak(which) 0x3a would be a good start :)11:41
wolfspraulafter we have design and test process stable, we will have plenty of boards11:41
wpwrakhehe :)11:41
wolfspraulyeah I know you like that one11:41
wolfspraulNOR study galore11:41
wpwrakyou have to start somewhere :)11:42
wolfspraulbut not on a one-off NOR chip problem11:42
wolfspraulat least not while the rc3 run is in showstopper mode11:42
wpwrakwhat i'd look for is whether the issue is on the NOR side or the FPGA side. i think we can determine this.11:42
wolfspraulyou will get plenty of boards11:42
wpwrakthen it becomes a question of bad chip or bad connectivity11:42
wolfspraulbut now we have to focus on consistent rc3 quality11:43
wpwrakbad connectivity can be probed on the NOR. not so easily on the FPGA (well, you can flex the board a little, see if it gets worse)11:43
wolfspraulotherwise no rc3 can be sold, and I will continue to work on getting them to be good11:43
wolfspraulabsolutely not exhausted yet, just warming up11:43
wolfspraulok 9 more to go11:44
wpwrakbad connectivity could point to an SMT process issue. of course, you wouldn't like to have any such thing pop up :)11:44
wolfspraulwill be interesting11:44
wolfspraulthe pulse thing is nasty because we think we overlooked something in the reset circuit11:44
wpwrakbad chips are easier. once we're sure they're just bad, swap them, new chip, new luck11:44
wolfsprauland 0x48 is nasty because it falls back from rendering to unreconfigurable11:45
wpwrakpulse thing would be which board ?11:45
wolfspraul0x32 and 0x3C11:45
wolfspraul'bad' chip may come from the process11:45
wolfspraulI think we will see more with pulses11:45
wpwrak0x32/0x3c also show regressions. they're regressing to a pre-fix2b behaviour11:45
wpwrak(bad chip) yes, could just be a bad SMT profile11:46
wolfspraulok, I meant regression as in pass the test software, but then fail afterwards11:46
wolfspraulno - not SMT profile11:46
wpwrakyup, 0x48 has a high-level regression11:46
wolfspraulwe are way past that. the design is mostly good, the process is mostly good.11:46
wolfspraulthere are no fundamental issues.11:46
wolfspraulthe schmitt-trigger was a fundamental issue - fixed.11:47
wolfspraulwhat we have now are statistical and manufacturing issues.11:47
wpwrakor maybe the through-hole pass didn't agree with all the components. no idea what this one actually is.11:47
wpwrakschmitt-trigger was the fake part ?11:47
wolfspraulbut our inability to test for 100% good boards means we cannot sell anything!11:47
wolfspraulsome were irregular, yes11:47
wolfspraulwe replaced all 270, done11:47
wolfspraulone thing is funny in this11:48
wolfspraulI just recently realized we should add a few 'render cycles' after the test program.11:48
wolfsprauldont' know why, intuition11:48
wolfspraulI felt uneasy that we never let the board do waht it's supposed to do with our users.11:48
wolfspraulwhich is to - RENDER11:48
wpwrakwell, worst case, you can decide to just sell them with the promise to replace all of them in case something major pops up. better than losing rc3 entirely.11:49
wpwrakyeah, end user testing is missing, too :)11:49
wolfsprauland now these render cycles are what give us the most unsettling feedback about our test process, even our design.11:49
wolfspraulgood catch Wolfgang!11:49
wolfspraulI don't mind the unsettling feedback, I can handle that.11:49
wolfspraulno way, you don't know how expensive support is11:49
wolfspraulthe render cycles are a godsent11:50
wolfspraulvery good11:50
wolfspraulit lifts m1 to the next level11:50
wolfspraulI already want to do 1h render testing on each board :-)11:50
wolfspraulor 24h :-)11:50
wolfspraulI'm sure if we would do that, we would find more issues.11:50
wpwraknext you'll want the temperatur chamber :)11:50
wolfspraulisn't it funny. if we remove the render cycle test (which we did not have in rc2), we would already sell now :-)11:50
wolfspraulkeep that in mind when complaining11:50
wolfspraulso I think we should not go bezerk, no 24h test etc. but we should do a few render cycles, yes.11:51
wolfsprauland we have to handle the fallout.11:51
wpwraki'm not complaining about the meticulous process. i'm merely suggesting that you could widen your bottleneck :)11:51
wolfspraulfly to Taipei11:52
wolfspraulthat widens the bottleneck11:52
wolfspraulyou can be there Saturday, no? :-)11:52
wolfspraulanyway just kidding. sometimes it just needs a bit of relaxation. I will think more about get-there-ism.11:52
wpwrakof course, there's no guarantee. e.g., further analysis on a problem board could be inconclusive, the board may suffer additional failures on its journey, just the shipping may take too long for the results to be meaningful, etc.11:52
wolfsprauloh sure, it wouldn't help with the rc3 sales showstopper problem at all.11:53
wolfspraulit would be nice for rc4 though11:53
wpwrak(fly to tpe) heh, i'd also want my lab :) some of adam's equipment is pretty marginal.11:53
wolfspraulit would even harm the rc3 showstopper resolution because it takes valuable data from Adam (from a consistent overview)11:53
wolfspraulso I'd rather pick 'safe' boards for this strange exercise, which defeats the purpose already. bottom line: it doesn't work.11:54
wolfspraulI've been in too many runs and tried too many things.11:54
wpwrak(overview) only if he ever plans to return to those boards for analysis.11:54
wolfspraulmost important is consistency (for what we are trying to improve now).11:54
wpwrakmy view is simply that, before you've isolated a problem, you don't know whether it's a one-off or something systemic. the current approach of having lots of boards is good for common problems. you get a lot of data, can do clustering, etc., and you can modify a lot of boards and gather many new results. very useful.11:57
wpwrakhowever, when you run out of these big clusters, then you need to track down seemingly individual problems. and there, the mass analysis approach doesn't scale.11:58
wpwrakso you really have distinct phases: first, get the lay of the land. second, examine the widespread issues and apply the corresponding mass cure. go back to step one and repeat until things settle.11:59
wolfspraulfix2b so far: 0x32 pulse / 0x34 good / 0x39 good / 0x3A nor / 0x3C pulse / 0x40 good / 0x48 render fail / 0x54 good / 0x55 nor / 0x5C good11:59
wolfspraulthat's my grouping11:59
wolfspraulvery good results so far!11:59
wolfspraulremember these are all boards that had problems before12:00
wolfspraulif people don't see the progress, well, sorry, I cannot help12:00
wolfspraulhuge progress12:00
wolfspraulfix2b is fantastic12:00
wpwrakphase 2: hunt down the ones that are weird. then see if this yield anything to apply to the rest. e.g., new, targetted experiments.12:00
wolfspraula life safer12:00
wpwrakyeah, fix2b is good12:00
wpwrakbut ... 0x3a and 0x3c worry me12:00
wolfspraulthose 2 pulse boards are strange12:00
wpwrakerr, 0x32 and 0x3c12:00
wolfspraulthey charge us12:00
wolfsprauland 0x48 is an insult12:01
wolfspraulbut we get to it, really12:01
wpwraki don't like boards to regress to a pre-fix2b state. that's not supposed to happen :)12:01
wolfspraulI don't know which board to send where now, it just doesn't work. I hope for some understanding.12:01
wolfspraullet's wait for the other 9 boards12:01
wolfspraulbut those results are really good. all of those boards didn't work before - keep in mind.12:01
wpwrakwell, you can think a bit about which boards you'd want to send where :)12:01
wolfspraulwe took them all out of the fail pool!12:02
wpwrakpity that the weekend is near. so unless you make a quick decision, you lose 1-2 days of fedex transit. but i can see the scheduling conflict, also with adam.12:02
wpwrakhe really needs an assistant :)12:02
wolfsprauloh of course, we think about that carefully. and there will be enough selection. but I need to suck out anything valuable from the perspective of the entire run and yield first.12:03
wolfspraulI did not do this well on rc2, plain and simple. over-excited I guess.12:03
wolfspraulmy problem, but this time I fix it.12:03
wolfspraulalso if not, rc4 would bankrupt me :-) (I wouldn't do it unless rc3 was under control)12:03
wolfspraulAdam knows this too, we have to catch more cases here, otherwise the next run will totally blow up.12:04
wolfspraulwe should not forget - Adam has far more production experience than we do. many runs of many thousand units, even some runs with millions I think.12:04
wolfspraulso it's not like we tell him - solder here, solder there. our solder monkey. Adam knows what is needed to get the _manufacturing_ quality (yield, efficiency, etc) up.12:04
wpwrakwell, he currently does operate a bit in the solder monkey way. and i agree, that's not a very efficient thing to do.12:05
wpwraktaht's again some cost of the centralized approach. he has all the boards, but he's also got the only hands available for solder monkeying.12:06
wolfspraulyes but I'm saying we (Adam and me) know what we need to do for rc4, and we will do it because we want a successful rc4.12:06
wolfsprauloh sure, that I agree with. but our resources are limited there. no need to argue with me that it would be better to have more people for this, in Taipei.12:07
wolfspraulor whereever the run is. the testing needs to be fast and efficient and in one place.12:07
wpwrakbtw, here's a nice picture from 0x3a: http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x3a_ch1-FLASH_RESET_N_ch2-INIT_B.JPG12:15
wpwrakthose narrow drops on CH2 are probably configuration retries, after hitting a CRC error12:16
wpwrak(CH2 is INIT_B)12:16
wpwrakthis is a fairly distinct pattern. something that can help to classify problems, in case we see this anywhere else.12:17
wolfspraulwpwrak: here's 0x3C 'pulsing' http://downloads.qi-hardware.com/people/adam/m1/pic/rc3_0x3c_ch1-tp37.JPG12:17
wpwrakah :)12:19
wolfspraulhere's the text for it http://en.qi-hardware.com/mmlogs/milkymist_2011-08-16.log.html#t06:1112:19
wpwrakinteresting voltages. 2V :)12:19
wpwrakwould need to see TP36 for a better picture. real trouble is probably there, TP37 just follows it12:20
wolfspraulif you look at the 0x3C testing notes, this is after fix2b applied12:21
wolfspraulmaybe another malfunctioning part in the same circuit12:22
wpwraklooks like one of those fix2b regressions. maybe adam really needs a new soldering iron :)12:22
wolfspraulif it's a malfunctioning part that's no problem at all actually12:22
wolfspraulthen the validity of fix2b still stands12:22
wolfspraulwe don't even need to look into it in that case12:22
wolfspraulquestion is whether we want to make that assumption :-)12:23
wpwraki think fix2b is valid, no matter what else we observe. it undoes an unnecessary extension of the reset circuit.12:23
wolfspraulso I suggest 0x48 first (but before even that wait for full fix2b results)12:23
wolfspraulthings really look good, I am not worried right now.12:23
wolfspraulthat's all I can say and now I get some tasty dinner! :-)12:24
wpwrakwhat bothers me is that we see so many weird effects on D16. here's a "test plan":12:24
wpwrak- if the voltages are "weird", scope TP36 and TP37 and archive the screenshot12:25
wpwrak- inject current from a 3.3 V source into TP36 and measure how much current flows12:25
wpwrak- if the current is low, add D60, and try again12:26
wpwrak- if the current is high (> 1 mA), stop for further headbanging12:26
wpwrakwell, or make this 100 uA even12:28
wpwraksorry, no D60. R60.12:28
wpwrakwell, if the current is high, remove C238, then try again12:30
wpwrakremoving C238 can cause FLASH_RESET_N to PROGRAM_B contamination, but that should be relatively benign12:30
wpwraki.e., if i understood lekernel correctly, what would happen is that, if you try to command a "software reset" (from the GUI or such), the M1 would just shut down.12:31
wpwrakheh, if we remove the "software reset" feature, we could even connect PROGRAM_B to FLASH_RESET_N, throw away C238 and D16 for good ;-)12:32
wpwraksometimes, all a puzzling knot needs is a good sword ;-)12:33
wpwrak(but don't try this - you'd also have to change the use of P22. else, you could drive P22 high into reset out low. not sure what happens then.)12:34
lekernelif giving up the software reset on the rc3 boards prevents those already huge delays from growing up even further, i'm for it12:42
wpwraklekernel: i was just waiting for you to say this ;-))12:43
wpwrakanyway, up to and including C238 removal, i think the above test plan looks reasonably, doesn't it ?12:44
wpwrakif we end up with C238 removed, we can then figure out what to do about it12:44
wpwrakwolfspraul: background / memory refresh: C238 protects PROGRAM_B from falling edges on INIT_B or FLASH_RESET_N propagating through the diodes. a falling edge on FLASH_RESET_N can happen (only ?) when a software reset is commanded, e.g., through the GUI. in this case, propagation into PROGRAM_B would also reset the FPGA, which according to lekernel just shuts down the M1. (why does it shut down and not just reconfigure ?)12:47
wpwrakwolfspraul: INIT_B would drop when there's a CRC error. so contamination of PROGRAM_B would create a feedback loop, where each failed try to configure would reset the FPGA. that sounds undesirable. with fix2b, we already remove INIT_B from the equation, leaving only the much friendlier FLASH_RESET_N connection.12:51
lekernelwpwrak: reconfigure to standby bitstream = shutdown12:55
lekernelthe nasty problem we may have here, though, is that the reset pulse may not be long enough for the flash12:56
lekernelso that can become another headache12:56
lekernelbecause as soon as the fpga is deconfigured by program_b, the reset will be deasserted immediately12:56
wpwrakwhy does reconfigure to standby mean shutdown but initial configuration to standby means that the system starts ? how do the two paths diverge ?12:57
wpwrak(glitch on flash reset) yeah, could be tricky. the NOR wants at least 100 ns.13:01
wolfspraulno even in initial configuration, it ends with the standby bitstream and you have to press the middle button to actually boot further (start)13:05
wolfspraulman we need to get one of those boards to you :-)13:06
wolfspraulhow about one of the good ones? including fix2b. 0x34 ?13:07
wolfspraullet me check the history of that one13:07
wolfspraulyeah looks perfect. a typical rc3 story :-)13:07
wolfspraulI had an evil thought on 0x48: nor corruption after first power-down. in that case we may have to try the 4.4v reset ics...13:10
wolfspraulbut we see later what we find13:10
wpwrak(middle button) aah, now it makes sense, thanks :) and now i also understand why it's called "standby" bitstream :)13:10
wpwrak0x48. yes, that would be a possibility. bring it up and read back the NOR. we now know that urjtag works :)13:11
wolfspraulthat'd be the worst case. nor corruption on 0x48 requiring a reset ic rework on the entire run :-)13:12
wolfsprauland then who knows it may not even fix the nor corruption... well, think positive.13:12
wpwrakwolfspraul: actually, you could go to taipei to help adam ;-) alas, it seems that you'd then have to relocate rejon as well13:12
wolfspraulwe did see a nor corruption on rc2 (xiangfu), and also on 0x3A (unexplained, I'm just leaning towards 'replace nor chip' right now)13:13
wpwraki don't like these "replace the chip" operations. at least not without having isolated the fault. otherwise, you just roll the dice and you have no idea where they fall.13:15
wolfspraulyes and no. as long as it's efficient it may scale up well into the thousands of units.13:15
wolfspraulour difficulty is our own uncertainty into the design and our test process13:15
wolfspraulthat complicates things13:15
wolfspraulnow we have too many unknowns13:16
wolfspraulso we cannot effectively kill the bugs13:16
wpwrakand of course, if the problem is anywhere NOR-related, you may very well make it go away. e.g., but eliminating all the parts with tolerances in the region of the bell curve the design doesn't cover :) (and, of course, in the next run, you'll run into more of the same again)13:16
wolfspraulif at the same time you question your design, your test process, and the chips, what then?13:16
wolfspraulwe need to get the design and test process off the table first13:17
wolfspraulno matter what13:17
wolfspraulthe design and test process must be of unquestionable standard13:17
wpwrakthat's when you need systematic analysis :) yes, you may waste your time on random freak accidents. but chances are there's more to these things.13:17
wolfspraulotherwise we can never manufacture effectively13:17
wpwrakthe test process is a separate issue13:17
wpwrakright now, we're still trying to find causes13:18
wolfspraulgood news 0x61 0x63 also good13:30
wolfspraulfix2b so far: 0x32 pulse / 0x34 good / 0x39 good / 0x3A nor / 0x3C pulse / 0x40 good / 0x48 render fail / 0x54 good / 0x55 nor / 0x5C good / 0x61 good / 0x63 good13:31
wolfspraul7 good, 2 pulse, 2 nor (my grouping), 1 render then fail13:32
wolfspraul7 more to go: 0x6B 0x6C 0x77 0x7A 0x7D 0x7F 0x8513:33
wolfspraulthe good news about 0x48 is also that it failed on the second render cycle, i.e. after the first power cycle13:35
wolfspraulI like that much better than failing on the 6th or 9th one as we saw before13:35
wpwrak(0x61, 0x63) great !13:50
wpwrak(0x48) will be interesting to see the NOR dump13:50
aw_(0x6B, 0x6C) good14:16
wolfspraulaw_: great!14:17
aw_let's dump 0x48 now14:17
wolfspraulthere you go14:20
aw_needs 5 minutes to dump. :)14:23
wolfspraulaw_: yes but the reading seems to work14:23
wolfspraulso 0x48 cannot reconfigure now?14:23
wolfspraulmaybe after the dumping you try to boot, just to see whether it's still stuck somewhere (cannot boot)14:24
aw_it's said that but I've never calculate it14:24
wolfspraulaw_: we followed the wiki a bit today, excellent work!14:24
aw_0x48 is quite a little same with 0x3a yesterday we did14:24
wolfspraulfix2b so far: 0x32 pulse / 0x34 good / 0x39 good / 0x3A nor / 0x3C pulse / 0x40 good / 0x48 render fail / 0x54 good / 0x55 nor / 0x5C good / 0x61 good / 0x63 good / 0x6b good / 0x6c good14:25
wolfspraul9 good, 2 pulse, 2 nor (my grouping), 1 render then fail14:25
wpwraki like the trend in the last few :)14:25
wolfspraul5 more to go: 0x77 0x7A 0x7D 0x7F 0x8514:25
aw_so even d2/d3 is still dimly lit and make sure tp36 and tp37 is fully pull high , also init_b is okay, then it tried to enter reconfiguration stage14:26
wolfspraulaw_: maybe I misread the 0x48 notes? the 0x48 notes say that this board rendered, and then failed after the first power cycle?14:26
wolfspraulaw_: yes but did 0x48 render before?14:26
wpwrak(5 to go) kewl. that was really a productive day.14:26
larscwpwrak: do you happen to know any rtc chips, which could be used for the milkymist?14:28
aw_wolfspraul, the 0x48 has never rendering successfully before.14:28
wolfspraulthat's good14:28
wolfspraulthen I misunderstood the notes, one sec14:28
wolfspraulaw_: 0x48 notes are saying "5. applied fix2b 6. D16(in-circuit): For.V.=152mV, Rev.V = 1548mV 7. d2/d3 is fully off after power on 8. reflashed successfully 9. cant reconfigure @2nd rendering, tp36/tp37 is 3.3V "14:29
wolfspraulsee 9. can't reconfigure @2nd rendering14:29
wpwrakso "@2nd rendering" really means "@2nd power cycle" ?14:29
aw_sorry that i should say in the first round test, it has never rendering before14:30
wolfsprauldid the test software run?14:30
aw_then after fix2b, can't reconfigure at 2nd power - cyle14:30
wolfspraulI don't understand the notes14:30
wolfspraulaw_: did the test software run on 0x48 ?14:30
aw_yes, it's passed in test program14:30
wolfspraulthen after the test software, you power cycle?14:31
wolfsprauland then it doesn't reconfigure?14:31
wolfspraulwell but that's even better than I thought14:31
wolfspraulaw_: how do you feel about fix2b today, and the boards you worked on?14:32
aw_see the last bottom, you can see I've only copied first one time boot up log14:32
Action: wpwrak eagerly awaits sinking his greedy fingers into the dump of 0x48 :)14:32
aw_i have a strange feelings that if the board has smoothly and fully passed (including rendering )in FIRST run (fix2 circuit) , then the applied fix2b will be also passed the rendering job14:34
wpwrak(i actually thought if extending my bit error checker to look for algorithmic patterns in the address bits. could be fun.)14:34
wpwrakyeah, if a board was happy with fix2, it should be only happier with fix2b.14:35
wolfspraulaw_: ok I don't fully understand the 0x48 process, but maybe you can update the notes a little with what you remember. maybe like werner said "@2nd power cycle" or "@1st boot to render"14:35
wolfspraulstill strange how 0x48 failed, oh well14:36
aw_good note from werner, I'll do like that14:36
aw_i don't know, just second power-cyle then byebye14:36
wpwrakpassing with fix2 and failing with fix2b could have the following explanations: 1) some rework mistake in fix2b, 2) something was borderline and went over the limit (e.g., a temperature dependency)14:36
wpwrakaw_: how's the 0x48 dump coming along ?14:37
wolfspraulwpwrak: which board?14:37
wolfspraulaw_: I think the results today are super encouraging.14:37
aw_dump done...let me mv14:37
wpwrakwolfspraul: no, in general14:37
wolfspraulwe are on a very good path with fix2b14:38
aw_wolfspraul, yes, no; i still fill somethings strange though.14:38
wolfspraulaw_: today, you set 9 boards to 'available' status, that means 90 thirty second rendering cycles. and not a single board failed after the test software, with the exception of 0x48 which failed right after.14:39
wpwrakit seems that the diodes are unreliable. with fix2b alone, we're removing about 50% of the unreliability :) and with the extra testing adam does as part of fix2b, most of the rest as well14:39
aw_tomorrow when I test all cluster batch boards, then back to work with werner to check failed board14:39
wolfspraulyes exactly14:39
wolfspraulbut I am still happy - look at the numbers I just said - because we can now safely distinguish between 100% good and failed boards14:39
wolfsprauland I do trust the ones that are 100% good14:39
wolfspraulthey are stable and good and will stay like that14:39
wpwrakyes, looks pretty good now14:40
wolfspraulwe can do another 10 render cycles on them in 2 days to verify14:40
wolfspraulnever say never14:40
wpwrakheh :)14:40
wpwrakwait for a hot day14:40
aw_wpwrak, it could be on diode. but this leave tomorrow to check. since d16's one terminal was soldering twice: one is my fix2, the other is to take apart for fix2b.14:40
wolfspraulaw_: yes we notice, sure. 0x3C, 0x48, 0x5514:40
wolfspraulthere are still problems14:40
wolfspraulthis bloody diode has to go in rc4 :-)14:42
wpwrakwolfspraul: ah, and the reflashing may need some clarification: is the reflash script as adam uses it supposed to do a verification (e.g., CRC) ? because it appears that it doesn't do this14:42
wpwrakyeah, the diode is evil14:42
wolfspraulthe problem is that jtag verification is too slow14:43
wolfspraulcrazy slow14:43
Action: wpwrak waiting for the dump14:43
wolfspraulthat's why we added crc checks to the test software14:43
wpwrakwolfspraul: hmm. okay, so writing with urjtag is unreliable. okay.14:43
wolfspraulnot unreliable14:43
wolfspraulverification is too slow to be practical (30 minutes or more)14:43
wpwrakwolfspraul: unchecked = unreliable ;-)14:43
wolfsprauldon't know why, we can easily enable it14:43
wolfspraulbut then it's crazy slow14:43
wolfspraulso we check crc in the test software, which runs right after urjtag14:44
wolfsprauland the results are logged14:44
wpwrakwolfspraul: that doesn't make sense ;-) if write + read is faster than write + verify, something doesn't add up :)14:44
wolfspraulI'm sure there are inefficiencies14:44
wolfspraulbut if we enable 'verification' now in the script, it's crazy slow14:44
wolfspraulso we moved the crc check to the test software instead14:44
wolfspraulread is also slow14:45
wolfspraulas you can see right now14:45
wolfspraulreading the entire 32 megabytes takes over 4 hours14:45
wolfspraulthe only thing urjtag is fast at is unverified writing14:45
wolfspraulas of right now14:45
aw_wpwrak, http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/bitstream/0x48-standby1.bit/14:45
wolfspraulthat's why there is a crc check in the test software14:46
wpwrakwhee, six 1 -> 0 transitions !14:46
wpwrakformayt is   bit: #to0 #to114:47
wpwrakthere #toX is "number of transitions from !X to X"14:47
aw_which bit you think?14:48
aw_sorry don't understand your format. :)14:48
wpwrakone word was zeroed out for some reason14:49
wpwrakso this is quite different from 0x3a14:50
wolfspraulthat could finally be a software bug! :-)14:50
wpwraki'd re-write, read back without power-cycling, then power-cycle and see what happens14:50
wolfspraulwe are moving uuupppp!14:50
wpwrakyes :)14:50
wolfspraullekernel: man this is great!14:51
aw_wpwrak, yes, different from 0x3a though14:51
wolfspraulthe nor becomes so stable now that we can see actual software bugs making it all the way back in (well, likely software bugs)14:51
wolfspraulI think that's good news14:51
wolfspraulthe hardware become stable...14:51
lekernelwhat kind of software bug?14:52
wpwrakaw_: yes, very different. 0x3a has all the errors on the same bit and scattered over many many addresses14:52
aw_wpwrak, so you want me to re-write/reflash it again?14:52
wpwrakaw_: if you don't feel too tired, yes please14:52
aw_wpwrak, yes, i noticed that.14:52
wolfspraullekernel: no worries, I was half joking. just extrapolating what it could be...14:52
aw_wpwrak, wait14:52
aw_should we use xilinx tool?14:52
wolfspraulWerner just saw an entire word in nor zeroed out.14:52
wpwrakaw_: and then read back before power-cycling, so that we can see whether the writing was okay14:52
wpwrakaw_: naw, urjtag is fine14:53
wolfspraulaw_: no use urjtag, I trust it14:53
aw_aalright...let's use urjtag first. ;-)14:53
wpwrakwolfspraul: it's a bit too early to blame sw. could also be a urjtag glitch for all we know. or powering down.14:54
wolfspraulyes yes sure14:54
wolfspraulI was just expressing my joy14:54
wolfspraula full word!!!14:54
wolfspraulwe are clearly moving upwards14:54
wolfspraulactually, in that theory, it must have happened before flickernoise14:55
wolfspraulbut anyway, just speculation14:55
wolfspraulI don't care much because this was caught by the test process14:56
wolfsprauland safely caught, not at last second14:56
aw_reflashed done14:59
aw_now dump again.14:59
aw_wpwrak, bad...sorry that I didn't notice that you wanted to dump without power-cycling...15:05
aw_i redo now..sorry15:05
wpwraknaw, take this one then15:05
aw_wpwrak, also okay?15:05
aw_phew~ almost my finger to power off. :)15:06
wpwrakheh :)15:07
wpwrakif 0x48-2 is okay, which is what i'd expect, then the power cycle didn't matter. in case we find an error also in 0x48-2, the power cycle will need investigating.15:08
wolfspraulif it were that easy, we are lucky15:09
wolfspraulwpwrak: it could well be 1 out of 10 power cycles15:09
wolfspraulremember that we are zooming in on troublemakers in a run. whenever you do that your cases get stranger and stranger.15:09
wpwrakwolfspraul: it could be, yes.15:09
wolfsprauldon't forget all the dozens of boards in hundreds of tests that have never shown anything like this15:10
wolfsprauland how we are looking at the one time that we saw this15:10
wpwrakwolfspraul: we also have the risk of an undefined power state in the down ramp in all of rc315:10
wolfspraulyes sure, I know15:10
wolfspraulI am aware of it15:10
wolfspraulbut today, we had 9 boards pass a total of 90 rendering power cycles15:10
wpwraki hope applying locking wherever possible will reduce the risk of the down ramp doing too much damage15:10
wpwrakwould be good to have a CRC check for the unprotected partitions, though15:11
Fallenou /win 1215:11
aw_wpwrak, http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/bitstream/0x48-standby2.bit/15:13
wpwraktry to boot and render :)15:14
wolfspraulaw_: if you do 10 full render cycles + crc checks on 0x48, I think you can add it to avail - fix2b15:15
wolfspraulI wouldn't know why not15:15
aw_d2 is ON..and rendering15:15
wolfspraulbut you can also do that tomorrow, it gets later and later and it was a long day...15:15
aw_yes..i'd go for sleep ...hehe ;-)15:16
wolfspraulI'm thinking whether we should be suspicious about 0x48, but I don't see why15:16
wolfspraulthanks for the excellent work today!15:16
wolfspraulphantastic, really15:17
wolfspraulso many boards15:17
aw_but good that we caught known issue on 0x48 today15:17
wolfspraulvery enlightening fix2b results15:17
wolfspraulwe will do some more thinking15:17
wolfspraulremember this in the notes history...15:17
aw_wpwrak, thanks a lot though.15:17
wolfspraulwe can hold onto 0x48 for a while15:17
wolfspraulbut I am 99% sure there's no problem with 0x4815:18
wpwraka great day indeed !15:19
wpwrakaw_: sweet dreams ! :)15:19
wolfspraulwpwrak: 0x48 is one of those cases that I would/might end up holding back or not selling15:22
wolfspraulI always sell the best things first15:22
wolfspraulbut it's too early to tell. if we find a clear software bug one day then it changes.15:23
wolfspraulI think we can leave 0x48 alone now.15:23
wolfspraulso 0x3C/0x32 are interesting, or maybe the ones that I grouped as 'nor failure' (0x3A/0x55)15:24
wolfspraulwpwrak: do you have any idea for an rtc chip we could add to rc4?15:25
wpwrak0x48 may just be the first one to exhibit a down ramp corruption15:26
wolfspraulvery speculative, almost wishful thinking15:26
wpwrak(rtc chip) no idea :)15:26
wpwraki don't exactly "wish" for down ramp corruption ;-)15:26
wolfspraulno but it's too speculative for me - no reason15:27
wolfspraulwill think more15:27
wolfsprauladam did a great job today, lots of hard data15:27
wolfspraulfix2b looks good, all on track15:27
wpwraki think we may currently have a very low probability of encountering down ramp corruption. maybe it needs a bus access plus the right power drop. a synthetic test may be able to make it happen more often.15:28
wpwrakor maybe down ramp corruption never happens and this was something else15:28
wpwrakmaybe it's a one in a hundred years sw bug :)15:28
wolfspraulhere's an important question: should adam investigate 32/3c or 3a/55 first, or first proceed with fix2b across all 90 boards?15:31
wpwrakhmm, let's give 0x32/0x3c a try first. maybe there's a low-hanging fruit there. in 0x3a, we already know that things are a bit harder.15:32
wolfspraulok, but with time limit probably15:33
wolfspraulI feel good about fix2b across all 90 boards15:34
wolfspraulcalling it a day as well, n8 (reading backlog tmr)15:34
wpwrak0x55 looks worse than 0x3a15:34
wpwrak0x32/0x3c still don't act as a fix2b'ed board should15:34
wolfspraulyes but can 0x55 raise or lower fix2b validity? I doubt it...15:35
wolfspraulsame for 32/3c. just some small problem on those particular boards, nothing to do with fix2b.15:35
wpwrak0x3a and 0x55 don't affect fix2b. 0x32/0x3c might.15:35
wolfspraulthe more reworks we make, the more manual mistakes we introduce into the run, which then have to be fixed again.15:35
wpwrakbut if we find something "interesting" in 0x3a/0x55, it may make sense to include it in the post-fix2b testing, to save time.15:36
wolfspraulok so 32/3c first, I guess15:36
wpwrak(manual errors) yes, that could very well be the problem of 0x32 and 0x3c15:36
wolfspraulonce we can safely assume that, there is no value in looking at them at all, even until after rc3 sales start (not just fix2b verification)15:41
wolfspraulbut let's ping them quickly, see what we find, then decide15:42
wpwrakyeah, we won't know for sure before we've fixed them :)15:45
wpwraki don't expect this to be overly hard15:45
wpwrakC238, most likely15:45
--- Fri Aug 19 201100:00

Generated by irclog2html.py 2.9.2 by Marius Gedminas - find it at mg.pov.lt!