#milkymist IRC log for Saturday, 2011-08-13

awlekernel, 0x71: I plugged usb-A port with keyboard and usb-B port with mouse, they can be detected well, but port-A shows 'USB: HC: Transfer start: RX timeout error'; so I swapped keyboard and mouse. that error won't show up, then swapped again, it shows up still. so I let 0x71 be in gui mode, I still can use mouse and keyboard together even swapped them. What does this stand for?07:25
aw0x71: http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/71-results07:26
awI will mark 0x71 as 'X' still though. ;-)07:27
wolfspraulaw: are you trying with the silicone keyboard? try with another keyboard too (either a completely different non-silicone keyboard, or with a second silicone keyboard)07:29
wolfspraulbut definitely X for now07:30
awyes, silicone keyboard07:30
awokay...let me try another silicone keyboard.07:31
wolfsprauland a different keyboard (non-silicone) too, if you have one that works07:33
wolfsprauljust to get some more data07:33
wolfspraulbut the board stays FAIL anyway, so maybe just a waste of time...07:34
awhmmm....the same results after used another silicone keyboard. also swapped in gui 'login' it reacted well responsely to my type though. strange indeed.07:38
awwell...marked 'x' still..next board to test. ;-)07:38
wolfspraulyes, correct. mark 'x' and move on.07:38
wolfsprauldefinitely not pass like this07:39
wpwraktry a shorter cable ? (-:C07:48
aw0x6c interesting histories: 1. TP4 – 174 ohm, 2. current 0.53A normally 3. http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/6C-reflash-results 4. No VGA screen. 5. usb-B 6. d2/d3 always dimly lit after powered-on since finished first test program firstly 7.  d2/d3 dimly lit after powered on since replaced u7/u19/u20 couple days later 8. reflashed successfully by BEN usb cable: http://download08:22
aws.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/6C-reflash-result s-1 9. replaced new u17. 10. reflashed successfully by erase version:http://downloads.qi-hardware.com/hardware/milkymist_one/production/rc3/test_results/6C-reflash-results-e08:22
aw0x6c now is rendering done successfully. ;-) any amazing?08:22
wolfspraulwith this history, you cannot set it to 'available'08:23
awalso no d2/d3 dimly lit although the power-cylce is not too much.08:23
wolfspraulwe need to understand the flash/boot/dimly lit issues first08:23
awsure, i just wanted to say it passed all tests. strange and amazing.08:24
wolfspraulthat's why we need to do more research before starting to sell08:24
awanyway...i just posted here any news i found/tested. ;-)08:26
wpwrakso the USB transceivers are acting up, too ? (U16, U17)12:34
lekernelthe FPGA design has tons of (painful) bugs in the USB "UART", it could just be that part tolerances are tickling them12:36
wolfspraulI don't think I'm worried about those bugs right now12:37
wolfspraulthey are clearly identified, and the boards singled out12:37
wolfspraulwpwrak: you proposed a reset IC with different threshold voltage and supplied from 5V. any candidates?12:37
lekernelby the way, it never ceases to amaze me how people seem happy with the opencores USB "UARTs", for example this one: http://opencores.org/project,usbhostslave12:38
wolfspraulI realize to save time we should order some parts...12:38
lekernel"Works like a champ for me" ... this thing has MORE bugs than my crappy design. I used it at the beginning and had to throw it away because it did not work at at all.12:38
wpwrakwolfspraul: (not worried) okay, but what is the trigger for replacing U17 ?12:39
lekernelyou could see that piece of crap obviously misbehaving on trivial corner cases of bit stuffing etc.12:39
lekernelfor example it doesn't bitstuff correctly the last bit of USB packets12:39
lekernelevery USB application note tells you to be careful about that12:39
wpwrakwolfspraul: for the reset IC, there's a -440 part (4.4 V) of the same chip. that would allow you to operate perfectly within specs.12:40
wolfspraulmaybe Adam should order that right on Monday morning?12:41
wolfspraulwpwrak: [trigger for replacing u17] don't know. when the test fails? :-)12:41
wolfspraulmy feeling from looking at testing results is that (if it's 1 problem only), the root cause is not a simple flash write somewhere12:45
wolfspraulthat wouldn't explain why some boards cannot be reflashed anymore, sometimes for a day or several days, sometimes forever12:46
wolfspraulif it's just 1 problem, it must be some out-of-spec electrical shock/impact on the chip that may sometimes result in corrupt data, sometimes it different kinds of damage12:46
wpwrak(part) U24, instead of A4809E3R-263DN, use A4809E3R-440DN, 4.312-4.488 V12:47
wpwrakyes, there seems to be too much of a connection to power cycling for it to be just some weird writes12:47
wolfspraulalso the boards that end up in "stopped at 'Bitstream length: 1484404' while reflashing" state12:48
wpwrakblock locking may still make the problem "go away", but i wouldn't rely on it12:48
wpwrakthat sounds like USB12:48
wolfspraulnah, too related to prior reconfig or d2/d3 dimly lit problems12:48
wolfsprauland why does it not go away? and the same board could be flashed before?12:48
wpwrakmy guess would be that, if you switch to full-speed, these ""stopped at bitstream" things will vanish12:49
wolfspraulok, definitely one of the first tests to do12:50
wpwraki suspect high-speed USB signal integrity issues. you probably have tons of CRC errors you never see. and every once in a while, one slips through and spoils your day.12:50
wolfsprauleven if that is so, it doesn't explain why boards that render eventually experience flash problems and then eventually end up unflashable12:50
wpwraki think it's unrelated12:51
wolfspraulbut that's why I think the high-speed CRC problem, if it exists, is already contained now12:51
wolfspraulbecause once the flash is written, and the crc checks of the test software pass, we are behind this potential failure case12:51
wolfspraulthat doesn't explain why at some later point this same board suddenly and persistently cannot be flashed at all anymore12:52
wpwrakthe design of the reset circuit does not seem to offer protection when powering down. not by design and, if the voltage rail traces are still representative, also not by accident. so if the underlying reason for using the reset circuit in the first place is correct, then that might be the problem.12:53
wpwrakof course, if the reset circuit is actually completely unnecessary, then it's not ;-)12:53
wpwrak(crc contained) maybe. depends a bit on how it's implemented. do you remember how fdformat works ? (from a user's perspective)12:55
wolfspraulah no, that sounds 80's, forgot12:56
wpwraklekernel: there's a lot you can get way with on USB a lot of times. and people are probably just happy they don't have to give evil FTDI their money ;-)12:57
wpwrakwolfspraul: maybe qi-hw's first ASIC should be a completely open USB-to-serial converter ;-)12:58
wolfspraulwpwrak: the reset ic in 4.4v variant would offer protection when powering down as well?12:58
lekerneldo we really want to spend time on something as mundane, overengineered and pesky as USB? :-)12:58
wpwrak(fdformat) hey, that must have been '92 ! :) well, what it does is that it formats tracks 1-N, then seeks back to track 1 and verifies tracks 1-N. unlike the approach the DOS tools use, which format and verify track 1, then format and verify track 2, etc.12:59
wolfspraulbasically I am trying to think whether there are other alternatives and whether we would order more parts, to speedup12:59
wpwrak(fdformat) can you guess why it does this ? hint: i wrote the whole formatting stuff and my floppy drive was a little bit defective :)13:00
wolfspraulbecause whatever we do, Adam will have to do some testing of this and that variant. and if parts are missing we will quickly have another 'couple days' waiting time in between...13:00
wpwraklekernel: i think no matter how stubborn we are, we can't defeat USB ;-)13:00
wpwrak(a809) yeah, dunno how long it takes to get that one. i think they're in taiwan. it's one of those never-to-be-seen-at-digi-key parts :)13:03
wolfspraulok, attack plan seems to be13:06
wolfspraul1. order 4.4v variant of reset ic13:06
wpwrak(fdformat) the problem was that the stepper motor sometimes didn't step. so i could get logical tracks 1-2-3-5-6-7-... on successive physical tracks. a per-track verification would have succeeded. the whole disk verification would spot the problem13:06
wolfspraul2. for a board in 'unflashable' state, try to reseat jtag board, try to force USB to full-speed, try to enable urjtag debug messages, try Xilinx Impact13:07
wolfspraul3. for a board in 'cannot reconfigure' state, run the test software for CRC checking13:07
wolfspraul4. for a board that is in 'available' state right now, try to do 100 thirty second render cycles to see whether the 'cannot reconfigure' problem can be enforced13:08
wpwrak(fdformat) lesson learned: if location is unreliable, separate write and verification phases. (another lesson, implicit in the floppy structure, would be to have location information embedded in the data. alas, that would be difficult in this case. but then, we have a lot of entropy, so i wouldn't be worried about tht)13:09
wolfspraul5. if we feel better about reproducing the 'cannot reconfigure' problem, compare the different ways to power cycle - unplug DC, unplug mains, three-button reset13:09
wolfspraul6. once we have the 4.4v reset ic, rework a board and see whether we can reproduce the 'cannot reconfigure' problem still, on that board13:09
wolfspraul7. make some power-down scope measurements to collect more data points?13:10
wolfspraula lot depends on us being able to reproduce the 'cannot reconfigure' state better13:10
wolfspraulwpwrak: [fdformat] but we do have that already. the crc checks are separate, because the test software checks later, completely independent of the flashing operation13:11
wolfsprauldoes my attack plan #1 - #7 sound about right?13:11
wolfspraulI will dwell over the testing results a bit more...13:11
wpwrakhmm, for 2., i'd say to estimate the current rate of flash failures / CRC errors with the long cable. then switch to full-speed permanently and see if the error rate drops to a low-probability percentile.13:12
wolfspraulyou mean full-speed with long cable?13:12
wolfspraulI would like to get my head off of usb asap, because like I said the crc checks are already independent, so there is no way _ANY_ jtag flashing issue can still be around that much later13:13
wolfspraulso even if the jtag usb is unreliable like hell, once we managed to write nor properly, it's there because it will be independently verified by the test software later, in a totally different code path13:13
wolfspraulthat's my understanding at least, I do not see how a USB issue can matter then, even if one exists13:14
wolfspraulthe test software is loaded via serial, it checks the crc of the data on nor. if that is ok, any potential usb/jtag flashing bug is behind us.13:15
wolfspraul8. implement locking of standy+rescue partitions13:17
wolfspraulwpwrak: do you understand/agree that flashing and checking are already completely separate?13:23
wolfspraulmaybe I misunderstand our process...13:23
wpwrakback from phone13:24
wpwrak(usb issue) whether it's truly solved or not depends on how the protocol is designed. so i'd rather eliminate the root cause, just to be sure.13:25
wpwrakbut yes, if you do a verification with the test sw afterwards, the NOR is good13:26
wolfspraulI'm looking for the root cause of the 'cannot reconfigure' problem, not the root cause of any jtag/usb flashing issue13:26
wolfspraulbecause the latter one isn't a sales showstopper, but the former one is13:26
wpwrak(locking) i would also lock the regular bitream. maybe also APP, in case it's mostly read-only. basic rule: lock everything you're not going to write to often.13:27
wolfspraulthose sound like software improvements13:28
wolfspraulthe highest priority is to make a decision whether we have boards (any of the 90) that we believe are electrically good13:28
wpwrakwhat makes me uncomfortable about USB is that it also complicated analysis. so each analysis step needs to include retries to make sure any supposed NOR errors found are not from USB13:28
wolfspraulyes but it's easy to run the test sw13:29
wpwrak(locking) correct. you shouldn't _need_ the locking. it's another safety belt. while chasing the NOR corruption, i wouldn't lock at all. i.e., leave things as they were13:29
wpwrakif you have a single-bit error, that will work13:30
wpwrakif you have multibit errors, it's more work. then you need to implement the crc also on the pc, to verify that the (failing) CRC is the same on both sides13:30
wolfspraulI don't think (guess) we are dealing with any 'proper' nor write13:31
wolfspraulI think it's a maltreatment of some wires into the chip that may also express itself in the form of a bad bit13:31
wolfspraulthat's just my uninformed guess of course13:31
wpwrak(proper nor write) locking may also protect against other write actions13:31
wolfspraulmaybe I'm trying to find a root cause to fix all flash issues at once :-)13:31
wpwrakfor now, i would assume that the software is perfect and thus doens't need NOR locking to survive :)13:32
wolfspraulhow does my attack plan above sound? right direction overall?13:32
wolfspraulyes definitely13:32
wolfspraulonce we are on the software level it's a different thing already13:32
wolfspraulI think we have some issue below software however13:32
wolfspraulthis is not a 'clean' nor write going astray sometimes13:33
wolfspraulthe data doesn't add up to that theory13:33
wpwrak1. sounds good. 2., i would simplify to "force full-speed" (and report if the stopped bitstream ever appears again)13:33
wpwrak3. i would run the test sw always, independent of "cannot reconfigure'. if NOR corruption happens at random locations, you'll encounter the problem 20x as often.13:34
wpwrak4. does "render cycle" include power-cycling ?13:34
wpwrakif things go as expected, 7. may be unnecessary :)13:35
wpwrakregarding 3, i would switch from "deal with NOR corruption when you happen to observe it" to "specifically look for it"13:37
wpwrakalso, NOR corruption could also cause other upsets than just failure to reconfigure. e.g., do BIOS/RTEMS/flickernoise check their consistency when booting ?13:38
wolfspraulI doubt it13:41
wpwrakthe BIOS is quite small, so it shouldn't be hit very often. FN is more than twice the size of the bitstream. so assuming uniform distribution, for each hit reconfiguration takes, you make have two hits on FN.13:41
wolfspraulmaybe from now on, when Adam tests the 10 render cycles, he should run the test software in between13:41
wpwrakon the other hand, if where's checksumming to protect FN, so that APP NOR corruption would have a clear and distinct indication, then this would tell us something about the distribution of where NOR corruption happens. but let's worry about that later13:42
wpwrakyes, definitely run the test sw in between13:42
wpwrakNOR corruption hitting FN could also cause other failures. failures which will go away if adam then reworks a (perfectly good) chip and reflashes :)13:43
wpwrakhe seems to reflash very often, so that may hide such things13:43
wpwrakso i would reflash only if there's a known corruption (or if the NOR content needs updating for some reason)13:44
wolfspraulgood point [from now on reflash more carefully after a board was flashed knowingly good for the first time]13:46
strangeloopanyone at camp who wants to chat (ie explain) a bit about milkymist to me? :D13:50
wpwrakis lekernel still there ? if yes, he'd be the man to catch :)13:51
wpwrak(lekernel = sebastien)13:51
strangeloopna he said he already left camp13:54
strangeloopof course can always discuss online, but this kind of stuff is more fun face to face :)13:54
wpwrak(left) pity. hmm, dunno if there's anyone else. roh (joachim) knows the M1 a bit, at least mechanically, but i don't know how familiar he is with its software13:56
wpwrakcould be anything from not even having power up the board to him having a rave with the wildest video effects in town every night ;-)13:57
strangeloophehe i see  :)13:57
strangeloopi'll keep my eyes and ears open then13:57
strangeloop(and obvioulsy have a few more technical questions here as soon as i manage to get my hands on a board :)13:58
wolfspraulstrangeloop: wow, nice to hear from you and definitely, stop back here...14:00
wpwrakah, and the reset chip rework (to 5V) would be as follows: unsolder old chip, bend pin 3 (the one on the side with only one pin) up, solder the two other pins, run a patch wire from pin 3 to 5V, put something isolating between pin 3 and the pad underneath. a bit hackish.14:30
kristianpaul"uCLinux USB driver." l :-)15:37
kristianpaullekernel: as seems you have lot experience with testing the opencores stuff, what about this one http://opencores.org/project,ethmac ?15:39
lekernelsure, that would be cool. get it to work, kristianpaul!15:39
lekernelI used it before; it's bloated15:39
lekernelbut it works15:39
lekernelthat's rare enough for something from opencores, so it's worth being mentioned15:39
lekernelit's about the size of LM3215:39
kristianpaulmay be is bloated to be full IEEE  compliance?15:40
lekernelkristianpaul, if you dislike my choice of not wasting my time implementing all the useless/legacy features of ethernet into minimac, go ahead and fix it. i'm always waiting for your patches. and I think a better job can be done than what the opencores people did, even when sticking to the standard.16:23
kristianpauli'm do _not_ disliking nothing, i respect others work, bloated or not :-)17:13
kristianpauland what's the problem if i dont send patches? i'm not as good as you or others here coding, is that a problem? please tell me--17:15
kristianpaulor i'll better hold my comments, wich seems are not well wellcome if a patch is not attched..17:16
lekernelkristianpaul, simply stating the obvious things that do not work or are not implemented in open source FPGA cores is not going to get many things done, so I'm simply gently prodding you in the right direction :-P17:49
--- Sun Aug 14 201100:00

Generated by irclog2html.py 2.9.2 by Marius Gedminas - find it at mg.pov.lt!