I'm running a Xilinx XC7Z010-2CLG225I
chip. I have this chip on several PCBs, all running the same software, and observe the same problem on all of them. This implies a systemic problem, not a one-off production issue. The problem is also reproducible, implying I should be able to kill it if I know where to look. But I'm still having surprising difficulty debugging the application.
The board under test accepts 24V, which gets stepped down to 5V through a V7805. The chip runs on its internal oscillator, with a 16x PLL, giving an operation speed of ~29.5 MIPS. The relevant code on this board is essentially very simple: wake up, read data from EEPROM, then enter an infinite loop. Interrupt every millisecond, observe some environmental data, and write an updated value to EEPROM. There's other stuff going on, but the problem still occurs even if the unrelated code is commented out, so I can be reasonably certain it's not relevant to the problem at hand.
In general use, 95% of the time the board wakes up with the correct value in memory, and goes on about its business. The other 5% of the time, though, it wakes up with an incorrect value. Specifically, it wakes up with a bit-flipped version of the data it's supposed to have. It's a four-byte unsigned long that I'm watching, and either the upper or lower word of the long can get flipped. For example, 10 becomes 2^16-10, which later becomes 2^32-10. I can reproduce the glitch by manually cycling power several dozen times, but that's not very consistent, and my switch finger gets worn out.
In order to reproduce the problem in a controlled fashion, I built a second board which drives the 24V supply to the board under test. (Another dsPIC driving a darlington optocoupler.) The tester board turns the 24V off for 1.5 seconds (long enough for the 5V rail to drop to essentially 0 and stay there for one second), then turns the 24V on for some configurable length of time. With an on-time of approximately 520 mS, I can reproduce this EEPROM glitch within five power cycles, every time.
The 5V rail is behaving reasonably. It settles at 5V within 1 mS of turn-on, with perhaps .4V of overshoot, assuming I can trust my scope. At turn-off it decays to 0V exponentially, reaching 1V within 50 mS. I have no build warnings that seem relevant, just unused variables and missing newlines at the end of files.
I've tried several things:
Enabling/disabling the MCLR
Enabling/disabling the WDT
Enabling/disabling code protection
Enabling/disabling/changing brownout detect voltage
Enabling/disabling/changing the power-on timer
Different PLL settings on the main internal oscillator
Connecting/disconnecting my PICkit 3 programmer
Adding 470 uF of capacitance to the 5V rail
Adding/removing .1 uF across the 4.7k pullup on my MCLR pin
Disabling all interrupts in the code and leaving nothing but EEPROM updates in the main loop
Adding a 1.5 second delay to my startup routine before I start reading EEPROM
I've also written separate test code which does nothing but continually write values to EEPROM and then read them back, making sure that the value has not changed. Tens of thousands of iterations gave no errors. All I can conclude is that something goes wrong with EEPROM read or write, specifically at powerup/powerdown.
I've been using the same EEPROM libraries since 2007. I've seen occasional glitches, but nothing reproducible. The relevant code can be found here:http://srange.net/code/eeprom.chttp://srange.net/code/readEEByte.shttp://srange.net/code/eraseEEWord.shttp://srange.net/code/writeEEWord.s
I've seen EEPROM errors before in other applications, but always as one-off glitches, nothing this reproducible or consistent.
Does anyone have any idea what's going on? I'm running out of things to try.