Skip to content
0
  • Home
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
  • Home
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Sketchy)
  • No Skin
Collapse

Wandering Adventure Party

  1. Home
  2. Uncategorized
  3. In the early days of personal computing CPU bugs were so rare as to be newsworthy.

In the early days of personal computing CPU bugs were so rare as to be newsworthy.

Scheduled Pinned Locked Moved Uncategorized
78 Posts 29 Posters 2 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • Kim Spence-Jones 🇬🇧😷K Kim Spence-Jones 🇬🇧😷

    @gabrielesvelto
    There’s also meta-stability. If a value is snapshotted half way through it changing, it may occasionally result in the output not being one or zero, but some ‘half’ value. Depending on the circuits using that result, it may be interpreted as either 1 or 0 — and maybe different parts of the circuit will use different interpretations. Such intermediate states are only meta-stable, and will flip to a firm 1 or 0 at some indeterminate time later, possibly propagating the problem.

    Gabriele SveltoG This user is from outside of this forum
    Gabriele SveltoG This user is from outside of this forum
    Gabriele Svelto
    wrote last edited by
    #49

    @KimSJ ah yes, very good point. It's been a while since my days in hardware land and I had forgotten about it.

    Link Preview Image
    1 Reply Last reply
    0
    • Dubious BlurD Dubious Blur

      @gabrielesvelto fantastic thread thank you 😄

      Gabriele SveltoG This user is from outside of this forum
      Gabriele SveltoG This user is from outside of this forum
      Gabriele Svelto
      wrote last edited by
      #50

      @dubiousblur glad you liked it!

      1 Reply Last reply
      0
      • StuT Stu

        @gabrielesvelto Fascinating thread, especially the degradation over time inherit to modern processors. That came up recently in an interesting viral video on a world where we forget how to make new CPUs.

        Bit of an aside, but I assume this affects other architectures? The thread mentioned Intel and AMD, but I assume Arm and Risc-V are similarly prone to these sorts of problems?

        Gabriele SveltoG This user is from outside of this forum
        Gabriele SveltoG This user is from outside of this forum
        Gabriele Svelto
        wrote last edited by
        #51

        @tehstu yes, absolutely. I've encountered several bugs in AMD CPUs, not many on ARM just yet, but our ARM user-base is very small compared to x86, so it's just less likely for us to stumble upon them. Plus we have some machinery that can detect some hardware bugs automatically but it doesn't work on ARM just yet.

        1 Reply Last reply
        0
        • Gabriele SveltoG Gabriele Svelto

          However not all bugs can be fixed this way. Bugs within logic that sits on a critical path can rarely be fixed. Additionally some microcode fixes can only be made to work if the microcode is loaded at boot time, right when the CPU is initialized. If the updated microcode is loaded by the operating system it might be too late to reconfigure the core's operation, you'll need an updated UEFI firmware for some fix to work. 20/31

          Marcos DioneM This user is from outside of this forum
          Marcos DioneM This user is from outside of this forum
          Marcos Dione
          wrote last edited by
          #52

          @gabrielesvelto but UEFI is already quite complex, it has to find block devices, read their partition tables, read FAT file systems, read directories and files, load data in memory and transfer execution. Wouldn't a patch after all that not be too late?

          Gabriele SveltoG 1 Reply Last reply
          0
          • Gabriele SveltoG Gabriele Svelto

            I can't be sure that this is exactly what's happening on Raptor Lake CPUs, it's just a theory. But a modern CPU core has millions upon millions of these types of circuits, and a timing issue in any of them can lead to these kinds of problems. And that's without saying that voltage delivery across a core is an exquisitely analog problem, with voltage fluctuations that might be caused by all sorts of events: instructions being executed, temperature, etc... 27/31

            K This user is from outside of this forum
            K This user is from outside of this forum
            krzysdz
            wrote last edited by
            #53

            @gabrielesvelto Intel's officially stated reason is that (too) high voltage (and temperature) caused fast degradation of clock trees inside cores. This degradation resulted in a duty cycle shift (square wave no longer square?), which caused general instability. If they use both posedge and negedge as triggers, then change in duty cycle will definitely violate timing.

            Link Preview Image
            Intel Core 13th and 14th Gen Desktop Instability Root Cause Update

            Following extensive investigation of the Intel® Core™ 13th and 14th Gen desktop processor Vmin Shift Instability issue, Intel can now confirm the

            favicon

            (community.intel.com)

            1 Reply Last reply
            0
            • arclightA arclight

              @gabrielesvelto Thank you for this detailed and specific explanation. Chris Hobbs discusses the relative unreliability of popular modern CPUs in "Embedded Systems Development for Safety-Critical Systems" but not to this depth.

              I don't do embedded work but I do safety-related software QA. Our process has three types of test - acceptance tests which determine fitness-for-use, installation tests to ensure the system is in proper working order, and in-service tests which are sort of a mystery. There's no real guidance on what an in-service test is or how it differs from an installation test. Those are typically run when the operating system is updated or there are similar changes to support software. Given the issue of CPU degradation, I wonder if it makes sense to periodically run in-service tests or somehow detect CPU degradation (that's probably something that should be owned by the infrastructure people vs the application people).

              I've mainly thought of CPU failures as design or manufacturing defects, not in terms of "wear" so this has me questioning the assumptions our testing is based on.

              Gabriele SveltoG This user is from outside of this forum
              Gabriele SveltoG This user is from outside of this forum
              Gabriele Svelto
              wrote last edited by
              #54

              @arclight timing degradation should not be visible outside of the highest-spec desktop CPUs which are really pushing the envelope even when they're new. Embedded systems and even mid-range desktop CPUs will never fail because of it. What might become visible is increased power consumption over time though.

              Gabriele SveltoG 1 Reply Last reply
              0
              • Gabriele SveltoG Gabriele Svelto

                @arclight timing degradation should not be visible outside of the highest-spec desktop CPUs which are really pushing the envelope even when they're new. Embedded systems and even mid-range desktop CPUs will never fail because of it. What might become visible is increased power consumption over time though.

                Gabriele SveltoG This user is from outside of this forum
                Gabriele SveltoG This user is from outside of this forum
                Gabriele Svelto
                wrote last edited by
                #55

                @arclight on the other hand watch out for memory errors. Those can crop up much sooner than CPU problems due to circuit degradation: https://fosstodon.org/@gabrielesvelto/112407741329145666

                1 Reply Last reply
                0
                • Gabriele SveltoG Gabriele Svelto

                  In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31

                  A Flock of BeaglesB This user is from outside of this forum
                  A Flock of BeaglesB This user is from outside of this forum
                  A Flock of Beagles
                  wrote last edited by
                  #56

                  @gabrielesvelto there was also no meaningful computer security nor much need for it in the days of 6502. it's much different when most computers are now connected to the internet and can be infected with malware within seconds of connecting.

                  Mike SpoonerS 1 Reply Last reply
                  0
                  • Marcos DioneM Marcos Dione

                    @gabrielesvelto but UEFI is already quite complex, it has to find block devices, read their partition tables, read FAT file systems, read directories and files, load data in memory and transfer execution. Wouldn't a patch after all that not be too late?

                    Gabriele SveltoG This user is from outside of this forum
                    Gabriele SveltoG This user is from outside of this forum
                    Gabriele Svelto
                    wrote last edited by
                    #57

                    @mdione yes, it's very complex, but motherboard firmware has a mechanism to load the new microcode right as the CPU is bootstrapped. That is even before the CPU is capable of accessing DRAM. All the rest of the UEFI machinery runs after that. Note that this early bootstrap mechanisms usually involves a separate bootstrap CPU, usually an embedded microcontroller whose task is to get the main x86 core up and running.

                    Marcos DioneM 1 Reply Last reply
                    0
                    • Gabriele SveltoG Gabriele Svelto

                      All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31

                      x0X This user is from outside of this forum
                      x0X This user is from outside of this forum
                      x0
                      wrote last edited by
                      #58

                      @gabrielesvelto I wonder if they could use said statistical toys as part of a large-scale fuzzing process to detect such bugs?

                      Gabriele SveltoG 1 Reply Last reply
                      0
                      • Gabriele SveltoG Gabriele Svelto

                        In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31

                        Paul BarnfatherC This user is from outside of this forum
                        Paul BarnfatherC This user is from outside of this forum
                        Paul Barnfather
                        wrote last edited by
                        #59

                        Fascinating thread. Do you know if the same issues exist on low power, embedded CPUs like ESP32, or is this something that mostly affects high-end stuff?

                        1 Reply Last reply
                        0
                        • Perpetuum MobileP Perpetuum Mobile

                          @gabrielesvelto that's the deep nerdy stuff I love about IT! Thanks a ton for sharing this!

                          Alex@rtnVFRmedia Suffolk UKV This user is from outside of this forum
                          Alex@rtnVFRmedia Suffolk UKV This user is from outside of this forum
                          Alex@rtnVFRmedia Suffolk UK
                          wrote last edited by
                          #60

                          @perpetuum_mobile @gabrielesvelto I used to even code in assembler on 8 bit platforms, for years I could not quite get my head round how modern CPUs worked until this thread (and now I know a bit more)

                          Gabriele SveltoG Perpetuum MobileP 2 Replies Last reply
                          0
                          • Gabriele SveltoG Gabriele Svelto

                            Bonus end-of-thread post: when you encounter these bugs try to cut the hardware designers some slack. They work on increasingly complex stuff, with increasingly pressing deadlines and under upper management who rarely understands what they're doing. Put the blame for these bugs where it's due: on executives that haven't allocated enough time, people and resources to make a quality product.

                            BrettB This user is from outside of this forum
                            BrettB This user is from outside of this forum
                            Brett
                            wrote last edited by
                            #61

                            @gabrielesvelto

                            I don’t cut any slack for Intel producing two whole generations of CPUs with manufacturing flaws then trying to cover it up and never really offering full restitution to any customers.

                            1 Reply Last reply
                            0
                            • Gabriele SveltoG Gabriele Svelto

                              All in all modern CPUs are beasts of tremendous complexity and bugs have become inevitable. I wish the industry would be spending more resources addressing them, improving design and testing before CPUs ship to users, but alas most of the tech sector seems more keen on playing with unreliable statistical toys rather than ensuring that the hardware users pay good money for works correctly. 31/31

                              HyaninerH This user is from outside of this forum
                              HyaninerH This user is from outside of this forum
                              Hyaniner
                              wrote last edited by
                              #62

                              @gabrielesvelto It was a very rich, exciting, interesting, and useful post! Thank you very much!

                              1 Reply Last reply
                              0
                              • Alex@rtnVFRmedia Suffolk UKV Alex@rtnVFRmedia Suffolk UK

                                @perpetuum_mobile @gabrielesvelto I used to even code in assembler on 8 bit platforms, for years I could not quite get my head round how modern CPUs worked until this thread (and now I know a bit more)

                                Gabriele SveltoG This user is from outside of this forum
                                Gabriele SveltoG This user is from outside of this forum
                                Gabriele Svelto
                                wrote last edited by
                                #63

                                @vfrmedia @perpetuum_mobile if you have some free time this is a good deep dive: https://cseweb.ucsd.edu/classes/fa14/cse240A-a/pdf/04/Gonzalez_Processor_Microarchitecture_2010_Claypool.pdf

                                While it doesn't cover some of the most recent advancement it captures 90% of what you need to know.

                                If you have a lot of free time and want to dive deeper there's this: https://www.agner.org/optimize/microarchitecture.pdf

                                1 Reply Last reply
                                0
                                • Gabriele SveltoG Gabriele Svelto

                                  In the early days of personal computing CPU bugs were so rare as to be newsworthy. The infamous Pentium FDIV bug is remembered by many, and even earlier CPUs had their own issues (the 6502 comes to mind). Nowadays they've become so common that I encounter them routinely while triaging crash reports sent from Firefox users. Given the nature of CPUs you might wonder how these bugs arise, how they manifest and what can and can't be done about them. 🧵 1/31

                                  N This user is from outside of this forum
                                  N This user is from outside of this forum
                                  Neil Moffatt
                                  wrote last edited by
                                  #64

                                  @gabrielesvelto The book 'Silicon' by the Italian who designed the 4004, 8080 and Z80 is a most splendid read. Fascinating that he had to add reverse engineering optical confusions to minimise cloning by rivals.

                                  1 Reply Last reply
                                  0
                                  • Alex@rtnVFRmedia Suffolk UKV Alex@rtnVFRmedia Suffolk UK

                                    @perpetuum_mobile @gabrielesvelto I used to even code in assembler on 8 bit platforms, for years I could not quite get my head round how modern CPUs worked until this thread (and now I know a bit more)

                                    Perpetuum MobileP This user is from outside of this forum
                                    Perpetuum MobileP This user is from outside of this forum
                                    Perpetuum Mobile
                                    wrote last edited by
                                    #65

                                    @vfrmedia @gabrielesvelto I did code a little bit in x86 asm when I was a teen. It was the only way to turn on SVGA modes in Turbo Pascal and I wanted to make games back then 😉 I did a program which simulated a flame in real time, doing per pixel average of surrounding pixels and adding random 255 sparks on the bottom to make the flame move and look real

                                    1 Reply Last reply
                                    0
                                    • x0X x0

                                      @gabrielesvelto I wonder if they could use said statistical toys as part of a large-scale fuzzing process to detect such bugs?

                                      Gabriele SveltoG This user is from outside of this forum
                                      Gabriele SveltoG This user is from outside of this forum
                                      Gabriele Svelto
                                      wrote last edited by
                                      #66

                                      @x0 hardware design already involves various forms of fuzzing, but the amount of state involved is so large that no amount of testing can be truly exhaustive

                                      x0X 1 Reply Last reply
                                      0
                                      • Gabriele SveltoG Gabriele Svelto

                                        @x0 hardware design already involves various forms of fuzzing, but the amount of state involved is so large that no amount of testing can be truly exhaustive

                                        x0X This user is from outside of this forum
                                        x0X This user is from outside of this forum
                                        x0
                                        wrote last edited by
                                        #67

                                        @gabrielesvelto Makes sense. Especially when something like voltage/temperature fluctuations are involved.

                                        1 Reply Last reply
                                        0
                                        • Grumble 🇺🇸 🇺🇦 🇬🇱G Grumble 🇺🇸 🇺🇦 🇬🇱

                                          @gabrielesvelto I went to a lecture in the early 1990's by Tim Leonard, the formal methods guy at DEC. His story was that DEC had as-built simulators for every CPU they designed, and they had correct-per-the-spec simulators for these CPUs.

                                          At night, after the engineers went home, their workstations would fire up tools that generated random sequences of instructions, throw those sequences at both simulators, and compare the results. This took *lots *of machines, but, as Tim joked, Equipment was DEC's middle name.

                                          And they'd find bugs - typically with longer sequences, and with weird corner cases of exceptions and interrupts - but real bugs in real products they'd already shipped.

                                          But here was the banger: sure, they'd fix those bugs. But there were still more bugs to find, and it took longer and longer to find them.

                                          Leonard's empirical conclusion is that there is no "last bug" to be found and fixed in real hardware. There's always one more bug out there, and it'll take you longer and longer (and cost more and more) to find it.

                                          Mike SpoonerS This user is from outside of this forum
                                          Mike SpoonerS This user is from outside of this forum
                                          Mike Spooner
                                          wrote last edited by
                                          #68

                                          @grumble209 @gabrielesvelto I rember that the UltraSPARC-II (Blackbird) CPU, over it's lifetime (and to date, to boot) only had a single errata, and that was an extremely unlikely timing issue, not a logic bug. Unfortunately, that overall goodness was offset by the widespread off-chip L2 cache issue circa y2k.

                                          1 Reply Last reply
                                          0

                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          Powered by NodeBB Contributors
                                          • First post
                                            Last post