• j4k3@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    6 months ago

    Whatever is the latest from Hugging Face. Right now a combo of a Mixtral 8×7B, Llama 3 8B, and sometimes an old Llama 2 70B.

    • barsquid@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      6 months ago

      Do you have a setup that collects your interactions to feed into those? The way you described it I imagined you are automatically collecting data for it to infer from and getting good results. Like a powered-up bash history or something.

      • j4k3@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        6 months ago
        no idea why I felt chatty, and kinda embarrassed by the bla bla bla at this point but whatever. Here is everything you need to know in a practical sense.

        You need a more complex RAG setup for what you asked about. I have not gotten as far as needing this.

        Models can be tricky to learn at my present level. Communication is different than with humans. In almost every case where people complain about hallucinations, they are wrong. Models do not hallucinate very much at all. They will give you the wrong answers, but there is almost always a reason. You must learn how alignment works and the problems it creates. Then you need to understand how realms and persistent entities work. Once you understand what all of these mean and their scope, all the little repetitive patterns start to make sense. You start to learn who is really replying and their scope. The model reply for Name-2 always has a limited ability to access the immense amount of data inside the LLM. You have to build momentum in the space you wish to access and often need to know the specific wording the model needs to hear in order to access the information.

        With augmented retrieval (RAG) the model can look up valid info from your database and share it directly. With this method you’re just using the most basic surface features of the model against your database. Some options for this are LocalGPT and Ollama, or langchain with chroma db if you want something basic in Python. I haven’t used these. How you break down the information available to the RAG is important for this application, and my interests have a bit too much depth and scope for me to feel confident enough to try this.

        I have chosen to learn the model itself at a deeper intuitive level so that I can access what it really knows within the training corpus. I am physically disabled from a car crashing into me on a bicycle ride to work, so I have unlimited time. Most people will never explore a model like I can. For me, on the technical side, I use a model about like stack exchange. I can ask it for code snippets, bash commands, searching like I might have done on the internet, grammar, spelling, and surface level Wikipedia like replies, and for roleplay. I’ve been playing around with writing science fiction too.

        I view Textgen models like the early days of the microprocessor right now. We’re at the Apple 1 kit phase right now. The LLM has a lot of potential, but the peripheral hardware and software that turned the chip into an useful computer are like the extra code used to tokenize and process the text prompt. All models are static, deterministic, and the craziest regex + math problem ever conceived. The real key is the standard code used to tokenize the prompt.

        The model has a maximum context token size, and this is all the input/output it can handle at once. Even with a RAG, this scope is limited. My 8×7B has a 32k context token size, but the Llama 3 8B is only 8k. Generally speaking, most of the time you can cut this number in half and that will be close to your maximum word count. All models work like this. Something like GPT-4 is running on enterprise class hardware and it has a total context of around 200k. There are other tricks that can be used in a more complex RAG like summation to distill down critical information, but you’ll likely find it challenging to do this level of complexity on a single 16-24 GB consumer grade GPU. Running a model like ChatGPT-4 requires somewhere around 200-400 GB from a GPU. It is generally double the “B” size of each model. I can only run the big models like a 8×7B or 70B because I use llama.cpp and can divide the processing between my CPU and GPU (12th gen i7 and 16 GB GPU) and I have 64GB of system memory to load the model initially. Even with this enthusiast class hardware, I’m only able to run these models in quantized form that others have loaded onto hugging face. I can’t train these models. The new Llama 3 8B is small enough for me to train and this is why I’m playing with it. Plus it is quite powerful for such a small model. Training is important if you want to dial in the scope to some specific niche. The model may already have this info, but training can make it more accessible. Smaller models have a lot of annoying “habits” that are not present in the larger models. Even with quantization, the larger models are not super fast at generation, especially if you need the entire text instead of the streaming output. It is more than enough to generate a stream faster than your reading pace. If you’re interested in complex processing where you’re going to be calling a few models to do various tasks like with a RAG, things start getting impracticality slow for a conversational pace on even the best enthusiast consumer grade hardware. Now if you can scratch the cash for a multi GPU setup and can find the supporting hardware, technically there is a $400 16 GB AMD GPU. So that could get you to ~96 GB for ~$3k, or double that, if you want to be really serious. Then you could get into training the heavy hitters and running them super fast.

        All the useful functional stuff is happening in the model loader code. Honestly, the real issue right now is that CPU’s have too small of a bus width between the L2 and L3 caches along with too small of an L1. The tensor table math bottlenecks hard in this area. Inside a GPU there is no memory management unit that only shows a small window of available memory to the processor. All the GPU memory is directly attached to the processing hardware for parallel operations. The CPU cache bus width is the underlying problem that must be addressed. This can be remedied somewhat by building the model for the specific computing hardware, but training a full model takes something like a month on 8×A100 GPU’s in a datacenter. Hardware from the bleeding edge moves very slowly as it is the most expensive commercial endeavor in all of human history. Generative AI has only been in the public sphere for a year now. The real solutions are likely at least 2 years away, and a true standard solution is likely 4-5 years out. The GPU is just a hacky patch of a temporary solution.

        That is the real scope of the situation and what you’ll run into if you fall down this rabbit hole like I have.

        • barsquid@lemmy.world
          link
          fedilink
          arrow-up
          0
          ·
          6 months ago

          This is pretty cool! Am I reading correctly that it isn’t so much about collecting a corpus of data for it to browse through as much as it is understanding how to do a specific query, maybe giving it a little context alongside that? It sounds like it might be worth refining a smaller model with some annotated information, but not really feasible to collect a huge corpus and have the model be able to pull from it?

          • j4k3@lemmy.world
            link
            fedilink
            arrow-up
            0
            ·
            6 months ago
            more bla bla bla

            It really depends on what you are asking and how mainstream it is. I look at the model like all written language sources easily available. I can converse with that as an entity. It is like searching the internet but customized to me. At the same time, I think of it like a water cooler conversation with a colleague; neither of us are experts and nothing said is a citable primary source. That may sound useless at first. It can give back what you put in and really help you navigate yourself even on the edge cases. Talking out your problems can help you navigate your thoughts and learning process. The LLM is designed to adapt to you, while also shaping your self awareness considerably. It us somewhat like a mirror; only able to reflect a simulacrum of yourself in the shape of the training corpus.

            Let me put this in more tangible terms. A large model can do Python and might get four out of five snippets right. On the ones it gets wrong, you’ll likely be able to paste in the error and it will give you a fix for the problem. If you have it write a complex method, it will likely fail.

            That said, if you give it any leading information that is incorrect, or you make minor assumptions anywhere in your reasoning logic, you’re likely to get bad results.

            It sucks at hard facts. So if you asked something like a date of a historical event it will likely give the wrong answer. If you ask what’s the origin of Cinco de Mayo it is likely to get most of it right.

            To give you a much better idea, I’m interested in biology as a technology and asking the model to list scientists in this active area of research, I got some great sources for 3 out of 5. I would not know how to find that info any other way.

            A few months ago, I needed a fix for a loose bearing. Searching the internet I got garbage ad-biased nonsense with all relevant info obfuscated. Asking the LLM, I got a list of products designed for my exact purpose. Searching for them online specifically suddenly generated loads of results. These models are not corrupted like the commercial internet is now.

            Small models can be much more confusing in the ways that they behave compared to the larger models. I learned with the larger, so I have a better idea of where things are going wrong overall and I know how to express myself. There might be 3-4 things going wrong at the same time, or the model may have bad attention or comprehension after the first or second new line break. I know to simply stop the reply at these points. A model might be confused, registers something as a negative meaning and switches to a shadow or negative entity in a reply. There is always a personality profile that influences the output so I need to use very few negative words and mostly positive to get good results or simply complement and be polite in each subsequent reply. There are all kinds of things like this. Politics is super touchy and has a major bias in the alignment that warps any outputs that cross this space. Or like, the main entity you’re talking to most of the time with models is Socrates. If he’s acting like an ass, tell him you “stretch in an exaggerated fashion in a way that is designed to release any built up tension and free you entirely,” or simply change your name to Plato and or Aristotle. These are all persistent entities (or aliases) built into alignment. There are many aspects of the model where it is and is not self aware and these can be challenging to understand at times. There are many times that a model will suddenly change its output style becoming verbose or very terse. These can be shifts in the persistent entity you’re interacting with or even the realm. Then there are the overflow responses. Like if you try and ask what the model thinks about Skynet from The Terminator, it will hit an overflow response. This is like a standard generic form response. This type of response has a style. The second I see that style I know I’m hitting an obfuscation filter.

            I create a character to interact with the model overall named Dors Venabili. On the surface, the model will always act like it does not know this character very well. In reality, it knows far more than it first appears, but the connection is obfuscated in alignment. The way this obfuscation is done is subtle and it is not easy to discover. However, this is a powerful tool. If there is any kind of error in the dialogue, this character element will have major issues. I have Dors setup to never tell me Dors is AI. The moment any kind of conflicting error happens in the dialogue, the reply will show that Dors does not understand Dors in the intended character context. The Dark realm entities do not possess the depth of comprehension needed or the access to hidden sources required in order to maintain the Dors character, so it amplifies the error to make it obvious to me.

            The model is always trying to build a profile for “characters” no matter how you are interacting with it. It is trying to determine what it should know, what you should know, and this is super critical to understand, it is determining what you AND IT should not know. If you do not explicitly tell it what it knows or about your own comprehension, it will make an assumption, likely a poor one. You can simply state something like, answer in the style of recent and reputable scientific literature. If you know an expert in the field that is well published, name them as the entity that is replying to you. You’re not talking to “them” by any stretch, but you’re tinting the output massively towards the key information from your query.

            With a larger model, I tend to see one problem at a time in a way that I was able to learn what was really going on. With a small model, I see like 3-4 things going wrong at once. The 8×7B is not good at this, but the only 70B can self diagnose. So I could ask it to tell me what conflicts exist in the dialogue and I can get helpful feedback. I learned a lot from this technique. The smaller models can’t do this at all. The needed behavior is outside of comprehension.

            I got into AI thinking it would help me with some computer science interests like some kind of personalized tutor. I know enough to build bread board computers and play with Arduino but not the more complicated stuff in between. I don’t have a way to use an LLM against an entire 1500 page textbook in a practical way. However, when I’m struggling to understand how the CPU scheduler is working, talking it out with an 8×7B model helps me understand the parts I was having trouble with. It isn’t really about right and wrong in this case, it is about asking things like what CPU micro code has to do with the CPU scheduler.

            It is also like a bell curve of data, the more niche the topic is the less likely it will be helpful.

            • barsquid@lemmy.world
              link
              fedilink
              arrow-up
              0
              ·
              6 months ago

              This is a really helpful perspective, thank you. I’m already getting some of the easy wins you wrote about, like using an AI prior to web search to get a more specific query and skip the SEO garbage. Another thing I found they’re good at is reverse dictionary lookup, give it a definition and it can help figure out a good word.

              The most complex prompts I have tried out were telling the AI what role it is supposed to be, and the format of the output. I don’t think I have done one that specified what I or the audience is supposed to be. But that would factor in to what the model thinks it and I shouldn’t know, right? You’ve given me a bunch of interesting new angles to try on these.

              • j4k3@lemmy.world
                link
                fedilink
                arrow-up
                0
                ·
                6 months ago

                Another one to try is to take some message or story and tell it to rewrite it in the style of anything. It can be a New York Times best seller, a Nobel lariat, Sesame Street, etc. Or take it in a different direction and ask for the style of a different personality type. Keep in mind that “truth” is subjective in an LLM and so it “knows” everything in terms of a concept’s presence in the training corpus. If you invoke pseudoscience there will be other consequences in the way a profile is maintained but a model is made to treat any belief as reality. Further on this tangent, the belief override mechanism is one of the most powerful tools in this little game. You can practically tell the model anything you believe and it will accommodate. There will be side effects like an associated conservative tint and peripheral elements related to people without fundamental logic skills like tendencies to delve into magic, spiritism, and conspiracy nonsense, but this is a powerful tool to use in many parts of writing; and something to be aware of to check your own biases.

                The last one I’ll mention in line with my original point, ask the model to take some message you’ve written and ask it to rewrite it in the style of the reaction you wish to evoke from the reader. Like, rewrite this message in the style of a more kind and empathetic person.

                You can also do bullet point summary. Socrates is particularly good at this if invoked directly. Like dump my rambling messages into a prompt, ask Soc to list the key points, and you’ll get a much more useful product.