Journalist/developer. Storytelling developer @ USA Today Network. Builder of @HomicideWatch. Sinophile for fun. Past: @frontlinepbs @WBUR, @NPR, @NewsHour.
2105 stories
·
45 followers

A Gentle Intro to Running a Local LLM

1 Share

How to Run a Chatbot on Your Laptop (and Why You Should)

Open, small LLMs have gotten really good. Good enough that more people – especially non-technical people – should be running models locally. Doing so not only provides you with an offline, privacy-safe, helpful chatbot; it also helps you learn how LLMs work and appreciate the diversity of models being built.

Following the LLM field is complicated. There are plenty of metrics you can judge a model on and plenty of types of models that aren’t easily comparable to one another. Further, the metrics that enthusiasts and analysts evaluate models might not matter to you.

But there is an overarching story across the field: LLMs are getting smarter and more efficient.

And while we continually hear about LLMs getting smarter, before the DeepSeek kerfuffle we didn’t hear so much about improvements in model efficiency. But models have been getting steadily more efficient, for years now. Those who keep tabs on these smaller models know that DeepSeek wasn’t a step-change anomaly, but an incremental step in an ongoing narrative.

These open models are now good enough that you – yes, you – can run a useful, private model for free on your own computer. And I’ll walk you through it.

Installing Software to Run Models

Large language models are, in a nutshell, a collection of probabilities. When you download a model, that’s what you get: a file(s) full of numbers. To use the model you need software to perform inference: inputting your text into a model and generating output in response to it. There are many options here, but we’re going to pick a simple, free, cross-platform app called Jan.

Jan is an open-source piece of software that manages and runs models on your machine and talks to hosted models (like the OpenAI or Anthropic APIs). If you’ve used ChatGPT or Claude, its interface will feel familiar:

Jan in use

Go download Jan and come back when it’s installed.

In the screenshot above we’re in the chat tab (see that speech bubble icon in the upper left?) Before we can get started, we’ll need to download a model. Click the “four square” icon under the speech bubble and go to the model “hub”.

Making Sense of Models

In Jan’s model hub we’re presented with a list of models we can download to use on our machine.

Jan's model hub

We can see the model name, the file size of the model, a button to download the model, and a carat to toggle more details for a given model. Like these:

Details about Llama 3.2 1B Instruct Q8

If you are new to this, I expect these model names to be very confusing. The LLM field moves very fast and has evolved language conventions on the fly which can appear impenetrable to the unfamiliar (don’t make me tap the sign!).

But we can clear this up. You don’t need to understand the following to run a local model, but knowing the language conventions here will help demystify the domain.

Breaking down the components of a model name

Let’s go from left to right:

  1. Family: Models come in families, which helps you group them by the teams that make them and their intended use (usually). Think of this as the model’s brand name. Here, it’s “Llama”, a family of open models produced by Meta.
  2. Generation: The generation number is like a version number. A larger number means a more recent model, with the value to the left of the decimal indicating major generations. The number to the right of the decimal might indicate an incremental update or signify a variant for a specific use case. Here we’re looking at 3.2, a generation of Llama models which are smaller sized and designed to run “at the edge” (aka, on a device like a phone or PC, not a remote server).
  3. Parameters: As I said, models are essentially collections of probabilities. The parameter count is the number of probabilities contained in a given model. This count loosely correlates to a model’s performance (though much less so these days). A model with more parameters will require a more powerful computer to run. Parameter count also correlates with the amount of space a model takes up on your computer. This model has 1 billion parameters and clocks in at 1.23GB. There is a 3 billion parameter option in the Llama 3.2 generation as well, weighing in at 3.19GB.
  4. Variant: Different models are built for different purposes. The variant describes the task for which a model is built or tuned. This model is made to follow instructions, hence, “Instruct.” You will also see models with “Chat” or “Code” (both self-explanatory). “Base” we tend to see less of these days, but it refers to models that have yet to be tuned for a specific task.
  5. Quantization: Quantization is a form of model compression. We keep the same number of parameters, but we reduce the details of each. In this case, we’re converting the numbers representing the probabilities in the model from highly detailed numbers with plenty of decimal places to 8-bit integers: whole numbers between -128 and 127. The “Q8” here says the weights in the model have been converted to 8-bit integers. We’re saving plenty of space in exchange for some potential model wonkiness during usage.

Don’t worry too much about the quantization notation – you don’t need to know it. Jan very helpfully provides us with warnings if a model won’t run well on our machines:

Jan tells us when models will run slow on our PC

When getting started, feel free to stick with the model recommendations I list below. But if you want to explore, download and try anything that can run on your machine. If it’s slow, non-sensical, or you just don’t like it: delete it and move on.

The models listed here are curated by Jan. But Jan can run any text model hosted on Hugging Face (a website for sharing datasets and machine learning models) in the GGUF format (a file format for sharing models). And there’s plenty of them.

But let’s put a pin in this.

For now, go back to Jan’s model hub and hit the “Download” button for “Llama 3.2 3B Instruct Q8”. If Jan says this model will be, “Slow on your device,” download “Llama 3.2 1B Instruct Q8”. It’s smaller, but still great for its size. Come back when the download is complete.

Chatting with Jan

When Llama 3.2 is downloaded, switch back to the chat pane by clicking the speech bubble in the upper-left. In the text field at the bottom, there’s a model selector. Check to make sure it’s set to the model you just downloaded:

Jan's model selector in action

As you download more models, you can switch between them here.

Typing text in the text field and hitting enter creates a new thread (just like ChatGPT or Claude).

Here’s what the 3B parameter model says when we ask, “Why might someone want to run a LLM on their laptop, rather than using a cloud provider?”

While running a Large Language Model (LLM) on a cloud provider can be convenient and scalable, there are some reasons why someone might prefer to run an LLM on their own laptop:

1. **Cost savings**: Cloud providers charge by the hour or usage, which can add up quickly. Running an LLM locally can save money in the long run, especially if you only need to use it occasionally.
2. **Data ownership and control**: When you host your LLM on a cloud provider, you may have limited control over how your data is stored, processed, and shared. Running it locally ensures that you maintain full control over your data and can ensure its privacy and security.
3. **Customization and flexibility**: By running an LLM locally, you can customize the environment to suit your specific needs without relying on cloud provider configurations or APIs. This can be particularly useful for research or development purposes where you need to experiment with different models or hyperparameters.
4. **Faster training times**: Training a large language model can take significant time and computational resources. Running it locally means that you don't have to worry about network latency, API timeouts, or other cloud-related issues that might slow down the training process.
5. **Offline access**: With a local LLM, you can work on projects without relying on an internet connection, making it ideal for situations where connectivity is limited or unreliable.
6. **Reduced dependency on cloud providers**: By running your own LLM locally, you reduce your dependence on cloud providers and their infrastructure. This can be beneficial if you're concerned about the stability of cloud services or prefer to maintain a more decentralized approach.
7. **Self-hosting and security**: Running an LLM locally allows you to implement additional security measures and ensure that your model is not exposed to external attacks or vulnerabilities.

However, it's worth noting that running a large language model on a laptop can be computationally intensive and may require significant resources (e.g., GPU power). If you're planning to run a large-scale LLM, cloud providers often offer more suitable solutions with optimized infrastructure and economies of scale.

And here’s the 1B parameter model response to the same question:

Running a Large Language Model (LLM) on a laptop can be beneficial for:

1. **Small projects or proof-of-concepts**: For smaller tasks, like experimenting with different models or hyperparameters, running an LLM locally might be sufficient.
2. **Research and development**: Researchers may need to run small-scale experiments or prototype their ideas without being tied to cloud infrastructure.
3. **Personal use cases**: Individuals who only occasionally work on language-related projects can benefit from local hosting, as it provides a cost-effective alternative to cloud providers.

However, running an LLM on a laptop has some limitations:

1. **Limited scalability**: As the model size or complexity increases, local hosting may become impractical due to computational constraints.
2. **Data storage and processing**: Running an LLM locally requires significant storage space for data and intermediate results, which can be challenging in laptops with limited disk capacity.
3. **Network latency and security concerns**: Local connections can introduce latency and compromise the model's security, especially if you're handling sensitive or proprietary data.

To overcome these limitations, researchers and developers often use:

1. **Cloud-based distributed training**: Training large models on a cloud provider, which allows for scalability and parallel processing.
2. **Serverless computing**: Using serverless platforms that provide a pay-as-you-go pricing model, enabling users to scale up or down as needed.
3. **Edge computing**: Deploying computations closer to the source of data, reducing latency and improving security.

Ultimately, whether running an LLM on a laptop or cloud provider depends on your specific needs, project requirements, and preferences.

Both are pretty good for such tiny models! The 3B-sized Llama does a better job, sticking to the question and providing numerous reasons. But this comes with a speed cost: on my Mac with 64GB of RAM, Llama 3B produces ~50 words-per-minute while Llama 1B produces ~113.

Figuring out which model is right for you takes experimentation. But I have a few recommendations to get you started.

Unless otherwise noted, you can download these models in Jan by copy-and-pasting the URL in the model name link.

For people with (at least) a Mac with 32GB of RAM or an NVIDIA RTX 4090:

Mistral-Small-24B-Instruct is a bit of a revelation. While there have been open GPT-4o class models that fit on laptops (Llama 3.3 comes to mind), Mistral-Small is the first one I’ve used whose speed is comparable as well. For the last week, I’ve been using it as my first stop – before Claude or ChatGPT – and it’s performed admirably. Rarely do I need to try a hosted model. It’s that good.

If you can run the “Q4” version (it’s the 13.35GB option in Jan, when you paste in the link), I strongly recommend you do. This is the model that inspired me to write this post. Local models are good enough for non-nerds to start using them.

Also good: Microsoft’s Phi-4 model is a bit smaller (~8GB when you select the “Q4” quantized one) but also excellent. It’s great at rephrasing, general knowledge, and light reasoning. It’s designed to be a one-shot model (basically, it outputs more detail for a single question and isn’t designed for follow-ups), and excels as a primer for most subjects.

For people who want to see a model “reason”:

Yeah, yeah, yeah…let’s get to DeepSeek.

DeepSeek is very good, but you’re not likely going to be able to run the original model on your machine. However, you likely can run one of the distilled models the DeepSeek team prepared. Distillation is a strategy to create lightweight versions of large language models that are both efficient and effective. A smaller model ‘learns’ by reviewing the output from a larger model.

In this case, DeepSeek R1 was used to train [Qwen 2.5], an excellent set of smaller models, how to reason. (DeepSeek also distilled Llama models) Getting this distilled model up and running in Jan requires an extra step, but the results are well worth it.

First, paste the DeepSeek-R1-Distill-Qwen14B model URL into Jan’s model hub search box. If you’re on Mac, grab the “Q4_1” version, provided you can run it. On Windows, grab the “Q4-K_M”. (If either of those are flagged as not being able to run on your machine, try the 7B version. It’s about ~5GB.)

Once the model downloads, click the “use” button and return to the chat window. Click the little slider icon next to the model name (marked in red, below). This toggles the thread’s settings. Toggle the “Model Settings” dropdown (marked in blue) so that the “Prompt template” is visible.

Accessing the thread's settings in Jan

Your prompt template won’t look like the one in the image above. But we’re going to change that.. Paste the following in the prompt template field, replacing its current contents:

{system_message}
<|User|>
{prompt}
<|end▁of▁sentence|>
<|Assistant|>

Think of prompt templates as wrappers that format the text you enter into a format the model expects. Templates can vary, model by model. Thankfully, Jan will take care of most of this for you if you stick to their model library.

This change we’ve made, though, will let you see this model “think”. For example, here’s how it replies to our question about why one might want to use local LLMs. The reasoning is bracketed by <think> and </think> tokens and prefaces the final answer.

Why Bother?

There are giant, leading LLMs available for cheap or free. Why should one bother with setting up a chatbot on their PC?

Llama 3.2 3B already answered this for us, but…

  1. It’s free: These models work with the PC you have and require no subscriptions. Your usage is only limited by the speed of the model.
  2. It’s 100% privacy-safe: None of your questions or answers leave your PC. Go ahead, turn off your WiFi and start prompting – everything works perfectly.
  3. It works offline: The first time I used a local model to help with a coding task while flying on an airplane without WiFi, it felt like magic. There’s something crazy about the amount of knowledge these models condense into a handful of gigabytes.
  4. It’s customizable: We only scratched the surface here by changing our prompt template. But unfold the “Inference Settings” tab and take a look at the levers waiting to be pulled. Discussing these all is beyond the scope of this article, but here’s a quick tip: the “Temperature” setting effectively controls how much randomness is added during inference. Try setting it to each extreme and see how it changes your responses.
  5. It’s educational: This is the main reason you should bother with local LLMs. Merely grabbing a few models and trying them out demystifies the field. This exercise is an antidote to the constant hype the AI industry fosters. By getting your hands just slightly dirty, you’ll start to understand the real-world trajectory of these things. And hey, maybe the next DeepSeek won’t be so surprising when it lands.

So much of the coverage around LLMs focuses on raising the ceiling: the improved capabilities of the largest models. But beneath this noise the floor is being raised. There’s been incredible progress: the capabilities of models you can run on moderately powered laptops perform as well as the largest models from this time last year. It’s time to try a local model.


Read the whole story
chrisamico
4 days ago
reply
Boston, MA
Share this story
Delete

The web is already multiplayer

1 Share

I’ve been working on editors again: tweaking CodeMirror to make it work better in Val Town. It’s really hard work, unfortunately the kind of hard work that seems like it should be easy. It has me thinking a bit about why front-end engineering is difficult.

In fancy terms, it’s multivariate and extended over time. Frontends, backends, browsers, and users are all variable, unpredictable actors.

There are no single-player web applications: the simplest model of frontend software is a user interacting with a webpage, but the user and the webpage have similar capabilities. Your frontend application can respond to and produce events. It can modify the webpage.

So can the user: they can modify webpages at any point. Maybe they load every page in a new session, breaking your assumptions about how a website will persist state. Or maybe they never create a new session - they have 80+ tabs open that never get closed, so they will continue using versions of your frontend application that you released months ago, which now break when talking to an updated backend server.

But those aren’t the only players. Browsers are active participants in the game: a browser might translate a webpage and modify the DOM in a way that makes React crash. It might disable an API that the application expects, like localStorage, in a way that makes its behavior unpredictable. Ad-blockers might prevent some of the application from loading.

Distribution is probably the single most compelling attribute of the web over native applications. There’s no app store, no mandated release process, no installation step. But just as you as the application developer have more power, so do browsers and users.

The interplay between these actors is what makes things difficult, and difficult in a way that’s underemphasized in both Computer Science dogma and hacker culture. It isn’t some algorithmic challenge or hard performance problem, or anything that can be solved by switching programming languages. It’s an understanding of uncertainty in a deployment target that is constantly shifting and has a lot of unknowns.

After working in this realm, it’s a treat to work on ‘hard’ problems that at least give you more certainty about the environment. Optimizing SQL queries, for example – it can be challenging, but at least I know that I’m writing SQL for Postgres 16. Frontend engineering can feel like optimizing a query and not knowing whether it’ll run in DuckDB, Postgres, or MySQL, and even whether you can trust that the database schema will still be the same when it runs.

It’s hard work. It’s not cool. And I am nowhere near mastering it. I totally get it when people have an aversion to it, or think that the methods used today are wild. But mostly the methods are wild because the problem is, too. A straight-line solution to the problems of the front-end would be lovely, but those problems don’t really permit one.

Read the whole story
chrisamico
4 days ago
reply
Boston, MA
Share this story
Delete

For 20 years, WordPress existed without much drama—now it's at the center of a big fight that won't be over soon

1 Share

More about gaming industry

Most Popular

Read the whole story
chrisamico
5 days ago
reply
Boston, MA
Share this story
Delete

Meet Andrew Middleton, the map man charting a new course at R.I. business

1 Share

Through maps, Middleton said, we can think about the places that made us who we are.

“The store is selling stories, primarily,” Middleton said. “And the reason that they can still be current is because they still make people feel things. As long as it still makes you feel a thing, a map is up to date.”

That mentality is one reason Middleton won the business from its previous owner of more than 40 years, Andrew Nosal. After Nosal attempted — and failed — to sell the business, he opened up a nationwide contest to find its next proprietor.

Middleton, a GIS (geographic information system) veteran of Apple and Amtrak, believes he is the only mapmaker to apply among a global pool of more than 100, and is the only one to hold that expertise in the store’s four-owner history. He moved across the country from California, and the store reopened in October 2023.

Since then, Middleton has enacted what he calls a philosophical shift in the store’s operation, revamping the store’s book selection and curating a brainy and eccentric stock of map-based art and supplies.

Recent examples include student Yimiao Wang’s RISD Cry project, which locates where students go to cry on campus. Writer Rebeca Solnit’s Infinite City project maps are in the collection, including a map that replaces New York subway stops with the names of women who lived near them. Middleton collaborated with the Decolonial Atlas project to produce tangible copies of Turtle Island, North America reoriented with place names that reflect accurate Indigenous naming. (One entry is Middleton’s own, a scuba diving map of Monterey, Calif., which he said sells better on the West Coast.)

Online sales account for about 20 percent of his business, he said. But the rest happens in person. There is a plush couch, library, and children’s corner.

The old tag line was “maps that help you.” And helpful maps still abound: Middleton physically condensed the travel and trail maps by about 80 percent (in space, though not number) In a cozy corner, an illustrated poster of New Hampshire’s 4,000-foot peaks adorns a shelf of White Mountain trail guides and other wilderness resources.

But there are more efficient ways of finding the capital of Ohio, said Middleton, and it’s pointless to sell information that’s available for free online.

“Why would I compete with the entire internet when I am one of the very, very, very few people who do what I do and have a physical place that you can go?” he said.

Middleton sees himself as a sort of sommelier of maps. He said anyone in the state could come in and he could probably find a map of their hometown. To do so quickly invokes the daydreamy experience of returning to a place through imagination: one sees the first street they lived on, school, and other landmarks of memory. That’s one type of story.

Another map on display told a story of redlining, race-based property ownership exclusion in 1930s Providence. A fully illustrated relief map at first suggests a tree-dappled cityscape, but upon further inspection reveals how much real estate is dedicated to car-centric design choices such as parking lots. Cities are a collection of choices, said Middleton, and maps tell those stories.

Middleton has been calibrating his moral compass.

“The lever that I have is small, but I’m thinking about the most beneficial ways I can pull it,” he wrote recently on Bluesky.

His mission caught the internet’s attention. First, Middleton’s collaboration with the Decolonial Atlas project gained traction when he posted about it. Subsequent posts, including a YouTube introduction, reached well beyond the smallest state.

The warm response he received surprised him — the internet can be a dark and awful place, he said.

“You’re not supposed to read the comments,” he said. “I read every comment. Every one was lovely.”

But Middleton isn’t striving to become a cartographic influencer, although he would love to host school groups and give talks. He said he’s just speaking to a common, if low-profile interest. Maps affect us all. Recently, customers have asked if he will display maps referring to the “Gulf of America,” as President Trump recently declared the Gulf of Mexico be renamed via executive order.

Much of geography, said Middleton, simply amounts to an agreement: what is our country, what is occupied territory, and the degree to which other countries recognize such claims. Mexico and other international countries do not recognize the name change, as the Associated Press noted in its recent guidance. (Middleton called the change “profoundly stupid,” and cartographically misguided due to the location of relevant territorial waters.)

“All maps are wrong,” said Middleton, who believes all maps represent the mapmaker’s perspective, “but some are useful.”

That ambiguity can frustrate some, he said, when in search of a single authoritative answer. Maps pretend to be objective, but they are not.

“When we look at a map, we have to think about why it was made, who it was made for, for what purpose? How were they making money?” said Middleton. “A lot of maps are made by insurance companies, empires, militaries, [and] advertising executives.”

He peeled out a map of Narragansett Bay made by the British Navy in 1777 for the king of England. Landforms appear as lumpy afterthoughts, but water depth and military fortifications are clearly noted, marking the priority information. The mapmaker was likely offering a coy suggestion to invade.

Maps can be sneaky, persuasive, and believable, said Middleton. They can come in many forms: some songs and sculptures can be maps. And to reject any singular mapmaking authority is liberation.

“It means that your map is as interesting as mine, and I now have the great pleasure of stocking maps that are made by non-professionals,” he said. He is eager for more artist- and community-driven collaboration.

“There is a difference between mapping and being mapped,” he said.

If Middleton has his way, the shelves and files of the Map Center contain pages yet to be charted.

The Map Center is located at 545 Pawtucket Ave., Pawtucket, R.I., and is open Monday through Saturday, 9 a.m. to 6 p.m.

mapcenter.com

Read the whole story
chrisamico
5 days ago
reply
Boston, MA
Share this story
Delete

DOGE Employees Ordered to Stop Using Slack While Agency Transitions to a Records System Not Subject to FOIA

1 Share
Read the whole story
chrisamico
6 days ago
reply
Boston, MA
Share this story
Delete

I want you to do these four things right now

1 Share

Security

Okay, friends. Here’s what we’re going to do. It’s not going to take long.

Let’s install Signal.

Signal is an open-source, end-to-end encrypted instant messaging app. When you message someone with Signal, nobody can intercept your conversation to learn what you’re saying. It’s very easy to use and completely free.

Unlike WhatsApp (which is owned by Meta) and Telegram (which doesn’t encrypt messages by default), Signal is fully open-source, doesn’t store metadata, and is designed for privacy first.

Navigate to the Get Signal page on the Signal website.

Signal needs to be installed on your phone first. Choose the version that makes sense for you: iPhone or Android.

The cool part is that, once you’re logged in, Signal will tell you which of the people in your contacts are already using it, and as more sign up, they’ll just show up in your Signal contacts list over time.

I recommend also setting up a Signal username. Navigate to your Signal app’s settings pane, click on your profile, and then create a username. Then you don’t need to reveal your phone number to new contacts you want to chat with: you can just tell them your username.

Finally, Signal conversations can be set to auto-delete. I recommend that you do this. Four weeks is comfortable; one week is very safe.

My Signal username is benwerd.01. Once you’re signed up, send me a message to let me know you did it.

Signal

It’s time for a password manager.

Do you use the same password for every service? Or maybe you have an easy-to-remember formula for each one — something like the name of the service with the vowels replaced by numbers?

Those passwords are easy to guess and break into. It’s time to install a password manager.

1Password is the best-in-class password manager. You can install it on every device you own.

It’s really cheap to sign up. Set up your account, and then install the apps for your desktop, your phone, and your web browser.

Then, when you sign up for a new account, use 1Password’s suggested passwords instead of inventing your own:

When you go back to sign into a service, 1Password will show that you have a login for it, and logging in is one-click:

So not only are your credentials more secure, it’s actually easier to log in. You don’t need to struggle to remember what your password is anymore.

The passwords are encrypted, so nobody else, including 1Password itself, can ever see them.

Using a saved set of credentials is incredibly simple:

1Password

And so is creating and saving a new password:

1Password suggesting a new password

A VPN is a great idea.

Do me a favor: whenever you’re on public wifi — that is to say, an internet connection that isn’t your home or your workplace — run your internet connection through an encrypted VPN. This will make your internet activities harder to track and harder to intercept.

A VPN encrypts your internet traffic, which protects you from eavesdropping on public WiFi and makes it harder for advertisers to track you. However, it’s worth saying that it doesn’t make you completely anonymous — your online accounts and browsing habits still matter. (We’ll get to your social media accounts next.)

Mullvad is a great VPN choice for the privacy-conscious, but can be a little harder to use. (In particular, because it doesn’t ever want to know who you are, it assigns you a numeric account ID and charges on a time-based pay as you go basis.) ExpressVPN may be easier to use if you’re less technically-inclined. In both cases, you sign up, install an app, and simply turn it on and off from the app’s UI.

Mullvad VPN

Let’s make your social media more secure.

Social media is a magnet for harassment, doxing, stalkers and worse. In fact, one of the biggest vectors for attacks of all kinds on the internet is your social media accounts. If you haven’t locked them down in the right ways, you run the risk of sharing more than you intended with strangers, or even losing your account altogether to a hacker. Keeping all the settings straight is a real pain.

Block Party comes as an extension for the browser of your choice. Install it, sign up, and it’ll look at your social media accounts in turn and make informed suggestions about how you can lock them down for better privacy — and better mental wellness. Better yet, it gives you one-click options to make those settings changes itself.

One quick tune-up later, and your social media is safer and better for you. Which can’t be bad.

Block Party

And that’s it for now.

I’ve given you four quick steps that dramatically improve your online security. None of these take long, but they can make a huge difference.

If you found this useful, feel free to share it with a friend who could use a digital security boost. Let’s make the internet safer — one smart step at a time.

Read the whole story
chrisamico
8 days ago
reply
Boston, MA
Share this story
Delete
Next Page of Stories