Sitemap

Implementing an AI Coding Assistant for Free and Without Vendor Lock-In — A Step-by-Step Guide

10 min readFeb 13, 2025

Lately, I’ve been focusing a lot on integrating AI coding assistants into the software development process. According to various sources, code assistants can speed up coding by up to 30% — a significant boost.

In this article, I want to dispel the myths that AI coding assistants are something distant or impractical. More importantly, their adoption doesn’t necessarily mean being locked into a specific LLM provider or a particular IDE.

Today, we are going to implement an AI coding assistant using only open-source tools, completely for free. The funniest thing is that you can implement this guide in your company, and it will instantly start saving money and boosting the efficiency of your SWEs. Let’s dive in!

What is a Coding Assistant and How Do They Work?

Coding assistants are applications powered by large language models (LLMs) designed to assist with writing, understanding, reviewing, and explaining code. These assistants are often backed by specialized LLMs that have been fine-tuned on vast datasets of code examples.

Some of the most well-known AI coding assistants today include GitHub Copilot — an extension for IDEs and a chatbot (first version released in 2021 and based on Codex), Cursor — an AI-driven IDE that works with OpenAI/Anthropic and other models, Replit — an end-to-end AI-driven app developer with deployment and scaling features, and many more.

It’s important to note that companies that are serious about protecting their source code typically enter into enterprise agreements with coding assistant providers. If your company doesn’t yet have a strategy for integrating AI into development, now is the time to create one — otherwise, your developers may start using third-party tools on their own, sending your code straight to OpenAI or DeepSeek 🙂

How do coding assistants work? Let’s take a closer look.

Chat Mode

Chat mode is one of the most intuitive ways to interact with a coding assistant. In essence, it works similarly to using ChatGPT or any other LLM-based product. However, there are some key differences.

For example, most coding assistants in chat mode allow you to provide specific context, such as attaching a file related to your question. This helps the assistant generate more relevant and accurate responses.

Asking the Github Copilot about the file contents from one of our projects (image by author)

With more advanced configurations, a coding assistant can be set up to search for answers within your organization’s repositories using e.g. Retrieval-Augmented Generation (RAG) technology. This significantly reduces hallucinations and improves response accuracy.

Additionally, most modern coding assistants optimize chat mode for better code formatting and readability. Beyond that, the differences from standard chatbot interactions are minimal.

Inline Mode

In Inline Mode, you can give short commands to the coding assistant, but unlike chat mode, the output consists only of code, which is inserted directly into the file at your cursor’s position.

For example, you can place your cursor at the beginning of a function and ask the assistant to generate a docstring. Most modern coding assistants handle such tasks effortlessly, making inline mode a highly efficient way to integrate AI assistance into your workflow.

Generating a docsring with Github Copilot in the inline mode (image by author)

Once the assistant generates code in Inline Mode, you can either approve it or delete it and refine your request with additional instructions. The key advantage of inline mode over chat is that the code is inserted directly where you’re working, eliminating the need to switch between tabs in your IDE.

However, inline mode is less suited for tasks requiring explanations, such as understanding how a file, module, or function works. For those cases, chat mode may be a better choice.

AutoComplete Mode

AutoComplete mode is the most well-known and was the first feature introduced by coding assistants (around 2021). It activates based on predefined internal rules — for example, when the assistant has already generated a suggestion, but you haven’t typed the next letter yet.

At this point, a ghost text (a faint, gray suggestion) appears, indicating that the text is just a suggestion, and you can choose to accept or ignore it.

Autocomplete mode in Github Copilot (image by author)

With just one “tab” click, you can generate an entire line — or even multiple lines — of useful code, significantly reducing development time. However, AutoComplete mode has limitations: it often lacks sufficient context, making its suggestions somewhat naive.

On the bright side, it works almost instantly and requires no instructions — the assistant “predicts” what you’re about to type. That said, users accustomed to these assistants sometimes get frustrated when the suggestions aren’t quite right. Yes, that happens!

Implementing Your Own Coding Assistant

Now, let’s dive into how you can integrate a coding assistant into your development workflow — for free — using only open-source tools.

What You’ll Need

To set up your own coding assistant, you need two key components:

  1. An IDE plugin
  2. Access to a large language model (LLM)

Both can be obtained for free without writing code (except for some terminal commands), thanks to the abundance of open-source solutions.

This guide is based on Continue, an open-source IDE plugin that offers full telemetry control, ensuring your code stays private. Another advantage of Continue is that it supports both Visual Studio Code and JetBrains IDEs, covering a wide range of programming languages.

Choosing an LLM

To power your assistant, you’ll need access to an LLM. Here are two options:

  • Option 1: Deploying LLMs on a company server
    If your company has GPU servers, you can use vLLM — a tool for fast language model deployment with an API fully compatible with OpenAI’s interface.
  • Option 2: Running LLMs locally
    If you prefer a completely local setup, consider Ollama. This tool runs LLMs directly on your own machine, meaning your code never leaves your system. While performance is limited by your hardware, you won’t need an additional GPU. Ollama can also run on a GPU server, but its real strength is portability.

Below is the system architecture we aim to build. 🚀

An architecture of our open-source AI Coding Assistant

Option 1: Deploying an LLM on a GPU Server with vLLM

For this option, you’ll need a GPU server. If your company has one, you can proceed with this guide. If not, you can use cloud services like Runpod or Vast AI.

Setting Up vLLM

Install vLLM and HuggingFace Hub
On your server, you need to install the vLLM library. Use the pip package manager to do this:

pip install vllm huggingface_hub[cli]

Run Your Language Model
After installation, you can run the desired LLM using a simple terminal command. There’s a wide selection of models available — most models published on HuggingFace can be used. Here’s an example for running the Qwen2.5-Coder 14B model:

vllm serve Qwen/Qwen2.5-Coder-14B-Instruct --port 8081

Depending on your GPU’s resources (specifically, VRAM), you can choose a model of varying size. Based on my experience, models around 14 billion parameters (14B) or larger perform quite well and can be used in production. For a 14B model without quantization (16-bit float), ~36GB of VRAM is usually sufficient. With quantization, much less memory is required.

Smaller models (8B, 3B, 0.5B) may work as well, though I can’t confirm their performance for writing code with the same level of reliability. They might still offer some productivity improvements.

Testing Your Setup

Once you’ve launched the LLM, you can test it by visiting the documentation page opened on the specified port (in this case, 8081). This ensures that the model is up and running correctly, and you can start using it for your coding assistant needs.

Open API inteface of the vLLM (image by author)

Option 2: Deploying an LLM Locally with Ollama

For this deployment method, you only need a powerful computer. While a GPU isn’t required, having one would certainly be preferred for better performance. Installing Ollama is straightforward and can be done with a single command on Linux:

curl -fsSL https://ollama.com/install.sh | sh

For Mac or Windows, visit the installation guide on Ollama’s website — it’s a simple process for all platforms.

Running Ollama

Once installed, you can launch an LLM with just one terminal command. You can choose from the models available on the Ollama website and deploy it locally on your machine. Once you run the model, you can start using it right away.

For example:

ollama run qwen2.5-coder:0.5b
>>> Send a message (/? for help)

Now, the language model is ready, but we’ll focus on configuring the assistant for coding tasks specifically.

Installing the Continue Plugin in VS Code

To integrate the Continue.dev plugin into Visual Studio Code, go to the Extensions tab and search for “Continue.” Once you find it, simply install the extension as you would with any other plugin.

One thing to keep in mind: if you’re already using other coding assistants, it’s recommended to deactivate them temporarily to avoid any conflicts between the tools. Once the Continue plugin is installed, you’ll notice an icon appearing in your IDE, indicating that the plugin is active.

At this point, almost everything is set up — you just need to create the configuration file, and you’re ready to start using the assistant.

Installing the Continue Plugin in JetBrains Products

To install the Continue.dev plugin in a JetBrains product (I’m demonstrating with PyCharm, but the process is similar for IntelliJ IDEA and other JetBrains IDEs), follow these steps:

  1. Open the Plugins Marketplace in the IDE.
  2. Search for “Continue”.
  3. Install the plugin following the standard procedure.

Just like in VS Code, if you’re already using any other coding assistants, it’s recommended to deactivate them temporarily to avoid potential conflicts between the tools. Once you install the Continue plugin, an icon will appear in your IDE to indicate that it’s active.

Writing the Configuration File for Continue

To open the configuration file (config.json) in the Continue plugin, simply click on the gear icon in the plugin tab within your IDE and select “open configuration file”. Below, I will outline the most important differences between vLLM and Ollama, followed by the complete configuration file.

Configuration for vLLM

"allowAnonymousTelemetry": false, // Don't send anonymous usage data
"models": [
{
"title": "⭐ My Assistant",
"provider": "openai", // openai because vLLM delivers a compatible API
"model": "Qwen/Qwen2.5-Coder-14B-Instruct",
"apiBase": "http://192.168.0.1:8081/v1" // change to yours
}
],
"tabAutocompleteModel": {
"title": "⭐ My Assistant",
"provider": "openai",
"model": "Qwen/Qwen2.5-Coder-14B-Instruct",
"apiBase": "http://192.168.0.1:8081/v1" // change to yours
}

Configuration for Ollama

"allowAnonymousTelemetry": false, // Don't send anonymous usage data
"models": [
{
"title": "⭐ My Assistant",
"provider": "ollama",
"model": "qwen2.5-coder:0.5b" // Model id
}
],
"tabAutocompleteModel": {
"title": "⭐ My Assistant",
"provider": "ollama",
"model": "qwen2.5-coder:0.5b"
}

Usage Examples

Now, I’ll demonstrate how I use a coding assistant in real life. In this case, I will use the Qwen2.5-Coder-14B model, as its performance is quite similar to commercial models. Below, I’ll show some of the most basic examples, but you can achieve even greater results by fine-tuning the plugin based on your existing code and documentation. I’ll dive deeper into that in future articles.

Inline Mode

In this example, I asked the assistant to replace all print statements with logging. While the assistant did a great job, not every line was perfectly placed. However, copying the import and logger configuration to the top is a minor task. Here's the result:

Writing a docstring with Continue.dev inline-mode (image by author)

By the way, all interactions with the plugin are saved locally in the folder ~/.continue/, so you can review or reference them later.

Chat Mode

In chat mode, I decided to ask the assistant to explain why using logging is better than print. Here's the response the assistant provided:

Asking about the particular implementation in the chat mode (image by author)

Seems like everything is on point. Just to note, I provided the context of the file I was working with (@PostgresClientDB.py).

AutoComplete Mode

Here’s an example of how the AutoComplete mode worked (on line 70):

Autocomplete mode with Continue.dev (image by author)

Quite trivial, but it definitely speeds up the coding process.

Context Providers and Commands

Context providers are the interfaces of your assistant that connect to the external world. For example, you can use a git diff context to generate a commit message in one of my projects. The context provider is called using the “@” symbol (more details in the config.json), followed by your prompt.

Asking to summarize the changes based on git diff (image by author)

Based on the changes I made, the message was quite accurate, though perhaps a bit lengthy. To achieve the optimal balance between length and detail, you’ll need to experiment with different prompts. Once you find that sweet spot, you can wrap your prompt into a command, which can also be defined in the config.json file.

Commands are simply shortcuts for your prompts. Here’s an example of a command for generating a commit message:

"slashCommands": [
{
"name": "commit",
"description": "Generate an understandable and concise commit message"
}
]

And now, let’s run this command, here is the result:

Using a slash-command to generate a git commit message (image by author)

What Else Should Be Said

In this article, I’ve covered just the basic implementation of a coding assistant. However, there are many other things that can help you fine-tune it exactly to your team’s needs, such as:

  • Collecting and analyzing development data — interactions between the developer and the assistant.
  • Fine-tuning system prompts — adjusting default prompts for better performance.
  • Customizing autocomplete rules — optimizing suggestions for your specific coding style.
  • Creating your own context provider for chat mode — to integrate with specific external systems.
  • Further training the model or adapting it to preferences — to improve accuracy and context understanding.

I hope this article has been helpful. If you have any questions, whether about the article itself or about more advanced use cases for coding assistants, feel free to reach out to me!

PS. If you’re interesting in knowing more advanced AI DevTools — feel free to join the waitlist here. I’m going to release an online course for professionals very soon.

--

--

Aleksandr Perevalov
Aleksandr Perevalov

Written by Aleksandr Perevalov

Researcher at Leipzig University of Applied Sciences // Expert in Applied Conversational AI

Responses (1)