Sergio Lombana – Portfolio & Blog

I am writing this from my fully vibe-code portfolio/blog website. A space that is meant to let anyone who passes through get an idea of how proficient I am in certain technologies (at least the portfolio part of it). I am unsure yet if the fact that is vibe-coded is a pretence for what I have spent years researching and am far from mastering, or does that even matter at all? Maybe I just wanted to get something up and running quickly so I could focus on what actually matters: finding a challenge I am passionate about solving.

This private space is now public. It’s open to friends, family, recruiters, and anyone else I choose to share it with. I want these posts to feel personal and unfiltered. At first, they’ll serve as a way to articulate my goals and track my progress. Over time, I hope they also become a form of accountability. I tend to accumulate ideas and ambitions faster than I execute on them, and writing is my way of reducing that cognitive load, extracting thoughts from my head, and anchoring them somewhere external.

The project I’m committing to here sounds simple once broken down into smaller pieces: I want to build my own self-hosted LLM inference service.

A growing frustration with how many AI services are currently designed, gated, and monetized
A desire to explore what more thoughtful, user-centric AI usage could look like
And, more pragmatically, a refusal to keep paying obscene amounts of money for AI APIs

This blog will document what I learn along the way, what works, what doesn’t, and how my thinking evolves as I try to turn this idea into something real.

I want to structure what this journey will look like. For now, the best I can think of are the following phases (starting with a lot of investigation)

Phase 1: Define the non-negotiables

Ethical Constraints:

Deterministic, the LLM should have "inspectable" behavior

Personal Constraints:

Have fun
Shift focus away from the outcomes and into the journey itself
Keep journaling and posting my progress - The good, the bad and the ugly

Phase 2: Define the Components Needed

1. Model

This is the engine that actually generates the text. It should achieve this by mapping the input data (in my case, journal entries) to an output based on patterns it learned during training. Now, training here might not be from scratch, since it could cost millions to train a model to a level that most of the models out there operate at, not to mention the petabytes of data needed to get a model running to those standards. I might have to outsource some of the processing to servers like OpenAI or Gemini.

2. Inference / Serving Layer

As I understand, this will basically be exposing the model as an API endpoint so that it can handle requests efficiently. Some options look viable. I want to research batch processing and async request handling to make sure that I am doing the right things.

3. Storage

Since LLMs are stateless by default, I want my server to remember the new context. I can look into vector databases and embeddings to accomplish this. At inference, retrieve relevant vectors and prepend them to the prompt, basically using RAG.

Phase 3: Hardware / Environment

This is not critical at the beginning since I am gonna start by using my M1 chip Mac, which is an 8-core CPU with 16GB RAM. Not the best, but it will get me started until it's time to upgrade.