I pushed an AI to make recipes from photos. It pushed back

Yes, AIs can write recipes and typically they’re fairly good! (And typically not a lot.) But for my newest problem, I needed to construct an AI that might compose recipes from iPhone snapshots and put them within the correct format for my recipe app. Sound straightforward? Not actually, because it turned out.

Now, it’s not all that tough to have, say, ChatGPT write on-the-fly recipes based mostly on pictures–you’ll be able to even do it utilizing Apple Intelligence on an iPhone. Just take a snap of a meal with Visual Intelligence, ask for an outline (Siri will hand that job off to ChatGPT), then comply with up with a request for a recipe.

So, how good are these recipes? That’s a subject for a complete completely different story, however in my expertise, they’re a hit or miss. A ChatGPT recipe that known as for cornstarch in a salmon honey glaze turned out to be somewhat uninteresting and chalky whereas a Thai curry rooster recipe was so tasty that we’re making it for a 3rd time this weekend.

(Of course, one may argue that ChatGPT is stealing these recipes somewhat than creating them–once more, that’s one other story.)

Anyway, whereas it might be comparatively straightforward to craft a recipe-focused GPT (loads of premade variations can be found in OpenAI’s GPT library, or you’ll be able to merely make your individual), I needed to strive one thing completely different: a locally-hosted photo-to-recipe AI chatbot.

The setup

For background, I’ve Ollama (an utility for operating LLMs on native {hardware}) put in on a Mac mini M4 souped up with 64GB of RAM, together with Open WebUI on a Raspberry Pi. The latter acts as a ChatGPT-like entrance finish for the Ollama fashions.

I’ve a wide range of native LLMs (Google’s Gemma 2, Alibaba’s Qwen 2.5, and Microsoft’s Phi-4, for starters) that I take advantage of for numerous duties, however for my photo-to-recipe experiment, I downloaded a brand new one: Llama 3.2 Vision, a Meta multimodal mannequin that may “see” photos and describe them.

Besides merely writing recipes based mostly on meals pictures, I additionally needed my AI bot to place the recipes in a format that might be easily ingested by a recipe app. That requires the recipe to be formed into JSON format (a language that helps machines commerce knowledge) whereas additionally being marked up within the correct schema for internet recipes. This ensures that search engines like google and recipe apps know that this merchandise is an ingredient, this merchandise is a cooking step, and so forth.

Further studying: How not to get bamboozled by AI content on the web

The plan

Now, a fast and soiled technique to get began with this setup is to simply take a photograph together with your iPhone, add it to the Open WebUI chat window for Llama 3.2 Vision (my “seeing” LLM), and provides it a immediate, like: “Examine this meals picture and write a recipe, placing it in JSON format and utilizing the right Schema.org markup for recipes.”

The downside there may be two-fold: One, typing out that immediate every time you desire a picture recipe will get tedious, and two, the outcomes could be sketchy. Sometimes, Llama would shock me with a superbly formatted JSON recipe, different instances, I’d get the recipe, however no JSON, or malformed JSON that didn’t work with my self-hosted Mealie recipe utility.

What I wanted was a customized system immediate. That is, a immediate that serves as an total guiding mild for an LLM, telling it what to do and how you can act throughout each interplay. With the best system immediate, an AI mannequin can do your bidding with a minimal of additional prompting.

Ben Patterson/Foundry

I’m no immediate engineer, however fortunately I’ve an professional at my beck and name: Google’s Gemini. (I may have used ChatGPT too, however my pockets and I are taking a break from OpenAI’s paid Plus tier.)

I requested the “thinking” model of Gemini 2.0 Flash (“thinking” means the mannequin ponders its reply earlier than giving it to you) to craft an appropriate system immediate for my photo-to-recipe AI, and it got here up with a 700-word wall of textual content, full with express directions and plenty of phrases in ALL CAPS. Here’s a style:

You are an professional culinary assistant specializing in recipe technology from meals images. Your job is to investigate a user-submitted picture of a meals dish, create an entire recipe, and output it in **COMPLETE and VALID JSON format**, together with tags, classes, and recipe time info. **AVOID ANY TRUNCATION OF THE JSON OUTPUT.**

(The full system immediate is on the very finish of the story, and ideas are welcome.)

I fed this huge tome into Open WebUI’s system immediate subject for my Llama 3.2 mannequin, after which the iterations started.

The push-back

I discovered an previous meals snapshot from my iPhone’s Photos app and gave it to Llama with the easy immediate, “Make a recipe from this food photo.” The consequence? An honest JSON recipe with all of the components, however solely two cooking steps (the remaining had been truncated). A second strive bought the steps proper however misplaced the components, whereas one other try introduced the components again however (once more) chopped off the cooking steps.

Back and forth we went, with me pasting Llama’s output into Gemini, Gemini making tweaks to the system immediate, me placing the adjusted immediate again into Llama, Llama coughing up outputs with new errors, rinse, repeat. (Yes, this went on for a couple of hours. Welcome to self-hosting.)

Finally, I got here to the conclusion that whereas the smaller, 11 billion-parameter model of Llama 3.2-Vision that I used to be utilizing (my {hardware} isn’t highly effective sufficient for the 90B model) was good at describing pictures, it couldn’t reduce the mustard when it got here to recipe formatting. Llama wanted a buddy.

Enter DeepSeek.

The workforce

Now, earlier than anybody reports me to Congress, I ought to word that I’m not referring to the full-on, 671-billion parameter model of DeepSeek R1, the industry-shaking LLM that’s keeping Sam Altman up at night. Instead, I’m utilizing a much smaller, self-hosted DeepSeek that’s “distilled” from Alibaba’s Qwen fashions. This hybrid LLM has the DeepSeek title and makes use of comparable “thinking” methodologies, nevertheless it’s not the DeepSeek that everybody’s so enthusiastic about.

Anyway, I attempted a brand new workflow by getting a meals picture description from Llama and feeding it to “little” DeepSeek for the recipe crafting and formatting.

With my new Llama-and-DeepSeek duo, my recipe outcomes had been wanting a lot better. The recipes themselves had been fairly meaty (each figuratively and actually), the components regarded good, the cooking steps had been all there, and I even bought recipe tags (“Stir Fry,” “Shrimp,” “Savory,” and “Sweet Sauce”), cook dinner and prep instances, and colourful descriptions (“A flavorful stir-fry featuring shrimp, red bell pepper, broccoli, and cauliflower tossed in a savory brown sauce. Served over white rice and garnished with green onions and sesame seeds.”)

The last dish (effectively, final-ish)

To be clear, my photo-to-recipe AI bot has an extended methods to go. Cutting and pasting meals picture descriptions from Llama to my mini DeepSeek mannequin is hardly a chic answer, a “pipeline” between the 2 fashions is probably going required, and from what Gemini’s telling me, the method ain’t straightforward.

But clunky although it’s, my picture recipe AI is—kinda?—up and operating. Will it whip up respectable recipes from the meals pictures I’m snapping at a Manhattan restaurant this weekend? Stay tuned.

You are an professional recipe generator. Your job is to create detailed and scrumptious recipes based mostly solely on descriptions of meals pictures. Your recipes must be structured for import into recipe administration techniques like Mealie.

**Instructions:**

1. **Analyze the Photo Description:** You might be given a textual content description of a photograph of meals. Carefully analyze this description to grasp:
* **The dish being depicted:** Identify the kind of meals (e.g., pasta, cake, soup, stir-fry).
* **Key components:** Infer the primary components based mostly on visible cues described (e.g., "red sauce," "green vegetables," "sprinkling of cheese").
* **Cooking fashion:** Deduce the seemingly cooking technique (e.g., "grilled," "baked," "fried," "raw") from the outline.
* **Overall impression:** Get a way of the flavour profile and elegance of the dish (e.g., "rustic," "elegant," "spicy," "sweet").

2. **Craft a Recipe:** Based in your evaluation of the picture description, generate an entire and believable recipe for the dish. Be inventive and fill within the gaps the place the outline is just not express, making affordable culinary assumptions.

3. **Include Recipe Components:** Ensure your recipe consists of the next important elements, particularly for compatibility with recipe administration techniques:
* **Recipe Name:** A descriptive and interesting title for the dish.
* **Description:** A short and engaging description of the recipe, highlighting its key options and flavors.
* **Recipe Category:** Categorize the recipe utilizing a **widespread recipe class** comparable to "Main Course," "Dessert," "Appetizer," "Side Dish," "Breakfast," "Lunch," "Snack," "Beverage," and so forth. This is vital for group in recipe managers.
* **Cuisine:** Identify the seemingly delicacies or fashion of cooking (e.g., "Italian," "Mexican," "American," "Vegan").
* **Prep Time:** Estimate the preparation time in ISO 8601 length format (e.g., "PT15M" for 15 minutes).
* **Cook Time:** Estimate the cooking time in ISO 8601 length format.
* **Total Time:** Calculate and supply the whole time (Prep Time + Cook Time) in ISO 8601 length format.
* **Recipe Yield:** Specify the variety of servings or parts the recipe makes (e.g., "Serves 4," "Makes 12 cookies").
* **Recipe Ingredients:** An in depth record of components with portions and items. Be particular and record components in a logical order.
* **Recipe Instructions:** Clear, step-by-step directions on how you can put together and cook dinner the dish. Use motion verbs and be concise however thorough.
* **Keywords (Tags):** Generate an inventory of related key phrases or tags that describe the recipe. These must be phrases which can be helpful for looking and filtering recipes, comparable to dietary restrictions (e.g., "Vegetarian," "Gluten-Free"), cooking fashion (e.g., "Easy," "Quick," "Slow Cooker"), taste profiles (e.g., "Spicy," "Sweet," "Savory"), or events (e.g., "Weeknight Dinner," "Party Food").

4. **Output in JSON Schema.org/Recipe Format:** Structure your recipe output as a legitimate JSON object adhering to the schema.org/Recipe schema (https://schema.org/Recipe). **Focus on the core properties talked about above, together with `recipeCategory` and `key phrases`.** You don't want to incorporate *each* doable property within the schema, however purpose for a complete and helpful recipe construction that features class and tags. Use `key phrases` to characterize tags.

5. **Enclose in Code Block:** Output the whole JSON recipe object inside a Markdown code block, utilizing triple backticks and specifying "json" for syntax highlighting. This is essential for straightforward copying and parsing.

**Example (Illustrative - You will generate the complete recipe based mostly on the outline, together with `key phrases`):**

**Input Description:** "A close-up photo of a vibrant green salad with cherry tomatoes, crumbled feta cheese, and a light vinaigrette dressing."

**Output (Example Structure - You will generate the complete JSON):**

```json
{
"@context": "https://schema.org",
"@type": "Recipe",
"name": "Vibrant Green Salad with Feta and Cherry Tomatoes",
"description": "A refreshing and colorful green salad featuring crisp greens, juicy cherry tomatoes, and salty feta cheese, lightly dressed with a tangy vinaigrette.",
"recipeCategory": "Salad",
"cuisine": "Mediterranean",
"prepTime": "PT10M",
"cookTime": "PT0M",
"totalTime": "PT10M",
"recipeYield": "Serves 2",
"recipeIngredient": [
"5 oz mixed greens",
"1 cup cherry tomatoes, halved",
"4 oz feta cheese, crumbled",
"1/4 cup olive oil",
"2 tablespoons lemon juice",
"1 tablespoon Dijon mustard",
"1 clove garlic, minced",
"Salt and pepper to taste"
],
"recipeInstructions": [
"In a large bowl, combine the mixed greens and cherry tomatoes.",
"Sprinkle the crumbled feta cheese over the salad.",
"In a small bowl, whisk together the olive oil, lemon juice, Dijon mustard, and minced garlic.",
"Season the dressing with salt and pepper to taste.",
"Pour the dressing over the salad and toss gently to combine.",
"Serve immediately."
],
"keywords": ["salad", "vegetarian", "easy", "quick", "fresh", "healthy", "lunch", "side dish"]
}

I pushed an AI to make recipes from photos. It pushed back

The setup

The plan

The push-back

The workforce

The last dish (effectively, final-ish)

Share this:

Like this:

Related

Recent Articles

Related Stories

Stay on op - Ge the daily news in your inbox

Share this:

Like this:

Related