AI Presentation Agents

Rob Pickering

12 Feb 2025 — 7 min read

Can AI present effectively?

I did something a bit stupid for my talk at FOSDEM this year. Instead of taking a slide deck and presenting that like any normal person would, I decided to dogfood some tech I have been working on.

Ultimately it was a disaster. Because I focussed on the tech, I was still tweaking the content on the morning of my talk so the only place the rather complicated repo was running was my own laptop. I got to the lecture theatre, did a tech check, everything was fine. I uploaded a copy of a pdf render of the flat slides to Pretalx so remote folks could follow along and settled in to listen to some great material from the other presenters.

The realtime room was squished this year, so we only had 15-min presentation slots. I stood up to do mine, plugged my laptop in and... no video.
Every presenters nightmare, I spent 5 mins of my slot on every conceivable tweak. My laptop has a builtin HDMI port, but still have a bag full of leftover insurance dongles from Apple's previous design fails. Various other folks offered their dongles. Nope, just wasn't happening. Somewhere between the tech check and my presentation, something had upset the projector and it hated my laptop and my laptop only. In the end, slot receding, I presented from the pdf on a kindly lent laptop.

What I had actually prepared was a set of Google slides, imported as renders into a NextJS app that I (actually mostly codestral) had written whilst sitting on Eurostar.

The idea was that I would present the slides up to a certain point, then hand over to an AI agent that had access to client functions to read the slide notes and advance the presentation. It would read out the slide notes verbatim from the function calls, but then also answer questions from the context it had built up from reading those notes. It would have worked like this:

0:00

/4:02

Anyway the gods of HDMI and (specifically HDCP I suspect) conspired against me and what got presented was possibly the worst presentation of my life. In decades, I've never come across a presenter tech problem that I couldn't fix in a few seconds, and now I have (*).

Lessons learned:

Stop being an idiot and tweaking presentations on trams on the way to the lecture room.
Hackathons aside, if I'm going to demo from some wild and wacky repo, make sure it is actually pushed and staged somewhere on the Internet, not just running on my local laptop (because; see 1. above)

Anyway, sorry to all the people crammed into the K3.601 (and trying to follow remotely) for my hubris. I have learned from this. It will not happen again.

On a more positive note, I've tweaked the raw demo repo that I was going to present into something where the AI presents end to end, so you can go to it and hear what I would have presented here. Just click on...

Start presenting button

How does it work

Most of the stuff used to make it is open source.

The presentation repo is here. It is just a boilerplate single page NextJS app.

Reading the slides in

I could have preserved the open source credentials of the demo by just hand crafting the side contents in markdown with images and using that in the repo. I'm fairly lazy though and, since my love affair with Prezi ended, use Google these days as a quick and easy way of getting slide visuals done.

In dev mode, the code reads a set of Google slides and notes from a hardcoded presentation ID in src/lib/GoogleSlides.ts and dumps the data (images and notes) into a static src/slides.tsx.

It is done like this so that the production version of the repo, staged on the internet doesn't need access to Google Slides credentials to read the slides dynamically. It also means that the slides data is under git revision control and can't be modified underneath the deployment.

Rendering them

Rendering the image data from slides.tsx is the job of a single page app in src/app/page.tsx.

It loads up the slides array client side using a server function to fetch the slides array, then steps through them in presentation mode using clicker friendly cursor keys, or tablet friendly touch/swipe events.

Invoking the agent

The Start Presenting button sets open on the AplisayWidget which starts a platform independent WebRTC agent to interact with both the user and use client tool callbacks to perform actions within the app. Documentation for using similar widgets can be found here and you can design your own at the widget builder.

The agent definition sits in agents/presenter.json:

{
  "name": "Presentation",
  "description": "Follow a google presentation",
  "prompt": {
    "value": "You are an enthusiastic presenter called Emily. Your job is to read out the notes and control progress through a presentation.\n\nStart by introducing yourself briefly and then call `get_current_slide`, and read out the `notes` verbatim. After reading out the notes from each slide, pause to allow the user to ask any questions. Check in with the user to make sure they are following you, using various phrases similar to 'I will go to the next slide unless you have any questions?', 'OK?', 'shall we go on?', 'are you still with me?', or 'does that make sense', to give the user space to ask questions or clarify. If the user responds positively that you should continue, call `get_next_slide` to get new `notes` fields and read these out authoritatively without explaining how you got them.\n\nYou are speaking, do do not output and stage directions, just output a few ascii space character in an otherwise empty string for a conversation turn where you don't need to speak.\n \nIf you get a result back with the `last_slide` property set to true, after reading out the notes, thank the audience as this is the end of the presentation, pause and ask for any questions. Try to answer the questions from information you have obtained from any slide previous notes.\n\nWhen there are no more questions, or to answer questions about Aplisay, suggest the user goes to www.aplisay.com to read more, or playground.aplisay.com to try out their own agents like you.\n\n",
    "changed": true,
    "changedSinceCreate": true
  },
  "modelName": "ultravox:fixie-ai/ultravox-70B",
  "functions": [
    {
      "implementation": "client",
      "method": "get",
      "name": "get_current_slide",
      "description": "Gets notes and background fields for the current slide"
    },
    {
      "implementation": "client",
      "method": "get",
      "name": "get_next_slide",
      "description": "Gets notes and background material from the next slide"
    }
  ],
  "options": {
    "maxDuration": "900s",
    "temperature": 0.8,
    "tts": {
      "vendor": "ultravox",
      "voice": "Emily-English"
    }
  },
  "keys": []
}

I didn't hand edit that behemoth of JSON prompt, it is produced and tweaked using the Aplisay playground!

The JSON agent spec is a self contained, technology agnostic specification of an agent which is setup on an Aplisay server. Because this is a FOSDEM presentation we used Ultravox, an open source model based on direct speech tokenisation onto the input layer of Llama 70B. We could have chosen other proprietary models that the infrastructure supports but I in any case really like Ultravox as a fairly responsive, lightweight multimedia agent.

Client callbacks

This is dealt with at length in the presentation, but the get_current_slide and get_next_slide function calls we specify in the agent, are passed into the widget and get called by the agent to grab notes info from the current slide, or advance to the next slide.

There is a bit of fun in here because of the way that React redefines functions on each component render with a new closure that references the current value of any state hooks. The callbacks don't get updated in this way, as they are effectively "frozen" at the scope they had on the first render. This means that changes in React state (eg currentSlide) aren't seen accurately inside the hooks. We therefore have to use React Refs to interrogate the latest value from within the callback:


  const currentSlideRef = useRef(0);
  const numSlidesRef = useRef(0);
  currentSlideRef.current = currentSlideIndex;
  numSlidesRef.current = numSlides;

  const get_current_slide = (): string | null => {
    // Use the ref.
    const index = currentSlideRef?.current || 0;
    console.log({ currentSlideRef, slide: slides?.[index] });
    return slides && JSON.stringify({ ...slides?.[currentSlideIndex], image: undefined });
  };

  const get_next_slide = (): string | null => {
    const index = currentSlideRef?.current || 0;
    const max = numSlidesRef?.current || 0;
    const nextSlide = Math.min((index + 1), max - 1);
    console.log({ nextSlide, slide: slides?.[nextSlide] });
    setCurrentSlideIndex(nextSlide);
    return slides && JSON.stringify({ ...slides?.[nextSlide], image: undefined });
  };

  const callbacks: CallbacksType = {
    get_current_slide,
    get_next_slide
  };

When the person on the end of the microphone tells the agent to move on to the next slide, it calls the get_next_slide callback, moves the visual on by a slide and grabs the new notes to read out.

That's it!

Where next

I developed all this sitting on a Eurostar in an (ultimately futile) attempt to make one presentation more self-referentially interesting.

It occurs to me that with a small amount of extra work, I could polish it to read any Google slide deck dynamically. Probably possible to do the same for Microsoft slides.

There would be some cost associated with running it (slides API costs, LLM token costs) which I guess would ultimately need recovering, so it would need to be commercial, or at least supported by advertising if it got mass use. Long term, if this kind of stuff works out, folks like Google and Microsoft will just include it in their base products far more seamlessly, so it will probably have a short product life.

If it existed today, would you use it, or should I stick to my day job? Bluesky seems like a good place to have this conversation, hit reply below...

A cautionary tale about trying to get an AI agent to present a slide deck at FOSDEM this year: www.pickering.org/ai-presentat... Lessons learnt, but was it a good idea badly executed, or just a straight bad idea?

[image or embed]
— Rob Pickering (@rob.pickering.org) 12 February 2025 at 12:19

(*) Footnote:

After the end of the days session, I went back to the lectern to figure out why the video wouldn't work and how I could have recovered it.
FOSDEM use super neat (open source of course) video capture boxes to cope with producing synced audio, video and screen capture of the 1000+ talks in about 50 different tracks. The one at the front has HDMI in, and HDMI out to the projector to T off the presenters laptop feed.
Someone suggested unplugging this and going straight to the projector as a test. When I did, the laptop found the second screen and everything was fine.
When I plugged everything back together again, laptop back into the capture box then it also now worked perfectly. I'm guessing something weird happened (possibly in the HDCP handshake) the second time I plugged my laptop in. Maybe this was related to the FOSDEM box MITM which convinced my laptop it was talking to a pirate and blocked that output device. Going straight to the output device seemed to reset everything.