AI Services and Tools (2)#

In this chapter, we briefly introduce the APIs provided by OpenAI, along with short explanations and example code for each. Even if you’re not a developer, if you understand the basic idea of what an API is, this chapter may help you get a rough sense of “what you can do” and set direction when you want to build a service using OpenAI’s APIs.


About the APIs provided by OpenAI#

AI providers such as OpenAI offer consumer-facing services like ChatGPT, but they also provide APIs to third-party developers, expanding the ecosystem. For startups, using an API instead of training your own model can be a very practical way to ship high-quality AI features quickly.

As a side note: when non-developers ask me where to start learning programming, I’ve usually recommended small goals with visible outcomes—building a dynamic web page with HTML and JavaScript, or scraping a web page with Python to extract useful information. But after integrating OpenAI’s and Google Gemini’s APIs, my personal “top recommendation” changed: connecting an AI API is, in my experience, one of the most approachable and motivating starting points.

Below is a quick overview of what OpenAI’s APIs are and what you can do with them. For details, see the official documentation. The sample code is also included here because it’s relatively short and easy to follow.

Text generation#

There are APIs that take text input and produce text output. As of January 2026, with the OpenAI Python SDK, it’s common to use the Responses API(client.responses.create) as the primary entry point. instructions corresponds to “setup / system guidance,” and input represents user input (or conversation history).

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
  model="gpt-5-mini",
  instructions="You are a helpful assistant.",
  input=[
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
    {"role": "user", "content": "Where was it played?"},
  ],
)

print(response.output_text)

One key detail is role. role="user" means “what the user said,” and role="assistant" means “what the model previously answered (conversation history).”
What people often call system instructions are the shared rules you give the AI before the conversation starts (e.g., role / tone / forbidden content / output format). In this example, we keep those rules in instructions.

Function calling#

As mentioned in the RAG and GPTs sections, LLMs primarily answer based on what they’ve learned during training, so they have limitations when dealing with information they didn’t originally learn (e.g., up-to-date information or internal documents). To address this, techniques such as fine-tuning, RAG, and function calling (often also called “tool calling”) are used. Adding Actions to a custom GPT (GPTs) is, from a server perspective, similar to defining “tools/functions” that the model can call when needed.

A simplified function-calling flow looks like this:

  1. Assume a [user(client) ↔ server ↔ model(GPT)] architecture.
  2. When the server forwards the user’s request to the model, it also sends information about functions (name, description, parameters, etc.). (There can be multiple functions, all implemented on the server side.)
  3. While generating a response, if the model decides a function call is needed, it includes which function to call in its output, and sends that to the server.
  4. The server calls the function (which may include algorithms and external API calls).
  5. The server sends the function result back to the model, and only then does the model generate the final user-facing answer.

Function calling and RAG are similar in that they both “bring in external information when needed,” but their goals and implementation differ. RAG typically retrieves relevant passages from a document/knowledge base and attaches them as context (so the answer is grounded in documents). Function calling invokes external systems (API / DB / business systems, etc.) to fetch real data or perform actions, then feeds the result back into the model to produce the final response.

Image generation#

There are APIs that generate images from a text prompt. Image generation uses dedicated image models (e.g., gpt-image-1.5, dall-e-3) rather than a text model like gpt-5-mini. The generated image can be returned as a URL or Base64. If you use URLs, be aware they can be short-lived, so downloading and storing the result is usually safer for production use.

from openai import OpenAI

client = OpenAI()

response = client.images.generate(
  model="gpt-image-1.5",
  prompt="a white siamese cat",
  size="1024x1024",
  quality="standard",
  n=1,
)

image_url = response.data[0].url
print(image_url)
[Example response]
[
  Image(
    b64_json=None,
    revised_prompt="A Siamese cat with a pr..(omitted)",
    url="https://oaidalleapiprodscus.blob.core..(omitted)",
  )
]

Running the code above gives you the URL of the generated image. (Prompt engineering matters if you want specific details.)

In addition to generating images from a text prompt (client.images.generate()), you can also create an edited image by providing an existing image and a mask (client.images.edit()), or create variations of an image (client.images.create_variation()).

Vision#

You can provide an image along with a text prompt and receive a text response. Typical use cases include describing an image, checking whether a certain object exists, or comparing multiple images.

from openai import OpenAI

client = OpenAI()

img_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

response = client.responses.create(
  model="gpt-5-mini",
  input=[
    {
      "role": "user",
      "content": [
        {"type": "input_text", "text": "What's in this image?"},
        {"type": "input_image", "image_url": img_url},
      ],
    }
  ],
)
print(response.output_text)
[Example response]
This image depicts a picturesque scene of a wooden boardwalk path stretching through a grassy field. The sky is blue with some scattered clouds, and the surrounding vegetation is lush and green, suggesting a summer or late spring setting. Trees and shrubs can be seen in the background, adding to the serene and natural atmosphere of the landscape.

Text to speech#

There are APIs that generate an audio file from text. This feature uses dedicated TTS models (tts-1, tts-1-hd) rather than a text model. You can choose from preset voices (alloy, echo, fable, onyx, nova, shimmer) and output formats (mp3, opus, aac, flac, wav, pcm). For longer text, streaming is often useful.

from pathlib import Path
from openai import OpenAI

client = OpenAI()

speech_file_path = Path(__file__).parent / "6_text_to_speech_result.mp3"
with client.audio.speech.with_streaming_response.create(
  model="tts-1",
  voice="alloy",
  input="Today is a wonderful day to build something people love!",
) as response:
  response.stream_to_file(speech_file_path)

Speech to text#

If you provide audio input, you can get a text transcript. Like text-to-speech, this supports many languages. This feature uses dedicated transcription models (e.g., whisper-1).

from pathlib import Path
from openai import OpenAI

client = OpenAI()

audio_file = Path("6_text_to_speech_result.mp3")
transcription = client.audio.transcriptions.create(
  model="whisper-1",
  file=audio_file,
)

print(transcription.text)
[Example response]
Today is a wonderful day to build something people love.

Some APIs can also provide timestamps (when each word is spoken), which makes it easier to map text back to a specific point in the audio. You can also pass a text prompt along with the audio to improve transcription quality. For example, if the audio includes a word like “DALL-E,” the model might be unsure whether it should be transcribed as “DALI” or “DALL-E.” Providing context such as “this audio is about AI and mentions DALL-E and GPT models” can reduce mistakes.

Moderation#

There are APIs that check whether a text contains harmful content. They typically score or classify categories such as sexual content or hate content and return structured results.

from openai import OpenAI

client = OpenAI()

response = client.moderations.create(
  input="Hello world!!",
)
print(response)
[Example response]
ModerationCreateResponse(
  id="modr-...",
  model="...",
  results=[...],
)

Assistants#

So far we’ve looked at APIs for one-off requests. The Assistants API is a more advanced set of APIs for building a chatbot experience (like ChatGPT) inside your own application. It can support workflows such as attaching files for analysis, searching within those files to answer questions, and running Python code. You could build a chatbot using only the APIs above, but Assistants exists to package common chatbot building blocks in a more integrated way.

Below is a simple Python example of a chatbot using Assistants:

from openai import OpenAI
from typing_extensions import override
from openai import AssistantEventHandler

client = OpenAI()

assistant = client.beta.assistants.create(
  name="Math Tutor",
  instructions="You are a personal math tutor. Write and run code to answer math questions.",
  tools=[{"type": "code_interpreter"}],
  model="gpt-5-mini",
)

thread = client.beta.threads.create()

class EventHandler(AssistantEventHandler):
  @override
  def on_text_created(self, text) -> None:
    print("\nassistant > ", end="", flush=True)

  @override
  def on_text_delta(self, delta, snapshot):
    print(delta.value, end="", flush=True)

  def on_tool_call_created(self, tool_call):
    print(f"\nassistant > {tool_call.type}\n", flush=True)

  def on_tool_call_delta(self, delta, snapshot):
    if delta.type == "code_interpreter" and delta.code_interpreter:
      if delta.code_interpreter.input:
        print(delta.code_interpreter.input, end="", flush=True)
      if delta.code_interpreter.outputs:
        print("\n\noutput >", flush=True)
        for output in delta.code_interpreter.outputs:
          if output.type == "logs":
            print(f"\n{output.logs}", flush=True)

while True:
  user_input = input("\n\nuser > ")
  if user_input.lower() in ["exit", "quit"]:
    break

  client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content=user_input,
  )

  with client.beta.threads.runs.stream(
    thread_id=thread.id,
    assistant_id=assistant.id,
    instructions="Please address the user as Ted Kim. The user has a premium account.",
    event_handler=EventHandler(),
  ) as stream:
    stream.until_done()
[Example run log]

user > Who are you?

assistant > Hello, Ted Kim! I'm your AI assistant, here to help you with information, tasks, and any questions you might have. How can I assist you today?

user > I need to solve the equation `3x + 11 = 14`. Can you help me?

assistant > Sure, Ted Kim! Let's solve the equation (3x + 11 = 14).

1. **Subtract 11 from both sides:**
   3x + 11 - 11 = 14 - 11
   3x = 3

2. **Divide both sides by 3:**
   x = 3/3
   x = 1

So, the solution to the equation (3x + 11 = 14) is (x = 1).

user > Please run the python code. `for i in range(0,3): print('i * 2 = %d' % (i * 2))`

assistant > code_interpreter

output >
i * 2 = 0
i * 2 = 2
i * 2 = 4

user > exit

About APIs provided by other big-tech AI companies#

APIs from other major AI providers—such as Google (Gemini) and Anthropic (Claude)—also tend to have a broadly similar shape at a high level, with capabilities like “text generation / images / audio / multimodal / tool calling.” The goal of this chapter is not to provide an API reference for every vendor, but to help you build an overall intuition for what APIs enable, so we’ll skip a detailed introduction to other vendors’ APIs here.

© 2026 Ted Kim. All Rights Reserved.