Skip to main content

Usage

Intended Usage

One of the initial goals of Subworkflow.ai was not to replace your AI workflows but rather, to enhance them. We do this by focusing and handling the often difficult and tedious parts and leaving the fun stuff - prompting, models, data and structured outputs - to you.

Subworkflow.ai's core service offering, the Datasets API, is designed to remove the challenges of working with large documents (and later other file types) in LLM-based projects but does not itself perform VLM-based OCR or document understanding. We believe such LLM-based tasks are best left to the developer to architect and refine for their particular project or use-case.

1. Use Subworkflow.ai as part of your AI application

Have a customer-facing application that relies on handling a variety of differently-sized documents in bulk? Use Subworkflow.ai as part of your backend service invisibly and securely processing files on your behalf. Integration with Subworkflow.ai saves your development team time to focus on more key and higher value features for your product.

2. Use Subworkflow.ai as part of internal automation workflows

Subworkflow.ai works great with internal reports, insurance and compliance policy documents, request-for-proposals,questionnaires, legal documents and in many popular formats such as pdf, docx, pptx and xlsx. In our roadmap, we also plan to support audio and video files for uses cases such as breaking down user interviews and online meetings.

3. Use Subworkflow.ai if you or your client doesn't have a dev team.

As AI workflows grow, inevitably file processing requirements increase and you may find to support such an increase, your project requires a significant investment in infrastructure. An investment which you may not have the means, resources, time/financial budget for or the return may take too long to realise for stakeholders, not to forget the maintenance commitment afterwards! Subworkflow.ai provides a low-cost yet effective solution in this scenario where your project can start using our infrastructure and service from day 1.

Technical Usage

There are 2 ways to retrieve data from the Datasets API after uploading your file.

  • Using the Dataset Items API endpoint
  • Using the Search API endpoint

Using the Dataset Items for LLMs

Once you've uploaded your document, the Dataset API allows you to retrieve a page or a range of pages and their binary asset links which you can then use as input into your LLM of choice.

info

Dataset binary data can only be accessed via the Datasets API and requires a token to be shared publicly. This share link is self-expiring (default 10mins) and is compatible with all LLMs. This is arguably better than just using CDNs to share assets as it protects against unauthorized access - which can be a result of overzealous caching and accidental leaking eg. google search results.

1. Query for Dataset Item

curl https://api.subworkfow.ai/v1/datasets/<datasetId>/items?row=jpg&cols=1
--header 'x-api-key: <YOUR-API-KEY>'
{
"sort": ["-createdAt"],
"offset": 0,
"limit": 10,
"total": 1,
"data": [
{
"id": "dsx_B5bsOBDzsXsqfmLo",
"valueType": "base64",
"col": 1,
"row": "jpg",
"createdAt": 1761910649511,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH",
"token": "kCThsH",
"expiresAt": 1761911418809
}
}
]
}

2. Use an Input for LLM

curl https://api.openai.com/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4.1",
"input": [
{
"role": "user",
"content": [
{"type": "input_text", "text": "what is in this image?"},
{
"type": "input_image",
"image_url": "https://api.subworkflow.ai/v1/share/dsx_B5bsOBDzsXsqfmLo?token=kCThsH"
}
]
}
]
}'

Using the Search API Endpoint

The Datasets API includes a /search endpoint which performs a similarity search over dataset items within a single dataset but also can be expanded to many datasets within the workspace.

To use the /search api functionality, you must either upload your file using the /vectorize endpoint or you can trigger a vectorize job on an existing dataset to generate the search index. If you do not, you won't be able to use the search functionality.

Extract API vs Vectorize API

The /vectorize endpoint is optional and should be used if you need the search functionality. Under the hood, the /vectorize endpoint actually calls the /extract process before it can generate the search index and therefore, is a longer task to complete. If you're only interested in the files, stick with the /extract api and trigger the /vectorize process later on when you have need for it.

1. Use Natural Language to Search for Matching Dataset Items

When using the /search endpoint, the query is based on context similarity and not fulltext search. ie. searching for the exact text won't work but searching with questions, topics, ideas will. Since we use image embeddings, the search functionality also has the added benefit of being able to match against photos, screenshots, charts, site plans and graphics within the dataset.

curl -X POST 'https://api.subworkflow.ai/v1/search' \
--header 'x-api-key: <YOUR-API-KEY>'
--data '{
"query": "Where does it mention about policy requirements?",
"datasetIds": ["<datasetId>"]
}'

The response is identical to the dataset items response but includes an additional score property. This score value represents relevancy to the original query where 1 is very strongly relevant and 0 is not relevant. Note, that we do not perform "re-ranking" on the results at time of writing.

{
"success": true,
"total": 5,
"data": [
{
"id": "dsx_IiOCCtl1MqtGH6bn",
"col": 22,
"row": "jpg",
"createdAt": 1761933153684,
"score": 0.22778621,
"share": {
"url": "https://api.subworkflow.ai/v1/share/dsx_IiOCCtl1MqtGH6bn?token=c9hHWj",
"token": "c9hHWj",
"expiresAt": 1761933532573
}
},
// ... shortened for brevity
]
}