This vignette shows one complete, reproducible workflow focused on the Search endpoint:
read_corpus(abstracts = TRUE).The goal is to create a corpus that is ready for downstream vector/comparison workflows while preserving query-level partitions.
library(kagiPro)
conn <- kagi_connection(
api_key = function() keyring::key_get("API_kagi")
)
q <- c(
biodiversity_search = query_search(
query = "biodiversity annual report",
filetype = c("pdf", "docx"),
expand = FALSE
)[[1]],
ecosystem_methods = query_search(
query = "ecosystem services valuation methods",
filetype = c("pdf", "docx"),
expand = FALSE
)[[1]]
)Using a named list keeps query names stable throughout JSON/parquet/content partitions (query=<query_name>).
project_folder <- "tests_complex"
kagi_fetch(
connection = conn,
query = q,
project_folder = project_folder,
overwrite = TRUE
)This writes:
tests_complex/search/jsontests_complex/search/parquetIf you run kagi_request() manually instead of kagi_fetch(), convert JSON to parquet explicitly:
kagi_request_parquet(
input_json = file.path(project_folder, "search", "json"),
output = file.path(project_folder, "search", "parquet"),
overwrite = TRUE
)Download for all search queries in that endpoint:
download_content(
project_folder = project_folder,
endpoint = "search",
query_name = NULL, # all queries in `search`
workers = 4
)This writes binary/source files under:
tests_complex/search/content/query=<query_name>/...content_markdown(
project_folder = project_folder,
endpoint = "search",
query_name = NULL, # all queries in `search`
workers = 4
)This writes markdown files under:
tests_complex/search/markdown/query=<query_name>/...Use OpenAI summarization:
markdown_abstract(
project_folder = project_folder,
endpoint = "search",
query_name = NULL, # all queries in `search`
summarizer_fn = summarize_with_openai,
model = "gpt-4.1-mini",
workers = 1 # sequential is recommended for OpenAI rate limits
)Or use Kagi summarization over extracted text:
markdown_abstract(
project_folder = project_folder,
endpoint = "search",
query_name = NULL,
summarizer_fn = summarize_with_kagi,
model = "cecil",
connection = conn,
workers = 4
)Abstract parquet files are written to:
tests_complex/search/abstract/query=<query_name>/...Read parquet only:
ds <- read_corpus(
project_folder = project_folder,
endpoint = "search",
corpus = "parquet",
abstracts = FALSE
)Read parquet with linked abstracts (id + query):
ds_abs <- read_corpus(
project_folder = project_folder,
endpoint = "search",
corpus = "parquet",
abstracts = TRUE,
silent = TRUE
)
tbl <- dplyr::collect(ds_abs)
names(tbl)At this stage, tbl is a query-partitioned search corpus with an additional abstract column, ready for downstream modeling and comparison workflows.
download_content(), content_markdown(), and markdown_abstract() all support selector expansion (endpoint = NULL and/or query_name = NULL).read_corpus(abstracts = TRUE) expects abstract parquet schema with lowercase abstract.