| Title: | Kagi API Client for R |
|---|---|
| Description: | User-friendly R client for the Kagi APIs (Search, Enrich, Summarizer, and FastGPT). Build endpoint-specific query objects, run single or batch requests with one reusable connection, and write reproducible JSON outputs for analysis pipelines. Includes optional graceful error handling with dummy outputs and JSON-to-parquet conversion for downstream workflows. |
| Authors: | Rainer M. Krug [aut, cre] |
| Maintainer: | Rainer M. Krug <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.4.1 |
| Built: | 2026-06-09 06:35:10 UTC |
| Source: | https://github.com/rkrug/kagiPro |
Remove JSON request data files from endpoint JSON folders in a project while
preserving per-query metadata files (_query_meta.json).
clean_request(project_folder, dry_run = FALSE, verbose = TRUE)clean_request(project_folder, dry_run = FALSE, verbose = TRUE)
project_folder |
Root project folder containing endpoint subfolders. |
dry_run |
Logical. If |
verbose |
Logical. If |
This function is intended for reclaiming disk space while keeping enough
metadata for kagi_update_query() reruns.
A list with:
details: data frame with per-query deletion counts/bytes
totals: list with files and bytes
dry_run: logical flag
## Not run: clean_request("kagi_project", dry_run = TRUE) clean_request("kagi_project", dry_run = FALSE) ## End(Not run)## Not run: clean_request("kagi_project", dry_run = TRUE) clean_request("kagi_project", dry_run = FALSE) ## End(Not run)
Extract Downloaded Content to Markdown
content_markdown( project_folder, endpoint = NULL, query_name = NULL, text_root = "markdown", output_format = "markdown", workers = 4, verbose = FALSE, progress = interactive() )content_markdown( project_folder, endpoint = NULL, query_name = NULL, text_root = "markdown", output_format = "markdown", workers = 4, verbose = FALSE, progress = interactive() )
project_folder |
Root project folder containing endpoint subfolders. |
endpoint |
Optional endpoint selector (for example '"search"' or '"enrich_news"'). If 'NULL', all supported endpoints are considered. |
query_name |
Optional query selector. If 'NULL', all query partitions are considered. |
text_root |
Root folder name used for extracted text outputs. |
output_format |
Output format. Only '"markdown"' is supported. |
workers |
Number of parallel workers to use for extraction. |
verbose |
Logical indicating whether progress messages should be shown. |
progress |
Logical indicating whether a progress bar should be shown. |
A data frame with extraction status and diagnostics columns: 'endpoint', 'id', 'query', 'text_path', 'status', 'error'.
Download Endpoint Content for Abstract Generation
download_content( project_folder, endpoint = NULL, query_name = NULL, workers = 4, progress = interactive(), verbose = FALSE )download_content( project_folder, endpoint = NULL, query_name = NULL, workers = 4, progress = interactive(), verbose = FALSE )
project_folder |
Root project folder containing endpoint subfolders. |
endpoint |
Optional endpoint selector (for example '"search"' or '"enrich_news"'). If 'NULL', all supported endpoints are considered. |
query_name |
Optional query selector. If 'NULL', all query partitions are considered. |
workers |
Number of parallel workers to use for downloads. |
progress |
Logical indicating whether a progress bar should be shown. |
verbose |
Logical indicating whether progress messages should be shown. |
A data frame with download status and paths.
Build a typed S3 object of class kagi_connection which holds the
basic configuration required to talk to the Kagi API. This includes
the API base URL, authentication key, and retry settings.
kagi_connection( base_url = "https://kagi.com/api/v0", api_key = Sys.getenv("KAGI_API_KEY"), max_tries = 3 )kagi_connection( base_url = "https://kagi.com/api/v0", api_key = Sys.getenv("KAGI_API_KEY"), max_tries = 3 )
base_url |
Character scalar. Base URL for the Kagi API.
Defaults to |
api_key |
API key used for authentication. By default this is read
from the environment variable |
max_tries |
Integer scalar. Maximum number of retry attempts
for transient errors. Defaults to |
An object of class kagi_connection with components:
base_urlBase API URL.
api_keyAPI key (or a function to resolve it).
max_triesMaximum retry attempts.
## Not run: # Basic connection (API key from env var) conn <- kagi_connection() conn # Explicit API key conn2 <- kagi_connection(api_key = "my-key") # Lazy API key via keyring conn3 <- kagi_connection(api_key = function() keyring::key_get("API_kagi")) ## End(Not run)## Not run: # Basic connection (API key from env var) conn <- kagi_connection() conn # Explicit API key conn2 <- kagi_connection(api_key = "my-key") # Lazy API key via keyring conn3 <- kagi_connection(api_key = function() keyring::key_get("API_kagi")) ## End(Not run)
High-level helper that runs kagi_request() and kagi_request_parquet() in
sequence and writes outputs into endpoint-scoped project folders.
kagi_fetch( connection, query, project_folder = NULL, endpoint = NULL, overwrite = FALSE, workers = 1, limit = NULL, verbose = FALSE, error_mode = c("stop", "write_dummy") )kagi_fetch( connection, query, project_folder = NULL, endpoint = NULL, overwrite = FALSE, workers = 1, limit = NULL, verbose = FALSE, error_mode = c("stop", "write_dummy") )
connection |
A |
query |
A query object of class |
project_folder |
Root folder for endpoint-scoped outputs. If |
endpoint |
Optional endpoint override. One of |
overwrite |
Logical. If |
workers |
Number of workers for list requests. |
limit |
Optional integer limit used for search/enrich request calls. |
verbose |
Logical indicating whether progress messages should be shown. |
error_mode |
Error handling mode passed to |
Folder layout:
<project_folder>/<endpoint>/json
<project_folder>/<endpoint>/parquet
For a single endpoint, normalized parquet path. For mixed endpoint query lists, a named list of normalized parquet paths by endpoint.
## Not run: conn <- kagi_connection(api_key = function() keyring::key_get("API_kagi")) q <- query_search("biodiversity", expand = FALSE) kagi_fetch( connection = conn, query = q, project_folder = "kagi_project" ) ## End(Not run)## Not run: conn <- kagi_connection(api_key = function() keyring::key_get("API_kagi")) q <- query_search("biodiversity", expand = FALSE) kagi_fetch( connection = conn, query = q, project_folder = "kagi_project" ) ## End(Not run)
Execute one or more kagiPro query objects against the Kagi API and write
raw JSON responses to disk. This function supports search, enrich (web/news),
summarize, and FastGPT query classes generated by query_search(),
query_enrich_web(), query_enrich_news(), query_summarize(),
and query_fastgpt().
kagi_request( connection, query, limit = NULL, output = NULL, overwrite = FALSE, append = FALSE, workers = 1, verbose = FALSE, error_mode = c("stop", "write_dummy"), metadata_request_args = list() )kagi_request( connection, query, limit = NULL, output = NULL, overwrite = FALSE, append = FALSE, workers = 1, verbose = FALSE, error_mode = c("stop", "write_dummy"), metadata_request_args = list() )
connection |
A |
query |
A query object of class |
limit |
Optional integer limit used for search and enrich endpoints. |
output |
Directory where JSON response files are written. |
overwrite |
Logical. If |
append |
Logical. If |
workers |
Number of parallel workers to use when |
verbose |
Logical indicating whether progress messages should be shown. |
error_mode |
Error handling mode. |
metadata_request_args |
Optional named list persisted in replay metadata
( |
If query is a list of query objects, requests are executed in parallel
(using workers) and each query is written into a named subdirectory under
output.
Files are written as {endpoint}_{page}.json (for example search_1.json).
Pagination is handled via meta$next_cursor when provided by the API.
Query replay metadata is written alongside JSON outputs:
per query folder: _query_meta.json
The normalized path to output.
Convert a directory of JSON files written by kagi_request() into an
Apache Parquet dataset. JSON files are processed one-by-one and written as
hive-partitioned parquet by query.
kagi_request_parquet( input_json = NULL, output = NULL, add_columns = list(), overwrite = FALSE, append = FALSE, verbose = TRUE, delete_input = FALSE )kagi_request_parquet( input_json = NULL, output = NULL, add_columns = list(), overwrite = FALSE, append = FALSE, verbose = TRUE, delete_input = FALSE )
input_json |
Directory containing JSON files from |
output |
output directory for the parquet dataset; default: temporary directory. |
add_columns |
List of additional fields to be added to the output. They
have to be provided as a named list, e.g. |
overwrite |
Logical indicating whether to overwrite |
append |
Logical indicating whether to append/update query partitions in
an existing |
verbose |
Logical indicating whether to print progress information.
Defaults to |
delete_input |
Determines if the |
The function uses DuckDB to read the JSON files and to create the
Apache Parquet files. It creates an in-memory DuckDB connection, reads each
JSON response, and writes endpoint-specific tabular data into the parquet
dataset. Files with data = null are skipped.
Output parquet rows include an id column for traceability:
Search: SEARCH_<hash> from normalized url when available.
Enrich web: ENRICH_WEB_<hash> from normalized url when available.
Enrich news: ENRICH_NEWS_<hash> from normalized url when available.
Summarize: SUMMARIZE_<hash> from request metadata.
FastGPT: FASTGPT_<hash> from request metadata.
Returns output invisibly if parquet files were written; otherwise
NULL.
Update one query dataset by query_name using metadata written by
kagi_request(). The function scans per-query metadata files under
<project_folder>/<endpoint>/json/<query_name>/_query_meta.json,
re-runs all matching query definitions, and refreshes only the touched
parquet query partitions.
kagi_update_query( connection, project_folder, query_name, workers = 1, verbose = FALSE, error_mode = c("stop", "write_dummy") )kagi_update_query( connection, project_folder, query_name, workers = 1, verbose = FALSE, error_mode = c("stop", "write_dummy") )
connection |
A |
project_folder |
Root project folder containing endpoint subfolders. |
query_name |
Query name to update (for example |
workers |
Number of workers for request execution. |
verbose |
Logical indicating whether progress messages should be shown. |
error_mode |
Error handling mode passed to |
If the same query_name exists across multiple endpoints, all matching
endpoints are updated.
Named list of normalized parquet output paths by updated endpoint.
## Not run: kagi_update_query( connection = conn, project_folder = "kagi_project", query_name = "query_1" ) ## End(Not run)## Not run: kagi_update_query( connection = conn, project_folder = "kagi_project", query_name = "query_1" ) ## End(Not run)
Read markdown files generated for a specific endpoint/query and summarize each record with either OpenAI or Kagi text summarization. The result is written as a single parquet file per query under 'abstract/'.
markdown_abstract( project_folder, endpoint = NULL, query_name = NULL, workers = 4, progress = interactive(), verbose = FALSE, summarizer_fn = summarize_with_openai, model = "gpt-4.1-mini", connection = NULL, provider_args = list(), markdown_root = "markdown", abstract_root = "abstract" )markdown_abstract( project_folder, endpoint = NULL, query_name = NULL, workers = 4, progress = interactive(), verbose = FALSE, summarizer_fn = summarize_with_openai, model = "gpt-4.1-mini", connection = NULL, provider_args = list(), markdown_root = "markdown", abstract_root = "abstract" )
project_folder |
Root project folder containing endpoint subfolders. |
endpoint |
Optional endpoint selector (for example '"search"' or '"enrich_news"'). If 'NULL', all supported endpoints are considered. |
query_name |
Optional query selector. If 'NULL', all query partitions are considered. |
workers |
Number of parallel workers to use for summarization. |
progress |
Logical indicating whether progress messages should be shown. |
verbose |
Logical indicating whether detailed messages should be shown. |
summarizer_fn |
Function with signature 'fn(text, model, ...) -> character(1) | NA_character_'. |
model |
Provider-specific model/engine. |
connection |
Optional [kagi_connection()] object. Used for [summarize_with_kagi()] when not supplied via 'provider_args'. |
provider_args |
Optional named list forwarded to 'summarizer_fn'. |
markdown_root |
Root folder name containing markdown files. |
abstract_root |
Root folder name for abstract parquet outputs. |
Invisibly returns a data frame with columns 'endpoint', 'id', 'query', 'abstract', 'status', 'error'.
Open a Kagi search in the browser
open_search_query(query, session_token = NULL)open_search_query(query, session_token = NULL)
query |
A full query string (typically from [query_search()]). |
session_token |
Optional Kagi session token for private search (see your Kagi account's "Session Link"). |
Construct one or more query strings for the Kagi Search API by combining
free-text terms with structured operators such as filetype:, site:,
inurl:, and intitle:.
Use kagi_request() to execute the request
and obtain the json replies.
query_enrich_news( query, filetype = NULL, site = NULL, inurl = NULL, intitle = NULL, expand = TRUE )query_enrich_news( query, filetype = NULL, site = NULL, inurl = NULL, intitle = NULL, expand = TRUE )
query |
Character vector of free-text query terms (required). These can include quoted phrases and boolean operators. |
filetype |
Optional character vector of file type extensions
(e.g. |
site |
Optional character vector of domains
(e.g. |
inurl |
Optional character vector of URL substrings that must be
present in the result URL. Each is prefixed with |
intitle |
Optional character vector of terms that must appear in
the page title. Each is prefixed with |
expand |
Logical, default |
This helper makes it easy to build reproducible, complex queries with
structured operators. Use expand = TRUE when you want all possible
combinations (useful in systematic search contexts). Use expand = FALSE
when you want a single combined query.
A named list containing query strings of class
kagi_query_enrich_news, to be used in kagi_request().
open_search_query(),
kagi_request(),
kagi_request_parquet(),
## Not run: # Single combined query query_search( query = "biodiversity", filetype = c("pdf", "docx"), site = "example.com", expand = FALSE ) # Expanded combinations query_search( query = c("biodiversity", "ecosystem"), filetype = c("pdf", "docx"), site = c("example.com", "gov"), expand = TRUE ) # Open a generated query manually in browser open_search_query(query_search("openalex api", site = "docs.openalex.org")[[1]]) ## End(Not run)## Not run: # Single combined query query_search( query = "biodiversity", filetype = c("pdf", "docx"), site = "example.com", expand = FALSE ) # Expanded combinations query_search( query = c("biodiversity", "ecosystem"), filetype = c("pdf", "docx"), site = c("example.com", "gov"), expand = TRUE ) # Open a generated query manually in browser open_search_query(query_search("openalex api", site = "docs.openalex.org")[[1]]) ## End(Not run)
Construct one or more query strings for the Kagi Search API by combining
free-text terms with structured operators such as filetype:, site:,
inurl:, and intitle:.
Use kagi_request() to execute the request
and obtain the json replies.
query_enrich_web( query, filetype = NULL, site = NULL, inurl = NULL, intitle = NULL, expand = TRUE )query_enrich_web( query, filetype = NULL, site = NULL, inurl = NULL, intitle = NULL, expand = TRUE )
query |
Character vector of free-text query terms (required). These can include quoted phrases and boolean operators. |
filetype |
Optional character vector of file type extensions
(e.g. |
site |
Optional character vector of domains
(e.g. |
inurl |
Optional character vector of URL substrings that must be
present in the result URL. Each is prefixed with |
intitle |
Optional character vector of terms that must appear in
the page title. Each is prefixed with |
expand |
Logical, default |
This helper makes it easy to build reproducible, complex queries with
structured operators. Use expand = TRUE when you want all possible
combinations (useful in systematic search contexts). Use expand = FALSE
when you want a single combined query.
A named list containing query strings of class
kagi_query_enrich_web, to be used in kagi_request().
open_search_query(),
kagi_request(),
kagi_request_parquet(),
## Not run: # Single combined query query_search( query = "biodiversity", filetype = c("pdf", "docx"), site = "example.com", expand = FALSE ) # Expanded combinations query_search( query = c("biodiversity", "ecosystem"), filetype = c("pdf", "docx"), site = c("example.com", "gov"), expand = TRUE ) # Open a generated query manually in browser open_search_query(query_search("openalex api", site = "docs.openalex.org")[[1]]) ## End(Not run)## Not run: # Single combined query query_search( query = "biodiversity", filetype = c("pdf", "docx"), site = "example.com", expand = FALSE ) # Expanded combinations query_search( query = c("biodiversity", "ecosystem"), filetype = c("pdf", "docx"), site = c("example.com", "gov"), expand = TRUE ) # Open a generated query manually in browser open_search_query(query_search("openalex api", site = "docs.openalex.org")[[1]]) ## End(Not run)
Construct one or more FastGPT query payloads for POST /fastgpt.
Use kagi_request() to execute the request and obtain JSON responses.
query_fastgpt(query, cache = TRUE, web_search = TRUE)query_fastgpt(query, cache = TRUE, web_search = TRUE)
query |
Character vector. Query text to answer. |
cache |
Logical. Whether cached responses are allowed. Default: |
web_search |
Logical. Whether to use web search enrichment. Default: |
According to current Kagi FastGPT API behavior, web_search = FALSE is out
of service and rejected. This constructor enforces web_search = TRUE.
A named list of query objects of class kagi_query_fastgpt to be
used in kagi_request().
## Not run: query_fastgpt("Python 3.11") query_fastgpt(c("Python 3.11", "What is biodiversity?")) ## End(Not run)## Not run: query_fastgpt("Python 3.11") query_fastgpt(c("Python 3.11", "What is biodiversity?")) ## End(Not run)
Construct one or more query strings for the Kagi Search API by combining
free-text terms with structured operators such as filetype:, site:,
inurl:, and intitle:. Queries can either be concatenated into a
single string or expanded into a Cartesian product of all combinations.
query_search( query, filetype = NULL, site = NULL, inurl = NULL, intitle = NULL, expand = TRUE, open_in_browser = FALSE )query_search( query, filetype = NULL, site = NULL, inurl = NULL, intitle = NULL, expand = TRUE, open_in_browser = FALSE )
query |
Character vector of free-text query terms (required). These can include quoted phrases and boolean operators. |
filetype |
Optional character vector of file type extensions
(e.g. |
site |
Optional character vector of domains
(e.g. |
inurl |
Optional character vector of URL substrings that must be
present in the result URL. Each is prefixed with |
intitle |
Optional character vector of terms that must appear in
the page title. Each is prefixed with |
expand |
Logical, default |
open_in_browser |
Logical, default |
This helper makes it easy to build reproducible, complex queries with
structured operators. Use expand = TRUE when you want all possible
combinations (useful in systematic search contexts). Use expand = FALSE
when you want a single combined query.
A named list containing the query strings of type kagi_query_search,
to be used in kagi_request().
open_search_query(),
kagi_request(),
kagi_request_parquet(),
## Not run: # Single combined query query_search( query = "biodiversity", filetype = c("pdf", "docx"), site = "example.com", expand = FALSE ) # Expanded combinations query_search( query = c("biodiversity", "ecosystem"), filetype = c("pdf", "docx"), site = c("example.com", "gov"), expand = TRUE ) # Immediately open in browser query_search("openalex api", site = "docs.openalex.org", open_in_browser = TRUE) ## End(Not run)## Not run: # Single combined query query_search( query = "biodiversity", filetype = c("pdf", "docx"), site = "example.com", expand = FALSE ) # Expanded combinations query_search( query = c("biodiversity", "ecosystem"), filetype = c("pdf", "docx"), site = c("example.com", "gov"), expand = TRUE ) # Immediately open in browser query_search("openalex api", site = "docs.openalex.org", open_in_browser = TRUE) ## End(Not run)
Construct a typed S3 object of class kagi_summarize that describes a
Universal Summarizer request. Use kagi_request() to execute the request
and obtain the json replies.
query_summarize( url = NULL, text = NULL, engine = NULL, summary_type = NULL, target_language = NULL, cache = TRUE )query_summarize( url = NULL, text = NULL, engine = NULL, summary_type = NULL, target_language = NULL, cache = TRUE )
url |
Optional character scalar. URL to be summarized. Mutually exclusive
with |
text |
Optional character scalar. Raw text to be summarized. Mutually
exclusive with |
engine |
Character scalar. Summarizer engine (options: |
summary_type |
Character scalar. Type of summary requested (options:
|
target_language |
Character scalar. Target language ISO code. Supported
codes: |
cache |
Logical. Whether to allow API-side caching. |
A named list of kagi_query_summarize objects to be passed to
kagi_request().
## Not run: req <- query_summarize(text = "Lorem ipsum") req ## End(Not run)## Not run: req <- query_summarize(text = "Lorem ipsum") req ## End(Not run)
Modelled on 'openalexPro::read_corpus()' with an additional 'abstracts' switch. By default this opens an Arrow dataset from a parquet directory. When 'return_data = TRUE', the result is collected into memory.
read_corpus( project_folder, endpoint, corpus = "parquet", return_data = FALSE, abstracts = FALSE, silent = FALSE )read_corpus( project_folder, endpoint, corpus = "parquet", return_data = FALSE, abstracts = FALSE, silent = FALSE )
project_folder |
Root project folder. |
endpoint |
Endpoint folder name under 'project_folder'. |
corpus |
Folder name under 'project_folder/endpoint' to read as parquet corpus. Defaults to '"parquet"'. |
return_data |
Logical; if 'TRUE', collect and return in-memory data. |
abstracts |
Logical; if 'TRUE', link sibling abstract data by 'id' and 'query'. |
silent |
Logical; if 'TRUE', suppress informative messages. |
If 'abstracts = TRUE', abstract data is read from the sibling 'abstract' folder and left-joined by 'id' + 'query'. If no abstract files are present, an 'abstract' column filled with 'NA' is added.
An Arrow dataset/query when 'return_data = FALSE', otherwise a data frame/tibble.
Summarize Text via Kagi Summarize Endpoint
summarize_with_kagi( text, model = "cecil", connection = NULL, api_key = NULL, base_url = NULL, summary_type = "summary", target_language = "EN", cache = TRUE, retry_max_tries = 5 )summarize_with_kagi( text, model = "cecil", connection = NULL, api_key = NULL, base_url = NULL, summary_type = "summary", target_language = "EN", cache = TRUE, retry_max_tries = 5 )
text |
Plain text to summarize. |
model |
Kagi summarize engine ('"cecil"', '"agnes"', '"muriel"', '"daphne"'). |
connection |
Optional [kagi_connection()] object. |
api_key |
Optional Kagi API key override. |
base_url |
Optional Kagi API base URL override. |
summary_type |
Summarize mode ('"summary"' or '"takeaway"'). |
target_language |
Target language code. |
cache |
Cache flag forwarded to Kagi summarize endpoint. |
retry_max_tries |
Maximum number of HTTP retry attempts passed to [httr2::req_retry()]. |
A single summary string (or 'NA_character_').
Summarize Text via OpenAI Chat Completions
summarize_with_openai( text, model = "gpt-4.1-mini", api_key = Sys.getenv("API_openai", ""), base_url = "https://api.openai.com/v1", system_prompt = "Summarize input text in 4-6 concise sentences for literature review.", retry_max_tries = 5 )summarize_with_openai( text, model = "gpt-4.1-mini", api_key = Sys.getenv("API_openai", ""), base_url = "https://api.openai.com/v1", system_prompt = "Summarize input text in 4-6 concise sentences for literature review.", retry_max_tries = 5 )
text |
Plain text to summarize. |
model |
OpenAI model name. |
api_key |
OpenAI API key. Defaults to 'API_openai'. |
base_url |
OpenAI API base URL. |
system_prompt |
Prompt used to guide summarization behavior. |
retry_max_tries |
Maximum number of HTTP retry attempts passed to [httr2::req_retry()]. |
A single summary string (or 'NA_character_').