Skip to main content

Data Loading Methods

FOVEA's Wikibase loader supports multiple methods for importing data from Wikidata into your local Wikibase instance.

Overview

MethodUse CaseEnvironment Variable
Test DataDevelopment/testing(default)
Entity ListKnown specific entitiesWIKIDATA_ENTITIES
Type FilterAll entities of a typeWIKIDATA_TYPES
SPARQL QueryComplex selectionsWIKIDATA_SPARQL_FILE
JSON DumpBulk importWIKIDATA_DUMP_PATH
Config FileCombined methodsWIKIDATA_CONFIG_FILE

Method 1: Test Data (Default)

Loads a small representative set of entities for development and testing.

docker compose -f docker-compose.wikibase.yml --profile loader run --rm wikibase-loader

Default entities include:

  • Entity types: Q5 (human), Q515 (city), Q6256 (country)
  • Event types: Q198 (war), Q178651 (battle)
  • Sample entities: Q937 (Einstein), Q60 (NYC)

Method 2: Entity List

Import specific entities by their Q-IDs.

WIKIDATA_ENTITIES=Q5,Q515,Q937,Q60 \
WIKIDATA_DEPTH=2 \
docker compose -f docker-compose.wikibase.yml --profile loader run --rm wikibase-loader
VariableDescription
WIKIDATA_ENTITIESComma-separated Q-IDs
WIKIDATA_DEPTHRecursion depth for related entities (default: 1)

Method 3: Type Filter (Instance-of)

Import all entities that are instances of specific types.

WIKIDATA_TYPES=Q5,Q515 \
docker compose -f docker-compose.wikibase.yml --profile loader run --rm wikibase-loader

This imports entities where P31 (instance-of) matches the specified types.

Examples:

  • Q5 - All humans
  • Q515 - All cities
  • Q6256 - All countries

Note: Large types can return thousands of entities. Consider using SPARQL for more control.

Method 4: SPARQL Query

Import entities based on a SPARQL query file.

WIKIDATA_SPARQL_FILE=/data/queries/scientists.sparql \
docker compose -f docker-compose.wikibase.yml --profile loader run --rm wikibase-loader

Example query file (scientists.sparql):

SELECT ?item WHERE {
?item wdt:P31 wd:Q5 . # instance of human
?item wdt:P106 wd:Q901 . # occupation: scientist
?item wikibase:sitelinks ?sitelinks .
FILTER(?sitelinks > 50) # well-known
} LIMIT 100

The query should return ?item bindings with Wikidata entity URIs.

Method 5: JSON Dump

Import from a standard Wikidata JSON dump file.

WIKIDATA_DUMP_PATH=/data/wikidata-20231201-all.json.gz \
docker compose -f docker-compose.wikibase.yml --profile loader run --rm wikibase-loader

Supports:

  • .json - Uncompressed JSON
  • .json.gz - Gzip compressed JSON

Warning: Full Wikidata dumps are 100GB+ compressed. Consider using subsets.

Method 6: Configuration File

Combine multiple import methods in a single YAML file.

WIKIDATA_CONFIG_FILE=/app/config/my-dataset.yaml \
docker compose -f docker-compose.wikibase.yml --profile loader run --rm wikibase-loader

Example configuration (my-dataset.yaml):

version: "1.0"
name: "Event Annotation Dataset"

entities:
# Direct entity IDs
direct:
- Q5 # Human
- Q515 # City
- Q178651 # Battle

# By type with limits
by_type:
Q5: # All humans
limit: 100
Q515: # Cities
limit: 50

# SPARQL queries
sparql_queries:
- name: "Famous Scientists"
query: |
SELECT ?item WHERE {
?item wdt:P31 wd:Q5 .
?item wdt:P106 wd:Q901 .
} LIMIT 50

# How deep to follow entity references
reference_depth: 1

# Languages to include
output:
languages:
- en
- es

See wikibase/config/example-config.yaml for a complete example.

Mounting Data Files

To provide data files to the loader, mount a volume:

# In docker-compose.wikibase.yml or via CLI
volumes:
- ./my-data:/data:ro

Then reference files as /data/filename:

WIKIDATA_DUMP_PATH=/data/my-dump.json.gz \
docker compose -f docker-compose.wikibase.yml --profile loader run --rm wikibase-loader

Incremental Loading

You can run the loader multiple times to add more data:

# First load core types
WIKIDATA_ENTITIES=Q5,Q515,Q6256 \
docker compose -f docker-compose.wikibase.yml --profile loader run --rm wikibase-loader

# Then add specific entities
WIKIDATA_ENTITIES=Q937,Q60,Q42 \
docker compose -f docker-compose.wikibase.yml --profile loader run --rm wikibase-loader

Performance Considerations

FactorImpactRecommendation
Entity countLinear timeStart with fewer than 1000 entities
DepthExponential growthUse depth of 2 or less
SPARQL queriesWikidata rate limitsAdd delays between queries
Batch sizeMemory usageDefault 50 is usually optimal

ID Mapping

When the loader imports entities, Wikibase assigns new sequential IDs (Q1, Q2, Q3...) instead of preserving the original Wikidata IDs (Q42, Q937, Q60). The loader automatically generates a mapping file to track this relationship.

Mapping File Location

The ID mapping file is saved to wikibase/output/id-mapping.json:

{
"Q42": "Q4",
"Q937": "Q14",
"Q60": "Q7",
"Q5": "Q9"
}

This maps original Wikidata IDs (keys) to local Wikibase IDs (values).

How It's Used

  1. The backend reads the mapping file from WIKIBASE_ID_MAPPING_PATH
  2. The /api/config endpoint serves the mapping to the frontend
  3. The frontend translates local IDs back to original Wikidata IDs in search results
  4. Annotations store the original wikidataId for interoperability

Regenerating the Mapping

If you reset the Wikibase data (docker compose down -v), the old mapping becomes invalid. Re-run the loader to generate a new mapping:

docker compose -f docker-compose.wikibase.yml --profile loader run --rm wikibase-loader

Note: Entity IDs are assigned in order of import. Importing entities in a different order will result in different local IDs.

For more details on why ID mapping is necessary, see ID Mapping.

Logging

View loader logs:

docker compose -f docker-compose.wikibase.yml --profile loader logs wikibase-loader

Set log level:

LOG_LEVEL=DEBUG docker compose -f docker-compose.wikibase.yml --profile loader run --rm wikibase-loader