January 28, 2015

Ever wondered what the top subjects / predicates / objects are in DBpedia?

I recently came across this problem while trying to draw a random sample of nodes from DBpedia which follow a given degree distribution for my PhD.

Turns out this is actually more difficult than i expected. Mostly due to the fact that quad stores don’t optimize for such queries. This means that you can’t just ask a SPARQL endpoint (not even your local one) to give you the top subjects, predicates or objects with a query like this:

SELECT ?n COUNT(*) AS ?c
WHERE {
  ?n ?p ?o.
}
ORDER BY DESC(?c)
LIMIT 10

Try yourself here if you don’t believe me… (i set it to time out after 15 seconds and it will return quite a dangerously nonsensical result if you’re not aware that you might get partial answers).

Some Rant

So this lead me to the fascinating conclusion that our beloved RDF query language doesn’t even allow us to answer simple questions such as “which node is most often used as a subject / predicate / object?” (we’re talking with a single SPARQL endpoint here, don’t even try dragging me into an open/closed world assumption discussion, …).

So, all is great, let’s just not ask those evil questions…

… said no (computer) scientist ever.

So let’s get our hands dirty and use some unix tool magic…

Working with Dumps in NT Format

Luckily, I already had all the dumps laid out locally as described here, and lucky again, they are in N-Triples format.

N-Triples is a line based format, which means we have exactly one triple per line. I don’t exactly know whom to thank for this, but should you ever read this (wait, why are you reading my blog?) THANK YOU. It means that neither subject nor predicate nor object can contain (unescaped) newlines. And this means that you can actually quite sanely sort and parse .nt files with standard unix tools that have been optimized by generations of smart people.

I think you see where this is going: a good old bash one-liner with grep, cut, sort and uniq, by far the fastest tools i know for the job.

A Word about Sort Orders and Locales

Sort orders depend on your locale! This means that files sorted with a locale such as en_US.UTF-8 are not properly sorted for someone with a locale such as de_DE.UTF-8. Hence it’s wise to always run this in a shell before working with sort:

export LC_ALL=C

It resets your locale to a classic C byte-wise one, having the nice side effect that it’s faster as well.

Deduplication

First, it turns out the DBpedia dumps actually contain quite an astonishing amount of duplicate triples. This is not a problem if loaded into a quad store as they’ll just count once, but for counting them like we will, it is a problem.

To split them apart let’s do the following: we pick up all the dump files that are loaded into our endpoint with pv a handy little tool similar to cat, but it shows a nice progress bar. Then we decompress with zcat, remove comments from the files with grep and then call sort. We actually tell sort to use a ton of RAM (32 GB), but actually not even that is enough for the > 80 GB decompressed dumps, so we need temp files. We can direct sort to put them onto an SSD instead of just in /tmp as by default, and we can also compress those temp files on the fly. lzop is a very fast compression tool and the perfect fit for this (not compressing the files with this actually degrades performance even at 300 MB/s write speeds of the SSD!). After this we use tee to multiplex our stream into two channels: one plain uniq and gzipped with pigz (like gzip but parallel, as gzipping > 80 GB becomes quite the bottleneck here otherwise) into dbpedia_uniq.nt.gz and another invocation of uniq -c -d which only counts the duplicate lines and gzips (this is ok to be single threaded, as it’s not sooo big) them into dbpedia_dups.txt.gz.

pv /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/{dbpedia.org,ext.dbpedia.org,pagelinks.dbpedia.org,topicalconcepts.dbpedia.org}/* |
  zcat |  # decompress
  grep -v -E '^\s*#' |  # ignore comments in the nt files
  sort -S32G -T/ssd/tmp/ --compress-program=lzop |
  tee \
    >( uniq | pigz > dbpedia_uniq.nt.gz ) \
    >( uniq -c -d | gzip > dbpedia_dups.txt.gz ) \
    >/dev/null

As you can see from the first line include external, pagelinks and topicalconcepts datasets, but the process is really the same no matter what.

After ~ 10 minutes we’re left with a 6.5 GB dbpedia_uniq.nt.gz (547,084,682 unique triples) and a 238 MB dbpedia_dups.txt.gz.

Top Duplicates

The top duplicates as acquired with

zcat dbpedia_dups.txt.gz | sort -n -r -S8G | head -n20

are (full file (238 MB)):

   4891 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_Slovenia.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Flag_of_Slovenia.svg> .
   4891 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_Slovenia.svg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_Slovenia.svg?width=300> .
   4891 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_Slovenia.svg> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Flag_of_Slovenia.svg> .
   1520 <http://commons.wikimedia.org/wiki/Special:FilePath/Naval_Ensign_of_the_United_Kingdom.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Naval_Ensign_of_the_United_Kingdom.svg> .
   1520 <http://commons.wikimedia.org/wiki/Special:FilePath/Naval_Ensign_of_the_United_Kingdom.svg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Naval_Ensign_of_the_United_Kingdom.svg?width=300> .
   1520 <http://commons.wikimedia.org/wiki/Special:FilePath/Naval_Ensign_of_the_United_Kingdom.svg> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Naval_Ensign_of_the_United_Kingdom.svg> .
   1195 <http://commons.wikimedia.org/wiki/Special:FilePath/Airplane_silhouette.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Airplane_silhouette.svg> .
   1195 <http://commons.wikimedia.org/wiki/Special:FilePath/Airplane_silhouette.svg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Airplane_silhouette.svg?width=300> .
   1195 <http://commons.wikimedia.org/wiki/Special:FilePath/Airplane_silhouette.svg> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Airplane_silhouette.svg> .
   1188 <http://commons.wikimedia.org/wiki/Special:FilePath/Med_1.png?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Med_1.png> .
   1188 <http://commons.wikimedia.org/wiki/Special:FilePath/Med_1.png> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Med_1.png?width=300> .
   1188 <http://commons.wikimedia.org/wiki/Special:FilePath/Med_1.png> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Med_1.png> .
   1159 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_the_British_Army.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Flag_of_the_British_Army.svg> .
   1159 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_the_British_Army.svg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_the_British_Army.svg?width=300> .
   1159 <http://commons.wikimedia.org/wiki/Special:FilePath/Flag_of_the_British_Army.svg> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Flag_of_the_British_Army.svg> .
    914 <http://en.wikipedia.org/wiki/Special:FilePath/Cricket_no_pic.png?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Cricket_no_pic.png> .
    914 <http://en.wikipedia.org/wiki/Special:FilePath/Cricket_no_pic.png> <http://xmlns.com/foaf/0.1/thumbnail> <http://en.wikipedia.org/wiki/Special:FilePath/Cricket_no_pic.png?width=300> .
    914 <http://en.wikipedia.org/wiki/Special:FilePath/Cricket_no_pic.png> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Cricket_no_pic.png> .
    885 <http://dbpedia.org/resource/List_of_Tachinidae_genera> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://dbpedia.org/resource/List_of_Tachinidae_genera> .
    784 <http://commons.wikimedia.org/wiki/Special:FilePath/Illinois_-_outline_map.svg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://en.wikipedia.org/wiki/File:Illinois_-_outline_map.svg> .

Getting S,P,O Counts

OK, now let’s count the subject, predicate and object occurrences.
Subjects, predicates and objects are delimited with a single space (” “), everything else in the line we just count as an object (so we just count the final ” .” to the object).
Similar to the above pipeline, we use tee again to multiplex the stream into three pipelines for subject, predicate and object counts.
Each of them is mostly based on cut, first to get the fields (-f1 for subject, -f2 predicate, -f3- object), then for limiting very long strings to only the first 1024 chars. While this actually introduces some false positive matches for long literals, it’s probably safe for URIs, and reduces sort times and file sizes for the object chunk a lot. If you want very accurate counts you should probably re-run without the cut -c-1024 lines.
Afterwards in each pipeline the occurrences of a node in the s,p,o positions are sorted and counted with uniq -c, then gzipped with pigz.

pv dbpedia_uniq.nt.gz |
  zcat |
  tee \
    >( cut -f1 -d' ' |
       cut -c-1024 |
       sort -S16G -T/ssd/tmp/ --compress-program=lzop |
       uniq -c |
       pigz > dbpedia_1_subject_counts.txt.gz ) \
    >( cut -f2 -d' ' |
       cut -c-1024 |
       sort -S16G -T/ssd/tmp/ --compress-program=lzop |
       uniq -c |
       pigz > dbpedia_2_predicate_counts.txt.gz ) \
    >( cut -f3- -d' ' |
       cut -c-1024 |
       sort -S16G -T/ssd/tmp/ --compress-program=lzop |
       uniq -c | pigz > dbpedia_3_object_counts.txt.gz ) \
    >/dev/null

After 15 minutes we’re left with 3 files:
dbpedia_1_subject_counts.txt.gz (214M), dbpedia_2_predicate_counts.txt.gz (387K), dbpedia_3_object_counts.txt.gz (1.9G)

As expected there are only relatively few different predicates and the objects actually take up quite a lot of data.

Before getting the tops it’s quite useful to exclude subjects and objects that occur less than 10 times with awk, which greatly reduces the filesizes and subsequent sort times:

zcat dbpedia_1_subject_counts.txt.gz | awk ' $1 > 9 { print } ' | pigz > dbpedia_1_subject_counts_o9.txt.gz
zcat dbpedia_3_object_counts.txt.gz | awk ' $1 > 9 { print } ' | pigz > dbpedia_3_object_counts_o9.txt.gz

dbpedia_1_subject_counts_o9.txt.gz (89M), dbpedia_3_object_counts_o9.txt.gz (30M).

As we can see from the size reduction already there’s actually way more objects occurring less than 10 times than subjects.

Similar to before the tops can be acquired with something like this:

for f in dbpedia_1_subject_counts_o9.txt.gz dbpedia_2_predicate_counts.txt.gz dbpedia_3_object_counts_o9.txt.gz ; do
  zcat $f | sort -n -r | pigz > ${f%.txt.gz}_tops.txt.gz
done

So here they are, the …

Top 100 Subjects:

dbpedia_1_subject_counts_o9_tops.txt.gz (95M)

   8118 <http://dbpedia.org/resource/Alphabetical_list_of_communes_of_Italy>
   7110 <http://dbpedia.org/resource/List_of_places_in_Afghanistan>
   6162 <http://dbpedia.org/resource/Index_of_Andhra_Pradesh-related_articles>
   5857 <http://dbpedia.org/resource/List_of_populated_places_in_Bosnia_and_Herzegovina>
   5712 <http://dbpedia.org/resource/2013_in_film>
   5550 <http://dbpedia.org/resource/List_of_municipalities_of_Brazil>
   5458 <http://dbpedia.org/resource/List_of_dialling_codes_in_Germany>
   5405 <http://dbpedia.org/resource/IUCN_Red_List_vulnerable_species_(Plantae)>
   5392 <http://dbpedia.org/resource/List_of_CJK_Unified_Ideographs,_part_3_of_4>
   5392 <http://dbpedia.org/resource/List_of_CJK_Unified_Ideographs,_part_2_of_4>
   5392 <http://dbpedia.org/resource/List_of_CJK_Unified_Ideographs,_part_1_of_4>
   5182 <http://dbpedia.org/resource/IUCN_Red_List_vulnerable_species_(Animalia)>
   5152 <http://dbpedia.org/resource/Index_of_India-related_articles>
   5090 <http://dbpedia.org/resource/List_of_law_clerks_of_the_Supreme_Court_of_the_United_States>
   5068 <http://dbpedia.org/resource/List_of_Social_Democratic_Party_of_Germany_members>
   4942 <http://dbpedia.org/resource/List_of_painters_in_the_Web_Gallery_of_Art>
   4873 <http://dbpedia.org/resource/List_of_stage_names>
   4829 <http://dbpedia.org/resource/List_of_CJK_Unified_Ideographs,_part_4_of_4>
   4795 <http://dbpedia.org/resource/List_of_Harvard_University_people>
   4743 <http://dbpedia.org/resource/List_of_OMIM_disorder_codes>
   4726 <http://dbpedia.org/resource/List_of_populated_places_in_Serbia>
   4698 <http://dbpedia.org/resource/List_of_populated_places_in_Serbia_(alphabetic)>
   4690 <http://dbpedia.org/resource/Index_of_philosophy_articles_(I%E2%80%93Q)>
   4603 <http://dbpedia.org/resource/List_of_molluscan_genera_represented_in_the_fossil_record>
   4493 <http://dbpedia.org/resource/List_of_American_television_programs_by_date>
   4457 <http://dbpedia.org/resource/List_of_biographical_films>
   4443 <http://dbpedia.org/resource/List_of_brachiopod_genera>
   4355 <http://dbpedia.org/resource/List_of_English_writers>
   4345 <http://dbpedia.org/resource/List_of_composers_by_name>
   4341 <http://dbpedia.org/resource/List_of_historical_German_and_Czech_names_for_places_in_the_Czech_Republic>
   4329 <http://dbpedia.org/resource/2012_in_film>
   4275 <http://dbpedia.org/resource/List_of_people_from_Illinois>
   4219 <http://dbpedia.org/resource/List_of_people_from_Texas>
   4218 <http://dbpedia.org/resource/List_of_village_development_committees_of_Nepal>
   4194 <http://dbpedia.org/resource/List_of_postal_codes_in_Portugal>
   4159 <http://dbpedia.org/resource/IUCN_Red_List_data_deficient_species_(Chordata)>
   4140 <http://dbpedia.org/resource/List_of_trilobite_genera>
   4137 <http://dbpedia.org/resource/List_of_aircraft_engines>
   4130 <http://dbpedia.org/resource/List_of_moths_of_Taiwan>
   4084 <http://dbpedia.org/resource/List_of_flora_of_the_Sonoran_Desert_Region_by_common_name>
   3992 <http://dbpedia.org/resource/List_of_film_score_composers>
   3984 <http://dbpedia.org/resource/List_of_marine_gastropod_genera_in_the_fossil_record>
   3930 <http://dbpedia.org/resource/List_of_performances_on_Top_of_the_Pops>
   3886 <http://dbpedia.org/resource/List_of_gliders>
   3873 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Romania>
   3839 <http://dbpedia.org/resource/List_of_20th-century_classical_composers>
   3768 <http://dbpedia.org/resource/Rosters_of_the_top_basketball_teams_in_European_club_competitions>
   3740 <http://dbpedia.org/resource/List_of_airports_by_ICAO_code:_K>
   3705 <http://dbpedia.org/resource/List_of_United_States_counties_and_county_equivalents>
   3659 <http://dbpedia.org/resource/List_of_Russian_people>
   3646 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Germany>
   3615 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Switzerland>
   3597 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Slovakia>
   3589 <http://dbpedia.org/resource/List_of_protected_areas_of_China>
   3583 <http://dbpedia.org/resource/List_of_Advanced_Dungeons_&_Dragons_2nd_edition_monsters>
   3541 <http://dbpedia.org/resource/Index_of_U.S._counties>
   3502 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Hungary>
   3499 <http://dbpedia.org/resource/The_opera_corpus>
   3466 <http://dbpedia.org/resource/List_of_German_Christian_Democratic_Union_politicians>
   3466 <http://dbpedia.org/resource/2012%E2%80%9313_UEFA_Europa_League_qualifying_phase_and_play-off_round>
   3439 <http://dbpedia.org/resource/List_of_viruses>
   3439 <http://dbpedia.org/resource/Google_Street_View_in_the_United_States>
   3432 <http://dbpedia.org/resource/2013%E2%80%9314_UEFA_Europa_League_qualifying_phase_and_play-off_round>
   3430 <http://dbpedia.org/resource/List_of_Lepidoptera_of_the_Czech_Republic>
   3392 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Greece>
   3378 <http://dbpedia.org/resource/List_of_surnames_in_Russia>
   3378 <http://dbpedia.org/resource/List_of_film_director_and_actor_collaborations>
   3377 <http://dbpedia.org/resource/2010%E2%80%9311_UEFA_Europa_League_qualifying_phase_and_play-off_round>
   3342 <http://dbpedia.org/resource/2009%E2%80%9310_UEFA_Europa_League_qualifying_phase_and_play-off_round>
   3327 <http://dbpedia.org/resource/Index_of_World_War_II_articles_(U)>
   3321 <http://dbpedia.org/resource/List_of_moths_of_Madagascar>
   3295 <http://dbpedia.org/resource/List_of_country_houses_in_the_United_Kingdom>
   3277 <http://dbpedia.org/resource/List_of_counties_by_U.S._state>
   3273 <http://dbpedia.org/resource/List_of_licensed_and_localized_editions_of_Monopoly:_Europe>
   3255 <http://dbpedia.org/resource/List_of_moths_of_North_America_(MONA_8322-11233)>
   3254 <http://dbpedia.org/resource/List_of_local_administrative_units_of_Romania>
   3236 <http://dbpedia.org/resource/Catalog_of_paintings_in_the_National_Gallery,_London>
   3233 <http://dbpedia.org/resource/IUCN_Red_List_endangered_species_(Animalia)>
   3232 <http://dbpedia.org/resource/IUCN_Red_List_near_threatened_species_(Animalia)>
   3209 <http://dbpedia.org/resource/List_of_Chopped_episodes>
   3201 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Poland>
   3200 <http://dbpedia.org/resource/List_of_directorial_debuts>
   3192 <http://dbpedia.org/resource/List_of_postal_codes_in_Germany>
   3175 <http://dbpedia.org/resource/2010_in_film>
   3163 <http://dbpedia.org/resource/Index_of_philosophy_articles_(R%E2%80%93Z)>
   3156 <http://dbpedia.org/resource/List_of_bannered_U.S._Routes>
   3136 <http://dbpedia.org/resource/Timeline_of_Google_Street_View>
   3135 <http://dbpedia.org/resource/Index_of_Byzantine_Empire-related_articles>
   3129 <http://dbpedia.org/resource/Index_of_Singapore-related_articles>
   3114 <http://dbpedia.org/resource/List_of_postal_codes_of_Switzerland>
   3107 <http://dbpedia.org/resource/2009_in_film>
   3088 <http://dbpedia.org/resource/List_of_University_of_Pennsylvania_people>
   3071 <http://dbpedia.org/resource/List_of_children's_television_series_by_country>
   3065 <http://dbpedia.org/resource/List_of_populated_places_in_the_Netherlands>
   3044 <http://dbpedia.org/resource/List_of_ZX_Spectrum_games>
   3039 <http://dbpedia.org/resource/October_2011_in_sports>
   3022 <http://dbpedia.org/resource/List_of_flora_of_Ohio>
   3019 <http://dbpedia.org/resource/List_of_PlayStation_2_games>
   3016 <http://dbpedia.org/resource/List_of_Lepidoptera_of_Bulgaria>
   3005 <http://dbpedia.org/resource/List_of_voice_actors>

Observations:

The top subjects are clearly dominated by list-like resources. Very big “normal” articles such as those of countries like dbpedia:United_States (1375 occurrences as subject) or dbpedia:Germany (1331 occurrences as subject) can only be found below ranks of 1518 or 1673. Scrolling through the top subject counts it seems that the amount of “List” vs. non-“List” resources slowly seems to equalize around 1000 occurrences (rank 3800+), but even for subjects that “only” occur ~500 times (rank 21000+) there seem to be ~1/4 “Lists” still.

Top 100 Predicates:

dbpedia_2_predicate_counts_tops.txt.gz (384K)

149707899 <http://dbpedia.org/ontology/wikiPageWikiLink>
86391520 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
33958849 <http://www.w3.org/2002/07/owl#sameAs>
18731754 <http://purl.org/dc/terms/subject>
13926391 <http://www.w3.org/2000/01/rdf-schema#label>
13494896 <http://dbpedia.org/ontology/wikiPageRevisionID>
13494875 <http://www.w3.org/ns/prov#wasDerivedFrom>
13494819 <http://dbpedia.org/ontology/wikiPageID>
10948106 <http://dbpedia.org/ontology/wikiPageOutDegree>
10948106 <http://dbpedia.org/ontology/wikiPageLength>
10948086 <http://xmlns.com/foaf/0.1/primaryTopic>
10948086 <http://xmlns.com/foaf/0.1/isPrimaryTopicOf>
10948086 <http://purl.org/dc/elements/1.1/language>
7081593 <http://dbpedia.org/ontology/wikiPageExternalLink>
6473988 <http://dbpedia.org/ontology/wikiPageRedirects>
5926272 <http://dbpedia.org/ontology/abstract>
5925778 <http://www.w3.org/2000/01/rdf-schema#comment>
4267352 <http://xmlns.com/foaf/0.1/name>
4041585 <http://dbpedia.org/property/hasPhotoCollection>
3781737 <http://dbpedia.org/property/name>
2342002 <http://purl.org/dc/elements/1.1/rights>
2268299 <http://www.w3.org/2004/02/skos/core#broader>
2084717 <http://purl.org/dc/elements/1.1/description>
1514496 <http://dbpedia.org/ontology/team>
1374565 <http://xmlns.com/foaf/0.1/depiction>
1374185 <http://dbpedia.org/ontology/thumbnail>
1363398 <http://dbpedia.org/ontology/wikiPageDisambiguates>
1289141 <http://dbpedia.org/property/title>
1231780 <http://dbpedia.org/property/subdivisionType>
1171004 <http://xmlns.com/foaf/0.1/thumbnail>
1122598 <http://www.w3.org/2004/02/skos/core#prefLabel>
1080114 <http://xmlns.com/foaf/0.1/givenName>
1058532 <http://www.georss.org/georss/point>
1052578 <http://dbpedia.org/property/shortDescription>
1052115 <http://xmlns.com/foaf/0.1/surname>
1005079 <http://dbpedia.org/ontology/birthPlace>
 995639 <http://dbpedia.org/ontology/birthDate>
 983813 <http://dbpedia.org/property/subdivisionName>
 973597 <http://dbpedia.org/ontology/birthYear>
 968085 <http://dbpedia.org/property/dateOfBirth>
 907869 <http://www.w3.org/2003/01/geo/wgs84_pos#lat>
 906919 <http://www.w3.org/2003/01/geo/wgs84_pos#long>
 861765 <http://dbpedia.org/property/goals>
 846283 <http://dbpedia.org/property/placeOfBirth>
 846182 <http://dbpedia.org/ontology/isPartOf>
 838381 <http://dbpedia.org/property/birthPlace>
 826348 <http://dbpedia.org/property/years>
 656559 <http://dbpedia.org/property/length>
 653929 <http://dbpedia.org/property/date>
 649375 <http://xmlns.com/foaf/0.1/homepage>
 643162 <http://dbpedia.org/ontology/careerStation>
 641528 <http://dbpedia.org/ontology/years>
 574296 <http://dbpedia.org/property/birthDate>
 556627 <http://dbpedia.org/property/genre>
 553122 <http://dbpedia.org/ontology/country>
 539366 <http://dbpedia.org/property/clubs>
 529649 <http://dbpedia.org/property/location>
 525787 <http://dbpedia.org/property/rd1Team>
 512507 <http://dbpedia.org/ontology/numberOfGoals>
 501875 <http://dbpedia.org/ontology/genre>
 492028 <http://dbpedia.org/ontology/numberOfMatches>
 453911 <http://dbpedia.org/ontology/deathDate>
 449759 <http://dbpedia.org/ontology/deathYear>
 448696 <http://dbpedia.org/property/dateOfDeath>
 448362 <http://www.w3.org/2002/07/owl#equivalentClass>
 446799 <http://dbpedia.org/property/caption>
 446238 <http://www.w3.org/2000/01/rdf-schema#subClassOf>
 440121 <http://dbpedia.org/property/votes>
 437797 <http://dbpedia.org/property/wordnet_type>
 435648 <http://dbpedia.org/property/type>
 431262 <http://dbpedia.org/property/caps>
 418391 <http://dbpedia.org/ontology/utcOffset>
 362378 <http://dbpedia.org/property/percentage>
 362327 <http://dbpedia.org/ontology/type>
 355814 <http://dbpedia.org/property/country>
 346584 <http://dbpedia.org/property/candidate>
 340788 <http://dbpedia.org/property/starring>
 338307 <http://dbpedia.org/ontology/location>
 327879 <http://dbpedia.org/ontology/family>
 326730 <http://dbpedia.org/property/longew>
 326699 <http://dbpedia.org/property/latns>
 315041 <http://dbpedia.org/property/writer>
 314566 <http://dbpedia.org/ontology/starring>
 312958 <http://dbpedia.org/property/label>
 310020 <http://dbpedia.org/property/rd2Team>
 306816 <http://dbpedia.org/property/settlementType>
 306271 <http://dbpedia.org/property/longd>
 306246 <http://dbpedia.org/property/latd>
 306195 <http://dbpedia.org/ontology/populationTotal>
 293237 <http://dbpedia.org/property/team>
 283282 <http://dbpedia.org/property/producer>
 279814 <http://dbpedia.org/ontology/occupation>
 278108 <http://dbpedia.org/ontology/order>
 277006 <http://dbpedia.org/property/episodenumber>
 275430 <http://dbpedia.org/property/longm>
 275416 <http://dbpedia.org/property/latm>
 272562 <http://dbpedia.org/ontology/deathPlace>
 267256 <http://dbpedia.org/ontology/class>
 265454 <http://dbpedia.org/property/timezone>
 264081 <http://dbpedia.org/ontology/viafId>

Observations:

The predicates are clearly dominated by dbpedia-owl:wikiPageWikiLink and rdf:type relations.
What’s a bit surprising for me is that dcterms:subject occurs less often than rdf:type, but my guess is that it’s probably due to YAGO and also hierarchy materialization (Athlete is also a Person). There’s a slight mismatch between dbpedia-owl:wikiPageRevisionID and prov:wasDerivedFrom. There are more dbpedia-ontology:abstracts than rdfs:comments and more geo:lats than geo:longs.

Top 100 Objects:

dbpedia_3_object_counts_o9_tops.txt.gz (33M)

10948086 <http://xmlns.com/foaf/0.1/Document> .
10948086 "en"^^<http://www.w3.org/2001/XMLSchema#string> .
6239553 "1"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
2250659 <http://dbpedia.org/class/yago/PhysicalEntity100001930> .
2169386 <http://dbpedia.org/class/yago/Object100002684> .
2155200 <http://www.w3.org/2002/07/owl#Thing> .
1974654 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Agent> .
1974654 <http://dbpedia.org/ontology/Agent> .
1816213 <http://dbpedia.org/class/yago/YagoLegalActorGeo> .
1650316 <http://xmlns.com/foaf/0.1/Person> .
1649647 <http://wikidata.dbpedia.org/resource/Q5> .
1649647 <http://wikidata.dbpedia.org/resource/Q215627> .
1649647 <http://schema.org/Person> .
1649646 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#NaturalPerson> .
1649646 <http://dbpedia.org/ontology/Person> .
1621660 <http://dbpedia.org/class/yago/Whole100003553> .
1318799 <http://dbpedia.org/resource/Category:Living_people> .
1290718 <http://dbpedia.org/class/yago/YagoLegalActor> .
1257968 <http://dbpedia.org/class/yago/YagoPermanentlyLocatedEntity> .
1192248 <http://www.w3.org/2004/02/skos/core#Concept> .
1090313 <http://dbpedia.org/class/yago/LivingThing100004258> .
1090140 <http://dbpedia.org/class/yago/Organism100004475> .
1046726 <http://dbpedia.org/class/yago/Person100007846> .
1020287 <http://dbpedia.org/class/yago/CausalAgent100007347> .
 868376 <http://dbpedia.org/resource/United_States> .
 816854 <http://www.ontologydesignpatterns.org/ont/d0.owl#Location> .
 816837 <http://schema.org/Place> .
 816837 <http://dbpedia.org/ontology/Wikidata:Q532> .
 816837 <http://dbpedia.org/ontology/Place> .
 814269 <http://dbpedia.org/class/yago/YagoGeoEntity> .
 726965 <http://dbpedia.org/class/yago/Abstraction100002137> .
 658562 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Situation> .
 643162 <http://dbpedia.org/ontology/CareerStation> .
 561841 <http://dbpedia.org/class/yago/LivingPeople> .
 547827 <http://dbpedia.org/ontology/PopulatedPlace> .
 547037 "0"^^<http://www.w3.org/2001/XMLSchema#integer> .
 539993 "1"^^<http://www.w3.org/2001/XMLSchema#integer> .
 531929 <http://www.opengis.net/gml/_Feature> .
 528794 <http://dbpedia.org/class/yago/Artifact100021939> .
 526256 <http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing> .
 524742 <http://dbpedia.org/class/yago/Location100027167> .
 505425 <http://dbpedia.org/class/yago/Region108630985> .
 476724 "2"^^<http://www.w3.org/2001/XMLSchema#integer> .
 469006 <http://dbpedia.org/ontology/Settlement> .
 438713 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#InformationEntity> .
 425044 <http://schema.org/CreativeWork> .
 425044 <http://dbpedia.org/ontology/Work> .
 419234 <http://dbpedia.org/resource/List_of_sovereign_states> .
 401287 "N"@en .
 400317 <http://dbpedia.org/resource/Animal> .
 377252 "3"^^<http://www.w3.org/2001/XMLSchema#integer> .
 358891 <http://dbpedia.org/class/yago/GeographicalArea108574314> .
 350209 <http://dbpedia.org/class/yago/District108552138> .
 347718 <http://dbpedia.org/class/yago/Group100031264> .
 336091 <http://dbpedia.org/ontology/Athlete> .
 335320 "4"^^<http://www.w3.org/2001/XMLSchema#integer> .
 321614 <http://dbpedia.org/class/yago/AdministrativeDistrict108491826> .
 313062 <http://dbpedia.org/class/yago/SocialGroup107950920> .
 302658 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#SocialPerson> .
 302658 <http://schema.org/Organization> .
 302658 <http://dbpedia.org/ontology/Organisation> .
 292395 "28"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 288074 "29"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 287957 "27"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 286256 "5"^^<http://www.w3.org/2001/XMLSchema#integer> .
 283702 <http://dbpedia.org/class/yago/Organization108008335> .
 279633 "yes"@en .
 279134 <http://dbpedia.org/ontology/SportsTeamMember> .
 279134 <http://dbpedia.org/ontology/OrganisationMember> .
 277439 "30"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 277025 "26"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 276000 "6"^^<http://www.w3.org/2001/XMLSchema#integer> .
 264578 "31"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 263773 <http://dbpedia.org/class/yago/Contestant109613191> .
 261435 <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Organism> .
 261435 <http://dbpedia.org/ontology/Species> .
 260007 "E"@en .
 256474 <http://dbpedia.org/ontology/Eukaryote> .
 255993 "25"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 252172 <http://dbpedia.org/resource/England> .
 249675 "32"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 240706 <http://dbpedia.org/resource/Iran_Standard_Time> .
 234043 "24"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 231236 "33"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 228539 "7"^^<http://www.w3.org/2001/XMLSchema#integer> .
 221419 "0".
 219180 <http://dbpedia.org/class/yago/Player110439851> .
 218573 "23"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 218297 <http://dbpedia.org/class/yago/PsychologicalFeature100023100> .
 217297 "34"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 215320 <http://dbpedia.org/class/yago/Athlete109820263> .
 214359 "18"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 208654 <http://dbpedia.org/class/yago/Tract108673395> .
 208383 <http://dbpedia.org/resource/Arthropod> .
 206076 "22"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 205888 "8"^^<http://www.w3.org/2001/XMLSchema#integer> .
 204693 <http://dbpedia.org/resource/Lepidoptera> .
 204220 "21"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
 203472 <http://dbpedia.org/class/yago/Instrumentality103575240> .
 202270 "35"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .

Observations:

The object counts are dominated on top with an order of magnitude difference by foaf:Document and “en”. The non-negative “1” follows an order of magnitude ahead of the normal “0” and “1” ;) In between a lot of very useful types follow, and we can see that we have a lot of information about physical things, people, concepts and places. It’s also nice to see http://wikidata.dbpedia.org/resource/Q5 right under foaf:Person, even though the URI doesn’t resolve anymore(?) :(

The first “real” “A-Box” resource is dbpedia:United_States, followed by dbpedia:Animal, dbpedia:England, dbpedia:Iran_Standard_Time, dbpedia:Arthropod, dbpedia:Lepidoptera, dbpedia:Canada, dbpedia:Insect, dbpedia:France, dbpedia:United_Kingdom, dbpedia:India, dbpedia:Germany. In general it seems as if apart from ontology types many instances of types country, biological genus and city occur very often as objects.
The top literals seem to be numbers, especially years and single letters.

Conclusion

We’ve seen that it’s sadly not possible to get basic top-degree-counts for big datasets via SPARQL, as the endpoints don’t seem to be optimized for these kind of queries. I hope this changes in the future as it’s quite useful to know degree distributions for all kinds of queries. Especially in the machine learning sector it seems quite essential to know if you’re dealing with a “normal” node or one of the exceptional top nodes that is several orders of magnitude bigger than the rest.

Hope you enjoyed. Feedback welcome, as always.

January 24, 2015

When trying to migrate from Windows 7 to 8.1 I learned that I would lose my installed applications. This was especially frustrating because I knew from the upgrade assistant that there is an option to keep them – it is just not available for my configuration. And even the official guide states that I would have to reinstall all apps.

The solution was: First migrate to Windows 8 (the option to keep your apps is available here) and then to 8.1 (which again allows you to keep your apps). My system is up and running and I did not have to reinstall anything. Voilà! The only question which remains is: why?

January 18, 2015

DATE FORMATS The GIT_AUTHOR_DATE, GIT_COMMITTER_DATE environment variables and the –date option support the following date formats: Git internal [...]

January 14, 2015

Create the file /usr/local/bin/error-check with the bash script below. #!/bin/bash LOGFILE=/var/log/apache2/error.log EMAIL=you@example.com errors=$(grep "PHP Fatal error" "$LOGFILE" | wc -l) warnings=$(grep "PHP Warning" "$LOGFILE" | wc -l) content=$(cat $LOGFILE) if [ ! -z "$errors" ] then mail -s "There [...]

January 11, 2015

Romain Schmitz

Deep insight of xdg-open

Romain Schmitz

In a previous post I explained how to set the default browser with xdg-open but it’s not limited to this kind of application. To name a few examples where xdg-open is [...]

December 20, 2014

Weblog der Fachschaft Informatik

Fachschaftsworkshop

Weblog der Fachschaft Informatik

Am 10. Januar im neuen Jahr findet ein Fachschaftsworkshop statt.
Wir treffen uns um 10:00 Uhr in der Fachschaft Informatik (48-462).
Hier wird besprochen, wie wir unsere Arbeit für die Fachschaft verbessern können.

Bisherige Themen sind:
- Event-Referat und das Einführen von regelmäßigen Veranstaltungen
- Krokurier und Fachschaftszeitung
- generelles zum Öffentlichkeitsarbeitsreferat

In diesem Pad könnt ihr weitere Ideen sammeln.

Zusätzlich dazu werden wir aus der Diskussion entstehende Projekte gleich angehen.
Falls ihr tolle Ideen habt oder mitdiskutieren wollt seid ihr gerne gesehen. Kommt einfach vorbei.
(Wir machen euch am Samstag natürlich gerne die Tür auf, falls ihr keinen Transponder besitzt.
Ruft dazu einfach in der Fachschaft 0631/205 –2553 an.)

December 17, 2014

Composer { ... "config": { ... "github-protocols": [ "https" ], [...]

December 08, 2014

Weblog der Fachschaft Informatik

Vollversammlung der Fachschaft und Weihnachtsfeier

Weblog der Fachschaft Informatik

Am kommenden Mittwoch, den 10.12.2014, findet im Seminarraum 48-462 die diessemestrige Vollversammlung statt, zu welcher alle Studierenden und Interessierten des Fachbereichs herzlich eingeladen sind!

Die Tagesordnung sieht vor:
-Rechenschaftsbericht des FSRs
-Informationen zu den FBR-, StuPa- und Senatswahlen
-Vorstellung der studentischen FBR-Kanidaten
-Kassenprüfer wählen
-Ausgabe der Süßigkeiten

Direkt im Anschluss beginnt die Weihnachtsfeier, zu der der gesamte Fachbereich herzlich eingeladen ist! Es gibt Glühwein, Kaffee und Tee, sowie selbstgemachte Kekse. Ab 18 Uhr wird es etwas warmes zum Essen geben.

Wer mag darf sich gerne eingeladen fühlen einen Salat mitzubringen. Tragt euch dazu bitte in die Salatliste ein.

Ebenso findet wieder ein Schrottwichteln statt:
Wir bieten euch die Möglichkeit zum Schrott-Wichteln, dazu müsst ihr nur bis zur Feier euer Geschenk (0 bis 5€) in die Kiste in der Fachschaft legen und euren Namen auf die Liste schreiben. Dann werden nach dem Essen die Geschenke an jene verteilt, die auf der Liste stehen.

VV2014WF2014

December 03, 2014

Weblog der Fachschaft Informatik

Die VLU läuft wieder

Weblog der Fachschaft Informatik

Wie ihr möglicherweise schon festgestellt habt, läuft die VLU seit heute wieder. Wer einen SCI Account hat sollte den Teilnahmetoken bereits per Mail erhalten haben, alle anderen können sich mit ihrer Uni-Mailadresse hier anmelden und automatisch einen Token erhalten. Eine manuelle Anmeldung im SCI wie in den Vorjahren ist nicht mehr Notwendig.

Die VLU bietet euch die Möglichkeit eure Vorlesungen anonym zu bewerten und Feedback zu geben. Wir bitten euch daher alle teilzunehmen und andere dazu zu motivieren das Gleiche zu tun. Als zusätzliche Motivation werden unter allen Teilnehmern wie jedes Jahr Preise verlost, z.B. gibt es einen Tablet-PC gesponsert vom FIT. Teilnehmen könnt ihr übrigens unter folgendem Link: https://vlu.cs.uni-kl.de/

Ihr findet die Arbeit der Fachschaft bezüglich der VLU gut und wollt uns unterstützen? Oder ihr wollt etwas verbessern? Unser VLU Team sucht Verstärkung! Bei Interesse meldet euch einfach bei unserem PR-Referenten unter pr@fachschaft.cs.uni-kl.de

November 17, 2014

Weblog der Fachschaft Informatik

Hitchhikerarbeitstreffen am Samstag

Weblog der Fachschaft Informatik

Am Samstag, den 22.11., ab 10 Uhr findet in der Fachschaft ein Hitchhikerarbeitstreffen statt. Dabei soll der gute alte Hitchhiker auf 2014 gebracht werden, um die letzten Änderungen an Studiengang und Rechtschreibung eingebaut werden.

Interessiert, mit zu machen? Dann komm vorbei.

Weblog der Fachschaft Informatik

Wir suchen dich!

Weblog der Fachschaft Informatik

Die Fachschaft braucht euch als

  1. Kassenprüfer (m, w)
  2. Studentisches FBR-Mitglied (m, w)

Das Fachschaftswiki beschreibt den Job der Kassenprüfer:

Die Kassenprüfung findet vor der Vollversammlung, auf welcher der Fachschaftsrat neu gewählt wird, durch zwei gewählte Kassenprüfer statt. Die Kassenprüfer werden auf der vorherigen Vollversammlung gewählt und dürfen keine Mitglieder im Fachschaftsrat sein [...]. Geprüft werden sollte im speziellen:

  • Mittelverwendung: Waren die Ausgaben sinnvoll, bzw. entsprechen sie der Finanzordnung
  • Ausgaben: Entsprechen die (größeren) Ausgaben (stichprobenhaft) den angenommenen Finanzanträgen im Fachschaftsrat (oder sind diese unter dem Freibetrag).
  • Buchführung: Ist das Kassenbuch, die Belegsammlung, die Kontoauszüge, … sauber geführt, geordnet und auch ohne Anwesenheit des Finanzers wieder auffindbar. Wurde das Kassenbuch in einer nicht veränderbaren Weise (durch Unterschriften/handschriftlicht) fixiert und ist vor nachträglichen Änderungen geschützt.

Zum Abschluss der Kassenprüfung muss die aktuelle Finanzlage aller Kassen im Kassenbuch vermerkt, durch Unterschrift der Kassenprüfer (mit Datum) fixiert und inkl. einer Übersicht über die obigen Fragen auf der Vollversammlung berichtet werden.

Der Fachbereichsrat (FBR) setzt sich neben den drei wissenschaftlichen Mitarbeitern, einem nichtwissenschaftlichen Mitarbeiter und neun Professoren auch aus vier studentischen Mitgliedern zusammen. Die letzteren werden am Ende Januar 2015 wieder neu gewählt, und dafür suchen wir Kandidaten.

Der FBR berät und entscheidet in Angelegenheiten des Fachbereichs, so die Fachbereichswebsite. Als studentisches Mitglied hat man insbesondere die Aufgabe, ein Auge auf die Studienangelegenheiten zu legen: Modulhandbuch, Studiengänge, Langzeitplanung, etc.  Wie das konkret aussieht, ist schwer in Texte zu packen, deshalb sprecht einfach mal die aktuellen FBR-Vertreter an oder kommt in die nächste FBR-Sitzung mit. Die Sitzungen sind fachbereichsöffentlich, bis auf den letzten Teil in dem es um Personalangelegenheiten geht.

Wenn ihr das machen wollt, dann meldet euch bis zum 7.12.14 mit einer Mail an kasse.

November 16, 2014

Python-like generator functions, implemented as a library.

This is possible through two simple tricks:

  1. The language represents stack frames as objects on the heap.
  2. There's a native function for grabbing the current stack frame.

Turns out replacing the calling stack frame is really the only thing you need in order to implement coroutines and generics.

The implementation is only 23 lines long. On my screen, including the screenshot, this weblog article is already longer up until here.

November 13, 2014

I’ve been using powerline-shell for quite a while and like it a lot. I get aware of that every time I use a terminal which does not tell me which branch I’m on. Some days ago I stumbled upon promptline.vim and as I’m also using vim-airline I gave it a try. Promptline.vim exports a shell […]

November 10, 2014

So you’re the guy who is allowed to setup a local DBpedia mirror or more generally a local Linked Data mirror for your work group? OK, today is your lucky day and you’re in the right place. I hope you’ll be able to benefit from my many hours of trials and errors. If anything goes wrong (or everything works fine), feel free to leave a comment below.

Versions of this guide

There are three older versions of this guide:

  • Oct. 2010: The first version focusing on DBpedia 3.5 – 3.6 and Virtuoso 6.1
  • May 2012: A bigger update to DBpedia 3.7 (new local language versions) and Virtuoso 6.1.5+ (with a lot of updates making pre-processing of the dumps easier)
  • Apr. 2014: Update to DBpedia 3.9 and Virtuoso 7

In this step by step guide I’ll tell you how to install a local Linked Data mirror of the DBpedia 2014, hosting a combination of the regular English and (exemplary) the i18n German datasets adding up to over half a billion triples. If this isn’t enough you can also follow the links to the Freebase, DBLP, Yago, Umbel and Schema.org datasets / vocabularies adding up to over 3.5 billion triples.

Let’s jump in.

Used Versions

  • DBpedia 2014
  • Virtuoso OpenSource 7.1.0
  • Ubuntu 14.04 LTS

Prerequesits

A strong machine with root access and enough RAM: We used a VM with 4 Cores and 32 GBs of RAM for DBpedia only. If you intend to also load Freebase and other datasets i recommend at least 64 GBs of RAM (we actually ended up using a 16 Core, 256 GB RAM Server). For installing i recommend more than 128 GB free HD space for DBpedia alone, 256 GB if you want to load Freebase as well, especially for downloading and repacking the datasets, as well as the growing database file when importing (mine grew to 50 GBs for DBpedia and 180 GB with Freebase).

Let’s go

Download and install virtuoso

Go and download virtuoso opensource: either from http://sourceforge.net/projects/virtuoso/ (make sure you get v7.1.0 as in this guide or newer version).

Put the file in your home dir on the server, then extract it and switch to the directory:

cd ~
tar -xvzf virtuoso-7.1.0.tar.gz
cd virtuoso-opensource-7.1.0 # or newer, depending what you got

Now do the following to install the prerequisites and then build virtuoso:

sudo aptitude install libxml2-dev libssl-dev autoconf libgraphviz-dev \
     libmagickcore-dev libmagickwand-dev dnsutils gawk bison flex gperf

# NOTICE: the following will _not_ install into /usr/local but into /usr
# (so might clash with packages by your distribution if you install
# "the" virtuoso package)
# You'll find the db in /var/lib/virtuoso/db !
# check output for errors and FIX THEM! (e.g., install missing packages)
export CFLAGS="-O2 -m64"
./configure --with-layout=debian --enable-dbpedia-vad --enable-rdfmappers-vad

# the following will build with 5 processes in parallel
# choose something like your server's #CPUs + 1
make -j5

This will take about 5 min

sudo make install

Now change the following values in /var/lib/virtuoso/db/virtuoso.ini, the performance tuning stuff is according to http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFPerformanceTuning:

# note: virtuoso ignores lines starting with whitespace and stuff after a ;
[Parameters]
# you need to include the directory where your datasets will be downloaded
# to, in our case /usr/local/data/datasets:
DirsAllowed = ., /usr/share/virtuoso/vad, /usr/local/data/datasets
# IMPORTANT: for performance also do this
[Parameters]
# the following two are as suggested by comments in the original .ini
# file in order to use the RAM on your server:
NumberOfBuffers = 2720000
MaxDirtyBuffers = 2000000
# each buffer caches a 8K page of data and occupies approx. 8700 bytes of
# memory. It's suggested to set this value to 65 % of ram for a db only server
# so if you have 32 GB of ram: 32*1000^3*0.65/8700 = 2390804
# default is 2000 which will use 16 MB ram ;)
# Make sure to remove whitespace if you uncomment existing lines!
[Database]
MaxCheckpointRemap = 625000
# set this to 1/4th of NumberOfBuffers
[SPARQL]
# I like to increase the ResultSetMaxrows, MaxQueryCostEstimationTime
# and MaxQueryExecutionTime drastically as it's a local store where we
# do quite complex queries... up to you (don't do this if a lot of people
# use it).
# In any case for the importer to be more robust add the following setting
# to this section:
ShortenLongURIs = 1

The next step installs an init-script (autostart) and starts the virtuoso server. (If you’ve changed directories to edit /var/lib/virtuoso/db/virtuoso.ini, go back to the virtuoso source dir!):

sudo cp debian/init.d /etc/init.d/virtuoso-opensource &&
sudo chmod a+x /etc/init.d/virtuoso-opensource &&
sudo bash debian/virtuoso-opensource.postinst.debhelper

You should now have a running virtuoso server.

DBpedia URIs (en) vs. DBpedia IRIs (i18n)

The DBpedia 2014 consists of several datasets: one “standard” English version and several localized versions for other languages (i18n). The standard version mints URIs by going through all English Wikipedia articles. For all of these the Wikipedia cross-language links are used to extract corresponding labels in other languages for the en URIs (e.g., de/labels_en_uris_de.nt.bz2). This is problematic as for example articles which are only in the German Wikipedia won’t be extracted. To solve this problem the i18n versions exists and create IRIs in the form of de.dbpedia.org for every article in the German Wikipedia (e.g., de/labels_de.nt.bz2).

This approach has several implications. For backwards compatibility reasons the standard DBpedia makes statements about URIs such as http://dbpedia.org/resource/Gerhard_Schr%C3%B6der while the local chapters, like the German one, make statements about IRIs such as http://de.dbpedia.org/resource/Gerhard_Schröder (note the ö). In other words and as written above: the standard DBpedia uses URIs to identify things, while the localized versions use IRIs. This also means that http://dbpedia.org/resource/Gerhard_Schröder shouldn’t work. That said, clicking the link will actually work as there is magic going on in your browser to give you what you probably meant. Using curl curl -i -L -H "Accept: application/rdf+xml" http://dbpedia.org/resource/Gerhard_Schröder or SPARQLing the endpoint will nevertheless not be so nice/sloppy and can cause quite some headache: select * where { dbpedia:Gerhard_Schröder ?p ?o. } vs. select * where { <http://dbpedia.org/resource/Gerhard_Schr%C3%B6der> ?p ?o. }. In order to mitigate this historic problem a bit DBpedia actually offers owl:sameAs links from IRIs to URIs: en/iri_same_as_uri_en which you should load, so you at least have a link to what you want if someone tries to get info about an IRI.

As the standard DBpedia provides labels, abstracts and a couple other things in several languages, there are two types of files in the localized DBpedia folders: There are triples directly associating the English URIs with for example the German labels (de/labels_en_uris_de) and there are the localized triple files which associate for example the DE IRIs with the German labels (de/labels_de).

Downloading the DBpedia dump files & Repacking

For our group we decided that we wanted a reasonably complete mirror of the standard DBpedia (EN) (have a look at datasets loaded into the public DBpedia SPARQL Endpoint), but also the i18n versions for the German DBpedia loaded in separate graphs, as well as each of their pagelink datasets in another separate graph. For this we download the corresponding files in (NT) format as follows. If you need something different do so (and maybe report back if there were problems and how you solved them).

Another hint: Virtuoso can only import plain (uncompressed) or gzipped files, the DBpedia dumps are bzipped, so you either repack them into gzip format or extract them. On our server the importing procedure was reasonably slower from extracted files than from gzipped ones (ignoring the vast amount of wasted disk space for the extracted files). File access becomes a bottleneck if you have a couple of cores idling. This is why I decided on repacking all the files from bz2 to gz. As you can see I do the repacking per folder in parallel, if that’s not suitable for you, feel free to change it. You might also want to change this if you want to do it in parallel to downloading. The repackaging process below took about 1 hour but was worth it in the end. The more CPUs you have, the more you can parallelize this process.

# see comment above, you could also get the all_language.tar or another DBpedia version...
mkdir -p /usr/local/data/datasets/dbpedia/2014
cd /usr/local/data/datasets/dbpedia/2014
wget -r -nc -nH --cut-dirs=1 -np -l1 -A '*.nt.bz2' -A '*.owl' -R '*unredirected*' http://downloads.dbpedia.org/2014/{en/,de/,links/,dbpedia_2014.owl}

# if you want to save space do this:
for d in */ ; do for i in "${d%/}"/*.bz2 ; do bzcat "$i" | gzip > "${i%.bz2}.gz" && rm "$i" ; done & done
# else do:
#bunzip2 */*.bz2 &

# notice that the extraction (and repacking) of *.bz2 takes quite a while (about 1 hour)
# gzipped data is reasonably packed, but still very fast to access (in contrast to bz2), so maybe this is the best choice.

Data Cleaning and The bulk loader scripts

In contrast to the previous versions of this article the virtuoso import will take care of shortening too long IRIs itself. Also it seems the bulk loader script is included in the more recent Virtuoso versions, so as a reference only: see the old version for the cleaning script and VirtBulkRDFLoaderExampleDbpedia and
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderScript
for info about the bulk loader scripts.

Importing DBpedia dumps into virtuoso

Now AFTER the re-/unpacking of the DBpedia dumps we will register all files in the dbpedia dir (recursively ld_dir_all) to be added to the dbpedia graph. If you use this method make sure that only files reside in the given subtree that you really want to import.
Also don’t forget to import the dbpedia_2014.owl file (first step in the script below)!
If you only want one directory’s files to be added (non recursive) use ld_dir('dir', '*.*', 'graph');.
If you manually want to add some files, use ld_add('file', 'graph');.
See the VirtBulkRDFLoaderScript file for details.

Be warned that it might be a bad idea to import the normal and i18n dataset into one graph if you didn’t select specific languages, as it might introduce a lot of duplicates.

In order to keep track (and easily reproduce) what was selected and imported into which graph, I actually link (ln -s) the repacked files into a directory structure beneath /usr/local/data/datasets/dbpedia/2014/importedGraphs/ and import from there instead. To make sure you think about this, I use that path below, so it won’t work if you didn’t pay attention. If you really want to import all downloaded files, just import /usr/local/data/datasets/dbpedia/2014/.

Also be aware of the fact that if you load certain parts of dumps in different graphs (such as I did with the pagelinks, as well as the i18n versions of the DE and FR datasets) that only triples from the http://dbpedia.org graph will be shown when you visit the local pages with your browser (SPARQL is unaffected by this)!

So if you want to load the same datasets as loaded on the official endpoint (but restricted to the EN and DE ones ) the following should do the trick to link them up for the next steps:

cd /usr/local/data/datasets/dbpedia/2014/
mkdir importedGraphs
cd importedGraphs

mkdir dbpedia.org
cd dbpedia.org
# ln -s ../../dbpedia_2014.owl ./ # see below!
ln -s ../../links/* ./

ln -s ../../en/article_categories_en.nt.gz ./
ln -s ../../en/category_labels_en.nt.gz ./
ln -s ../../en/disambiguations_en.nt.gz ./
ln -s ../../en/external_links_en.nt.gz ./
ln -s ../../en/freebase_links_en.nt.gz ./
ln -s ../../en/geo_coordinates_en.nt.gz ./
ln -s ../../en/geonames_links_en_en.nt.gz ./
ln -s ../../en/homepages_en.nt.gz ./
ln -s ../../en/images_en.nt.gz ./
ln -s ../../en/infobox_properties_en.nt.gz ./
ln -s ../../en/infobox_property_definitions_en.nt.gz ./
ln -s ../../en/instance_types_en.nt.gz ./
ln -s ../../en/instance_types_heuristic_en.nt.gz ./
ln -s ../../en/interlanguage_links_chapters_en.nt.gz ./
ln -s ../../en/iri_same_as_uri_en.nt.gz ./
ln -s ../../en/labels_en.nt.gz ./
ln -s ../../en/long_abstracts_en.nt.gz ./
ln -s ../../en/mappingbased_properties_cleaned_en.nt.gz ./
ln -s ../../en/page_ids_en.nt.gz ./
ln -s ../../en/persondata_en.nt.gz ./
ln -s ../../en/redirects_transitive_en.nt.gz ./
ln -s ../../en/revision_ids_en.nt.gz ./
ln -s ../../en/revision_uris_en.nt.gz ./
ln -s ../../en/short_abstracts_en.nt.gz ./
ln -s ../../en/skos_categories_en.nt.gz ./
ln -s ../../en/specific_mappingbased_properties_en.nt.gz ./
ln -s ../../en/wikipedia_links_en.nt.gz ./

ln -s ../../de/labels_en_uris_de.nt.gz ./
ln -s ../../de/long_abstracts_en_uris_de.nt.gz ./
ln -s ../../de/short_abstracts_en_uris_de.nt.gz ./

ln -s ../../fr/labels_en_uris_fr.nt.gz ./
ln -s ../../fr/long_abstracts_en_uris_fr.nt.gz ./
ln -s ../../fr/short_abstracts_en_uris_fr.nt.gz ./
cd ..


mkdir ext.dbpedia.org
cd ext.dbpedia.org
ln -s ../../en/genders_en.nt.gz ./
ln -s ../../en/out_degree_en.nt.gz ./
ln -s ../../en/page_length_en.nt.gz ./
cd ..

mkdir pagelinks.dbpedia.org
cd pagelinks.dbpedia.org
ln -s ../../en/page_links_en.nt.gz ./
cd ..

mkdir topicalconcepts.dbpedia.org
cd topicalconcepts.dbpedia.org
ln -s ../../en/topical_concepts_en.nt.gz ./
cd ..


mkdir de.dbpedia.org
cd de.dbpedia.org
ln -s ../../de/article_categories_de.nt.gz ./
ln -s ../../de/category_labels_de.nt.gz ./
ln -s ../../de/disambiguations_de.nt.gz ./
ln -s ../../de/external_links_de.nt.gz ./
ln -s ../../de/freebase_links_de.nt.gz ./
ln -s ../../de/geo_coordinates_de.nt.gz ./
ln -s ../../de/homepages_de.nt.gz ./
ln -s ../../de/images_de.nt.gz ./
ln -s ../../de/infobox_properties_de.nt.gz ./
ln -s ../../de/infobox_property_definitions_de.nt.gz ./
ln -s ../../de/instance_types_de.nt.gz ./
ln -s ../../de/interlanguage_links_chapters_de.nt.gz ./
ln -s ../../de/iri_same_as_uri_de.nt.gz ./
ln -s ../../de/labels_de.nt.gz ./
ln -s ../../de/long_abstracts_de.nt.gz ./
ln -s ../../de/mappingbased_properties_de.nt.gz ./
ln -s ../../de/out_degree_de.nt.gz ./
ln -s ../../de/page_ids_de.nt.gz ./
ln -s ../../de/page_length_de.nt.gz ./
ln -s ../../de/persondata_de.nt.gz ./
ln -s ../../de/pnd_de.nt.gz ./
ln -s ../../de/redirects_transitive_de.nt.gz ./
ln -s ../../de/revision_ids_de.nt.gz ./
ln -s ../../de/revision_uris_de.nt.gz ./
ln -s ../../de/short_abstracts_de.nt.gz ./
ln -s ../../de/skos_categories_de.nt.gz ./
ln -s ../../de/specific_mappingbased_properties_de.nt.gz ./
ln -s ../../de/wikipedia_links_de.nt.gz ./
cd ..

mkdir pagelinks.de.dbpedia.org
cd pagelinks.de.dbpedia.org
ln -s ../../de/page_links_de.nt.gz ./
cd ..

This should have prepared your importedGraphs directory. From this directory you can run the following command which print out the necessary isql commands to register your graphs for importing:

for g in * ; do echo "ld_dir_all('$(pwd)/$g', '*.*', 'http://$g');" ; done

One more thing (thanks to Romain): In order for the DBpedia.vad package (which is installed at the end) to work correctly, the dbpedia_2014.owl file needs to be imported into graph http://dbpedia.org/resource/classes#.

Note: In the following i will assume that your virtuoso isql command is called isql. If you’re in lack of such a command it might be called isql-vt, but this usually means you installed it using some other method than described in here

isql # enter virtuoso sql mode
-- we are in sql mode now
ld_add('/usr/local/data/datasets/remote/dbpedia/2014/dbpedia_2014.owl', 'http://dbpedia.org/resource/classes#');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org', '*.*', 'http://dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org', '*.*', 'http://de.dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/ext.dbpedia.org', '*.*', 'http://ext.dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/pagelinks.dbpedia.org', '*.*', 'http://pagelinks.dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/pagelinks.de.dbpedia.org', '*.*', 'http://pagelinks.de.dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/topicalconcepts.dbpedia.org', '*.*', 'http://topicalconcepts.dbpedia.org');

-- do the following to see which files were registered to be added:
SELECT * FROM DB.DBA.LOAD_LIST;
-- if unsatisfied use:
-- delete from DB.DBA.LOAD_LIST;
EXIT;

You can now also register other datasets like Freebase, DBLP, Yago, Umbel and Schema.org … that you want to be loaded. Our full DB.DBA.LOAD_LIST currently looks like this:

SELECT ll_graph, ll_file FROM DB.DBA.LOAD_LIST;
ll_graph                             ll_file
VARCHAR                              VARCHAR NOT NULL
____________________________________

http://dblp.l3s.de                   /usr/local/data/datasets/remote/dblp/l3s/2014-11-08/dblp.nt.gz
http://dbpedia.org/resource/classes# /usr/local/data/datasets/remote/dbpedia/2014/dbpedia_2014.owl
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/amsterdammuseum_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/article_categories_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/bbcwildlife_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/bookmashup_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/bricklink_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/category_labels_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/cordis_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/dailymed_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/dblp_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/dbtune_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/disambiguations_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/diseasome_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/drugbank_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/eunis_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/eurostat_linkedstatistics_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/eurostat_wbsg_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/external_links_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/factbook_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/flickrwrappr_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/freebase_links_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/gadm_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/geo_coordinates_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/geonames_links_en_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/geospecies_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/gho_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/gutenberg_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/homepages_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/images_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/infobox_properties_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/infobox_property_definitions_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/instance_types_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/instance_types_heuristic_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/interlanguage_links_chapters_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/iri_same_as_uri_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/italian_public_schools_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/labels_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/labels_en_uris_de.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/labels_en_uris_fr.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/linkedgeodata_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/linkedmdb_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/long_abstracts_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/long_abstracts_en_uris_de.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/long_abstracts_en_uris_fr.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/mappingbased_properties_cleaned_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/musicbrainz_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/nytimes_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/opencyc_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/openei_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/page_ids_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/persondata_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/redirects_transitive_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/revision_ids_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/revision_uris_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/revyu_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/short_abstracts_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/short_abstracts_en_uris_de.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/short_abstracts_en_uris_fr.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/sider_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/skos_categories_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/specific_mappingbased_properties_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/tcm_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/umbel_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/uscensus_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/wikicompany_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/wikipedia_links_en.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/wordnet_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/yago_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/yago_taxonomy.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/yago_type_links.nt.gz
http://dbpedia.org                   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/dbpedia.org/yago_types.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/article_categories_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/category_labels_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/disambiguations_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/external_links_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/freebase_links_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/geo_coordinates_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/homepages_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/images_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/infobox_properties_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/infobox_property_definitions_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/instance_types_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/interlanguage_links_chapters_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/iri_same_as_uri_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/labels_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/long_abstracts_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/mappingbased_properties_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/out_degree_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/page_ids_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/page_length_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/persondata_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/pnd_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/redirects_transitive_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/revision_ids_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/revision_uris_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/short_abstracts_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/skos_categories_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/specific_mappingbased_properties_de.nt.gz
http://de.dbpedia.org                /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/de.dbpedia.org/wikipedia_links_de.nt.gz
http://ext.dbpedia.org               /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/ext.dbpedia.org/genders_en.nt.gz
http://ext.dbpedia.org               /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/ext.dbpedia.org/out_degree_en.nt.gz
http://ext.dbpedia.org               /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/ext.dbpedia.org/page_length_en.nt.gz
http://pagelinks.dbpedia.org         /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/pagelinks.dbpedia.org/page_links_en.nt.gz
http://pagelinks.de.dbpedia.org      /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/pagelinks.de.dbpedia.org/page_links_de.nt.gz
http://topicalconcepts.dbpedia.org   /usr/local/data/datasets/remote/dbpedia/2014/importedGraphs/topicalconcepts.dbpedia.org/topical_concepts_en.nt.gz
http://rdf.freebase.com              /usr/local/data/datasets/remote/freebase/2014-11-02/freebase-rdf-2014-11-02-00-00.gz
http://schema.org                    /usr/local/data/datasets/remote/schema.org/2014-11-08/all.nt
http://umbel.org/umbel/rc/           /usr/local/data/datasets/remote/umbel/External Ontologies/dbpediaOntology.n3
http://umbel.org/umbel/rc/           /usr/local/data/datasets/remote/umbel/External Ontologies/schema.org.n3
http://umbel.org/umbel               /usr/local/data/datasets/remote/umbel/Ontology/umbel.n3
http://umbel.org/umbel/rc/           /usr/local/data/datasets/remote/umbel/Reference Structure/umbel_reference_concepts.n3
http://yago-knowledge.org/resource/  /usr/local/data/datasets/remote/yago/yago2/2012-12/yagoLabels.ttl.gz

114 Rows. -- 5 msec.

OK, now comes the fun (and long part: about 1.5 hours (new virtuoso 7 is cool ;) for DBpedia alone, +~3 hours for Freebase)… After we registered the files to be added, now let’s finally start the process. Fire up screen if you didn’t already. (For more detailed metering than below see VirtTipsAndTricksGuideLDMeterUtility.)

sudo aptitude install screen
screen isql
rdf_loader_run();
-- DO NOT USE THE DB BESIDES THE FOLLOWING COMMANDS:
-- depending on the amount of CPUs and your IO performance you can run
-- more rdf_loader_run(); commands in other isql sessions which will
-- speed up the import process.
-- you can watch the progress from another isql session with:
-- select * from DB.DBA.LOAD_LIST;
-- if you need to stop the loading for any reason: rdf_load_stop ();
-- if you want to force stopping: rdf_load_stop(1);
checkpoint;
commit WORK;
checkpoint;
EXIT;

After this:
Take a look into var/lib/virtuoso/db/virtuoso.log file. Should you find any errors in there… FIX THEM! You might use the dump, but it’s incomplete then. Any error quits out of the loading of the corresponding file and continues with the next one, so you’re only using the part of that file up to the place where the error occurred. (Should you find errors you can’t fix please leave a comment.)

Final polishing

You can & should now install the DBpedia and RDF Mappers packages from the Virtuoso Conductor.
http://your-server:8890

login: dba
pw: dba

Go to System Admin / Packages. Install the dbpedia (v. 1.4.28) and rdf_mappers (v. 1.34.74) packages (takes about 5 minutes).

Testing your local mirror

Go to the sparql-endpoint of your server http://your-server:8890/sparql (or in isql prefix with: SPARQL)

sparql SELECT COUNT(*) WHERE { ?s ?p ?o } ;

This shouldn’t take long in Virtuoso 7 anymore and for me now returns 695,553,624 for DBpedia (en+de), 3,543,872,243 with DBpedia (en+de), Freebase, DBLP, Yago, Umbel and Schema.org.

I also like this query showing all the graphs and how many triples are in them:

sparql SELECT ?g COUNT(*) { GRAPH ?g {?s ?p ?o.} } GROUP BY ?g ORDER BY DESC 2;
g                                                           callret-1
LONG VARCHAR                                                LONG VARCHAR
____________________________________________________________

http://rdf.freebase.com                                     2760013365
http://dbpedia.org                                          375176108
http://pagelinks.dbpedia.org                                149707899
http://de.dbpedia.org                                       92508750
http://dblp.l3s.de                                          72519345
http://pagelinks.de.dbpedia.org                             55804533
http://ext.dbpedia.org                                      21900162
http://yago-knowledge.org/resource                          15372307
http://umbel.org/umbel/rc                                   403452
http://www.openlinksw.com/schemas/RDF_Mapper_Ontology/1.0/  256065
http://topicalconcepts.dbpedia.org                          149638
http://dbpedia.org/resource/classes                         27063
http://schema.org                                           8727
http://localhost:8890/DAV/                                  6187
http://www.openlinksw.com/schemas/virtrdf#                  2639
http://umbel.org/umbel                                      1702
http://OPEN.vocab.org/terms                                 1480
http://purl.org/ontology/bibo/                              1226
http://purl.org/goodrelations/v1                            937
http://purl.org/dc/terms/                                   857
http://www.openlinksw.com/schemas/opengraph                 804
http://www.openlinksw.com/schemas/linkedin                  741
http://www.openlinksw.com/schemas/googleplus                696
http://www.openlinksw.com/schemas/google-base               691
http://www.openlinksw.com/schemas/cv                        661
virtrdf-label                                               638
http://xmlns.com/foaf/0.1/                                  557
http://rdfs.org/sioc/ns#                                    553
http://www.openlinksw.com/schemas/evri                      482
http://www.openlinksw.com/schemas/crunchbase                444
http://bblfish.net/WORK/atom-owl/2006-06-06/                386
http://scot-project.org/scot/ns#                            332
http://www.openlinksw.com/schemas/zillow                    311
http://www.w3.org/2004/02/skos/core                         252
http://www.openlinksw.com/schemas/cnet                      225
http://www.openlinksw.com/schemas/tesco                     183
http://www.openlinksw.com/schemas/bestbuy                   172
http://www.w3.org/2002/07/owl#                              160
http://www.w3.org/2002/07/owl                               160
http://www.openlinksw.com/schemas/angel#                    144
http://www.openlinksw.com/schemas/amazon                    143
http://purl.org/dc/elements/1.1/                            139
http://www.w3.org/2007/05/powder-s#                         117
http://www.openlinksw.com/schemas/twitter                   103
http://www.openlinksw.com/schemas/stackoverflow#            102
http://www.openlinksw.com/schemas/klout                     90
http://www.w3.org/2000/01/rdf-schema#                       87
http://www.w3.org/1999/02/22-rdf-syntax-ns#                 85
http://www.openlinksw.com/schemas/ebay                      79
http://www.openlinksw.com/schema/attribution#               68
http://www.openlinksw.com/schemas/nyt                       41
http://www.openlinksw.com/schemas/wolframalpha#             32
http://www.openlinksw.com/schemas/oplbase                   26
http://www.openlinksw.com/schemas/cert#                     23
http://www.openlinksw.com/schemas/money                     21
http://www.openlinksw.com/schemas/dbpedia-spotlight#        21
http://localhost:8890/sparql                                14
http://dbpedia.org/schema/property_rules#                   12
dbprdf-label                                                6

59 ROWS. -- 61717 msec.

Congratulations, you just imported over half a billion triples (or over 3.5 G triples).

Backing up this initial state

Now is a good moment to backup the whole db (takes about half an hour):

sudo -i
cd /
/etc/init.d/virtuoso-opensource stop &&
tar -cvf - /var/lib/virtuoso | lzop > virtuoso-7.1.0-DBDUMP-$(date '+%F')-dbpedia-2014-en_de.tar.lzop &&
/etc/init.d/virtuoso-opensource start

Afterwards you might want to repack this with xz (lzma) like this:

# aptitude install xz
for f in virtuoso-7.1.0-DBDUMP-*.tar.lzop ; do lzop -d -c "$f" | xz > "${f%lzop}.xz" ; done

Yay, done ;)
As always, feel free to leave comments if i made a mistake or to tell us about your problems or how happy you are :D.

Our database dump file

In case you really want exactly the same state of the public datasets that we have loaded (as described above) you can download our database dump (57 GB, md5sum, including: DBpedia 2014 en,de,links,dbpedia_2014.owl, Freebase, DBLP, Yago, Umbel and Schema.org).

Thanks

Many thanks to the DBpedia team for their endless efforts of providing us all with a great dataset. Also many thanks to the Virtuoso crew for releasing an opensource version of their DB.

Updates

  • 2014-11-11: Added link to our Dump-File
  • 2014-11-24: Thanks to Romain: Load dbpedia_2014.owl into graph http://dbpedia.org/resource/classes# for DBpedia.vad to find it when resolving http://your-server:8890/ontology/author for example.

November 08, 2014

Weblog der Fachschaft Informatik

Informatik Praxistag 2014

Weblog der Fachschaft Informatik

Kleine Erinnerung:
Am 13.November 2014 findet der diesjährige Informatik-Praxistag statt.

Es finden 3 Podiumsdiskussionen statt:

  •  11:00 bis 11:45 Uhr: Softwareentwicklung in Deutschland – Wird in 20 Jahren am Standort Deutschland noch Software produziert?
  • 13:30 bis 14:15 Uhr: Mittelstand versus Großkonzern – Welcher Arbeitgeber passt zu mir?
  • 14:15 bis 15:00 Uhr: Bachelor, Master, Promotion – Wie starte ich eine erfolgreiche Karriere?

Außerdem stellen sich diverse Unternehmen vor.

Alle weiteren Informationen findet ihr hier beim FIT.

October 30, 2014

Weblog der Fachschaft Informatik

Diskussionsrunde zum Thema “Bargeldlose Zukunft”

Weblog der Fachschaft Informatik

scheinlos

Am Mittwoch, den 5.11.14, findet eine Fishbowl-Diskusion rund um das Thema “Scheinlos glücklich? Über Bitcoins, digitale Tauschgeschäfte und der bargeldlosen Zukunft” statt. Dort hab ihr die Möglichkeit euch über das Thema zu informieren und mit Experten ins Gespräch zu kommen. Weiter werden Studierende ihre Seminararbeiten zu diesem Thema als Poster präsentieren und für Fragen zur Verfügung stehen. Starten wird die Veranstaltung um 18 Uhr im BIC, die Diskussion findet dann um 19 Uhr statt.

October 29, 2014

On this page you’ll find a list of Content-types. If you want to know how to determine it programmatically, read the article Get the mimetype for a file type. File type MIME [...]

When receiving data from a server or sending data to it, the receiver should always get a hint of what type the data is. @media only [...]

October 28, 2014

Weblog der Fachschaft Informatik

2. FSR-Sitzung im Wintersemester 14

Weblog der Fachschaft Informatik

Auch hier die Erinnerung: Am Mittwoch, den 29.10. um 14:15 ist wieder FSR-Sitzung. Die Vorläufige Tagesordnung ist wie folgt:

  1. Festlegung der TO
  2. Protokolle
  3. Mitteilungen
  4. Crêpesstand am Informatik Praxistag
  5. Weihnachtsfeier und VV
  6. Handtücher GMF
  7. Verschiedenes

October 24, 2014

Weblog der Fachschaft Informatik

Die EWoche endet…

Weblog der Fachschaft Informatik

Die aktuelle EWoche geht langsam aber sicher zu Ende, am Montag beginnen wieder die Vorlesungen. Wir danken den zahlreichen Helfern, ohne die Events nicht möglich gewesen wären und wünschen allen Erstsemestern einen guten Start ins “echte” Studium.

Heute fand zum krönenden Abschluss unseres Info-Vorkurses ein Robocode-Turnier statt in dem sich unsere Erstsemester in 19 Teams in ihren Programmierkünsten messen konnten. Gratulation an die Sieger:

  1. Team “Fucking BUBBLES!!!” mit dem Roboter “Der Augenschmelzer”
  2. Team “The Destroyers of the Uni” mit dem Roboter “OLAF!!”
  3. Team “Schnitzel” mit dem Roboter “Penetrator”

Robocode Turnier Ewoche WS 2014

Wer macht mit?