The Challenges of German Archival Document Categorization on Insufficient Labeled Data

Proceedings of the Third Workshop on Humanities in the Semantic Web (WHiSe 2020), co-located with 15th Extended Semantic Web Conference (ESWC 2020)

Hoppe, Fabian and Tietz, Tabea and Dessì, Danilo and Meyer, Nils and Sprau, Mirjam and Alam, Mehwish and Sack, Harald

Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insufficient for capturing the semantic. This paper proposes and explores a dataless categorization approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Preliminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task.

@inproceedings{hoppe_challenges_2020,
  title = {The {Challenges} of {German} {Archival} {Document} {Categorization} on {Insufficient} {Labeled} {Data}},
  booktitle = {Proceedings of the {Third} {Workshop} on {Humanities} in the {Semantic} {Web} ({WHiSe} 2020), co-located with 15th {Extended} {Semantic} {Web} {Conference} ({ESWC} 2020)},
  author = {Hoppe, Fabian and Tietz, Tabea and Dessì, Danilo and Meyer, Nils and Sprau, Mirjam and Alam, Mehwish and Sack, Harald},
  year = {2020}
}