Corpus preparation

raw : .txt (notepad)
- Upload single or multiple texts in .txt format
- You will be able to search using wildcards (find all words beginning in un-, ending in -ing, containing 'zz', etc)
parallel: .xlsx
- Surce language sentence in column A, target language sentence in column B

You will be able to search using wildcard, POS (find all verbs, adjectives etc), lemma (one search to find see-sees-saw-seeing etc)
format is
- one word per line (vertical)
- each line = token tab tag tab lemma
You can use these apps to do that
- TreeTagger : download to your local machine and follow the instructions on the readme file. For Indonesian, I created a Windows package so you can click once and get the output!
- Tokeniser-Tagger: I created a web based app supporting Indonesian, Japanese and English (it might be slow! ), accessible here https://tokeniser-tagger.streamlit.app/

You will be able to search based on metalinguistic information restriction in the corpus (gender: male vs female, timeline of how words evolve across different periods, spoken vs written, pragmatic & sociolinguistic features, etc)
you need to encode this in XML attribute value pair. Learn here
One file can contain multiple metadata information

combination of tagged and XML: you can search based on POS tag, lemma, wildcard, and metadata
You can link two separate corpora (source and target language). Each sentence must be marked up in this format <s n="number"> sentence </s>. replace number with actual number. For each sentence in each corpus, number must be consistent
- EN: <s n="20"> sentence </s>
- ID: <s n="20"> kalimat </s>

Page updated

Google Sites

Report abuse