Upload single or multiple texts in .txt format
You will be able to search using wildcards (find all words beginning in un-, ending in -ing, containing 'zz', etc)
Surce language sentence in column A, target language sentence in column B
You will be able to search using wildcard, POS (find all verbs, adjectives etc), lemma (one search to find see-sees-saw-seeing etc)
format is
one word per line (vertical)
each line = token tab tag tab lemma
You can use these apps to do that
TreeTagger : download to your local machine and follow the instructions on the readme file. For Indonesian, I created a Windows package so you can click once and get the output!
Tokeniser-Tagger: I created a web based app supporting Indonesian, Japanese and English (it might be slow! ), accessible here https://tokeniser-tagger.streamlit.app/
You will be able to search based on metalinguistic information restriction in the corpus (gender: male vs female, timeline of how words evolve across different periods, spoken vs written, pragmatic & sociolinguistic features, etc)
you need to encode this in XML attribute value pair. Learn here
One file can contain multiple metadata information
combination of tagged and XML: you can search based on POS tag, lemma, wildcard, and metadata
You can link two separate corpora (source and target language). Each sentence must be marked up in this format <s n="number"> sentence </s>. replace number with actual number. For each sentence in each corpus, number must be consistent
EN: <s n="20"> sentence </s>
ID: <s n="20"> kalimat </s>