CrypText: Teext Perturbati0ns in the Wi1d


We welcome all feedbacks/comments regarding the system. Please provide us with your feedbacks/comments here.

General FAQ

What is ``Text Perturbations in the Wild"?
=> They are human-written texts that do not conform to perfect English. They are usually misspelled, yet sound similar and have the same semantic meanings with the original words. Netizen usually perturb different words in a sentence with the hope to envade auto filtering mechanisms online such as toxic comments detectors.

What can I do with this website?
=> This website provides you with a toolbox to discover and monitor human-written text perturbations in the wild. Specifically, it provides (i) a dictionary to lookup text perturbations for a specific word, (ii) a normalization function to transform potential perturbations to their original English words, (iii) a perturbation function to manipulate a given sentence using human-written perturbations that can help you evaluate the robustness of your textual ML models against realistic noisy texts, (iv) a monitoring interface where you can visualize the use of text perturbations on Reddit, Twitter, etc.

Do you publish any APIs for mass queries?
=> We are currently providing public APIs upon request. Please contact us for more details.

Where can I read more about how the system works?
=> We will release the technical paper soon in the future. Please stay tuned.


Social Listen

What are the data sources?
=> We use the PushShift API and Twitter API to crawl the data for analysis. All the data are permenantly deleted post-analysis.

What are the date range from which the data are collected and how many data points were collected?
=> By default, the system only searches for a maximum of 10K Reddit comments or submissions since the year 2020 for each query. Tweets are collected without any specific date range, with up to 200 tweets for each token.

Why we do not have timeline chart for Twitter platform?
=> We rely on Twitter's Recent Search API (v2) for collecting tweets. Since there are rate limits imposed by Twitter, we cannot guarantee we collect all the relevant tweets given a timeframe. However, we rely on the API to continually learn new tokens online.
Website's design is motivated/adopted from https://gnsp.in
Contact: Thai Le (https://lethaiq.github.io/tql3/), Dev by Thai Le (when was at PIKE Lab)