Find a good English dictionary for us! #438

bee-san · 2020-09-29T23:15:59Z

Hello spoooopyyy hackers 🎃

This is a Hacktoberfest only issue! 👻

This is also data-sciency!

The Problem

Our English dictionary contains words that aren't English, and does not contain common English words.

Examples of non-common words in the dictionary:

"hlithskjalf",
  "hlorrithi",
  "hlqn",
  "hm",
  "hny",
  "ho",
  "hoactzin",
  "hoactzines",

This is our current dictionary:

https://github.com/dwyl/english-words

What we want

An English dictionary without English words that are horrible, and with common English words in JSON format.

Ideas on how to achieve this

You'll likely need to use data science, parse English text (such as books / stories) and find uncommon words to remove them. Also potentially adding more words.

I'm not the best data scientist in the world, so what you decide will be good.

You can also publish this work outside of Ciphey, such as in a separate GitHub repository -- so long as Ciphey can use it <3

While I'm not an expert data scientist, I have studied it -- so if you need help leave a comment :)

issue-label-bot · 2020-09-29T23:16:01Z

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.62. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

chetnaaggarwal-ca · 2020-09-30T02:42:34Z

hello @bee-san can be please let me contribute in ur code?

bee-san · 2020-09-30T11:23:42Z

@chetnaaggarwal-ca Sure! Go right ahead :D

sashreek1 · 2020-10-02T10:58:58Z

i found this dictionary of words using this https://pypi.org/project/english-words/ but it has only 25487 words which is almost negligible compared to the 479826 in the UNIX default word-list. However all the words in this list are meaningful
here is a copy of that file:
https://pastebin.ubuntu.com/p/HzFSDs29SH/
if this dictionary seems fine to y'all, I'll open a PR
(I was looking for the dictionary after @SkeletalDemise suggest a fix along these lines for issue #412 )

sashreek1 · 2020-10-02T11:05:31Z

I had tried to create a fix with the logic that @bee-san suggested in the issue description by using this tool https://www.english-corpora.org/coca/ but this had a problem. the string "hm" apparently has a frequency almost 15 times higher than more meaningful words like "curate" and many more.

Itamai · 2020-10-02T12:25:24Z

https://raw.githubusercontent.com/jeremy-rifkin/Wordlist/master/master.txt
300k english words

or you could scrape a whole dictionary
https://dictionary.cambridge.org/dictionary/english/

sashreek1 · 2020-10-02T13:06:36Z

@Itamai the 300k dictionary looks good to me as well

I also found the dictionary that nltk library uses and it has 236736 words and as far as i am able to see they seem to be all meaningful

which one shall i go with ?

Itamai · 2020-10-02T13:16:49Z

@Itamai the 300k dictionary looks good to me as well

I also found the dictionary that nltk library uses and it has 236736 words and as far as i am able to see they seem to be all meaningful

which one shall i go with ?

the github user @chetnaaggarwal-ca got assigned to this issue so we should let her do it. I was just giving some ideas :D
Another idea would be to merge the 300k one and the 235k one you found together and then delete duplicates that should be good enough or if she knows how to write a scraper then just pull all words from an online dictionary

sashreek1 · 2020-10-02T13:25:27Z

sure thing ! I'll wait for that to finish so i can begin #412 :)

bee-san · 2020-10-02T14:38:31Z

I tried to merge dictionaries previously and it didn't work out so well :( I also tried scraping the speeches from the Houses of Parliament :(

The dictionary is loaded as JSON so {word: 1}. What do you think about scraping words from multiple sources and building a frequency distribution, like so:

{and: 0.21}

So the word "and" appears a lot. And then we add together the frequencies and if they go over a threshold, we use that?

We do a similar thing for stop words / top 1000 words, except this is for every word in the dictionary :)

sashreek1 · 2020-10-02T15:44:16Z

Oh right
But the issue there would be that we might miss out on several words that may not be in use in today's colloquial lingo but might still be a meaningful word and someone may come across an encrypted version of it.
This also depends on the source that we pick, but whatever source we pick it still has the possibility to miss out on some word right ?

bee-san · 2020-10-02T15:50:33Z

Oh right
But the issue there would be that we might miss out on several words that may not be in use in today's colloquial lingo but might still be a meaningful word and someone may come across an encrypted version of it.
This also depends on the source that we pick, but whatever source we pick it still has the possibility to miss out on some word right ?

If we used English Wikipedia I think it'd cover a significant amount of words, and the words that do not appear we could always do something like this:

if value == 0 but words are in dict:
    return plaintext
elif value > THRESHOLD:
        return plaintext

It's unlikely for any given text to be compromised entirely of only rare words that have never appeared in English Wikipedia before (or rarely appear) :)

sashreek1 · 2020-10-02T16:05:41Z

awesome !
we could probably also make a manual procedure to check if by any chance that a word has been falsely claimed to be "not common" this isn't likely but is still possible.
what do you think ?

sashreek1 · 2020-10-04T02:29:03Z

I found this https://github.com/IlyaSemenov/wikipedia-word-frequency where they conveniently calculated the frequency of english words on wikipedia for us (https://github.com/IlyaSemenov/wikipedia-word-frequency/blob/master/results/enwiki-20190320-words-frequency.txt) so perhaps i could use this along with the dictionaries we currently use to figure out "common words" and split them into 2 files of common words and uncommon words which we can then manually filter to see if we need any words from the uncommon file to be in the common file

sashreek1 · 2020-10-04T05:45:23Z

I have also found this highly reliable source called datamuse who provide an awesome API which uses google's n-gram to provide frequency of words used in several texts, books and several other material that google has access to. https://www.datamuse.com/api/.

for example : https://api.datamuse.com/words?sp=awesome&md=f&max=1

However, they have a cap of 100k requests per day.

We can get over this cap if 4-5 of us run a script to get the frequency of every single word in the dictionary that we are currently using and then merge the data collected into one file. We can then manipulate this data using whatever threshold we require.

This solution (in my opinion) sounds like the most viable one so far. If y'all agree i can get the script done (including the splitting of the file so 4 of us can run it)

bee-san · 2020-10-04T12:30:44Z

hello @bee-san can be please let me contribute in ur code?

Are you still working on this?

bee-san · 2020-10-05T12:50:50Z

Reassigned issue due to no communication :)

bee-san added good first issue feature_request hacktoberfest labels Sep 29, 2020

bee-san assigned chetnaaggarwal-ca Sep 30, 2020

SkeletalDemise mentioned this issue Oct 2, 2020

Feature request: Soundex #412

Closed

bee-san assigned sashreek1 and unassigned chetnaaggarwal-ca Oct 5, 2020

Ciphey / Ciphey

Find a good English dictionary for us! #438

Find a good English dictionary for us! #438

bee-san commented Sep 29, 2020

issue-label-bot bot commented Sep 29, 2020

chetnaaggarwal-ca commented Sep 30, 2020

bee-san commented Sep 30, 2020

sashreek1 commented Oct 2, 2020

sashreek1 commented Oct 2, 2020

Itamai commented Oct 2, 2020 •

edited

sashreek1 commented Oct 2, 2020

Itamai commented Oct 2, 2020 •

edited

sashreek1 commented Oct 2, 2020

bee-san commented Oct 2, 2020

sashreek1 commented Oct 2, 2020

bee-san commented Oct 2, 2020

sashreek1 commented Oct 2, 2020

sashreek1 commented Oct 4, 2020

sashreek1 commented Oct 4, 2020

bee-san commented Oct 4, 2020

bee-san commented Oct 5, 2020

Oct	NOV	Dec
	15
2019	2020	2021

Ciphey / Ciphey

Sponsor Ciphey/Ciphey

Join GitHub today

GitHub is where the world builds software

Find a good English dictionary for us! #438

Find a good English dictionary for us! #438

Comments

bee-san commented Sep 29, 2020

The Problem

What we want

Ideas on how to achieve this

issue-label-bot bot commented Sep 29, 2020

chetnaaggarwal-ca commented Sep 30, 2020

bee-san commented Sep 30, 2020

sashreek1 commented Oct 2, 2020

sashreek1 commented Oct 2, 2020

Itamai commented Oct 2, 2020 • edited

sashreek1 commented Oct 2, 2020

Itamai commented Oct 2, 2020 • edited

sashreek1 commented Oct 2, 2020

bee-san commented Oct 2, 2020

sashreek1 commented Oct 2, 2020

bee-san commented Oct 2, 2020

sashreek1 commented Oct 2, 2020

sashreek1 commented Oct 4, 2020

sashreek1 commented Oct 4, 2020

bee-san commented Oct 4, 2020

bee-san commented Oct 5, 2020

Essential cookies

Always active

Analytics cookies

Itamai commented Oct 2, 2020 •

edited

Itamai commented Oct 2, 2020 •

edited