close
The Wayback Machine - https://web.archive.org/web/20201115121153/https://github.com/Ciphey/Ciphey/issues/438
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a good English dictionary for us! #438

Open
bee-san opened this issue Sep 29, 2020 · 17 comments
Open

Find a good English dictionary for us! #438

bee-san opened this issue Sep 29, 2020 · 17 comments

Comments

@bee-san
Copy link
Member

@bee-san bee-san commented Sep 29, 2020

Hello spoooopyyy hackers 🎃

This is a Hacktoberfest only issue! 👻

This is also data-sciency!

The Problem

Our English dictionary contains words that aren't English, and does not contain common English words.

Examples of non-common words in the dictionary:

"hlithskjalf",
  "hlorrithi",
  "hlqn",
  "hm",
  "hny",
  "ho",
  "hoactzin",
  "hoactzines",

This is our current dictionary:

https://github.com/dwyl/english-words

What we want

An English dictionary without English words that are horrible, and with common English words in JSON format.

Ideas on how to achieve this

You'll likely need to use data science, parse English text (such as books / stories) and find uncommon words to remove them. Also potentially adding more words.

I'm not the best data scientist in the world, so what you decide will be good.

You can also publish this work outside of Ciphey, such as in a separate GitHub repository -- so long as Ciphey can use it <3

While I'm not an expert data scientist, I have studied it -- so if you need help leave a comment :)

@issue-label-bot
Copy link

@issue-label-bot issue-label-bot bot commented Sep 29, 2020

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.62. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@chetnaaggarwal-ca
Copy link

@chetnaaggarwal-ca chetnaaggarwal-ca commented Sep 30, 2020

hello @bee-san can be please let me contribute in ur code?

@bee-san
Copy link
Member Author

@bee-san bee-san commented Sep 30, 2020

@chetnaaggarwal-ca Sure! Go right ahead :D

@sashreek1
Copy link
Contributor

@sashreek1 sashreek1 commented Oct 2, 2020

i found this dictionary of words using this https://pypi.org/project/english-words/ but it has only 25487 words which is almost negligible compared to the 479826 in the UNIX default word-list. However all the words in this list are meaningful
here is a copy of that file:
https://pastebin.ubuntu.com/p/HzFSDs29SH/
if this dictionary seems fine to y'all, I'll open a PR
(I was looking for the dictionary after @SkeletalDemise suggest a fix along these lines for issue #412 )

@sashreek1
Copy link
Contributor

@sashreek1 sashreek1 commented Oct 2, 2020

I had tried to create a fix with the logic that @bee-san suggested in the issue description by using this tool https://www.english-corpora.org/coca/ but this had a problem. the string "hm" apparently has a frequency almost 15 times higher than more meaningful words like "curate" and many more.

@Itamai
Copy link
Contributor

@Itamai Itamai commented Oct 2, 2020

@sashreek1
Copy link
Contributor

@sashreek1 sashreek1 commented Oct 2, 2020

@Itamai the 300k dictionary looks good to me as well

I also found the dictionary that nltk library uses and it has 236736 words and as far as i am able to see they seem to be all meaningful

which one shall i go with ?

@Itamai
Copy link
Contributor

@Itamai Itamai commented Oct 2, 2020

@Itamai the 300k dictionary looks good to me as well

I also found the dictionary that nltk library uses and it has 236736 words and as far as i am able to see they seem to be all meaningful

which one shall i go with ?

the github user @chetnaaggarwal-ca got assigned to this issue so we should let her do it. I was just giving some ideas :D
Another idea would be to merge the 300k one and the 235k one you found together and then delete duplicates that should be good enough or if she knows how to write a scraper then just pull all words from an online dictionary

@sashreek1
Copy link
Contributor

@sashreek1 sashreek1 commented Oct 2, 2020

sure thing ! I'll wait for that to finish so i can begin #412 :)

@bee-san
Copy link
Member Author

@bee-san bee-san commented Oct 2, 2020

I tried to merge dictionaries previously and it didn't work out so well :( I also tried scraping the speeches from the Houses of Parliament :(

The dictionary is loaded as JSON so {word: 1}. What do you think about scraping words from multiple sources and building a frequency distribution, like so:

{and: 0.21}

So the word "and" appears a lot. And then we add together the frequencies and if they go over a threshold, we use that?

We do a similar thing for stop words / top 1000 words, except this is for every word in the dictionary :)

@sashreek1
Copy link
Contributor

@sashreek1 sashreek1 commented Oct 2, 2020

Oh right
But the issue there would be that we might miss out on several words that may not be in use in today's colloquial lingo but might still be a meaningful word and someone may come across an encrypted version of it.
This also depends on the source that we pick, but whatever source we pick it still has the possibility to miss out on some word right ?

@bee-san
Copy link
Member Author

@bee-san bee-san commented Oct 2, 2020

Oh right
But the issue there would be that we might miss out on several words that may not be in use in today's colloquial lingo but might still be a meaningful word and someone may come across an encrypted version of it.
This also depends on the source that we pick, but whatever source we pick it still has the possibility to miss out on some word right ?

If we used English Wikipedia I think it'd cover a significant amount of words, and the words that do not appear we could always do something like this:

if value == 0 but words are in dict:
    return plaintext
elif value > THRESHOLD:
        return plaintext

It's unlikely for any given text to be compromised entirely of only rare words that have never appeared in English Wikipedia before (or rarely appear) :)

@sashreek1
Copy link
Contributor

@sashreek1 sashreek1 commented Oct 2, 2020

awesome !
we could probably also make a manual procedure to check if by any chance that a word has been falsely claimed to be "not common" this isn't likely but is still possible.
what do you think ?

@sashreek1
Copy link
Contributor

@sashreek1 sashreek1 commented Oct 4, 2020

I found this https://github.com/IlyaSemenov/wikipedia-word-frequency where they conveniently calculated the frequency of english words on wikipedia for us (https://github.com/IlyaSemenov/wikipedia-word-frequency/blob/master/results/enwiki-20190320-words-frequency.txt) so perhaps i could use this along with the dictionaries we currently use to figure out "common words" and split them into 2 files of common words and uncommon words which we can then manually filter to see if we need any words from the uncommon file to be in the common file

@sashreek1
Copy link
Contributor

@sashreek1 sashreek1 commented Oct 4, 2020

I have also found this highly reliable source called datamuse who provide an awesome API which uses google's n-gram to provide frequency of words used in several texts, books and several other material that google has access to. https://www.datamuse.com/api/.

for example : https://api.datamuse.com/words?sp=awesome&md=f&max=1

However, they have a cap of 100k requests per day.

We can get over this cap if 4-5 of us run a script to get the frequency of every single word in the dictionary that we are currently using and then merge the data collected into one file. We can then manipulate this data using whatever threshold we require.

This solution (in my opinion) sounds like the most viable one so far. If y'all agree i can get the script done (including the splitting of the file so 4 of us can run it)

@bee-san
Copy link
Member Author

@bee-san bee-san commented Oct 4, 2020

hello @bee-san can be please let me contribute in ur code?

Are you still working on this?

@bee-san
Copy link
Member Author

@bee-san bee-san commented Oct 5, 2020

Reassigned issue due to no communication :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.